Preprocessing the Faces

In this phase of my experiments I’ve tested the following preprocessing algorithms in the pylearn2:

1) Standardization: This centers the columns of the image matrix by subtracting the mean and reduces the variance of the data by dividing by standard deviation.

2) Global Contrast Normalization: This class computes the mean and standard deviation of the all pixels in the image and subtracts the mean from each pixel and divides the pixels by the standard deviation.

3) Whitening: This is ZCA whitening is used for decorrelating the pixels in the image and removing the redundancy in the images. Whitening is one of the most used preprocessing tool in deep learning. It usually improves the results.

4) Standardization + Whitening:  It is known that Whitening and PCA works better if your data is mean centered. Standardization centers the data and therefore I expect that Standardized+Whitened Images will perform better.

5) Standardization + GCN: I have added this one, just to see if it will improve the results.

Analysing the dataset

Before starting working on a dataset I find it crucial to do at least some basic statistical analysis about the properties of the dataset which may be helpful to have some basic idea about the dataset.

Statistical properties

-First order statistics

Mean: 124.525897579

Standard deviation: 50.7928015714

-Higher order statistics:

Kurtosis:-0.256715476375

Skewness: 0.212287734117

Histogram of the dataset

faces_train_pixel_intensity_hist_2

Histogram of pixel intensity values of training dataset of faces dataset.

While I was working with pylearn2’s pipeline interface, I have realized a few small bugs in there and fixed them: https://github.com/lisa-lab/pylearn2/commit/b4d74eff8d492bf7f31cfee9a669d7b5ab9c3ad4

Apart from these informations images are 48×48 and ubyte single channel i.e. grayscale and intensities are between 0 and 255.

In my evaluation I’ve used a Rectifier nonlinearity with two layers containing 4000 hidden units at each layer with decaying learning rate and momentum.

As seen from the pixel intensity values in the histogram and the kurtosis and skewness values, the pixels are distributed almost according to a gaussian distribution. This is interesting because usually natural images are assumed to have a heavy-tailed distribution.

  Training Error Test Error
Standardization 0.062 0.219
GCN 0.021 0.272
ZCA Whitening 0.000 0.337
Standardization + Whitening 0.000 0.292
Standardization + GCN 0.062 0.219

As seen from the table ZCA leads to terrible results. I think this is due to the gaussian like distribution of the dataset. Because ZCA and PCA whitening works best if your pixel intensities are distributed according to a heavy  tailed distribution such as laplacian or student’s t distribution.

Here are some weights learned with different preprocessing:

2layers_STD_ZCA_filters_cropped

Filters after standardized and whitened inputs. Looks so scary.

2layers_STD_GCN_filters_cropped

Filters after standardized and global contrast normalized inputs.

2layers_STD_filters_cropped

Standardized inputs filter. The filter at the upper right corner looks very familiar.

2layers_ZCA_filters_cropped

ZCA filters, they are soo ugly

2layers_GCN_filters_cropped

Global Contrast Normalization Filters, looks reasonable.

Advertisements
Tagged , , ,

Choosing the Right Optimization algorithm

In pylearn2, there are 3 types of optimization algorithms available:

1) BGD: Batch gradient descent

Batch gradient descent computes the gradient of the loss with respect to each parameter and computes this for each example. BGD then sums up the all gradients and updates the parameters.

2) SGD: Stochastic gradient descent

This supports class supports several forms up adjusting learning rates and momentum. SGD class also supports minibatch updates as well which lets you to compute the gradient over small minibatches. I’ve implemented the batch size adjustor (https://github.com/lisa-lab/pylearn2/pull/174) which I haven’t seen big benefits of using it up to now. Also I’ve started experimenting with stepwise batch size adjustor and nonlinear updates of batch sizes.

3) CG: Nonlinear conjugate gradient

This is a form of second order method that does line search and takes the small steps at the location that line search finds. NCG is also a batch method but you can also train it with minibatches, as long as your minibatch sizes are big enough.

In order to test the algorithms, I’ve used a MLP with 2 hidden layers having 5000 hidden units at each layer. I’ve used standardization to preprocess the dataset.  I used 128 examples for each minibatch of SGD. I set the initial momentum to 0.7 and grow it up to 0.9. I’ve used the norm constraint on the weights and weight decay together.  In general I don’t recommend you to do this. In a following post, I might discuss about why you shouldn’t use norm constraint on weights with L1/L2 weight decay. Learning rate for the SGD starts from .06 and decays exponentially with 1.00004 decaying factor.

The results I’ve obtained is in the following graphs (Red line:SGD, green line: NCG, blue line:BGD):

optimization_train_errors

This graph shows the training error for each algorithm. As seen from the plot, SGD converged to 0 training error, after just seeing about 300k presentations.

optimization_training_loss

This graph shows the training loss for each optimization algorithm. SGD converges to 0 loss very quickly compared to other approaches.

optimization_valid_errors

The test error of SGD oscillates too much at the beginning of the training. But as training loss approaches to zero, it settles down around 18% error.

 

In a nutshell in general SGD has better performance in terms of generalization error and training loss. It would also be interesting to do a controlled experiment to observe the effect of momentum and shrinking learning rate on the performance of SGD. According to these results, in the rest of my experiments, I’m going to stick to SGD.

You can access my yaml files are there:

https://github.com/caglar/ift6266-project/tree/generate_dataset/contest_dataset/optimization

First Steps

In my initial tests I’ve just used global contrast normalization. I used stochastic gradient descent with minibatches. The whole contest dataset contains 4198 examples, I’ve used 4000 examples for training and 198 examples for validation. I’ve used decaying learning rate gradually increasing momentum in all my experiments. Momentum is a well-known techniques that is known to increase the rate of convergence of the SGD algorithm. I put norm constraint on the weights which Geoffrey Hinton discusses in his dropout paper(Improving neural networks by preventing co-adaptation of feature detectors).

On top of these I’ve tried the following algorithms:

 

Experiment YAML Name Hyperparams Best Valid Error Best Test Error
2layers_mlp_softmaxPool_4096hu_dropout_ext.yaml  2 layers of softmax Pooling layers with 4096 hidden units with dropout at each layer. ~0.22
2layers_mlp_RELU_4096hu_dropout_ext.yaml  2 hidden layers of 4000 hidden units at each layer RELU with dropout. ~0.17 0.34656
2layers_mlp_softmaxPool_5000hu_dropout_large_mb.yaml  2 hidden layers of softmax pooling experiment with large minibatches(400). ~0.18
3layers_mlp_RELU_4096hu_dropout_ext.yaml  3 hidden layers of RELU experiment with dropout. 0.2
1layer_mlp_RELU_4096hu_dropout_ext2.yaml 1 layer mlp with RELU 4096 hidden units dropout. ~16.45
2layers_mlp_RELU_4096hu_dropout_ext_best.yaml  2 layer MLP with RELU and 4096 hidden units + dropout. 0.2
2layers_mlp_softmaxPool_4096hu_dropout_ext.yaml  2 layers of MLP with softmax pooling and 4096 hidden units 0.72
3layers_mlp_softmaxPool_5100hu_dropout_ext.yaml  3 layers of MLP with softmax pooling and 5100 hidden units 0.72
1layer_mlp_RELU_4096hu_dropout_ext2_larger_hidden.yaml  1 layers of MLP with softmax pooling and 4096 hidden units 0.18
1layer_mlp_softmaxPool_6000hu_dropout_ext.yaml  1 layer of MLP with softmax pooling and 4096 hidden units 0.177
2layers_mlp_RELU_4096hu_dropout_ext2_best.yaml 0.72
2layers_mlp_softmaxPool_4096hu_dropout_large_mb.yaml 0.72
2layers_mlp_softmaxPool_5000hu_dropout_large_mb.yaml  2 layers of softmax pooling with 5000 hidden units at each layer and 400 examples per minibatch. 15.85 0.35115
2layer_mlp_RELU_4096hu_dropout_ext2_larger_hidden.pkl 0.1604

For softmax pooling, I’ve used the 5 by 5 pools. According to my experiments softmax pooling tends to overfit faster than the other algorithms and in general it seems to perform better on the validation dataset.

You can access all yaml files from there (There are probably some other yaml files that I’ve not put on this table):

https://github.com/caglar/ift6266-project/tree/master/contest_dataset/supervised_mlp

My best model has 15.85 percent error on the validation set but that model performed poorly on the test set. That might be because that the models overfit the training dataset or wrong choice of  preprocessing and optimization algorithm.

Here is the plot for the validation error during the training:

SoftmaxPooling5000HU2Layers

Plot showing the change in validation error during the training.

SoftmaxPooling5000HU2LayersTrainError

This plot shows the training error with respect to the number of examples seen by the model.

SoftmaxPooling5000HU2LayersTrainObjective

This plot shows the change in learning curve (cost is negative log likelihood) during the training.

 

 

Tagged , , , ,

Overcoming the Faces Challenge

pylearn2 is a machine learning tool written in python using theano at the background. pylearn2 has implementations of several different machine learning algorithms, mostly focused on vision tasks and deep learning algorithms. In this blog I’m going to explain how I approached the kaggle challenge as part of the class using the pylearn2 framework. I’m about 2 weeks late to participate the challenge but I’m going try to catch the train at least. My current plan covers the following steps:

  • First steps- Explain how the performance of different shallow learning algorithms effected on this dataset using a very simple preprocessing algorithm called, Global Contrast Normalization.
  • Choosing the relevant Optimization Algorithm-Explore the effect of using different learning algorithms: I’m going to discuss the effect of using different optimization algorithms on the learning problem.
  • Preprocessing-After deciding on the right optimization algorithm to use, I’m going to discuss the appropriate preprocessing algorithm that can be used on this challenge and explain the pipeline that I’m going to use.
  • Augmenting the data-At this stage I’m going to augment the relatively small dataset of size 4138 with random transformations.
  • Maxout Networks-At this stage I’m going to try Max-out networks on this challenge with dropout and weight norm constraint.
  • Convolutional Neural Networks-This is going to be the final stage in which I’m going to explore how to use the convolutional neural network for this dataset.
Tagged , , ,