Convolutional Neural Networks

The goal of this lab exercise is to promote a deeper understanding of convolutional neural networks (CNNs). Although CNNs are a powerful tool for learning, and training them most effectively requires large amounts of both computing power and training data, there is nothing fundamentally mysterious about what they are doing. In fact, they can be programmed in Matlab using mostly basic matrix operations like multiplication and addition.

Deep Learning Toolbox

To assist in our exploration we will be using the Deep Learning Toolbox by Rasmus Berg Palm. Although this package has been superseded by others as a programming tool and is no longer maintained by the author, it is perfect for our purposes as a learning tool. When you have finished the lab exercises here, you may also wish to read a blog post that gives another person's perspective on this toolbox and what it can teach us.

Download the toolbox code from the first link above and unzip the files. You can activate all the toolbox functions by setting the working directory to the root of the folder and recursively adding all the subfolders to the path.

>> cd rasmusbergpalm-DeepLearnToolbox-5df2801
>> addpath(genpath(pwd))

The toolbox comes with demonstration code and data. The data consists of a set of segmented handwritten numerals originally taken from postal addresses, known as MNIST. This data set has become something of a benchmark for character recognition, and includes some examples that are difficult for humans to discern. Load the data and prepare it for use.

load mnist_uint8;
train_x = double(reshape(train_x',28,28,60000))/255;
test_x = double(reshape(test_x',28,28,10000))/255;
train_y = double(train_y');
test_y = double(test_y');

Each test item is a 28x28 grayscale image, white on black. The images come from ten categories, namely the digits 0 through 9. 60,000 examples are available for training in the variable train_x, and an additional 10,000 examples for testing are in test_x. The category labels are stored as binary indicator variables in train_y and test_y. Each column corresponding to one example contains ten numbers, all zeros except for a one in the row corresponding to the correct category. You can browse through this data to get a feel for it. (Hit Ctrl-C in the command window to escape from the loop.)

for i = 1:100, 
  imshow(train_x(:,:,i)'); 
  title(sprintf('%d',train_y(:,i))); 
  waitforbuttonpress; 
end

The next step is to define a neural network and train it up. The demo suggests using a "6c-2s-12c-2s" architecture. That is to say, a convolutional layer with a depth of 6, followed with downsampling by a factor of two, then a second convolutional layer with a depth of 12, and another downsampling by two. You can see the implementation of these layers in the file cnnff.m.

% rand('state',0)  % uncomment this for a consistent result
cnn.layers = {
  struct('type', 'i') %input layer
  struct('type', 'c', 'outputmaps', 6, 'kernelsize', 5) %convolution layer
  struct('type', 's', 'scale', 2) %sub sampling layer
  struct('type', 'c', 'outputmaps', 12, 'kernelsize', 5) %convolution layer
  struct('type', 's', 'scale', 2) %subsampling layer
};

This CNN is trained by pure backpropagation: the desired output signal for each training case is used to compute an error signal, and to adjust the weights of the network so that the error is decreased. The updates are accumulated in batches from small sets of randomly selected training examples; an epoch corresponds to the number of batches required to see every training sample once. Training for just one epoch should take under two minutes, and gives an error rate of around 12% on the test set.

opts.alpha = 1;
opts.batchsize = 50;
opts.numepochs = 1;

cnn = cnnsetup(cnn, train_x, train_y);
cnn = cnntrain(cnn, train_x, train_y, opts);

[er, bad] = cnntest(cnn, test_x, test_y);
figure; plot(cnn.rL);

Exploration

While doing the exercises below, you might set up a second Matlab instance and run training for a longer number of epochs -- even 50 or 100 if you have the time. (If time is limited, you may load this file containing a CNN trained over 50 epochs.)

Now it is time to do some exploration, so as to better understand what this CNN is doing. First, let's run the evaluation by hand and take a look at some of the fields generated.

cnn = cnnff(cnn, test_x);
disp(cnn);

Details on each layer are stored in the layers field. Let's take a look at the second layer, which performs the first set of convolutions.

disp(cnn.layers{2});

According to our configuration above, each slice in this layer uses a 5x5 kernel. These are stored in the field k. Recall that these ase initialized to random values at first, and then trained via backpropagation. Let's look at the numbers for the first slice.

disp(cnn.layers{2}.k{1}{1});

We can compute the first slice ourselves using this kernel, and the constant bias value (essentially a threshold) stored in the field b. The result is passed through a sigmoid function and stored in the field a. Here we look at the computation for just the very first test sample, and compare it with the stored result.

ours = convn(test_x(:,:,1),cnn.layers{2}.k{1}{1},'valid')+cnn.layers{2}.b{1};
subplot(1,2,1);
imvis(sigm(ours))
subplot(1,2,2);
imvis(cnn.layers{2}.a{1}(:,:,1))

There are six slices computed in this layer, all from randomly chosen initial seeds so each different in behavior. We can look at all six at once, again just for a single test image at a time.

imshow(test_x(:,:,1)');
figure;
for i = 1:6, subplot(1,6,i);
  imshow(cnn.layers{2}.a{i}(:,:,1)'); 
end;

These layers may differ a bit in how they handle edges, and some may be negatives of the others, but they probably look more or less like modified versions of the input image. The next layer doesn't improve things much, since it just replaces each 2x2 block with a single value that is the average of the four pixels in the block.

for i = 1:6, subplot(1,6,i);
 imshow(cnn.layers{3}.a{i}(:,:,1)'); 
end;

The next layer is the second convolutional layer, and this is where it starts to get interesting. There are twelve slices in this layer, and each one takes input from all six subsampled slices in the previous layer. Thus layers{4}.k contains 60 5x5 kernels, arranged in 6 cells of 12. The results at this layer start to look much more diverse than they did before.

for i = 1:12, subplot(2,6,i);
  imshow(cnn.layers{4}.a{i}(:,:,1)'); 
end;

These results are subsampled again in layer 5, to give a total of 12 4x4 outputs. That's 192 numbers, which are stored in the field fv for convenience. They are fed through a single fully-connected layer to produce the ten binary indicator variables required as output. The weights and bias numbers for this step are stored in ffW and ffb respectively, and we can use them to compute the output ourselves. The index with the largest value is the predicted class.

out = sigm(cnn.ffW*cnn.fv(:,1)+cnn.ffb);
disp(out');
[~,pred] = max(out);
disp(pred-1)  % Character 0 is in position 1, etc.

You may be surprised that the output of level 5, which looks less character-like than the earlier levels, can suffice to identify all the classes. In fact, this is not such a difficult task in most cases. Our eye can also pick up patterns in the numbers, if we display them in the right way. Separating out all the feature vectors for each class and placing them side by side, we can discern differences between the groups. You should be able to see ten clear groupings, each somewhat different from the others. The neural net achitecture picks up on these differences to produce its predictions. We could just as easily use some other classification technique for this last step, such as an SVM, but the advantage of sticking with the neural network is that we can use backpropagation for training the entire way through.

bar(cnn.fv(:,1));  % this doesn't look like much
figure;
vis = []; 
for i = 1:10, 
  vis = [vis net.fv(:,find(class_y(1:1000)==i))]; 
end; 
imagesc(vis);  % this shows the contrast between classes
axis image;
colormap jet

You can compute the error rate and problematic images using the function cnntest. The error rate during the learning process has also been recorded, and can be shown in a plot.

[er, bad] = cnntest(cnn, test_x, test_y);
figure; plot(cnn.rL);

To Do

There are a number of different things you could explore at this point. Most obvious is the effect of more training. How many epochs does it take to reduce the error below 10%? 5%? 2%? What is the lowest you can get? You can also look at the effect of randomness in the initial conditions. Compare the patterns at level 4 for several networks trained for one epoch from random seeds. Do they all look similar? Or are they each finding different ways to encode the patterns? If you have time, it would be interesting to do the same thing for different networks trained for 50 epochs.

More ambitiously, you can also try adjusting other parameters of the algorithm. What if we change the depth of the various layers? Compare the results that you get with a 4c-2s-8c-2s architecture, or perhaps 8c-2s-8c-2s or something else of your own devising. (Keep in mind that traiing time will go up fast if you greatly increase the depth or the number of layers.) You can adjust the learning rate (alpha) or the batch size. Do these only affect the training time required, or the final achievable result? You could also consider changing the kernel size, although this will have implications for the length of the final feature vector.

Another question worth looking at is how well other learning algorithms can do at classifying the patters stored in the fv field. Try fitting an SVM and see what kind of error rate you get.

Further Explorations

Recent research on CNNs have suggested that a nonlinear operation like max-pooling is better than a simple subsampling layer. You could try replace the subsampling layer with a max-pool layer (where each output pixel is the max of a 2x2 block, rather than the average), and see whether better results can be achieved. You will have to modify both cnnff and cnnbp to do this properly. In the backpropagation of error, the max-pooling layer should send the entire gradient signal to the pixel that gave the maximum value, and zero gradient to the other three pixels.

The experiments so far have all used the pre-segmented character samples from MNIST, which are all 28x28 images. You could take this architecture and turn it into a detector that can search across an image of any size for 28x28 blocks that look like a digit. Since input images can be scaled up and down, such a detector could be used to scan any text image for numeric digits of any size.

In order for your detector to work well you will need to provide it with additional training samples showing other (not-digit) content that one might expect the system to encounter, so that it learns to give no response except when a digit is present. For efficiency, you will also probably want to write a new version of cnnff that convolves the entire image with each kernel just once, rather than computing the entire CNN pipeline for every possible 28x28 subwindow.

Another extension would be to take the digit-recognition architecture and apply it to some entirely different problem, for example the recognition of different coin types. It may be hard to gather enough training data for this task, but a small set of examples might be augmented by rotating the training imagesinto a number of different positions.

The deep learning toolbox has demos for several other related techniques: deep belief networks, autoencoders, and ordinary neural networks. You can explore these topics as well if they are of interest to you. The README file has links to youtube videos of several talks by Geoff Hinton and Andrew Ng that describe them in more detail.