A Study in Vision: CNN for dummies

This article relies on sources here, here, and here.

CNN can use images directly as input, avoiding feature extraction and description. Suppose we have an image with size 1000x1000, and 1 million hidden neurons, and we connect each neuron to a pixel, then we have 1000x1000x1000000 ways of connection. Nevertheless, the relationship of an image is local, and every neuron only needs to connect to a local region. On a higher level, information collected from these neurons is summarized to describe the global image. Suppose local region has size 10x10, and the 1 million neurons only need to connect to 10x10 pixels, we only have $10^8$ parameters.

For the 10x10 region, if the neurons connecting to them use the same weights, in other words, we use the same convolution kernel to convolve the image for each neuron, then we only have 100 parameters regardless of how many neurons we have. This is the most significant contribution.

However, fixing the weights, we only extract one kind of feature, since one convolution kernel corresponds to one feature, such as the edge of a certain orientation. What if we want multiple features? We increase the number of kernels. Suppose we have 100 filters to extract different features, such as different edges, then as a result of convolution we have a feature map. We have 100 feature maps for 100 kernels, and the 100 feature maps constitutes one layer of neurons. The number of parameters is merely 100 types of kernels x 100 parameters for each kernel = 10k.

How do we determine the number of neurons? If we have a 1000x1000 image, and the filter is 10x10, and there is no overlap, i.e. the stride is 10, then no. of neurons is $\frac{1000\times 1000}{10\times 10} = 10k$. This is the no. for one feature map.

One example of CNN is LeNet, which has 7 layers. The input is 32*32 image. C1 layer is convolution, to strengthen the feature and suppress the noise. There are 6 feature maps. In the feature map, each neuron maps to a 5*5 area. The size of feature map is 28*28. C1 has 156 parameters to train, which are 5*5 unit parameters and a bias coefficient for each filter. (5*5+1)*6 = 156 parameters. The no. of connection is 156*28*28.

S2 layer is downsample layer, to exploit the local image correlation and reduce the amount of information, while retaining useful information. There are 6 feature map with size 14*14, where each unit in the feature map connects to a 2*2 region in C1. We add the 4 inputs for each unit, multiple a coefficient to be trained, and add a bias, then use sigmoid. If the coefficients are small, then the operation is linearly and downsampling is like blurring.

C3 layer is also a convolution layer, where each feature map is a combination of the feature maps in S2. S4 layer is a downsampling layer, C5 is for convolution, F6 layer computes the dot product of the input and weights, and adds a bias. Eventually, the output layer comprises euclidean radial basis function.

Essentially, CNN is a projection from input to output, and it can learn a multitude of such relationships without knowing a precise mathematical expression.

A Study in Vision

Saturday, 11 July 2015

CNN for dummies

No comments:

Post a Comment