A Study in Vision: Compressive tracking

There are two main approaches for tracking according to this blog.
One is generative, generating the model of the object while tracking. The tracker learns the model from the current frame and use it as the template to track in the next frame. The disadvantage of this method is that in the beginning of the video, there are few frames to learn and thus the object could not change much. Otherwise, the tracker may drift.
The other is discriminative, which is essentially a binary classifier to separate the object from the background. However, normally only one positive sample and a few negative samples are used for training, and the tracker may drift when the template is noisy or is not well aligned.
Fast compressive tracking works in the following way.
1. Obtain the target. The image is convoluted with Gaussian kernel in different scales to obtain multi-scale information about the target. To improve the speed, Gaussian kernel is replaced by rectangular filter bank chosen as

where w is the width and h is the height of the window. The result is

The output of the convolution is an image with size W$\times$H is still W$\times$H. To combine these outputs, they are transformed into a column vector. For one input image there are w$\times$h outputs. Therefore concatenating all the column vectors, the dimension is very high and dimensionality reduction is required.
2. Obtain the sparse measurement matrix. According to compressive sensing theory, a small amount of randomly generated data could preserve the salient information in an image. The images that are compressible in this way is called K-sparse signal.
Based on this blog, Johnson-Lindenstrauss shows that the distance between two points in a subspace of a high dimensional space can equal that in the original space with a high probability. In other words, the relationship in the high dimensional space can be preserved in its subspace. Thus, the measurements only need to preform k+1 times to recover the K-sparse signal in the original space. Baraniuk proves that a random matrix satisfying the Johnson-Lindenstrauss also satisfies restricted isometry property (RIP). Thus, if R satisfies Johnson-Lindenstrauss, then x is compressible or sparse, and x can be recovered from v.
In this paper, the authors use sparse random measurement matrix R to extract the salient information in the K-sparse signal, and thus project the high dimensional signal to low dimensional space. The Gaussian random matrix could be used as R

when $\rho$ is 1, 3. When $\rho$ is 3, two third of the data is zero and does not require computation.
In the paper $\rho$ is $\frac{m}{4}$. Only c(<4) computations are performed for every row, and thus the complexity is O(cn). Multiplying this n$\times$m matrix with the high dimensional m$\times$1 matrix results in a low dimensional n$\times$1 matrix

where black parts in R is positive, gray parts are negative, and white parts are 0. It is evident that R is sparse, thus the computational load is decreased. The blue arrow indicates that a non-negative element in R compress an element in x, equivalent to the convolution of a rectangular region and a certain location in the image.
In the output vector v, every elements corresponds to the sum of x where R is non-negative, which means it contains the sum of information in many local regions.
4. Bayesian classification.

Here v represents the feature, p(y=1) and p(y=0) are the a prior probability of positive and negative samples, and in fact p(y=1)=p(y=0). It is proved that p($v_{i}$|y=1) and p($v_{i}$|y=0) satisfy Gaussian distribution.

where $\lambda$ is the learning parameter. The probability distribution of three different low dimensional spaces is

To evaluate the positive and negative samples:

Steps of the algorithm
1.1 If the frame is the first one, manually label the object in a rectangular region. Generate rectangular windows randomly to extract Haar features.
1.2 Use the current object location as center, and set the radius to 4 pixels, extract 45 positive samples. Choose 50 negative samples in the interior region of two circles: one with a radius of 8 pixels, the other with a radius of 30 pixels. Compute the integral image of the original input, and then extract the positive and negative samples based on the Haar features. Update the classifier.
2. If the frame is not the first one, use the object location in the previous frame as center, set the radius to 25 pixels, obtain 1100 regions to be classified in a brute force way. Compute the integral images of these regions, generate Haar features using the Haar like templates. Use the classifier to select the most probable one as the target.
Repeat 1.2.
The C++ code of this without multi-scale is here, and here.
Coarse to fine
Coarse to fine is used to reduce the computational complexity.
1. Coarse grained search
center: previous object location, radius: a large radius $\gamma_{c}$ = 25, step: a large number of pixels $\Delta_{c}$  = 4.
2. Fine grained search
center: coarse-grained detected location, radius: small radius $\gamma_{f}$ = 10, step: a small number of pixels $\Delta_{f}$  = 1.
For the exhaustive search with $\gamma_{c}$ = 25, $\Delta_{f}$  = 1, there will be 1,962 windows. However, for coarse to fine search there are only 436 windows.

A Study in Vision

Wednesday, 17 June 2015

Compressive tracking

No comments:

Post a Comment