A Study in Vision: Correlation filter

There are several tracking algorithms, such as TLD, that are quite stimulating. This time I am going to learn more about correlation filter.

Let's start with Discrete Fourier Transform (DFT), following this blog. In Matlab, we can use F = fft2(f) to compute DFT and then use F2 = log(abs(F)) to visualize the results. To increase the accuracy of Fast Fourier Transform (FFT), use F = fft2(f,256,256) instead. DFT can be used for template matching, for example searching for 'a' in an image full of text. The code is

Now we can study correlation filter in more details. From this blog, the first paper that introduces this concept in tracking is Visual Object Tracking using Adaptive Correlation Filters in CVPR 2010. The approach is Minimum Output Sum of Squared Error (MOSSE). Correlation filter (CF) is very fast. For instance, MOSSE's speed is 669 frame per second (fps), advancing the tracker from real time to high speed. Meanwhile, its accuracy is also high, about 73%.

Correlation here means Cross-correlation. Suppose we have two signals f, g, their cross-correlation is

$\begin{array}{l} (f \star g)(\tau )\mathop = \limits^{def} \int_{ - \infty }^\infty {f*(t)g(t + \tau )dt} \qquad (1)\\ (f \star g)(n)\mathop = \limits^{def} \sum\limits_{ - \infty }^\infty {f*[m]g(m + n)} \end{array}$

Intuitively, correlation measures the similarity of two functions at time $\tau$. Consider a simple case when f, g have the same shape, but g is delayed. Their correlation reaches its maximum only when they are aligned, like $g(t+\tau)$, since no one else is more similar to f other than f itself.

In terms of tracking, a filter needs to be designed such that when it is applied on the target, the response is maximum.

Therefore, we can write out the equation as

$g=h\text{ }\bigstar f \qquad (2)$

where g is the response, f is the input image, h is the filter. g could take any form, and we assume it is Gaussian in the above figure. Hence we only need to compute h. But why CF is so fast? The answer lies in the application of FFT. Convolution in spatial domain is equivalent to multiplication in frequency domain and vise versa, therefore

$Fh\text{ }\bigstar f\text{=}{{(Fh)}^{*}}\odot Ff \qquad (3)$

where F stands for FFT, $\odot$ is dot product. Suppose f has n pixels, then FFT of f takes $O(n\log n)$, which is the same to eq (3). This is much faster than other trackers. This is the essence of the paper.

To compute h, let F(f) = F, ${(F(h))}^{*} = H^{*}$, F(g) = G, then we have

${{H}^{*}}=\frac{G}{F}\qquad (4)$
In reality, we have to consider m templates to accommodate appearance change. The objective function then becomes

$\underset{{{H}^{*}}}{\mathop{\min }}\,\sum\limits_{i=1}^{m}{|{{H}^{*}}{{F}_{i}}-{{G}_{i}}{{|}^{2}}} \qquad (5)$

Eq (5) is not difficult since the computation is element wise in the frequency domain, according to the convolution theorem. Hence we can solve every element in ${{H}^{*}}$, denoted as $H_{w,v}^{*}$. Eq (5) becomes

$\underset{H_{w,v}^{*}}{\mathop{\min }}\,\sum\limits_{i=1}^{m}{|H_{w,v}^{*}{{F}_{w,v,i}}-{{G}_{w,v,i}}{{|}^{2}}}\qquad (6)$
We need to differentiate Eq (6) and equal it to zero. Note the difference of differentiation in the complex domain compared with real numbers

$\begin{array}{l} \frac{\partial }{{\partial H_{w,v}^*}}\sum\limits_{i = 1}^m {(H_{w,v}^*{F_{w,v,i}} - {G_{w,v,i}}) \cdot } {(H_{w,v}^*　　{F_{w,v,i}} - {G_{w,v,i}})^{\rm{*}}}{\rm{ = }}0\\ \Rightarrow \frac{\partial }{{\partial H_{w,v}^*}}\sum\limits_{i = 1}^m {H_{w,v}^*{F_{w,v,i}} \cdot {H_{w,v}}F_{w,v,i}^{\rm{*}} - H_{w,v}^*{F_{w,v,i}}G_{w,v,i}^{\rm{*}} - {H_{w,v}}F_{w,v,i}^{\rm{*}}G_{w,v,i}^{}{\rm{ + }}{G_{w,v,i}}G_{w,v,i}^{\rm{*}}} {\rm{ = }}0\\ \Rightarrow \sum\limits_{i = 1}^m {{F_{w,v,i}} \cdot {H_{w,v}}F_{w,v,i}^{\rm{*}} - {F_{w,v,i}}G_{w,v,i}^{\rm{*}}} {\rm{ = }}0\\ \Rightarrow {H_{w,v}}{\rm{ = }}\frac{{\sum\limits_{i = 1}^m {{F_{w,v,i}}G_{w,v,i}^{\rm{*}}} }}{{\sum\limits_{i = 1}^m {{F_{w,v,i}}F_{w,v,i}^{\rm{*}}} }} \qquad (7) \end{array}$

Following the Eq (7) to process element in H, we get
$H\text{=}\frac{\sum\limits_{i=1}^{m}{{{F}_{i}}\odot G_{i}^{\text{*}}}}{\sum\limits_{i=1}^{m}{{{F}_{i}}\odot F_{i}^{\text{*}}}}\qquad (8)$

To track a target, we only need to process the image using the filter H, and then the target location is the maximum response. The filter can be updated accordingly
${{H}_{t}}=(1-\eta ){{H}_{t-1}}+\eta H(t)\qquad (9)$
where ${H}_{t}$ is the filter obtained at frame t, and $\eta$ is the learning coefficient. We can normalize Eq (8) by adding $\epsilon$ in the denominator. We can also compute ${H}_{i}$ separately and then take the average.

The second blog of the same author addresses the improvement on MOSSE, particularly about Zhang Kaihua's ECCV 2014 paper named STC tracker：Fast Visual Tracking via Dense Spatio-Temporal Context Learning. Zhang also proposes Compressive Tracking.

The figure above depicts the workflow of the algorithm. It is similar to MOSSE regarding the use of FFT. Nevertheless, it has several original ideas as well
1. Application of Dense Spatio-Temporal Context
2. Incorporation of probabilistic theory
3. Consideration of scale change in template updating

Dense Spatio-Temporal Context:

It may be difficult to merely track the target alone due to appearance change or occlusion. If we consider the background (Spatio Context) as well, then the risk of losing target may be reduced. Take the figure above for example, if we consider the target (yellow part) alone, then tracker may get lost when occlusion happens. In contrast, if we also consider the context (red part), then it will help us locate the target. If we include the information contained in several frames, it is Temporal Context.

Incorporation of probabilistic theory:
Suppose $\mathbf{x}\in {{\mathbb{R}}^{2}}$ is a location, o is the target, we define the probability that o will appear at x in the confidence map as
m(x) = P(x|0) (1)
we also define the set of context as
${{X}^{c}}=\{\operatorname{c}(\mathbf{z})=(I(\mathbf{z}),\mathbf{z})|\mathbf{z}\in {{\Omega }_{c}}({{\mathbf{x}}^{\bigstar }})\}$
where $x^{\bigstar }$ stands for the target location, ${\Omega }_{c}(x^{\bigstar })$ represents a region two times larger than the target, centered at $x^{\bigstar }$. $I(\mathbf{z})$ is the pixel value at Z. In short, the equation takes $x^{\bigstar }$ as its center and selects a region two times the area of target as the feature, like the red part in the above figure. The Eq (1) can be expressed as
m(x) = P(x|0) = $\underset{c(Z)\in X^{c}}{\sum}P(X|c(Z), o)P(c(Z)|o) (2)$
where $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)$ accounts for the probability of target being at x, given target and its context. On the other hand, $P(\operatorname{c}(\mathbf{z})|o)$ represents the probability of one context belonging to the target, in other words, the a prior probability of target context. The purpose of $P(\operatorname{c}(\mathbf{z})|o)$ is to select context similar to the one surrounding the target, whereas $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)$ considers whether it is reasonable for the target to appear here even when the context is similar, to avoid drift.

At the first frame, the target location is known. Therefore, we can build a confidence map, such that the closer to the target the higher the probability. The confidence map is defined as

where b, $\alpha, \beta$ are parameters set empirically. m(x) is the response, which in the MOSSE is assumed to be Gaussian. The "dense" in the title refers to this confidence map in the sense that for every pixel near the target, we can define its probability based on Eq (3). Traditional tracker may use random sample or define a step size for sampling, but here the probability is defined densely.

For $P(\operatorname{c}(\mathbf{z})|o)$

which is the Gaussian weighted sum of the pixels near the target. Once we have $P(\operatorname{c}(\mathbf{z})|o)$, m(x), we can get $P(\operatorname{c}(\mathbf{z})|o)$ from Eq (2). The computation process is similar to MOSSE, first express m(x) as the convolution of $P(\operatorname{c}(\mathbf{z})|o)$ and $P(\operatorname{c}(\mathbf{z})|o)$, then transform to the frequency domain to use dot product, after which transform to spatial domain. Second, the target is at the maximum response. Let $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)={{h}^{sc}}(\mathbf{x}-\mathbf{z})$

Note that ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$ measures the distance and direction between target location and the context, and it is not symmetrical. According to the definition of convolution

Eq (5) becomes

Unlike MOSSE, STC only considers the first frame when training the template, which is ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$. But during the tracking, ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$ is updated in similar way as MOSSE.

The paper also illustrates how the template is updated. In Eq (5),

$m(\mathbf{x})=\sum\nolimits_{\mathbf{z}\in {{\Omega }_{c}}({{\mathbf{x}}^{\bigstar }})}{{{h}^{sc}}(\mathbf{x}-\mathbf{z})I(\mathbf{z}){{\omega }_{\sigma }}(\mathbf{z}-{{\mathbf{x}}^{\bigstar }})}$

where ${{\omega }_{\sigma }}(\mathbf{z}-{{\mathbf{x}}^{\bigstar }})$ is Gaussian weights. Analogously, it is like enclosing a circle around the target. The weights are higher inside the circle and lower otherwise. If the size of the target increases, we also enlarge the circle by adjusting the $\sigma$. The process is illustrated as follows

Suppose from frame t to t+1, the target size is multiplied by s. Equivalently, the axis unit is scaled by s. For convenience, let (u,v)=(sx,sy), and assume the target is at (0, 0) in frame t,

since

${{\omega }_{\sigma }}(x,y)=\alpha {{e}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{\sigma }^{2}}}}},{{\omega }_{\sigma }}(x/s,y/s)=\alpha {{e}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{(s\sigma )}^{2}}}}}$
so
${{\omega }_{\sigma }}(x/s,y/s)={{\omega }_{s\sigma }}(x,y)$
Then Eq (8) becomes

from t frame to t+1 frame, substituting the coordinates,

$h_{t}^{sc}(u/s,v/s)\approx h_{t+1}^{sc}(u,v)$

${{I}_{t}}(u/s,v/s)\approx {{I}_{t+1}}(u,v)$
Eq (9) becomes

Suppose the target shrinks from t to t+1, the integration in Eq 10 constitutes two parts, one is the context in frame t+1 (red rectangle), the other is the context in frame t (blue rectangle) subtracts off the red part. In math terms

Because $\omega$ is Gaussian, the right part in Eq 11 is small, which could be effectively treated as 0. Meanwhile $s{{\sigma }_{t}} = {{\sigma }_{t+1}}$, hence the left part is approximately ${{c}_{t+1}}(0,0)$

and therefore

some other skills include averaging s using sliding window approach. The most interesting parts include probability theory and the change of window size. As for context, it can be replaced by other features to increase robustness.

A Study in Vision

Sunday, 21 June 2015

Correlation filter

No comments:

Post a Comment