A Study in Vision: June 2015

Monday 29 June 2015

Git don't dos

There are several things that Git does not like.

For one thing, you could not replace an existing folder with an updated one copied from elsewhere. Git tells you "git changes not staged for commit modified content". Git does not operate that way by deletion and replacement. The changes have to be incremental.

For another thing, make sure there is no git repo in the subfolder of the current repo.

If you want to go back to an earlier version of the code, use git checkout <commit id>, instead of git revert.

Sunday 28 June 2015

Bayes theorem

Bayes theorem is fundamental to image analysis. Prof Yuille gave a talk,when he mentioned that the picture in his slide may not be Bayes, but the Bayes theorem may not be developed by Bayes in the first place...

Here is an interesting blog about this theorem. In high school, we often had to solve this kind of question: there are M black balls and N white balls in a bag, what is the probability that the ball you take out is black? Bayes theorem was originally proposed to solve the reverse problem: if we do not know how many black balls and white balls in a bag and take out some randomly, and observe the color of the balls, then find out the number of balls of each color in the bag.

Another example. Suppose in a school 60% are boys and 40% are girls. Boys always wear pants while half of the girls wear pants, the other half wear dress. It is easy to calculate the probability that a student wears pants or dress. In contrast, suppose you encounter someone wearing pants in the campus, do you know whether you see a boy or a girl?

Let's rephrase the question. Suppose you wander around in the campus, and see N students wearing pants, how many boys and girls are there in these students? It is easy. We need to calculate how many students in the school wear pants, and how many among them are girls, then we are done. If there are U students, then number of boys wearing pants is U * P(Boy) * P(Pants|Boy) = U * 60% * 100%. Likewise, for girls the figure is U * P(Girl) * P(Pants|Girl) = U * 40% * 50%. The proportion of girls wearing pants over the total students wearing pants times N would be the number of girls you have encountered.

Apparently, the total number of students is irrelevant. The simplified answer is $P(Girl|pants)=\frac{P(Girl)*P(Pants|Girl)}{P(Boy)*P(Pants|Boy)+P(Girl)*P(Pants|Girl)}$.

The equation above is nothing but P(Pants,Girl) over P(Pants), meaning how many girls P(Pants,Girl) are in all students wearing pants P(Pants).

Another example on spelling check. If we see a user enters a word that is not in the dictionary, we have to guess：what this guy is really trying to say? i.e. P(the word we guess he enters|the words he actually enters). We have to maximize this probability to find out the correct word. Of course there are more than one right answer. For instance, when the user enters thew, then is she really wants to enter the or thaw? If our guesses are h1, h2.... (h stands for hypothesis), and they all belong to a finite hypothesis space (since the choice of words is limited). Suppose the word that the user enters is D (Data), then P(h1|D), P(h2|D)...could be represented by P(h|D). Use Bayes theorem, we have P(h|D)=$\frac{P(h)*P(D|h)}{P(D)}$.

For h1, h2, h3..., P(D) remains the same, so we ignore this term. We only have to compare P(h)*P(D|h). The moral of the story is, for a given data, whether a guess is good or bad depends on the probability of the guess itself (prior), as well as the probability of the guess leading to the data (likelihood). In the above example, the probability of the users wanting to enter the depends on the probability of the word the in the usage, and the probability of entering thew instead of the.

The rest is simple. We only need to compute the probability for every possible hypothesis, i.e. P(h)*P(D|h)，and find the maximum, which is the most probable word.

Is there any alternative way other than Bayes? One common way is to choose the most closet one with thew, but the and thaw both are one alphabet different from that word. The distance alone is insufficient to determine the correct word.

We also notice that (maybe with the help from Holmes), e and w are closer on the keyboard, so it is easy for the finger to press the wrong key. In contrast, e and a are far away, and it is unlikely to press e for a. So the probability of thaw becoming thew is not very high. This is essentially the process to calculate P(D|h).

What does Bayes calculate? P(h)*P(D|h), there is an extra term P(h), which is a prior. Why do we need the prior? Suppose the user enters tlp, then is it top or tip? We also suppose top is more common than tip. This time, since both o and i are close to l, we could not decide based on P(D|h) alone. Since top is more common, then it is more probable that the user enters top.

Another problem of maximum a posterior is that even though a guess may fit the data well, the guess may not have high probability, simply because the guess itself is impossible.

For instance, for -1 3 7 11, is arithmetic progression more possible, or $\frac{-x^3}{11}+\frac{9}{11}x^2+\frac{23}{11}$ more possible? Of course it is the former. For another example, when fitting curves to points, we could either use high order polynomial or straight line. For points that are almost on a line, the residual error associated with high order polynomial is certainly lower than a straight line, but do we really want to use a complex equation? The answer lies in that we tend to choose the one that P(h) is much higher. Admittedly, we could not fit a line to a complex pattern. That's way we need the multiplication
P(h)*P(D|h).

The moral of the story is that, there are all kinds of errors in the measurement. So if we seek a perfect model in an excessive way, we may encounter overfitting, where even the noise is included in the model, which is unnecessary at all. Hence, P(D|h) is high does not necessarily mean that h is good, since it also depends on P(h).

Occam's Razor refers to the idea that if two theories are equally likely to explain a phenomenon, then we prefer the simpler one, which is also the more common one.

In reality, the data often follows Gaussian distribution. Thus your measurement is probably not exactly what the model predicts. In this case we do not need to gild the lily and change our model, because the data drifted from the model could not be explained by the limited factors in your model. For example, the relationship between height and weight is nearly second order polynomial. But everyone knows height is not the only factor that affects weight.

In other words, the theory with higher P(h) is more preferable. By contrast, MAP favors the one with higher P(D|h), because it fits the data better. Bayes theorem is the battle between the two. Suppose you toss a coin, which could either be tail or head. If you see head, you have to calculate the possibility that the head appears when tossing a coin. According to MAP, we tend to assign that figure to 1 because it maximizes P(D|h). This is impossible because a coin has a tail.

Our common sense is that most coins are not biased, and only a small portion of them are biased. We model this prior normal distribution as $p(\theta)$, where $\theta$ represents the probability that a coin gives head, and p represents the probability density function. Instead of maximizing P(D|h), we are actually maximizing P(D|$\theta$)*P($\theta$), and obviously $\theta = 1$ does not make sense because P($\theta = 1$) = 0, and the multiplication goes to nil. In fact, we can get the extreme by differentiation.

Previously we have discussed that we know the a prior. Suppose we do not know beforehand whether it is common or not, and have to assume they are equally likely. This time we can only use MAP.

There is an interesting debated between statisticians and Bayes supporters. The former says: let the data speaks for itself (we do not give a hack about the a prior). While the latter says: data can have all kinds of bias, and a prior can make the model more robust against noise. It turns out that Bayes wins, and the key to the success is that the a prior is also a result of statistics, but based on empirical evidence.

An ambiguous sentence goes like this: The girl saw the boy with a telescope. Does the girl saw the boy, using a telescope, or does she saw a boy holding a telescope?

The a prior of both interpretation is nearly the same, but we tend to use the first one. Why? Because we do not believe in coincidence. We ask ourselves: why does the boy holds a telescope, which could be used as a tool to see, instead of a book, or a football? The probability is too low! So the only probable explanation is that there is a certainty behind the "coincidence". (Sounds like what Holmes always tells Watson).

The reasonable explanation is if we interpret it as the girl saw the boy, using a telescope, then the data fits perfectly. Since the girl is observing the boy, then it is no longer a coincidence that she uses a telescope.

The above focuses on P(D|h). If we rely more on coincidence to explain things, it is more complicated. So we prefer reasonable, and thus simpler explanation. This is the Bayesian Occam's Razor. The razor works on P(D|h), instead of P(h). In contrast, the traditional Occam's Razor works on P(h).

Going back to the line fitting example. What is the chance of a high order polynomial generating data that approximately lies on a line? Not high. Conversely, the chance of getting a data that almost lies on a line by a line equation is much larger.

Bayesian deduction has two steps. Firstly, build a model based on the data. Secondly, use the model to predict the probability of unknown events. Previously we are only concerned with the maximum one. Sometimes, other models do not have zero probability. For instance, given data, model 1 probability is 0.5, model 2 is 0.4, model 3 is 0.1. If we only want to know which one is best, we choose model 1, end of the story.

More often than not, we want to predict things. All the models have its own prediction. Just because one model has a higher probability then we all listen to it is not democratic. The optimal Bayesian deduction takes the weighted sum of the models. Though it sounds perfect, we seldom use it. For one thing, model construction is very time consuming. For another thing, the model space may be continuous, and there is no way to build infinite models.

We could introduce more examples on Bayes theorem.It is the core of machine learning. For the separation of Chinese sentence, the sentence is 南京市长江大桥, which means the bridge of Changjiang river in Nanjing. In Chinese, there are two possible ways to interpret this:

1. 南京市/长江大桥，which is the same meaning as before.
2. 南京/市长/江大桥，which means the mayor of Nanjing is Jiangdaqiao， which is not correct.

Let's analyse it using Bayes. Suppose X is the sentence, Y is the way to separate the words, then we seek Y that maximizes P(Y|X), which is equivalent to finding the maximum of P(Y)*P(X|Y). In plain words, the equations means the probability of the separation of words multiply the probability of the sentence being generated by the separation of words. More specifically, we could treat P(X|Y) as 1, since no matter 1 or 2 could generate the sentence. Hence we need to maximize P(Y).

How to calculate the probability of a sequence of words, w1, w2, w3, w4? P(w1, w2, w3, w4) = P(w1)*P(w2|w1)*P(w3|w2,w1)*P(w4|w3,w2,w1). With increasing number of w, it is difficult to get $P(w_{n}|w_{n-1}...w1)$. To solve this issue, computer scientists, as usual, make a naive assumption: we assume the probability of a word only depends on the previous k (k<3) words. Though the assumption seems too young too simple, the result is good and robust.

With the assumption, we rewrite P(w1, w2, w3, w4) = P(w1)*P(w2|w1)*P(w3|w2)*P(w4|w3). Since P(Nanjing mayor|Jiangdaqiao) = 0, case 2 is incorrect, and thus P(Nanjing|Bridge of Changjiang river) wins.

In fact, the aforementioned examples are very shallow. At this level, machine learning only concerns about superficial phenomenon. Since it is shallow, the world seems more complicated. In terms of machine learning language, we need more features. As a result, the dimensionality increases dramatically, which leads to curse of dimension, and the data becomes quite sparse and insufficient.

On the other hand, human perceive things more deeply. To avoid data sparsity, we invent all kinds of devices, such as the microscope, to help us delve deeper to find out the underlying relationship that is more essential. For example, machine needs to learn from large samples to know that bikes have two wheels, but we know it without learning because it is unreasonable for bikes to have otherwise.

Machine translation could be defined as the following problem: given a sentence e, and its possible translation f, find out which one is more reliable. In other words, we need to calculate P(f|e), which is P(f)*P(e|f), favoring those translations that are common and is probable to generate e.

Computation of P(e|f) is not trivial. What is the probability of a given translation that generates a sentence? For example, suppose e is John loves Mary. The first f in Franch we examine is Jean aime Marie. Looking at the correspondences, one possibility is John (Jean) loves (aime) Marie (Mary). After the correspondence is established, P(e|f) is more easy to calculate as P(e|f) = P(John|Jean)*P(loves|aime)*P(Marie|Mary). We need to exhaust all correspondences and calculate the sum of theses values, which is P(e|f).
How do human translate? Do we exhaust all possibilities? Certainly not. We adopt bottom up/top down approach.

Clustering is an unsupervised way of machine learning. The problem states: given some data points, classify them into clusters. One solution assumes that these points resides around k centers following a Gaussian distribution. We need to estimate where the centers are and the parameters of Gaussian.

The difference here is that the answer is continuous and thus infinite. What's worse, we need to know the parameters to determine which points belong to which center, and we need to know which points belong to which center to calculate the parameters. This is a typical chicken-egg problem.

To break the loop, I don't care, we say. We randomly guess some centers, and adjust accordingly in an iterative manner. This is called Expectation Maximazation (EM). At expectation stage, we guess the centers and the parameters, and determine which center the points belong. During maximazation, we re-calculate the centers and parameters based on the data attributed. Bayes is applied to calculate the parameters based on the data.

Least squares is often used for linear regression. Given N points on a plane, assume we want a line to fit them (regression is a specific case of fitting, i.e. fitting with error). How do we define the best line? Suppose the coordinate of the points is (xi, yi), and the line is y = f(x). The error between the point and its estimation is $\delta yi = |yi - f(xi)|$. Least squares means to minimize $(\delta y1)^2 + ... + (\delta yn)^2$.

Why do we use squares instead of absolute values? Use Bayes theorem. Suppose f(xi) is the best estimation of xi, and all the yi that deviates f(xi) contain noise, and the further away the lower the probability. We can use a normal distribution for this, centered at f(xi), and the probability of (xi, yi) is proportional to $exp[-(\delta yi)^2]$. We want the maximum MAP, which is P(h|D), proportional to P(h)*P(D|h). h represents the line, D is the data points. We seek a line such that P(h)*P(D|h) is largest.

Obviously, P(h) is the same for all the lines. We only need to concern about P(D|h) = P(d1|h) * P(d2|h)... Here we assume each point is independent and multiply their probability. Remember that P(d1|h) = $exp[-(\delta yi)^2]$. Thus, P(D|h) = $exp[-(\delta y1)^2]$ * $exp[-(\delta y2)^2]$... = exp{-[(\delta y1)^2 + (\delta y2)^2...]}. To maximize it, we need to minimize (\delta y1)^2 + (\delta y2)^2... Familiar?

Another application of Bayes is to filter out the spam emails. Given an email, determine whether it is spam or not. Denote the email by D, which is comprises N words. We use h+ for spam, and h- for normal email. The problem can be expressed as:

$P(h+|D)=\frac{P(h+)*P(D|h+)}{P(D)}$
$P(h-|D)=\frac{P(h-)*P(D|h-)}{P(D)}$

The a prior, P(h+), P(h-) are easy to obtain by calculating the percentage of spam and normal emails in the inbox. However, P(D|h+) is hard, since D contains N words, d1, d2... So P(D|h+) = P(d1,d2...|h+). We encounter the data sparsity once more. Why? Because P(d1,d2...|h+) is the probability of the current email being exactly the same with a spam in the inbox. This is ridiculous, as every email is different. In other words, no matter how many spam emails you have collected, there is no way that there is one the same with the current email. This means the data is sparse, not dense.

Alternatively, we rewrite P(d1,d2...|h+) as P(d1|h+) * P(d2|d1,h+) ... For simplicity, we assume di is independent of previous words. Thus, P(d1,d2...|h+) = P(d1|h+) * P(d2|h+) ... This is what we call independence assumption. P(d1|h+), P(d2|h+) are easy to get as we only need to calculate the probability of d1, d2... in the spam.

As a side note, why we encounter the data sparsity issue? Because we work in superficial level. The combination of words are deemed to be infinite. Therefore, using statistics at this level is bothered by data sparsity. Notice that even though the combination of words is infinite, the variation of syntax is limited. How to model syntax is an AI problem. The higher the level, the fewer the possibilities. Many words map to one sentence, and many sentence map to on syntax, indicating a hierarchical system.

Why the simple independence assumption is so powerful? Why can we bravely say that a word only depends on the previous 3 or 4 words, though in reality it may depends on the previous sentence? One theory asserts that the positive and negative effects of the assumption cancel out.

Hierarchical Bayes theorem goes deeper than the most basic form. It includes the cause of the cause. For instance, suppose you have N coins manufactured from the same factory, you toss them one time each and deduct the probability of head $\theta$. p($\theta$) has a prior, which may follow beta distribution. That is to say, the outcome of a toss xi follows the normal distribution centered at $\theta$, and $\theta$ follows a beta distribution centered at $\psi$. The hierarchical layers of cause are emerged.

Hidden Markov Model is a simple Hierarchical Bayes model. How to obtain what the sender wants to say based on the message received? When we observe voice o1, o2, o3, we have to deduce the sentence received s1, s2, s3. We need to find the maximum P(s1, s2, s3|o1, o2, o3). The a prior of the HMM is $\lambda$, then P(S|O, $\lambda$).

Using Bayes theorem, P(S|O, $\lambda$) $\propto$ P(S|O) = P(O|S)*P(S), where P(O|S) represents the probability that a sentence S is interpreted as O, which means the probability of sending S. P(S) means the probability that words s1, s2, s3 can form a meaningful sentence, which depends on $\lambda$.

What is the difference between Bayesians and Frequentists? Let's use our favourite coin example (Bayesians really have lots of coins). Normally p(head) = p(tail) = 0.5. However, due to wear and tear, the probability may be uneven. We want to know p(head) = t, and p(tail) = 1-t. Frequentists will think t is some fixed value in [0, 1]. If we toss 100 times, and head appears 40 times, then t = 0.4. If we toss another 100 times, and head appears 42 times, then t = 0.42. Here comes the question: how can t be both 0.4 and 0.42 at the same time? In fact, t is neither 0.4 nor 0.42, but some value not far from those. All we can say is that the 95% confidence interval is [0.4-0.1, 0.4+0.1]. t is either in that interval or not. Frequentists tell us if you repeat the experiment many times, there will be one such interval that includes the real t.

However, sometimes experiment is not repeatable, or the data sampling is costly. For instance, if we are only allowed to toss 3 times, then there is a high chance that all of them are tails, which leads to t = 0. In this case, there is nothing Frequentists can do.

How does Bayesians solve the problem? An essential difference is that Bayesian does not care whether t is some fixed value, they care more about its distribution, which is the belief in t. Before we toss the coin, we think t lies evenly in [0. 1], and p(t) = 1. When new data is available as evidence, for instance tossing 3 times and all are tails, then our belief in t will change.

Suppose each toss is independent, and here n = 3. Let 1 represents head and 0 represents tail, we have p(x1=0,x2=0,x3=0|t) = p(x1=0|t) *p(x2=0|t) *p(x3=0|t) = $(1-t)^3$ = p(D). The denominator is the integral of all the possible values of t, thus m(D) = $\int (1-t)^3 dt = \frac{1}{4}$. We call p(t) the prior, and when data is available as evidence, p(t|D) is the posterior. We update our belief in t when we have data, but the prior still has influence. This influence weakens as we have more data. Hence when data is scarce, the choice of prior is very important.

Wednesday 24 June 2015

Optical flow on UAV

这里有一个很好的PX4FLOW原理的介绍。光流法可以算出图像中的角速度，但是并不能很好的服务无人机。近处移动很慢的物体和远处移动很快的物体的光流可能是一样的。所以光流必须结合其他传感器。

如上图所示，点C代表光流中心点，b是声纳测量得到的飞机高度，$\alpha$代表光流减去gyro测量得到的俯仰角。因为飞机的俯仰会影响光流的计算。a是飞机的地面速度。所以给定声纳测得的高度，IMU测得的俯仰角，还有光流，我们就能计算出地面速度。即知道b和$\alpha$就可得a。

但是当$\alpha = 90$的时候，a和h都是无限大，需要用到积分来处理。这时有地面速度=（光流-俯仰角）X 高度。当声纳精度不够的时候，可以用GPS测量地面速度。所以高度=地面速度/（光流-俯仰角）。

PX4FLOW有两种工作方式。1. 读取地面速度，计算高度。2. 给出光流。飞机自己计算高度和地面速度。

当飞机距离地面很近的时候，光流=(地面速度/高度)+俯仰角。所以因为高度小，光流会变成无限大。为了解决这个问题，可以采取一下措施。

1. 用更慢的速度降落。

2. 将光流计算原件装的高一些，比如在机翼下面，而不是在机身下面。

3. 使用广角相机。PX4FLOW有一个16mm焦距的相机。焦距变小会导致更大的视角，和更精确的光流。用普通相机而不是鱼眼，因为几何变形会使得光流不准。

4. 使用其他传感器降落。

Visual correspondence the lambert ambient shape space and the systematic design of feature descriptors

这篇文章名字完全看不懂，只知道大概讲的是feature descriptors就硬着头皮开始看了。从能看懂的地方开始记录笔记。

对于三角形来说，可以用顶点的坐标 x1, x2, x3来描述，或者用个2x3的矩阵,x = [x1, x2, x3]。所以说三角形可以看做6维空间的一个点。如果我们在空间中移动这个三角形，那它的坐标就会改变，但是三角形本身没变。所以我们用shape spaces来表述三角形对于移动的不变性。

对于欧几里得坐标系和刚体运动，三角形变换的时候距离，角度，方向都不改变。这样一来变换就以坐标矩阵x乘以一个旋转矩阵R然后加上一个平移矩阵T描述。如果g = (R,T)代表刚体运动，gx代表经过刚体运动的坐标，那么x' = gx, x' = Rx + T。如果再加上尺度，g = ($\alpha$R,T),x' = $\alpha$Rx + T,这就是仿射变换。三角形的本质就是无论如何变换坐标都不会变的东西。

这样一来，假设有两个三角形，坐标分别是x,y，我们就不能用d(x-y)=||x-y||，因为同一个三角形在不同坐标系下有不同坐标，d就不为0了。我们必须比较所有的点，而且要考虑到所有变换情况下点距离的最小值。（比较两个三角形都成了优化问题了吗...）

为了不计算所有可能的情况，我们需要canonization。对三角形x,我们让x1当原点，所以x1=0，然后用T=-x1来移动其它点，所有三角形就都可以用[0,x2-x1,x3-x1]=[0,x2',x3']来表示。这样我们就少了两个自由度。所有三角形就可以用具有平移不变性的[x2',x3']来表示。接下来处理旋转的问题。我们可以用R让x2'转到水平轴。对于尺度，我们可以乘以$\alpha$，让处于水平轴的x2'到原点距离为1。经过这个过程，再比较三角形的时候就直接用x-y就好，没必要求解最小值问题了,而且只需要x3'来表示三角形，这就把它从6维降到了二维（三体人的维度打击的数学原理？）。

不过这样带来一个问题。在原来的六维空间，三角形无论相加还是尺度变换都得到三角形。而在新的二维空间中就不成立了。如果原来的三角形是高斯分布，那么二维空间中的就不是了。

canonization的过程不是唯一的。我们可以用x1,x2,x3中的任何一个来当原点。但是有些canonization却不是最理想的。比如选取最左边的点最为原点，那么三角形旋转的时候，最左边的点变了，canonization也就变了。更好的方法是选取centeroid。

Process Mat collectively for Fourier Transform

I am very fond of Matlab because it is very easy to deal with vectors, matrices. This is not the case in Opencv, when you want to process element wise operations for Mat you always have to iterate through every row and column.

Here is an interesting discussion on how to do collective processing. Nevertheless, in case of DFT, it is easy to avoid iterations. What not so easy to comprehend is the Opencv documentation on DFT, which is like a tone-twister.

To compute DFT, use dft and set the flag as DFT_SCALE|DFT_COMPLEX_OUTPUT, which is pointed out here. To compute the element wise multiplication of the resulting matrices of two DFT, use mulSpectrums, which gives x.*y. If you want x.*conj(j), set the last flag as true (DFT_COMPLEX_OUTPUT,true).

Tuesday 23 June 2015

Git not ignore

I had a lot of trouble with git ignore, because most of the time I only want to include certain files, instead of ignoring some files.

Say I only want to add all the .m file and ignore the rest in my Matlab workspace. It is actually fairly simple from this post. I need to ignore all the files by putting * at the beginning, and then unignore all sub directories by !*/, then tell git that I want all the .m files by !*.m, as simple as that!

The gitignore file looks like this

*
!*.cpp
!*.h
!*.m
!*/

Also, according to this, if I am in a catkin workspace, but has a subdir for matlab code, I need to create another git ignore file in matlab folder like this

*
!*.m
!*/

so that only m file will be included.

If I want to only include the cpp, h, and txt file in the src folder, then in the catkin workspace, use gitignore like this

*
!*/
!/src/*.cpp
!/src/*.h
!/src/*.txt

Sunday 21 June 2015

Correlation filter

There are several tracking algorithms, such as TLD, that are quite stimulating. This time I am going to learn more about correlation filter.

Let's start with Discrete Fourier Transform (DFT), following this blog. In Matlab, we can use F = fft2(f) to compute DFT and then use F2 = log(abs(F)) to visualize the results. To increase the accuracy of Fast Fourier Transform (FFT), use F = fft2(f,256,256) instead. DFT can be used for template matching, for example searching for 'a' in an image full of text. The code is

Now we can study correlation filter in more details. From this blog, the first paper that introduces this concept in tracking is Visual Object Tracking using Adaptive Correlation Filters in CVPR 2010. The approach is Minimum Output Sum of Squared Error (MOSSE). Correlation filter (CF) is very fast. For instance, MOSSE's speed is 669 frame per second (fps), advancing the tracker from real time to high speed. Meanwhile, its accuracy is also high, about 73%.

Correlation here means Cross-correlation. Suppose we have two signals f, g, their cross-correlation is

$\begin{array}{l} (f \star g)(\tau )\mathop = \limits^{def} \int_{ - \infty }^\infty {f*(t)g(t + \tau )dt} \qquad (1)\\ (f \star g)(n)\mathop = \limits^{def} \sum\limits_{ - \infty }^\infty {f*[m]g(m + n)} \end{array}$

Intuitively, correlation measures the similarity of two functions at time $\tau$. Consider a simple case when f, g have the same shape, but g is delayed. Their correlation reaches its maximum only when they are aligned, like $g(t+\tau)$, since no one else is more similar to f other than f itself.

In terms of tracking, a filter needs to be designed such that when it is applied on the target, the response is maximum.

Therefore, we can write out the equation as

$g=h\text{ }\bigstar f \qquad (2)$

where g is the response, f is the input image, h is the filter. g could take any form, and we assume it is Gaussian in the above figure. Hence we only need to compute h. But why CF is so fast? The answer lies in the application of FFT. Convolution in spatial domain is equivalent to multiplication in frequency domain and vise versa, therefore

$Fh\text{ }\bigstar f\text{=}{{(Fh)}^{*}}\odot Ff \qquad (3)$

where F stands for FFT, $\odot$ is dot product. Suppose f has n pixels, then FFT of f takes $O(n\log n)$, which is the same to eq (3). This is much faster than other trackers. This is the essence of the paper.

To compute h, let F(f) = F, ${(F(h))}^{*} = H^{*}$, F(g) = G, then we have

${{H}^{*}}=\frac{G}{F}\qquad (4)$
In reality, we have to consider m templates to accommodate appearance change. The objective function then becomes

$\underset{{{H}^{*}}}{\mathop{\min }}\,\sum\limits_{i=1}^{m}{|{{H}^{*}}{{F}_{i}}-{{G}_{i}}{{|}^{2}}} \qquad (5)$

Eq (5) is not difficult since the computation is element wise in the frequency domain, according to the convolution theorem. Hence we can solve every element in ${{H}^{*}}$, denoted as $H_{w,v}^{*}$. Eq (5) becomes

$\underset{H_{w,v}^{*}}{\mathop{\min }}\,\sum\limits_{i=1}^{m}{|H_{w,v}^{*}{{F}_{w,v,i}}-{{G}_{w,v,i}}{{|}^{2}}}\qquad (6)$
We need to differentiate Eq (6) and equal it to zero. Note the difference of differentiation in the complex domain compared with real numbers

$\begin{array}{l} \frac{\partial }{{\partial H_{w,v}^*}}\sum\limits_{i = 1}^m {(H_{w,v}^*{F_{w,v,i}} - {G_{w,v,i}}) \cdot } {(H_{w,v}^*　　{F_{w,v,i}} - {G_{w,v,i}})^{\rm{*}}}{\rm{ = }}0\\ \Rightarrow \frac{\partial }{{\partial H_{w,v}^*}}\sum\limits_{i = 1}^m {H_{w,v}^*{F_{w,v,i}} \cdot {H_{w,v}}F_{w,v,i}^{\rm{*}} - H_{w,v}^*{F_{w,v,i}}G_{w,v,i}^{\rm{*}} - {H_{w,v}}F_{w,v,i}^{\rm{*}}G_{w,v,i}^{}{\rm{ + }}{G_{w,v,i}}G_{w,v,i}^{\rm{*}}} {\rm{ = }}0\\ \Rightarrow \sum\limits_{i = 1}^m {{F_{w,v,i}} \cdot {H_{w,v}}F_{w,v,i}^{\rm{*}} - {F_{w,v,i}}G_{w,v,i}^{\rm{*}}} {\rm{ = }}0\\ \Rightarrow {H_{w,v}}{\rm{ = }}\frac{{\sum\limits_{i = 1}^m {{F_{w,v,i}}G_{w,v,i}^{\rm{*}}} }}{{\sum\limits_{i = 1}^m {{F_{w,v,i}}F_{w,v,i}^{\rm{*}}} }} \qquad (7) \end{array}$

Following the Eq (7) to process element in H, we get
$H\text{=}\frac{\sum\limits_{i=1}^{m}{{{F}_{i}}\odot G_{i}^{\text{*}}}}{\sum\limits_{i=1}^{m}{{{F}_{i}}\odot F_{i}^{\text{*}}}}\qquad (8)$

To track a target, we only need to process the image using the filter H, and then the target location is the maximum response. The filter can be updated accordingly
${{H}_{t}}=(1-\eta ){{H}_{t-1}}+\eta H(t)\qquad (9)$
where ${H}_{t}$ is the filter obtained at frame t, and $\eta$ is the learning coefficient. We can normalize Eq (8) by adding $\epsilon$ in the denominator. We can also compute ${H}_{i}$ separately and then take the average.

The second blog of the same author addresses the improvement on MOSSE, particularly about Zhang Kaihua's ECCV 2014 paper named STC tracker：Fast Visual Tracking via Dense Spatio-Temporal Context Learning. Zhang also proposes Compressive Tracking.

The figure above depicts the workflow of the algorithm. It is similar to MOSSE regarding the use of FFT. Nevertheless, it has several original ideas as well
1. Application of Dense Spatio-Temporal Context
2. Incorporation of probabilistic theory
3. Consideration of scale change in template updating

Dense Spatio-Temporal Context:

It may be difficult to merely track the target alone due to appearance change or occlusion. If we consider the background (Spatio Context) as well, then the risk of losing target may be reduced. Take the figure above for example, if we consider the target (yellow part) alone, then tracker may get lost when occlusion happens. In contrast, if we also consider the context (red part), then it will help us locate the target. If we include the information contained in several frames, it is Temporal Context.

Incorporation of probabilistic theory:
Suppose $\mathbf{x}\in {{\mathbb{R}}^{2}}$ is a location, o is the target, we define the probability that o will appear at x in the confidence map as
m(x) = P(x|0) (1)
we also define the set of context as
${{X}^{c}}=\{\operatorname{c}(\mathbf{z})=(I(\mathbf{z}),\mathbf{z})|\mathbf{z}\in {{\Omega }_{c}}({{\mathbf{x}}^{\bigstar }})\}$
where $x^{\bigstar }$ stands for the target location, ${\Omega }_{c}(x^{\bigstar })$ represents a region two times larger than the target, centered at $x^{\bigstar }$. $I(\mathbf{z})$ is the pixel value at Z. In short, the equation takes $x^{\bigstar }$ as its center and selects a region two times the area of target as the feature, like the red part in the above figure. The Eq (1) can be expressed as
m(x) = P(x|0) = $\underset{c(Z)\in X^{c}}{\sum}P(X|c(Z), o)P(c(Z)|o) (2)$
where $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)$ accounts for the probability of target being at x, given target and its context. On the other hand, $P(\operatorname{c}(\mathbf{z})|o)$ represents the probability of one context belonging to the target, in other words, the a prior probability of target context. The purpose of $P(\operatorname{c}(\mathbf{z})|o)$ is to select context similar to the one surrounding the target, whereas $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)$ considers whether it is reasonable for the target to appear here even when the context is similar, to avoid drift.

At the first frame, the target location is known. Therefore, we can build a confidence map, such that the closer to the target the higher the probability. The confidence map is defined as

where b, $\alpha, \beta$ are parameters set empirically. m(x) is the response, which in the MOSSE is assumed to be Gaussian. The "dense" in the title refers to this confidence map in the sense that for every pixel near the target, we can define its probability based on Eq (3). Traditional tracker may use random sample or define a step size for sampling, but here the probability is defined densely.

For $P(\operatorname{c}(\mathbf{z})|o)$

which is the Gaussian weighted sum of the pixels near the target. Once we have $P(\operatorname{c}(\mathbf{z})|o)$, m(x), we can get $P(\operatorname{c}(\mathbf{z})|o)$ from Eq (2). The computation process is similar to MOSSE, first express m(x) as the convolution of $P(\operatorname{c}(\mathbf{z})|o)$ and $P(\operatorname{c}(\mathbf{z})|o)$, then transform to the frequency domain to use dot product, after which transform to spatial domain. Second, the target is at the maximum response. Let $P(\mathbf{x}|\operatorname{c}(\mathbf{z}),o)={{h}^{sc}}(\mathbf{x}-\mathbf{z})$

Note that ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$ measures the distance and direction between target location and the context, and it is not symmetrical. According to the definition of convolution

Eq (5) becomes

Unlike MOSSE, STC only considers the first frame when training the template, which is ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$. But during the tracking, ${{h}^{sc}}(\mathbf{x}-\mathbf{z})$ is updated in similar way as MOSSE.

The paper also illustrates how the template is updated. In Eq (5),

$m(\mathbf{x})=\sum\nolimits_{\mathbf{z}\in {{\Omega }_{c}}({{\mathbf{x}}^{\bigstar }})}{{{h}^{sc}}(\mathbf{x}-\mathbf{z})I(\mathbf{z}){{\omega }_{\sigma }}(\mathbf{z}-{{\mathbf{x}}^{\bigstar }})}$

where ${{\omega }_{\sigma }}(\mathbf{z}-{{\mathbf{x}}^{\bigstar }})$ is Gaussian weights. Analogously, it is like enclosing a circle around the target. The weights are higher inside the circle and lower otherwise. If the size of the target increases, we also enlarge the circle by adjusting the $\sigma$. The process is illustrated as follows

Suppose from frame t to t+1, the target size is multiplied by s. Equivalently, the axis unit is scaled by s. For convenience, let (u,v)=(sx,sy), and assume the target is at (0, 0) in frame t,

since

${{\omega }_{\sigma }}(x,y)=\alpha {{e}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{\sigma }^{2}}}}},{{\omega }_{\sigma }}(x/s,y/s)=\alpha {{e}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{(s\sigma )}^{2}}}}}$
so
${{\omega }_{\sigma }}(x/s,y/s)={{\omega }_{s\sigma }}(x,y)$
Then Eq (8) becomes

from t frame to t+1 frame, substituting the coordinates,

$h_{t}^{sc}(u/s,v/s)\approx h_{t+1}^{sc}(u,v)$

${{I}_{t}}(u/s,v/s)\approx {{I}_{t+1}}(u,v)$
Eq (9) becomes

Suppose the target shrinks from t to t+1, the integration in Eq 10 constitutes two parts, one is the context in frame t+1 (red rectangle), the other is the context in frame t (blue rectangle) subtracts off the red part. In math terms

Because $\omega$ is Gaussian, the right part in Eq 11 is small, which could be effectively treated as 0. Meanwhile $s{{\sigma }_{t}} = {{\sigma }_{t+1}}$, hence the left part is approximately ${{c}_{t+1}}(0,0)$

and therefore

some other skills include averaging s using sliding window approach. The most interesting parts include probability theory and the change of window size. As for context, it can be replaced by other features to increase robustness.

Saturday 20 June 2015

Music related to football

As a football buff, I enjoy the background music when watching match playbacks. The most famous one is arguably Attraction used in Total Soccer. Electric Romeo is also nice as I often hear it on qq.com.

Pattern recognition using MPL,MIL, MCL

This blog is very stimulating, as the author like to share her explanations of advanced vision algorithms and codes.

This time it is about Multi-Instance Learning (MIL) and Multi-Pose Learning (MPL), proposed by Prof Boris Babenko at UC San Diego. MIL refers to learning from a package of the object, instead of using a labelled instance. The training samples are adjusted such that they lie in correspondence. On the other hand, MPL means learning different poses using different classifiers. It separates the data into groups and train separate classifiers for each group. In other words, divide and conquer.

The right figure refers to MIL, where every row constitutes the training sample for an object. The left one refers to MPL, where every row contains different poses. Each color represents a class.

Compared to MPL, MIL is more widely used. Unlike boosting, MPL uses

where y used in traditional boosting is replaced by a combination of $y^k$, k represents class and the result is 1 as long as one class is 1.

The iterative training procedures are similar, except that MPL requires extra training for each $y^k$.

Prof Cipolla proposes Multi-Class Learning (MCL),where multiple classes are learnt. It uses Noisy-OR model as in

The update stage in MCL is more clear than MPL

where $w_{ki}$ is the weight, k is the class, i is the sample. Note that instead of -1, 1, MCL uses 0, 1. Therefore, if a class is negative sample, then it will not be used in the next training. Meanwhile, the positive class sample has larger weights. In a class for k, the samples wrongly determined have increased weights, similar to traditional boosting.

Friday 19 June 2015

Math in vision

I often find it difficult to understand the mathematical equations in papers, especially those published in PAMI, IJCV. Sometimes the math looks so intimidating that I only read introduction, conclusion, and skip the rest. However, as pointed out by pluskid, reading the introduction, conclusion, and the references are merely a way to convince myself of a sense of achievement, without actually thinking thoroughly.

A discussion on how to acquire the math knowledge is quite stimulating. There are several ways to meet this end, and could be summarized as follows.

Video lectures, such as the MIT opencourseware. I learnt linear algebra from Prof. Gilbert Strang in this way.

Courses. The university graduate courses provide plenty of materials on basic concepts in this field. This course is fundamental and also recommended, so as this and this.

Textbooks. However, I often do not have the patience to read all of the pages. I normally only read the relevant parts to my current research because I forget the equations and concepts quickly without applying them. It is reported that A First Course in Linear Algebra is a good reference book. In my opinion, the books caters for Opencv are really great as they explain the math in an easy to follow way, and full of examples.

Papers. Most papers only include advanced math but some of them cares to elaborate more.

In short, a list of required math may be

Linear Algebra
Singular Value Decomposition
Introductory level Pattern Recognition
Principal Component Analysis
Linear Discriminant Analysis
Fourier Transform
Wavelets
Probability, Bayes rule, Maximum Likelihood, MAP
Mixtures and Expectation-Maximization Algorithm
Introductory level Statistical Learning
Support Vector Machines
Genetic Algorithms
Hidden Markov Models
Bayesian Networks
Kalman filtering

The math is critical for computer vision research as discussed here. Essentially, to solve a problem we need to choose the most effective math model to formulate it. For instance, graph theory comes in handy for segmentation. If we are not aware of the math involved, then the vision algorithm we develop may appear ad hoc.

Thursday 18 June 2015

MSER

I have used MSER for feature point detection largely due to its robustness in locating homogeneous regions. In some cases color threshold does not work well due to illumination change, so it is desirable to bypass the thresholds and detect regions with the same color automatically.

Maximally Stable Extreme Region (MESR) is discussed here. A region is part of an image where for any two pixels A, B, there is a path connecting them. Assume the intensity of image ranges from 1 - 255, for each intensity i we consider those pixels with intensity not larger than i. For example, suppose we have a 1D signal 1，2，3，4，1，2，3，4, then if i = 2, the first, second pixels are connected, fifth and sixth pixels are connected, but the two groups of pixels are separated by 3, 4.

For an intensity i, extreme region means the region can not be any larger under the current i. One i may correspond to multiple extreme regions. For instance, consider a flag with stars, every star is an extreme region, but all the star as a whole is not an extreme region. On the other hand, any partial star is not an extreme region.

All the extreme regions in an image can form a tree structure, with parent and children nodes. The relationships between extreme regions are either exclusive (no intersection at all), or inclusive (one is totally inside another), and thus it is a tree. If extreme region A contains extreme region B, and no other regions are involved, then A is the parent node of B. All the regions have the same ancestor node. For the flag with stars, the whole flag is one extreme region, so as the stars. Every pixel in the stars belongs to the flag region.

To find the extreme regions, we build the tree in a bottom up way, listing the pixels in ascending order based on intensity value. Suppose current pixel being processed is x, and all the previous pixels have corresponding extreme regions as well as the parent children structure. To process x, we need to consider it case-by-case

1. Some pixels around x are processed, meaning those pixels are not larger than x.

2. No pixels around x are processed.

For case 2, simply build an extreme region centered at x. For case 1, suppose the pixels around x are y, then there are two cases

a. y < x, x will not influence the existing extreme region. We only need to create a parent node for the region containing y, to include x.

b. This case is more complicated.

Latex in Blogger

To be able to input latex equations in blogger, mathjax has to be used. Although several methods are available online, only this one works for my blogger.
To enable it, click template on the left side of the blogger edit page, then click edit HTML. Add the following code right after the first <head>

<script src='http://cdn.mathjax.org/mathjax/latest/MathJax.js' type='text/javascript'>
MathJax.Hub.Config({
extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
jax: ["input/TeX", "output/HTML-CSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\$","\$"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
},
"HTML-CSS": { availableFonts: ["TeX"] }
});
</script>

Wednesday 17 June 2015

Compressive tracking

There are two main approaches for tracking according to this blog.
One is generative, generating the model of the object while tracking. The tracker learns the model from the current frame and use it as the template to track in the next frame. The disadvantage of this method is that in the beginning of the video, there are few frames to learn and thus the object could not change much. Otherwise, the tracker may drift.
The other is discriminative, which is essentially a binary classifier to separate the object from the background. However, normally only one positive sample and a few negative samples are used for training, and the tracker may drift when the template is noisy or is not well aligned.
Fast compressive tracking works in the following way.
1. Obtain the target. The image is convoluted with Gaussian kernel in different scales to obtain multi-scale information about the target. To improve the speed, Gaussian kernel is replaced by rectangular filter bank chosen as

where w is the width and h is the height of the window. The result is

The output of the convolution is an image with size W$\times$H is still W$\times$H. To combine these outputs, they are transformed into a column vector. For one input image there are w$\times$h outputs. Therefore concatenating all the column vectors, the dimension is very high and dimensionality reduction is required.
2. Obtain the sparse measurement matrix. According to compressive sensing theory, a small amount of randomly generated data could preserve the salient information in an image. The images that are compressible in this way is called K-sparse signal.
Based on this blog, Johnson-Lindenstrauss shows that the distance between two points in a subspace of a high dimensional space can equal that in the original space with a high probability. In other words, the relationship in the high dimensional space can be preserved in its subspace. Thus, the measurements only need to preform k+1 times to recover the K-sparse signal in the original space. Baraniuk proves that a random matrix satisfying the Johnson-Lindenstrauss also satisfies restricted isometry property (RIP). Thus, if R satisfies Johnson-Lindenstrauss, then x is compressible or sparse, and x can be recovered from v.
In this paper, the authors use sparse random measurement matrix R to extract the salient information in the K-sparse signal, and thus project the high dimensional signal to low dimensional space. The Gaussian random matrix could be used as R

when $\rho$ is 1, 3. When $\rho$ is 3, two third of the data is zero and does not require computation.
In the paper $\rho$ is $\frac{m}{4}$. Only c(<4) computations are performed for every row, and thus the complexity is O(cn). Multiplying this n$\times$m matrix with the high dimensional m$\times$1 matrix results in a low dimensional n$\times$1 matrix

where black parts in R is positive, gray parts are negative, and white parts are 0. It is evident that R is sparse, thus the computational load is decreased. The blue arrow indicates that a non-negative element in R compress an element in x, equivalent to the convolution of a rectangular region and a certain location in the image.
In the output vector v, every elements corresponds to the sum of x where R is non-negative, which means it contains the sum of information in many local regions.
4. Bayesian classification.

Here v represents the feature, p(y=1) and p(y=0) are the a prior probability of positive and negative samples, and in fact p(y=1)=p(y=0). It is proved that p($v_{i}$|y=1) and p($v_{i}$|y=0) satisfy Gaussian distribution.

where $\lambda$ is the learning parameter. The probability distribution of three different low dimensional spaces is

To evaluate the positive and negative samples:

Steps of the algorithm
1.1 If the frame is the first one, manually label the object in a rectangular region. Generate rectangular windows randomly to extract Haar features.
1.2 Use the current object location as center, and set the radius to 4 pixels, extract 45 positive samples. Choose 50 negative samples in the interior region of two circles: one with a radius of 8 pixels, the other with a radius of 30 pixels. Compute the integral image of the original input, and then extract the positive and negative samples based on the Haar features. Update the classifier.
2. If the frame is not the first one, use the object location in the previous frame as center, set the radius to 25 pixels, obtain 1100 regions to be classified in a brute force way. Compute the integral images of these regions, generate Haar features using the Haar like templates. Use the classifier to select the most probable one as the target.
Repeat 1.2.
The C++ code of this without multi-scale is here, and here.
Coarse to fine
Coarse to fine is used to reduce the computational complexity.
1. Coarse grained search
center: previous object location, radius: a large radius $\gamma_{c}$ = 25, step: a large number of pixels $\Delta_{c}$  = 4.
2. Fine grained search
center: coarse-grained detected location, radius: small radius $\gamma_{f}$ = 10, step: a small number of pixels $\Delta_{f}$  = 1.
For the exhaustive search with $\gamma_{c}$ = 25, $\Delta_{f}$  = 1, there will be 1,962 windows. However, for coarse to fine search there are only 436 windows.

Tuesday 16 June 2015

Opencv and Matlab image structure

I need to use opencv to read an image but process it in a C++ function written for Matlab mex. Therefore, the opencv mat structure needs to be transformed to the Matlab image structure. A naive way would be to allocating memory for a new image space with Matlab layout, and then copy the pixel values one by one. But it is reported here that this takes 50 ms!

To speed up the process, we should dig into the data structures and see what is different. According to this blog, Mat.data gives a pointer referring to the original data matrix, which is a 1D array with values arranged as b1,g1,r1,b2,g2,r2,… in the BGR format. To get pixels in the i th row and j th col, simply use
b = image[img.step * j + i ] ;

g = image[img.step * j + i + 1];

r = image[img.step * j + i + 2];

It is a different story for Matlab, based on the official documentation here
In array C, each element of C specifies the color for one pixel of the image. The resulting image is an m-by-n grid of pixels where m is the number of columns and n is the number of rows in C. This means that the image is stored column first, instead of row first. Also, the RGB values are stored as [j, i, r], [j, i, g], [j, i, b] in three dimensions.

Git delete remote file

Sometimes I accidentally pushes too much stuff into remote directory, which makes it messy. In such cases, I would like to delete the excessive branches or files.

To delete a remote directory that is no longer wanted, use the command found here
git rm -r --cached some-directory

git commit -m 'Remove the now ignored directory "some-directory"'

git push origin master

To delete a branch, use the command found here

git push origin --delete <branchName>

Sometimes git push takes a long time to finish, and a git status shows that: Your branch is Ahead by X commits. In this case, the local commits may contain something wrong. To revert to the current commit in the remote repository, use the command here
git reset --hard origin/master

Monday 15 June 2015

Mat ROI coordinate

I remember a joke in Friends, when Chandler tells Joey it is a J in joincidence, not a C in coincidence.

In opencv, rect is often used when setting the roi of a mat. However, the coordinate could be confusing. Take an example in here, what should be the lower left of an image? It should be Rect( 0, image.rows/2, image.cols/2, image.rows/2 ), instead of Rect( image.rows/2, 0, image.cols/2, image.rows/2 ).

Unlike accessing elements in a mat directly, for instance image.at<float>(row, col), the rect takes in (x, y), corresponding to (col, row). Therefore, the rect is always set by Rect(int x, int y, int width, int height), where width = no. of columns while height = no. of rows.

A good illustration of opencv image coordinate system can be found here,

Bear in mind this coordinate system when converting opencv image coordinates to camera coordinates, like this