Sunday 28 June 2015

Bayes theorem

Bayes theorem is fundamental to image analysis. Prof Yuille gave a talk,when he mentioned that the picture in his slide may not be Bayes, but the Bayes theorem may not be developed by Bayes in the first place...

Here is an interesting blog about this theorem. In high school, we often had to solve this kind of question: there are M black balls and N white balls in a bag, what is the probability that the ball you take out is black? Bayes theorem was originally proposed to solve the reverse problem: if we do not know how many black balls and white balls in a bag and take out some randomly, and observe the color of the balls, then find out the number of balls of each color in the bag.

Another example. Suppose in a school 60% are boys and 40% are girls. Boys always wear pants while half of the girls wear pants, the other half wear dress. It is easy to calculate the probability that a student wears pants or dress. In contrast, suppose you encounter someone wearing pants in the campus, do you know whether you see a boy or a girl?

Let's rephrase the question. Suppose you wander around in the campus, and see N students wearing pants, how many boys and girls are there in these students? It is easy. We need to calculate how many students in the school wear pants, and how many among them are girls, then we are done. If there are U students, then number of boys wearing pants is U * P(Boy) * P(Pants|Boy) = U * 60% * 100%. Likewise, for girls the figure is U * P(Girl) * P(Pants|Girl) = U * 40% * 50%. The proportion of girls wearing pants over the total students wearing pants times N would be the number of girls you have encountered. 

Apparently, the total number of students is irrelevant. The simplified answer is $P(Girl|pants)=\frac{P(Girl)*P(Pants|Girl)}{P(Boy)*P(Pants|Boy)+P(Girl)*P(Pants|Girl)}$. 

The equation above is nothing but P(Pants,Girl) over P(Pants), meaning how many girls P(Pants,Girl) are in all students wearing pants P(Pants). 

Another example on spelling check. If we see a user enters a word that is not in the dictionary, we have to guess:what this guy is really trying to say? i.e. P(the word we guess he enters|the words he actually enters). We have to maximize this probability to find out the correct word. Of course there are more than one right answer. For instance, when the user enters thew, then is she really wants to enter the or thaw? If our guesses are h1, h2.... (h stands for hypothesis), and they all belong to a finite hypothesis space (since the choice of words is limited). Suppose the word that the user enters is D (Data), then P(h1|D), P(h2|D)...could be represented by P(h|D). Use Bayes theorem, we have P(h|D)=$\frac{P(h)*P(D|h)}{P(D)}$.

For h1, h2, h3..., P(D) remains the same, so we ignore this term. We only have to compare P(h)*P(D|h). The moral of the story is, for a given data, whether a guess is good or bad depends on the probability of the guess itself (prior), as well as the probability of the guess leading to the data (likelihood). In the above example, the probability of the users wanting to enter the depends on the probability of the word the in the usage, and the probability of entering thew instead of the.

The rest is simple. We only need to compute the probability for every possible hypothesis, i.e. P(h)*P(D|h),and find the maximum, which is the most probable word.

Is there any alternative way other than Bayes? One common way is to choose the most closet one with thew, but the and thaw both are one alphabet different from that word. The distance alone is insufficient to determine the correct word.

We also notice that (maybe with the help from Holmes), e and w are closer on the keyboard, so it is easy for the finger to press the wrong key. In contrast, e and a are far away, and it is unlikely to press e for a. So the probability of thaw becoming thew is not very high. This is essentially the process to calculate P(D|h).

What does Bayes calculate? P(h)*P(D|h), there is an extra term P(h), which is a prior. Why do we need the prior? Suppose the user enters tlp, then is it top or tip? We also suppose top is more common than tip. This time, since both o and i are close to l, we could not decide based on P(D|h) alone. Since top is more common, then it is more probable that the user enters top.

Another problem of maximum a posterior is that even though a guess may fit the data well, the guess may not have high probability, simply because the guess itself is impossible.

For instance, for -1 3 7 11, is arithmetic progression more possible, or $\frac{-x^3}{11}+\frac{9}{11}x^2+\frac{23}{11}$ more possible? Of course it is the former. For another example, when fitting curves to points, we could either use high order polynomial or straight line. For points that are almost on a line, the residual error associated with high order polynomial is certainly lower than a straight line, but do we really want to use a complex equation? The answer lies in that we tend to choose the one that P(h) is much higher. Admittedly, we could not fit a line to a complex pattern. That's way we need the multiplication
P(h)*P(D|h).

The moral of the story is that, there are all kinds of errors in the measurement. So if we seek a perfect model in an excessive way, we may encounter overfitting, where even the noise is included in the model, which is unnecessary at all. Hence, P(D|h) is high does not necessarily mean that h is good, since it also depends on P(h).

Occam's Razor refers to the idea that if two theories are equally likely to explain a phenomenon, then we prefer the simpler one, which is also the more common one. 

In reality, the data often follows Gaussian distribution. Thus your measurement is probably not exactly what the model predicts. In this case we do not need to gild the lily and change our model, because the data drifted from the model could not be explained by the limited factors in your model. For example, the relationship between height and weight is nearly second order polynomial. But everyone knows height is not the only factor that affects weight. 

In other words, the theory with higher P(h) is more preferable. By contrast, MAP favors the one with higher P(D|h), because it fits the data better. Bayes theorem is the battle between the two. Suppose you toss a coin, which could either be tail or head. If you see head, you have to calculate the possibility that the head appears when tossing a coin. According to MAP, we tend to assign that figure to 1 because it maximizes P(D|h). This is impossible because a coin has a tail. 

Our common sense is that most coins are not biased, and only a small portion of them are biased. We model this prior normal distribution as $p(\theta)$, where $\theta$ represents the probability that a coin gives head, and p represents the probability density function. Instead of maximizing P(D|h), we are actually maximizing P(D|$\theta$)*P($\theta$), and obviously $\theta = 1$ does not make sense because P($\theta = 1$) = 0, and the multiplication goes to nil. In fact, we can get the extreme by differentiation.   

Previously we have discussed that we know the a prior. Suppose we do not know beforehand whether it is common or not, and have to assume they are equally likely. This time we can only use MAP.

There is an interesting debated between statisticians and Bayes supporters. The former says: let the data speaks for itself (we do not give a hack about the a prior). While the latter says: data can have all kinds of bias, and a prior can make the model more robust against noise. It turns out that Bayes wins, and the key to the success is that the a prior is also a result of statistics, but based on empirical evidence.

An ambiguous sentence goes like this: The girl saw the boy with a telescope. Does the girl saw the boy, using a telescope, or does she saw a boy holding a telescope?

The a prior of both interpretation is nearly the same, but we tend to use the first one. Why? Because we do not believe in coincidence. We ask ourselves: why does the boy holds a telescope, which could be used as a tool to see, instead of a book, or a football? The probability is too low! So the only probable explanation is that there is a certainty behind the "coincidence". (Sounds like what Holmes always tells Watson).

The reasonable explanation is if we interpret it as the girl saw the boy, using a telescope, then the data fits perfectly. Since the girl is observing the boy, then it is no longer a coincidence that she uses a telescope.

The above focuses on P(D|h). If we rely more on coincidence to explain things, it is more complicated. So we prefer reasonable, and thus simpler explanation. This is the Bayesian Occam's Razor. The razor works on P(D|h), instead of P(h). In contrast, the traditional Occam's Razor works on P(h).

Going back to the line fitting example. What is the chance of a high order polynomial generating data that approximately lies on a line? Not high. Conversely, the chance of getting a data that almost lies on a line by a line equation is much larger.

Bayesian deduction has two steps. Firstly, build a model based on the data. Secondly, use the model to predict the probability of unknown events. Previously we are only concerned with the maximum one. Sometimes, other models do not have zero probability. For instance, given data, model 1 probability is 0.5, model 2 is 0.4, model 3 is 0.1. If we only want to know which one is best, we choose model 1, end of the story.

More often than not, we want to predict things. All the models have its own prediction. Just because one model has a higher probability then we all listen to it is not democratic. The optimal Bayesian deduction takes the weighted sum of the models. Though it sounds perfect, we seldom use it. For one thing, model construction is very time consuming. For another thing, the model space may be continuous, and there is no way to build infinite models.

We could introduce more examples on Bayes theorem.It is the core of machine learning. For the separation of Chinese sentence, the sentence is 南京市长江大桥, which means the bridge of Changjiang river in Nanjing. In Chinese, there are two possible ways to interpret this:

1. 南京市/长江大桥,which is the same meaning as before.
2. 南京/市长/江大桥,which means the mayor of Nanjing is Jiangdaqiao, which is not correct.

Let's analyse it using Bayes. Suppose X is the sentence, Y is the way to separate the words, then we seek Y that maximizes P(Y|X), which is equivalent to finding the maximum of P(Y)*P(X|Y). In plain words, the equations means the probability of the separation of words multiply the probability of the sentence being generated by the separation of words. More specifically, we could treat P(X|Y) as 1, since no matter 1 or 2 could generate the sentence. Hence we need to maximize P(Y).

How to calculate the probability of a sequence of words, w1, w2, w3, w4? P(w1, w2, w3, w4) = P(w1)*P(w2|w1)*P(w3|w2,w1)*P(w4|w3,w2,w1). With increasing number of w, it is difficult to get $P(w_{n}|w_{n-1}...w1)$. To solve this issue, computer scientists, as usual, make a naive assumption: we assume the probability of a word only depends on the previous k (k<3) words. Though the assumption seems too young too simple, the result is good and robust.

With the assumption, we rewrite P(w1, w2, w3, w4) = P(w1)*P(w2|w1)*P(w3|w2)*P(w4|w3). Since P(Nanjing mayor|Jiangdaqiao) = 0, case 2 is incorrect, and thus P(Nanjing|Bridge of Changjiang river) wins.

In fact, the aforementioned examples are very shallow. At this level, machine learning only concerns about superficial phenomenon. Since it is shallow, the world seems more complicated. In terms of machine learning language, we need more features. As a result, the dimensionality increases dramatically, which leads to curse of dimension, and the data becomes quite sparse and insufficient.

On the other hand, human perceive things more deeply. To avoid data sparsity, we invent all kinds of devices, such as the microscope, to help us delve deeper to find out the underlying relationship that is more essential. For example, machine needs to learn from large samples to know that bikes have two wheels, but we know it without learning because it is unreasonable for bikes to have otherwise.

Machine translation could be defined as the following problem: given a sentence e, and its possible translation f, find out which one is more reliable. In other words, we need to calculate P(f|e), which is P(f)*P(e|f), favoring those translations that are common and is probable to generate e.

Computation of P(e|f) is not trivial. What is the probability of a given translation that generates a sentence? For example, suppose e is John loves Mary. The first f in Franch we examine is Jean aime Marie. Looking at the correspondences, one possibility is John (Jean) loves (aime) Marie (Mary). After the correspondence is established, P(e|f) is more easy to calculate as P(e|f) = P(John|Jean)*P(loves|aime)*P(Marie|Mary). We need to exhaust all correspondences and calculate the sum of theses values, which is P(e|f).
How do human translate? Do we exhaust all possibilities? Certainly not. We adopt bottom up/top down approach.

Clustering is an unsupervised way of machine learning. The problem states: given some data points, classify them into clusters. One solution assumes that these points resides around k centers following a Gaussian distribution. We need to estimate where the centers are and the parameters of Gaussian.

The difference here is that the answer is continuous and thus infinite. What's worse, we need to know the parameters to determine which points belong to which center, and we need to know which points belong to which center to calculate the parameters. This is a typical chicken-egg problem.

To break the loop, I don't care, we say. We randomly guess some centers, and adjust accordingly in an iterative manner. This is called Expectation Maximazation (EM). At expectation stage, we guess the centers and the parameters, and determine which center the points belong. During maximazation, we re-calculate the centers and parameters based on the data attributed. Bayes is applied to calculate the parameters based on the data.

Least squares is often used for linear regression. Given N points on a plane, assume we want a line to fit them (regression is a specific case of fitting, i.e. fitting with error). How do we define the best line? Suppose the coordinate of the points is (xi, yi), and the line is y = f(x). The error between the point and its estimation is $\delta yi = |yi - f(xi)|$. Least squares means to minimize $(\delta y1)^2 + ... + (\delta yn)^2$.

Why do we use squares instead of absolute values? Use Bayes theorem. Suppose f(xi) is the best estimation of xi, and all the yi that deviates f(xi) contain noise, and the further away the lower the probability. We can use a normal distribution for this, centered at f(xi), and the probability of (xi, yi) is proportional to $exp[-(\delta yi)^2]$. We want the maximum MAP, which is P(h|D), proportional to P(h)*P(D|h). h represents the line, D is the data points. We seek a line such that P(h)*P(D|h) is largest.

Obviously, P(h) is the same for all the lines. We only need to concern about P(D|h) = P(d1|h) * P(d2|h)... Here we assume each point is independent and multiply their probability. Remember that P(d1|h) = $exp[-(\delta yi)^2]$. Thus, P(D|h) =  $exp[-(\delta y1)^2]$ * $exp[-(\delta y2)^2]$... = exp{-[(\delta y1)^2 + (\delta y2)^2...]}. To maximize it, we need to minimize (\delta y1)^2 + (\delta y2)^2... Familiar?

Another application of Bayes is to filter out the spam emails. Given an email, determine whether it is spam or not. Denote the email by D, which is comprises N words. We use h+ for spam, and h- for normal email. The problem can be expressed as:

$P(h+|D)=\frac{P(h+)*P(D|h+)}{P(D)}$
$P(h-|D)=\frac{P(h-)*P(D|h-)}{P(D)}$

The a prior, P(h+), P(h-) are easy to obtain by calculating the percentage of spam and normal emails in the inbox. However, P(D|h+) is hard, since D contains N words, d1, d2... So P(D|h+) = P(d1,d2...|h+). We encounter the data sparsity once more. Why? Because P(d1,d2...|h+) is the probability of the current email being exactly the same with a spam in the inbox. This is ridiculous, as every email is different. In other words, no matter how many spam emails you have collected, there is no way that there is one the same with the current email. This means the data is sparse, not dense.

Alternatively, we rewrite P(d1,d2...|h+) as P(d1|h+) * P(d2|d1,h+) ... For simplicity, we assume di is independent of previous words. Thus, P(d1,d2...|h+) = P(d1|h+) * P(d2|h+) ... This is what we call independence assumption. P(d1|h+), P(d2|h+) are easy to get as we only need to calculate the probability of d1, d2... in the spam.

As a side note, why we encounter the data sparsity issue? Because we work in superficial level. The combination of words are deemed to be infinite. Therefore, using statistics at this level is bothered by data sparsity. Notice that even though the combination of words is infinite, the variation of syntax is limited. How to model syntax is an AI problem. The higher the level, the fewer the possibilities. Many words map to one sentence, and many sentence map to on syntax, indicating a hierarchical system.

Why the simple independence assumption is so powerful? Why can we bravely say that a word only depends on the previous 3 or 4 words, though in reality it may depends on the previous sentence? One theory asserts that the positive and negative effects of the assumption cancel out.

Hierarchical Bayes theorem goes deeper than the most basic form. It includes the cause of the cause. For instance, suppose you have N coins manufactured from the same factory, you toss them one time each and deduct the probability of head $\theta$. p($\theta$) has a prior, which may follow beta distribution. That is to say, the outcome of a toss xi follows the normal distribution centered at $\theta$, and $\theta$ follows a beta distribution centered at $\psi$. The hierarchical layers of cause are emerged.

Hidden Markov Model is a simple Hierarchical Bayes model. How to obtain what the sender wants to say based on the message received? When we observe voice o1, o2, o3, we have to deduce the sentence received s1, s2, s3. We need to find the maximum P(s1, s2, s3|o1, o2, o3). The a prior of the HMM is $\lambda$, then P(S|O, $\lambda$).

Using Bayes theorem, P(S|O, $\lambda$) $\propto$ P(S|O) = P(O|S)*P(S), where P(O|S) represents the probability that a sentence S is interpreted as O, which means the probability of sending S. P(S) means the probability that words s1, s2, s3 can form a meaningful sentence, which depends on $\lambda$.

What is the difference between Bayesians and Frequentists? Let's use our favourite coin example (Bayesians really have lots of coins). Normally p(head) = p(tail) = 0.5. However, due to wear and tear, the probability may be uneven. We want to know p(head) = t, and p(tail) = 1-t. Frequentists will think t is some fixed value in [0, 1]. If we toss 100 times, and head appears 40 times, then t = 0.4. If we toss another 100 times, and head appears 42 times, then t = 0.42. Here comes the question: how can t be both 0.4 and 0.42 at the same time? In fact, t is neither 0.4 nor 0.42, but some value not far from those. All we can say is that the 95% confidence interval is [0.4-0.1, 0.4+0.1]. t is either in that interval or not. Frequentists tell us if you repeat the experiment many times, there will be one such interval that includes the real t.

However, sometimes experiment is not repeatable, or the data sampling is costly. For instance, if we are only allowed to toss 3 times, then there is a high chance that all of them are tails, which leads to t = 0. In this case, there is nothing Frequentists can do.

How does Bayesians solve the problem? An essential difference is that Bayesian does not care whether t is some fixed value, they care more about its distribution, which is the belief in t. Before we toss the coin, we think t lies evenly in [0. 1], and p(t) = 1. When new data is available as evidence, for instance tossing 3 times and all are tails, then our belief in t will change.

Suppose each toss is independent, and here n = 3. Let 1 represents head and 0 represents tail, we have p(x1=0,x2=0,x3=0|t) = p(x1=0|t) *p(x2=0|t) *p(x3=0|t) = $(1-t)^3$ = p(D). The denominator is the integral of all the possible values of t, thus m(D) = $\int (1-t)^3 dt = \frac{1}{4}$. We call p(t) the prior, and when data is available as evidence, p(t|D) is the posterior. We update our belief in t when we have data, but the prior still has influence. This influence weakens as we have more data. Hence when data is scarce, the choice of prior is very important.






 





No comments:

Post a Comment