Saturday 11 July 2015

SVM for dummies

SVM gains an increasing popularity in computer vision. Here is an amazing material on this topic. There are three levels of understanding SVM, from superficial to essential.

Level 1.

Let's start with classification. It is an important task in data mining. Its objective is to learn a function or model, called classifier, and SVM is a way of supervised learning.

SVM is a linear binary classifier, maximizing the spacing in the feature space. Let x denote data, which has n dimensions, and y denote class, being either 1 or -1. The linear classifier finds a hyper plane in n dimensional space whose equation is $w^{T}x + b = 0$.

Why use 1 or -1, instead of 23333 and 322222? It's because the logistic regression, whose aim is to learn a 0/1 classifier. The input is the linear combination of the feature, ranging from minus infinity to plus infinity. Hence, we use logistic (or sigmoid) function to map them to (0, 1), which is p(y=1). Its equation is $h_0 (x) = g({\theta}^{T}x) = \frac{1}{1+e^{-{\theta}^{T}x}}$, where x is n-dim feature, g is the logistic function, which projects an input z to (0,1).

Suppose the function gives the probability that the feature belongs to 1, i.e. p(y=1). Then $P(y=1|x;\theta) = h_{\theta}(x), P(y=0|x;\theta)=1-h_{\theta}(x)$. When a new feature comes, we need to compute $h_{\theta}(x)$. If it is larger than 0.5, then y = 1, otherwise y = 0.

Examining $h_{\theta}(x)$, we find that it only depends on ${\theta}^{T}$. If ${\theta}^{T}x > 0$, then $h_{\theta}(x) > 0.5$. g(z) is merely for mapping, the real determining factor is still ${\theta}^{T}x$. For learning, what we want is to let ${\theta}^{T}x >> 0$ for the features that y = 1. In contrast, ${\theta}^{T}x << 0$ for y = 0. Logistic is to learn $\theta$.

We substitute y = 0/1 in logistic to y = -1/1, and we use w, b instead of $\theta$. In logistic, ${\theta}^{T} = {\theta}_0 + {\theta}_{1}x + ... + {\theta}_{n}x$, and $x_0 = 1$. For SVM, we use b, and $w_{1}x_{1} + ... + w_{n}x_{n} = w^{T}x$. Thus, ${\theta}^{T}x = w^{T}x + b$, and $h_{\theta}(x) = g({\theta}^{T}x) = g(w^{T}x + b)$. In other words, except the labels SVM is nothing different from logistic.

We only care about whether $w^{T}x + b$ is positive or negative, and do not care about g(z). So we let g(z) = 1 when z is not negative, and g(z) = -1 otherwise. The reason why use -1/1 instead of 0/1 is because we have to determine whether z = 0 (To avoid confusion?).

The classification function for SVM is $h_{w,b}(x) = f(x) = g(w^{T}x + b)$. If we are separating two clusters of data by a hyper plane, then f(x) = 0 indicates that x is on the hyper plane, while f(x) < 0 means y = -1, and f(x) > 0 means y = 1. We call w the normal vector, and b the intercept.

For a hyper plane w*x + b = 0,|w*x + b| represents the distance from x to the plane. The consistency of the sign of w*x + b and class label y indicates the classification is right or wrong. Thus, we could use the sign of (w*x + b)*y to determine the confidence of classification.

That's what I can manage to learn from the above source.Another source here explains SVM in a much simpler way.

For every sample, we use a feature vector and a label to represent it: Di = (xi,yi). The distance between the sample and the hyper plane is $\delta_{i} = y_i (wx_i + b)$. Notice that if a sample belongs to a class, then wxi + b > 0, and yi > 0, otherwise both of them are negative. Hence, $\delta_{i}$ is always positive, and $\delta_{i} = |g(x_i)|$. We can normalize w, b, ie. use $\frac{w}{||w||}$ and $\frac{b}{||w||}$, then $\delta_{i} = \frac{1}{||w||}|g(x_i)|$. Looks familiar? This is the distance between point xi to the line g(x) = 0.

What is ||w||? It means the norm of w, measuring the length of w.

The $\delta_{i}$ after normalization is called geometric margin, which is the euclidean distance between the point and the hyper plane. The reason why we care so much about this margin is because $misclassification <= (\frac{2R}{\delta})^2$. Where $\delta$ is the margin, R = max||xi|| i = 1...n, which is the largest length in all the samples. In other words, R measures how broad the sample distribution is. The equation means that misclassification depends on the margin. The larger the margin, the smaller the upper limit of the misclassification.

For optimization, we refer to the source here. We want to minimize misclassification, so we want $max\frac{1}{||w||} = min\frac{1}{2}\frac{1}{{||w||}^2}$. Since  w is monotonic, we can square it, and add constant coefficient, for the sake of differentiation. The full version of the objective function is $min\frac{1}{2}{||w||}^2 s.t. y_i (w^T x_i + b) >= 1, i = 1...n$. st means subject to, meaning that under the constraint of the equation. This is actually a quadratic programming problem, and it is convex. A convex problem does not have local solution. Imagine a funnel, no matter where we put a ball, the ball will eventually come out, ie get the global solution. The equation after s.t. is like a convex polyhedral, and we want to find a global solution on it.

This problem can be solved by Lagrange multiplier, using KKT. The objective function is $L(w,b,a) = \frac{1}{2}{||w||}^2 - \sum_{i=1}^{n} \alpha_i (y_i (w^T x_i + b)-1)$.

To solve it, we minimize L wrt w, b, by letting the partial derivatives be zero. $$\frac{\partial L}{\partial w} = 0 i.e. w = \sum_{i=1}^{n}\alpha_i y_i x_i$$
$$\frac{\partial L}{\partial b} = 0 i.e. \sum_{i=1}^{n}\alpha_i y_i = 0$$

Substitute them back into the original equation, we get
$$max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j {x_i}^T x_j s.t. \sum_{i=1}^{n}\alpha_i y_i = 0$$
The equation can be solved by SMO.

What if the data is not linearly separable? We can either use a curve, or use a line and minimize the error. We can add punishment for the misclassified points, which is the distance between the point to its correct position. The equations becomes $min\frac{1}{2} {||w||}^2 + C\sum_{i=1}^{R}\epsilon_i, s.t. y_i (w^T x_i + b) >= 1 - \epsilon_i, \epsilon_i >= 0$.

When xi is at the correct side, $\epsilon = 0$. R is the total number of points. When C is large, misclassification is rare, at the expense of over-fitting. When C is small, misclassification is more common, but the model may not be correct.

Kernel function

When the data is linearly inseparable, we can use some non linear way to separate them, using curves. We can transform the space from the original low dimensional space to a high dimensional one, and then use a hyper plane to separate the data. For instance, no two books are identical. In terms of only (color, content), the two books may be the same, but when we add author, page number, publisher, owner..., the two books become different.

Recall the objective function
$$max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j {x_i}^T x_j s.t. \sum_{i=1}^{n}\alpha_i y_i = 0$$
we can let ${x_i}^T x_j = K(x_i,x_j)$. What is does is to map the space to high dimensional one.

K(xi, xj) has many forms, two common ones are $(x_i \dot x_j + 1)^d$ and $exp(-\frac{(x_i - x_j)^2}{2\sigma^2})$. The former is the polynomial kernel, whereas the latter is the Gaussian kernel, which maps to infinite dimensional space.

Other problems

In case of N class classification, we have two methods. 1. 1 vs (N-1), by training N classifiers, where the ith classifier determines whether the data belongs to class i. 2. 1 vs 1. We train $\frac{N*(N-1)}{2}$ classifiers, where the classifier (i,j) determines whether the data belongs to i or j. According to the author of libsvm, 1 vs 1 is better than 1 vs (N-1).

Will SVM overfit? We can tune the punishment coefficient C. Moreover, the $min {||w||^2}$ is also used in least square, which makes a function more smooth. Hence SVM is unlikely to overfit.











  


No comments:

Post a Comment