Friday 31 July 2015

Algorithm - queue, stack, node

Take an encryption case for example. You are given a sequence of number, 6 3 1 7 5 8 9 2 4, which is encrypted. To get the real meaning, we delete 6 and put 3 to the end, and the number becomes 1 7 5 8 9 2 4 3. Then we delete 1 and put 7 to the end, the number becomes 5 8 9 2 4 3 7. We delete 5 and put 8 to the end, getting 9 2 4 3 7 8... Eventually the numbers get deleted are 6 1 5 9 4 7 2 8 3.

Queue is a special structure, which only allows the elements at head to be deleted, which is called pop out. It only allows to input new elements at tail, means pop in. When head == tail, the queue is empty.

What is stack? Suppose we have a bucket, and we can only put in one ball each time. We now put in ball 2, 1, 3. If you want to take out ball 2, then you have to take ball 3 out first, and then ball 1.

We can use stack to determine whether a string is equal no matter if you read it forward or backward.

Now let's talk about node. We often use array to store numbers, but it is not always convenient. For example, we have a sequence of numbers, arranged in ascending order, 2 3 5 8 9 10 18 26 32. If we want to input 6 and ensure the new sequence is still in ascending order, then we have to move all the numbers after 8 if we use array.

What is node? Node is 2-3-5-8-9-10-18-26-32. If we want to input 6, we just need 2-3-5-6-8-9-10-18-26-32. Each node has two parts, one stores the data, using an int. The other stores the address of next data, using pointer.

An alternative way to realize node is to use array to simulate it. Suppose we have the data 2 3 5 8 9 10 18 26 32, we need to use another array called right to store the position of the number on the right of the data 2 3 4 5 6 7 8 9 0. For instance, right[1] = 2, means the number on the right of data[1] is in data[2]. right[9] = 0, means there is no number of the right of 9.

If we want to input 6, we only need to put it at the end, data[10] = 6. Then we need to change right[3] = 10, means we put the number on the right of data[3] to data[10]. We also need to set right[10] = 4, means the number right to data[10] is in data[4].

MonoSLAM for dummies

The basic idea about SLAM is that if I were a robot, and I know two fixed points: (x1, y1) and (x2, y2). I now can see point1, and the measurement tells me it is 5 m away from me. When I turn my head for 30 degrees, I can see point2, and the measurements tell me it is 10m away. As a result of knowing my relative position with respect to point1 and point2, I can localize myself.

But here comes the question. I have to look back to see point1 and point2 every time I move forward, to know where I am. This kind of looking back is very stupid. So, a smart PhD student comes up with a solution: as soon as I know where I am with respect to point1, point2, I add more points: point3, point4... because I know the coordinate of myself, I can measure the location of these points (mapping). I could use those points afterwards for localization.

Therefore, SLAM means knowing my localization and the coordinates of the fixed points (feature mapping). Because SLAM introduces noise no matter when measuring distance or calculating the rotation, traveled distance, SLAM requires Kalman filter.

MonoSLAM also refers to real time structure from motion. The fixed points are the visual features, and measuring distance becomes match feature and then triangulate.

MonoSLAM is implemented in MATLAB here. The authors introduce MonoSLAM in the following way. SLAM in a 3D environment - Need of 3D landmarks - Camera based method - use of visual 2D interest points - camera gives 2D interest points - no depth information - how to get depth? - by using camera motion - How to estimate camera motion - by using 3D landmarks.

Therefore, 3D landmark = 2D interest point + depth.

At the beginning, a few initial landmarks must be provided for initialization, otherwise the camera motion cannot be estimated. After the initialization phase, an iterative process begins. Step 1: 2D interest points must evolve into 3D landmarks when the estimate of their depth is good enough. Step 2: 3D landmarks contribute to the estimation of the state, including camera and the map.

The camera and map state is estimated through an EKF. After initialization, this filter performs a model based prediction, compares it to the measurements (camera data), and thus applies a correction on the global state according to the innovation.

To use camera data, we have to know its resolution, focal length, CMOS sensor dimension, and thus we need to perform a calibration step, such as the MATLAB toolbox.

The landmarks could be initialized manually, or by using an object of known size at the beginning.

Since we use EKF, we need to have a model, and the only model we can have for the camera is a kinematic one, instead of a dynamic one. This model allows only constant velocity movement. However, the camera may not move in this way. Hence we have to add a noise to allow for changes in direction or speed. This is necessary for the EKF to assess the confidence into its estimates.

A common noise model is the additive noise model, where random gaussian noise is added on each state component. But here we use a non linear, acceleration based noise model. To enable several trajectories around the previous one, we consider the linear and angular accelerations to be given by random zero mean gaussian variables. The standard deviation of these variables allows us to tune how uneven the motion can be.

The prediction step randomly chooses one trajectory among these possibilities, and if necessary the correction step will bring the system back in tracks.

Landmarks are identified from interest points. The Shi-Tomasi detector is used to define salient points. The detection is based on local extrema of the gradient analysis of 15x15 pixels patches. The eigenvalues of the matrix are used to assess the quality of a potential interest point, through a threshold. When a new interest point is identified, it is attached to its neighborhood, namely its 15x15 patch, to describe this point. The patch is normalized to increase the detection robustness for illumination change.

Since there is no depth information from a single image, we need to use motion to recover the depth. This is done using particle filter. When an interest point is selected, we only know that this point belongs to a certain line called the epipolar line. To estimate where the point lies on the epipolar line, the line is uniformly sampled by a given number of particles, where each particle represents a depth hypothesis. A maximal depth hypothesis related to the size of the environment limits the search zone.

When an interest point has been registered, meaning that its patch and epipolar line are known, the algorithm tries to match it in the subsequent pictures. An interest point without any depth information is tracked using correlation over a window. The size of the window is constant, defied by a maximal acceleration in terms of pixels. Interest points moving out of the image or weakly matched are discarded.

During the tracking, after the correlation has matched an interest point, all the particles associated to this point are re-projected in the picture. Resampling is performed to promote hypothesis best-linked to the correlation result, and 25% of the particles are resampled uniformly to avoid early convergence. When the variance of the depth hypotheses shrinks below a threshold, the estimation is considered reliable. The interest point is then attached a depth and becomes a landmark, with a large uncertainty along the epipolar line to allow the EKF to modify its position easily if necessary.

Compared to the interest point management (initialization, tracking, depth estimation and upgrade as landmark), landmark tracking is much easier. The EKF prediction gives an estimate of the camera position and orientation, as well as uncertainties on these quantities and on the landmarks position. The position and orientation of the camera, relative to the landmarks position, allow us to predict how the landmarks will be projected onto the image plane, along with an uncertainty area. Both the uncertainty of the camera position and orientation and the uncertainty of landmark position are taken into account to compute this uncertainty area. The larger the uncertainty is, the wider the search area is in the new picture. The the search is simply performed by correlation within this area. The difference between the EKF prediction and the camera observation gives the innovation signal of the filter.

When the algorithm is running, at least one landmark must always be visible. Thus, as soon as the number of visible landmarks falls below a chosen number, a search procedure is launched, in order to identify new interest points. This is achieved by searching interest points in randomly drawn windows. The only constraint on these windows is they have to be far away from the currently tracked landmarks and interest points. Otherwise, no effort is spent in trying to pick up interest points in the most useful place for the algorithm.

The innovation signal is at the core of the EKF correction step, which affects the whole state vector. We previously explained how the innovation signal is obtained, without any details concerning the projection.

The simple pinhole camera model is used to perform reprojection. No distortion correction is applied here. For a landmark yi described by its coordinates in the world frame, we have to express its position relatively to the camera.

Correction is based on the innovation. A simple additive noise is finally added to the observation model, which enables us to tune the confidence of the EKF towards the observations. To increase the robustness of the correction algorithm, landmarks weakly matched (bad correlation results) or unmatched (not seen when tey are predicted to be seen) along several frames are removed from the map. EKF is very likely to be lost if the landmark association is bad.





Tuesday 28 July 2015

Reduce PDF size in Ubantu

According to this post, we can use Ghostscript to reduce the pdf size.

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=foo-compressed.pdf foo.pdf

Sunday 19 July 2015

Latex table

In latex table, if you put \caption before \begin, then the caption will appear at the top. If you put \caption after \begin, the caption will appear below the table, which is the norm for most papers. Moreover, \label should be in the same line with \caption.

Wednesday 15 July 2015

默写欧洲/英国的侦探们

一个景点的知名度往往能够从它对所在地的影响上面看出来。而大名鼎鼎的福尔摩斯博物馆更是从地铁站就开始了它的宣传。到达以福尔摩斯居住地命名的贝克街地铁站后,看到那里的仿古建筑,仿佛瞬间回到了大侦探所在的维多利亚时代。
 
 

带着这种穿越的感觉走出站口,激动的发现福尔摩斯本人就伫立在门口,叼着烟斗穿着斗篷和鹿皮帽子,仿佛正在思考着案件. 看到他的雕像,想必大多数福迷都会内心中无比的激动,心跳加速,面目通红,手舞足蹈,恨不得要个签名啥的.
 
福尔摩斯博物馆收费6英镑,平均每层2英镑,每个房间1.5英镑,估计是所有博物馆里面每层和每间房最贵的之一了.

Tuesday 14 July 2015

Something about graduate studies July 14

Here is an interesting talk on what qualities are important for PhD. As far as I can see, the most important one in the eyes of supervisors is curiosity and persistence.

Here is a talk on how to write a good paper. The most important thing is to make it very clear what are the contributions, and the problem you are solving, the context of the problem, solution, and how your paper improves upon previous work.

Organization plays a vital part, as both that talk and this point out. A paper should start by telling the audience which problem you are addressing, and why they should care about it. Moreover, you should spell it out explicitly the implications of your solution. Then present the reasons why previous solution are not satisfactory. Next, explain your own solution, and do remember to compare it with other solutions.

What is often not noticed is that, the paper should be understandable for those who read it in a hurry. To achieve this, use figures and captions to tell the story. The figure captions should be self-contained and the caption should tell the reader what to notice about the figure.

The most common reasons for rejections include
Promise more than what is delivered
Exclusion of important references
Main idea is not new
The results are similar to previous work
The results are too different from previous work

Start writing the paper early and re-write it until it is good!

Saturday 11 July 2015

Latex flowchart figure

Many times a figure is too long for a paper, such as a flowchart. In this case, we have to put the figure horizontally to save space, like the overview in the HOG paper. To do this, we need to use \begin{figure*} instead of \begin{figure}, to ensure that the figure spans two columns. Moreover, we use [t!] to put the figure at the top of the page. Sometimes the figure appears at the next page of where it is referenced. To solve this, place the figure \begin{figure*}[t!] somewhere before it is referenced, eg in the previous page. It turns out that the figure position has nothing to do with where it is referenced, but depends on where it is included.

CNN for dummies

This article relies on sources here, here, and here.

CNN can use images directly as input, avoiding feature extraction and description. Suppose we have an image with size 1000x1000, and 1 million hidden neurons, and we connect each neuron to a pixel, then we have 1000x1000x1000000 ways of connection. Nevertheless, the relationship of an image is local, and every neuron only needs to connect to a local region. On a higher level, information collected from these neurons is summarized to describe the global image. Suppose local region has size 10x10, and the 1 million neurons only need to connect to 10x10 pixels, we only have $10^8$ parameters.

For the 10x10 region, if the neurons connecting to them use the same weights, in other words, we use the same convolution kernel to convolve the image for each neuron, then we only have 100 parameters regardless of how many neurons we have. This is the most significant contribution.

However, fixing the weights, we only extract one kind of feature, since one convolution kernel corresponds to one feature, such as the edge of a certain orientation. What if we want multiple features? We increase the number of kernels. Suppose we have 100 filters to extract different features, such as different edges, then as a result of convolution we have a feature map. We have 100 feature maps for 100 kernels, and the 100 feature maps constitutes one layer of neurons. The number of parameters is merely 100 types of kernels x 100 parameters for each kernel = 10k.

How do we determine the number of neurons? If we have a 1000x1000 image, and the filter is 10x10, and there is no overlap, i.e. the stride is 10, then no. of neurons is $\frac{1000\times 1000}{10\times 10} = 10k$. This is the no. for one feature map.

One example of CNN is LeNet, which has 7 layers. The input is 32*32 image. C1 layer is convolution, to strengthen the feature and suppress the noise. There are 6 feature maps. In the feature map, each neuron maps to a 5*5 area. The size of feature map is 28*28. C1 has 156 parameters to train, which are 5*5 unit parameters and a bias coefficient for each filter. (5*5+1)*6 = 156 parameters. The no. of connection is 156*28*28.

S2 layer is downsample layer, to exploit the local image correlation and reduce the amount of information, while retaining useful information. There are 6 feature map with size 14*14, where each unit in the feature map connects to a 2*2 region in C1. We add the 4 inputs for each unit, multiple a coefficient to be trained, and add a bias, then use sigmoid. If the coefficients are small, then the operation is linearly and downsampling is like blurring.

C3 layer is also a convolution layer, where each feature map is a combination of the feature maps in S2. S4 layer is a downsampling layer, C5 is for convolution, F6 layer computes the dot product of the input and weights, and adds a bias. Eventually, the output layer comprises euclidean radial basis function.

Essentially, CNN is a projection from input to output, and it can learn a multitude of such relationships without knowing a precise mathematical expression.

SVM for dummies

SVM gains an increasing popularity in computer vision. Here is an amazing material on this topic. There are three levels of understanding SVM, from superficial to essential.

Level 1.

Let's start with classification. It is an important task in data mining. Its objective is to learn a function or model, called classifier, and SVM is a way of supervised learning.

SVM is a linear binary classifier, maximizing the spacing in the feature space. Let x denote data, which has n dimensions, and y denote class, being either 1 or -1. The linear classifier finds a hyper plane in n dimensional space whose equation is $w^{T}x + b = 0$.

Why use 1 or -1, instead of 23333 and 322222? It's because the logistic regression, whose aim is to learn a 0/1 classifier. The input is the linear combination of the feature, ranging from minus infinity to plus infinity. Hence, we use logistic (or sigmoid) function to map them to (0, 1), which is p(y=1). Its equation is $h_0 (x) = g({\theta}^{T}x) = \frac{1}{1+e^{-{\theta}^{T}x}}$, where x is n-dim feature, g is the logistic function, which projects an input z to (0,1).

Suppose the function gives the probability that the feature belongs to 1, i.e. p(y=1). Then $P(y=1|x;\theta) = h_{\theta}(x), P(y=0|x;\theta)=1-h_{\theta}(x)$. When a new feature comes, we need to compute $h_{\theta}(x)$. If it is larger than 0.5, then y = 1, otherwise y = 0.

Examining $h_{\theta}(x)$, we find that it only depends on ${\theta}^{T}$. If ${\theta}^{T}x > 0$, then $h_{\theta}(x) > 0.5$. g(z) is merely for mapping, the real determining factor is still ${\theta}^{T}x$. For learning, what we want is to let ${\theta}^{T}x >> 0$ for the features that y = 1. In contrast, ${\theta}^{T}x << 0$ for y = 0. Logistic is to learn $\theta$.

We substitute y = 0/1 in logistic to y = -1/1, and we use w, b instead of $\theta$. In logistic, ${\theta}^{T} = {\theta}_0 + {\theta}_{1}x + ... + {\theta}_{n}x$, and $x_0 = 1$. For SVM, we use b, and $w_{1}x_{1} + ... + w_{n}x_{n} = w^{T}x$. Thus, ${\theta}^{T}x = w^{T}x + b$, and $h_{\theta}(x) = g({\theta}^{T}x) = g(w^{T}x + b)$. In other words, except the labels SVM is nothing different from logistic.

We only care about whether $w^{T}x + b$ is positive or negative, and do not care about g(z). So we let g(z) = 1 when z is not negative, and g(z) = -1 otherwise. The reason why use -1/1 instead of 0/1 is because we have to determine whether z = 0 (To avoid confusion?).

The classification function for SVM is $h_{w,b}(x) = f(x) = g(w^{T}x + b)$. If we are separating two clusters of data by a hyper plane, then f(x) = 0 indicates that x is on the hyper plane, while f(x) < 0 means y = -1, and f(x) > 0 means y = 1. We call w the normal vector, and b the intercept.

For a hyper plane w*x + b = 0,|w*x + b| represents the distance from x to the plane. The consistency of the sign of w*x + b and class label y indicates the classification is right or wrong. Thus, we could use the sign of (w*x + b)*y to determine the confidence of classification.

That's what I can manage to learn from the above source.Another source here explains SVM in a much simpler way.

For every sample, we use a feature vector and a label to represent it: Di = (xi,yi). The distance between the sample and the hyper plane is $\delta_{i} = y_i (wx_i + b)$. Notice that if a sample belongs to a class, then wxi + b > 0, and yi > 0, otherwise both of them are negative. Hence, $\delta_{i}$ is always positive, and $\delta_{i} = |g(x_i)|$. We can normalize w, b, ie. use $\frac{w}{||w||}$ and $\frac{b}{||w||}$, then $\delta_{i} = \frac{1}{||w||}|g(x_i)|$. Looks familiar? This is the distance between point xi to the line g(x) = 0.

What is ||w||? It means the norm of w, measuring the length of w.

The $\delta_{i}$ after normalization is called geometric margin, which is the euclidean distance between the point and the hyper plane. The reason why we care so much about this margin is because $misclassification <= (\frac{2R}{\delta})^2$. Where $\delta$ is the margin, R = max||xi|| i = 1...n, which is the largest length in all the samples. In other words, R measures how broad the sample distribution is. The equation means that misclassification depends on the margin. The larger the margin, the smaller the upper limit of the misclassification.

For optimization, we refer to the source here. We want to minimize misclassification, so we want $max\frac{1}{||w||} = min\frac{1}{2}\frac{1}{{||w||}^2}$. Since  w is monotonic, we can square it, and add constant coefficient, for the sake of differentiation. The full version of the objective function is $min\frac{1}{2}{||w||}^2 s.t. y_i (w^T x_i + b) >= 1, i = 1...n$. st means subject to, meaning that under the constraint of the equation. This is actually a quadratic programming problem, and it is convex. A convex problem does not have local solution. Imagine a funnel, no matter where we put a ball, the ball will eventually come out, ie get the global solution. The equation after s.t. is like a convex polyhedral, and we want to find a global solution on it.

This problem can be solved by Lagrange multiplier, using KKT. The objective function is $L(w,b,a) = \frac{1}{2}{||w||}^2 - \sum_{i=1}^{n} \alpha_i (y_i (w^T x_i + b)-1)$.

To solve it, we minimize L wrt w, b, by letting the partial derivatives be zero. $$\frac{\partial L}{\partial w} = 0 i.e. w = \sum_{i=1}^{n}\alpha_i y_i x_i$$
$$\frac{\partial L}{\partial b} = 0 i.e. \sum_{i=1}^{n}\alpha_i y_i = 0$$

Substitute them back into the original equation, we get
$$max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j {x_i}^T x_j s.t. \sum_{i=1}^{n}\alpha_i y_i = 0$$
The equation can be solved by SMO.

What if the data is not linearly separable? We can either use a curve, or use a line and minimize the error. We can add punishment for the misclassified points, which is the distance between the point to its correct position. The equations becomes $min\frac{1}{2} {||w||}^2 + C\sum_{i=1}^{R}\epsilon_i, s.t. y_i (w^T x_i + b) >= 1 - \epsilon_i, \epsilon_i >= 0$.

When xi is at the correct side, $\epsilon = 0$. R is the total number of points. When C is large, misclassification is rare, at the expense of over-fitting. When C is small, misclassification is more common, but the model may not be correct.

Kernel function

When the data is linearly inseparable, we can use some non linear way to separate them, using curves. We can transform the space from the original low dimensional space to a high dimensional one, and then use a hyper plane to separate the data. For instance, no two books are identical. In terms of only (color, content), the two books may be the same, but when we add author, page number, publisher, owner..., the two books become different.

Recall the objective function
$$max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j {x_i}^T x_j s.t. \sum_{i=1}^{n}\alpha_i y_i = 0$$
we can let ${x_i}^T x_j = K(x_i,x_j)$. What is does is to map the space to high dimensional one.

K(xi, xj) has many forms, two common ones are $(x_i \dot x_j + 1)^d$ and $exp(-\frac{(x_i - x_j)^2}{2\sigma^2})$. The former is the polynomial kernel, whereas the latter is the Gaussian kernel, which maps to infinite dimensional space.

Other problems

In case of N class classification, we have two methods. 1. 1 vs (N-1), by training N classifiers, where the ith classifier determines whether the data belongs to class i. 2. 1 vs 1. We train $\frac{N*(N-1)}{2}$ classifiers, where the classifier (i,j) determines whether the data belongs to i or j. According to the author of libsvm, 1 vs 1 is better than 1 vs (N-1).

Will SVM overfit? We can tune the punishment coefficient C. Moreover, the $min {||w||^2}$ is also used in least square, which makes a function more smooth. Hence SVM is unlikely to overfit.











  


Friday 10 July 2015

Algorithm for dummies —— Sorting

As a code monkey, I am not very familiar with the fundamental algorithms. I decide to start learning those because of this wonderful book, which urges me to learn more about algorithms.

Bucket Sort

Sorting is the first algorithms introduced, since it exists everywhere. For instance, to sort 5 numbers 5, 3, 5, 2, 8, we have 8 5 5 3 2. How to do this effectively using computers? Very simple, a 1D array will do.

We initialize an array int a[11], with all zeros. The first number is 5, so we increment a[5] to 1. Doing this for all the numbers, we have a[2] = a[3] = a[8] = 1, a[5] = 2, while the rest are zeros. If we print out the nonzero elements in a, with its value indicating the number of times we print it, the result is "2 3 5 5 8".

This is a simplified version of bucket sort. It is analogous to have 11 buckets, indexed by 0-10. Every time we have a number, we put a flag in the corresponding bucket. We only need to count how many flags in each bucket. The purpose of bucket is to record the number of appearance.

Regarding complexity, we have to iterate m times to throw the flag in m buckets, and iterate n times to throw the flag for n numbers, and do the same again for checking, so the complexity is O(m+n+m+n), which is O(2*(m+n)). Usually we ignore the lesser constants and use upper case so is O(M+N).

The problem regarding bucket sort is that we can only print out the numbers, and imagine they are test scores, we cannot tell who gets what score. In order to do that, we need bubble sort.

Bubble Sort

The basic idea about bubble sort is to interchange two adjacent elements if their order is wrong. For instance, we want to sort 12 35 99 18 76 in descending order, then the smaller numbers are placed at the back. We start with 12, 35, and find 12 < 35, so we change them to 35 12 99 18 76. We then proceed in the same way and since 12 < 99, we have 35 99 12 18 76. At the end, we will have 12 at the last place. This process is analogous to how bubbles come out to the surface, so we call it bubble sort.

In each process, only the index of one number can be determined. In other words, the first time only determines the last number, and the second time only determines second last number. If we have n numbers, we have to place n - 1 numbers in the correct index. Each time, we start with the first number and compare two adjacent numbers. The core of bubble sort is double iteration, and it has O(N^2), which is high.

Quick Sort

Suppose we are sorting 6 1 2 7 9 3 4 5 10 8,we can randomly choose a reference number first, eg 6. We need to put all numbers smaller than 6 to its left while putting those larger than it to its right such as: 3 1 2 5 4 6 9 7 10 8.

For the original sequence, 6 is in the first, and we need to move it to position k such that on its left < 6 and its right > 6. To do this, we can start with two ends, first look for a smaller number from right to left, and then look for a larger number from left to right, then swap them. For the sequence above, we use two "soldiers" i, j. At first, i points to 6, and j points to 8. After searching, j points to 5 and i points to 7, then we swap i and j. Now the sequence is 6 1 2 5 9 3 4 7 10 8. Then the second iteration begins, and j stops after he finds 4. i stops after he finds 9 and then stops. They swap again. The sequence is 6 1 2 5 4 3 9 7 10 8. The inspection continues for the third iteration, and now j finds 3, whereas i could find any number until he meets j. The task is over. Just swap 6 and 3, and it becomes 3 1 2 5 4 6 9 7 10 8.

Next we need to deal with the sequence on the left and right of 6. Starting with left one, 3 1 2 5 4, and use 3 as reference, then we have 2 1 3 5 4. Subsequently, we deal with 2 1 and 5 4. and then 9 7 10 8, we get 1 2 3 4 5 6 7 8 9 10. In essence, every iteration deals with locating the reference number.

The reason why quick sort is quick, is because it jumps over numbers, instead of comparing adjacent ones like bubble sort. In the worst case, it is $O(N^2)$, but the average is $O(NlogN)$.

Example

Say we want to build a library, we need to buy books. Every book has a unique ISBN number. The librarian asks the students to report the ISBN of their favourite books. The best sellers will be reported multiple times and the repeated ISBN needs to be deleted. Then the books are sorted according to ISBN.

The first input is n, represent n students. The second input is all the ISBN numbers. The first output is k, number of books to buy. The second input is all the ISBN in ascending order.

There are two ways. 1. Delete the repetitions, and then sort. 2. Sort first and delete repetitions during output.

For method 1, we can use bucket sort, which has $O(M+N)$. For method 2, we can use bubble sort or quick sort, and then only print out one number of repetitions. Sorting has $O(N^2)$, and both input output have O(N). We drop 2N and the final one is $O(N^2)$.