Kernels 101

(Migrated from old blog)

So Kernel is any function $f(x,\bar{x}) $ which represent the scalar product of two variables, It determines the distance in the space. They have two major properties that are normalization and symmetry. So let’s think of an example :

Lets assume we have a non linear boundary where we predict $y= 1 $ if

\[\theta_0 + \theta_1x_1 + \theta_2 x_2 .... \geq 0\]

So the Hypothesis becomes

\[h_0(x) = \begin{cases} 1 & \text{if } \theta_0 + \theta_1x_1 + \theta_2 x_2 .... \geq 0 \\ 0 & otherwise \end{cases}\]

We can rewrite the linear boundary equation as

\[\theta_0 + \theta_1 f_1 + \theta_2 f_2 ... \geq 0\]

where, f_1, f_2 , ,,, f_n are similarity between the training set and a point x.

Consider three training set $l^{(1)}, l^{(2)}, l^{(3)} $ and give a point $x $ we can write

\[f_1 = \mathrm{Similarity}\left( x, l^{(1)} \right)\] \[=\mathrm{exp}\left( \frac{ - \vert\vert x - l^{(1)} \vert\vert }{2\sigma^2} \right)\]

Here this similarity is called as Kernel Function. And this specific kernel is called a Gaussian Kernel.

How it works?

Case 1: When $x \approx l^{(1)} $

Then the $\vert\vert x - l^{(1)} \vert\vert \approx 0 $ therefore the equation becomes,

\[\mathrm{exp} \left( \frac{0}{2 \sigma} \right) \approx 1\]

Therefore the values becomes 1 so the distance in this space is 1, now consider a point away from training set $l^{(2)} $

Case 2: When $x $ is very far from $l^{(1)} $

Then the latex $\vert\vert x - l^{(1)} \vert\vert $ will be a very large number, so the function becomes now,

\[\mathrm{exp} \left( \frac{ (\text{very large number})^2 }{2 \sigma} \right) \approx 0\]

Thus kernel normalizes and puts a symmetry to the whole dataset.

Where to get this $l^{(i)} $ landmarks ? These can be chosen from the dataset i.e values of $( x^{(i)}, y^{(i)}) $ Consider for $x^{(i)} \rightarrow $

\[f_1^{(i)} = \mathrm{Sim}(x^{(i)}, l^{(1)}) \\ f_2^{(i)} = \mathrm{Sim}(x^{(i)}, l^{(2)}) \\ \vdots \\ f_i^{(i)} = \mathrm{Sim}(x^{(i)}, l^{(i)}) = 1 \text{ for gaussian kernel }\\ \vdots\\ \\ f_n^{(i)} = \mathrm{Sim}(x^{(i)}, l^{(n)})\]

We can build a new feature vector with

\[f^{(i)}=\left[ \begin{array}{c}{f^{(i)}_{0}} \\ {f^{(i)}_{1}} \\ {f^{(i)}_{2}} \\ {\vdots} \\ {f^{(i)}_{m}}\end{array}\right]\]

till $m$ because we have m training set.

Read about how it scales in SVM in post here :

Support Vector Machines

Kernels 101

(Migrated from old blog)

How it works?

Further Reading

A beginners guide to SVM

Introduction to gradience descent (a beginner's guide)

A beginners guide to regularization