Notes: VC-dimension

• This is a brief summary of what we talked about this week, including some further discussion of learning a halfspace, VC-dimension, and kernels. For details, you can read Chapter 6 of Foundations of Data Science .
• Last time we proved that for learning a halfspace, we can get the following mistake bound
• Perceptron: $O\left(\frac{||w^*||_2^2 ||X||_2^2} {\gamma^2}\right)$
• Winnow:  $O\left(\frac{||w^*||_1^2 ||X||_\infty^2} {\gamma^2}\log n\right)$
• We want to ask if we can improve the mistake bound, and how this can lead to a sample complexity bound in PAC learning
• Assume the examples come from a $R^n$-ball.  We choose a random halfspace in the remaining (consistent) set of hypotheses at each round. The initial volume is $V$ and the final volume is $V\gamma^n$ (a $\gamma$-cone). Therefore, the mistake bound is $n\log_{e/(e-1)} \frac{1}{\gamma}$.  The expected mistake bound is $2\log_2 \frac{1}{\gamma}$.

VC-dim(H): largest $m$ such that there is a set of $m$ points which can be partitioned by $H$ in $2^m$ different ways. That is, let $H(m)$ to be the maximum number of ways to partition any set of $m$ points using hypotheses in $H$, then VC-dim(H) is the maximum $m$ such that $H(m)=2^m$.

• Theorem 1: $m \ge \frac{2}{\epsilon}\left( \log H(2m) + \log \frac{1}{\delta} \right)$ examples suffice to PAC-learn H
• Theorem 2: For a set $S$ with $m \ge \frac{8}{\epsilon^2}\left( \log H(2m) + \log \frac{1}{\delta} \right)$ examples
• $\forall h\in H$, $|err_D(h) - err_S(h)| \le \epsilon$
• Sauer’s lemma: $H(m) \le \sum_{i=0}^{VC-dim(H)} {m \choose i} \le m^{VC-dim(H)}$
• Collorary: The number of examples to PAC-learn H with VC-dim(H)=d is $O\left (\frac{1}{\epsilon}( d\log\frac{1}{\epsilon} + \log\frac{1}{\delta} ) \right)$
• Let us get back to the case of halfspace with margin $\gamma$
• From the analysis of Perceptron, we know the mistake bound is $1/\gamma^2$ (which can be transformed into a PAC complexity bound)
• It is then natural to find a halfspace with a large margin
• Maximum margin problem:
• $\max_{w,\gamma} \gamma$, such that $\ell(x^i) (w\cdot x^i) \ge \gamma$, and $||w||^2\le 1$, $\forall i$
• This is equivalent to $\min_{w} ||w||^2$, such that $\ell(x^i) (w\cdot x^i) \ge 1$
• What if the data is not linearly separable
• Add some slack variables $\epsilon_i$
• $\min_{w} ||w||^2 + C\sum_i \epsilon_i$, such that $\ell(x^i) (w\cdot x^i) \ge 1-\epsilon_i$
• This is the problem that Support Vector Machine (SVM)  solves
• For non-linearly-separable data, the mistake bound of Perceptron is $\min_w \left( ||w ||^2 + 2\sum_i \epsilon_i \right)$
• Let $L = \sum_i \epsilon_i$
• On a mistake, $w^* \cdot w \rightarrow w^* (w+\ell(x^i)x^i) = W^* \cdot w + |w^*\cdot x^i| =w^* \cdot w + 1-\epsilon_i$
• After $M$ mistakes, $w^* \cdot w\ge M - L$
• $||w||^2 \rightarrow ||w +\ell(x^i)x^i||^2 = ||w||^2 + ||x^i||^2 + 2\ell(x^i)(w \cdot x^i) \le ||w||^2 +1$  (assume $||x||\le 1$)
• After $M$ mistakes, $||w||^2 \le M$
• Combining the above inequalities using $w^* \cdot w \le ||w^*|| ||w||$, we get $M - L \le ||w^*||\sqrt{M}$
• Square both sides and rearrange, we get $M + \frac{L^2}{M} \le ||w^*||^2 + 2L$
• Kernels:
• A function $k(\cdot, \cdot)$ is a legal kernel function if $k$ can be written as $k(x,y) = \phi(x)\cdot \phi(y)$.
• This is equivalent to say that the matrix of $k(x,y)$ is symmetric semi-definite.
• E.g. $k(x,y)=(x\cdot y)^d$  ==> $\phi(x) = vec(\otimes^d x)$
• $k(x,y) = \sum_j C_j (x\cdot y)^j$, $C_j>0$ is a legal kernel
• $k(x,y)=(1+x\cdot y)^d$ is a legal kernel
• $k(x,y) = e^{-||w-y||^2}$ is a legal kernel
• Kernel Perceptron:
• $w = \sum_i \ell(x^i) x^i$
• $w\cdot x = \sum_i\ell(x^i) x^i \cdot x$
• If we want to perform a non-linear transform $\phi$ on each $x$
• The above becomes $\sum_i\ell(x^i) \phi(x^i) \cdot \phi(x) =\sum_i\ell(x^i) k(x^i, x)$
• The algorithm can simply store all the examples (along with their labels) that it made a mistake. To predict a new examples $x$, we just need to calculate the kernel values $k(x^i, x)$ for each $i$ and output the sign of $\sum_i \ell(x^i) k(x^i, x)$.
• The benefit of using kernel is that we don’t need to explicitly map each data to the high dimensional space and do the inner-product, which can save a lot of time.