Notes: VC-dimension

  • This is a brief summary of what we talked about this week, including some further discussion of learning a halfspace, VC-dimension, and kernels. For details, you can read Chapter 6 of Foundations of Data Science .
  • Last time we proved that for learning a halfspace, we can get the following mistake bound
    • Perceptron: O\left(\frac{||w^*||_2^2 ||X||_2^2} {\gamma^2}\right)
    • Winnow:  O\left(\frac{||w^*||_1^2 ||X||_\infty^2} {\gamma^2}\log n\right)
  • We want to ask if we can improve the mistake bound, and how this can lead to a sample complexity bound in PAC learning
  • Assume the examples come from a R^n-ball.  We choose a random halfspace in the remaining (consistent) set of hypotheses at each round. The initial volume is V and the final volume is V\gamma^n (a \gamma-cone). Therefore, the mistake bound is n\log_{e/(e-1)} \frac{1}{\gamma}.  The expected mistake bound is 2\log_2 \frac{1}{\gamma}.

VC-dim(H): largest m such that there is a set of m points which can be partitioned by H in 2^m different ways. That is, let H(m) to be the maximum number of ways to partition any set of m points using hypotheses in H, then VC-dim(H) is the maximum m such that H(m)=2^m.

  • Theorem 1: m \ge \frac{2}{\epsilon}\left( \log H(2m) + \log \frac{1}{\delta} \right) examples suffice to PAC-learn H
  • Theorem 2: For a set S with m \ge \frac{8}{\epsilon^2}\left( \log H(2m) + \log \frac{1}{\delta} \right) examples
    • \forall h\in H, |err_D(h) - err_S(h)| \le \epsilon
  • Sauer’s lemma: H(m) \le \sum_{i=0}^{VC-dim(H)} {m \choose i} \le m^{VC-dim(H)}
  • Collorary: The number of examples to PAC-learn H with VC-dim(H)=d is O\left (\frac{1}{\epsilon}( d\log\frac{1}{\epsilon} + \log\frac{1}{\delta} ) \right)
  • Let us get back to the case of halfspace with margin \gamma
    • From the analysis of Perceptron, we know the mistake bound is 1/\gamma^2 (which can be transformed into a PAC complexity bound)
    • It is then natural to find a halfspace with a large margin
  • Maximum margin problem:
    • \max_{w,\gamma} \gamma, such that \ell(x^i) (w\cdot x^i) \ge \gamma, and ||w||^2\le 1, \forall i
    • This is equivalent to \min_{w} ||w||^2, such that \ell(x^i) (w\cdot x^i) \ge 1
  • What if the data is not linearly separable
    • Add some slack variables \epsilon_i
    • \min_{w} ||w||^2 + C\sum_i \epsilon_i, such that \ell(x^i) (w\cdot x^i) \ge 1-\epsilon_i
    • This is the problem that Support Vector Machine (SVM)  solves
  • For non-linearly-separable data, the mistake bound of Perceptron is \min_w \left( ||w ||^2 + 2\sum_i \epsilon_i \right)
    • Let L = \sum_i \epsilon_i
    • On a mistake, w^* \cdot w \rightarrow w^* (w+\ell(x^i)x^i) = W^* \cdot w + |w^*\cdot x^i| =w^* \cdot w + 1-\epsilon_i
    • After M mistakes, w^* \cdot w\ge M - L
    • ||w||^2 \rightarrow ||w +\ell(x^i)x^i||^2 = ||w||^2 + ||x^i||^2 + 2\ell(x^i)(w \cdot x^i) \le ||w||^2 +1  (assume ||x||\le 1)
    • After M mistakes, ||w||^2 \le M
    • Combining the above inequalities using w^* \cdot w \le ||w^*|| ||w||, we get M - L \le ||w^*||\sqrt{M}
    • Square both sides and rearrange, we get M + \frac{L^2}{M} \le ||w^*||^2 + 2L
  • Kernels:
    • A function k(\cdot, \cdot) is a legal kernel function if k can be written as k(x,y) = \phi(x)\cdot \phi(y).
    • This is equivalent to say that the matrix of k(x,y) is symmetric semi-definite.
    • E.g. k(x,y)=(x\cdot y)^d  ==> \phi(x) = vec(\otimes^d x)
    • k(x,y) = \sum_j C_j (x\cdot y)^j, C_j>0 is a legal kernel
    • k(x,y)=(1+x\cdot y)^d is a legal kernel
    • k(x,y) = e^{-||w-y||^2} is a legal kernel
  • Kernel Perceptron:
    • w = \sum_i \ell(x^i) x^i
    • w\cdot x = \sum_i\ell(x^i) x^i \cdot x
    • If we want to perform a non-linear transform \phi on each x
    • The above becomes \sum_i\ell(x^i) \phi(x^i) \cdot \phi(x) =\sum_i\ell(x^i) k(x^i, x)
    • The algorithm can simply store all the examples (along with their labels) that it made a mistake. To predict a new examples x, we just need to calculate the kernel values k(x^i, x) for each i and output the sign of \sum_i \ell(x^i) k(x^i, x).
    • The benefit of using kernel is that we don’t need to explicitly map each data to the high dimensional space and do the inner-product, which can save a lot of time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s