- This is a brief summary of what we talked about this week, including some further discussion of learning a halfspace, VC-dimension, and kernels. For details, you can read Chapter 6 of Foundations of Data Science .
- Last time we proved that for learning a halfspace, we can get the following mistake bound
- Perceptron:
- Winnow:

- We want to ask if we can improve the mistake bound, and how this can lead to a sample complexity bound in PAC learning
- Assume the examples come from a -ball. We choose a random halfspace in the remaining (consistent) set of hypotheses at each round. The initial volume is and the final volume is (a -cone). Therefore, the mistake bound is . The expected mistake bound is .

VC-dim(H): largest such that there is a set of points which can be partitioned by in different ways. That is, let to be the maximum number of ways to partition any set of points using hypotheses in , then VC-dim(H) is the maximum such that .

- Theorem 1: examples suffice to PAC-learn H
- Theorem 2: For a set with examples
- ,

- Sauer’s lemma:
- Collorary: The number of examples to PAC-learn H with VC-dim(H)=d is
- Let us get back to the case of halfspace with margin
- From the analysis of Perceptron, we know the mistake bound is (which can be transformed into a PAC complexity bound)
- It is then natural to find a halfspace with a large margin

- Maximum margin problem:
- , such that , and ,
- This is equivalent to , such that

- What if the data is not linearly separable
- Add some slack variables
- , such that
- This is the problem that Support Vector Machine (SVM) solves

- For non-linearly-separable data, the mistake bound of Perceptron is
- Let
- On a mistake,
- After mistakes,
- (assume )
- After mistakes,
- Combining the above inequalities using , we get
- Square both sides and rearrange, we get

- Kernels:
- A function is a legal kernel function if can be written as .
- This is equivalent to say that the matrix of is symmetric semi-definite.
- E.g. ==>
- , is a legal kernel
- is a legal kernel
- is a legal kernel

- Kernel Perceptron:
- If we want to perform a non-linear transform on each
- The above becomes
- The algorithm can simply store all the examples (along with their labels) that it made a mistake. To predict a new examples , we just need to calculate the kernel values for each and output the sign of .
- The benefit of using kernel is that we don’t need to explicitly map each data to the high dimensional space and do the inner-product, which can save a lot of time.

Advertisements