In supervised learning, the input is a set of labeled examples , and the goal is to learn the target function based on these examples. Here we assume and .
E.g. Conjunctions, i.e. . We can find a consistent conjunction by the following method:
Eliminate any that is set to 0 in any positive example (label ).
Eliminate any if is set to 1 in any positive example.
Take the conjunction of all that are left.
To say that our algorithms can “generalize” well for future examples, we use the PAC model.
Given i.i.d. examples from an unknown distribution , and labeled by , a PAC learning algorithm finds a hypothesis with probability at least , such that , in time polynomial in , the dimension of each example and the description length of the target hypothesis. (We will talk about VC in detail soon).
Let us show that the above algorithm PAC-learns the conjunction class.
Consider some “bad” conjunction whose error is at least . The probability that it is consistent with examples drawn from is at most .
There are possible conjunctions.
Therefore, the probability that there exists a “bad” conjunction consistent with all examples is at most .
We want this probability to be small, such that , which means . Note that we used the inequality .
Let us consider another function class, decision lists. A decision list is a function of the form: “if then , else if then , …, else ,” where is a literal (either a variable or its negation) and each .
One algorithm to find a consistent decision list is as follows. In each iteration, find a rule consistent with all the examples. Eliminate those examples that fire the rule. Continue this process until there is no examples left. Note that in each iteration at least one example will be eliminated, so the algorithm terminates in at most steps.
Note that there are at most decision list, and we can use the same method above to find the number of examples we need to PAC-learn this function class. In general, for a function class , we need .
For the function class of decision lists with at most rules. , and so . However, It remains an open problem to design an efficient algorithm that achieves this sample complexity bound.
Mistake bound model:
An example is presented in each round chosen by an adversary; the algorithm predicts its label before seeing the true label; then the true label is revealed. The goal is to minimize the number of total mistakes.
A simple (although inefficient) algorithm is to predict the majority of the predictions of all consistent hypotheses. Whenever this majority algorithm makes a mistake, it can eliminate at least half of the remaining hypotheses, so it will make at most mistakes.
We can convert an algorithm with a mistake bound of to a sample complexity bound of in PAC learning.
Run the algorithm until it produces a hypothesis that survives examples.
The probability that a hypothesis with error rate survives examples is at most
There are at most hypotheses generated by the algorithm, so, by union bound, the probability that the final hypothesis output by the above procedure has error rate is at most .
If we don’t know , we can instead produce a hypothesis that survives examples after i-th mistake. Then, by union bound, the probability that the final hypothesis output by this algorithm has error rate is at most . The sample complexity is .