PAC learning

  • In supervised learning, the input is a set of labeled examples (x_1, \ell(x_1)), \dots (x_m, \ell(x_m)) , and the goal is to learn the target function \ell based on these examples. Here we assume x\in X = \{0,1\}^n and \ell\in \mathcal{H}: X \rightarrow \{ 0,1\}.
  • E.g. \mathcal{H}= Conjunctions, i.e. \ell(x) = x_{i_1} \wedge x_{i_2} \wedge \bar{x}_{i_3}, \dots. We can find a consistent conjunction by the following method:
    1. Eliminate any x_i that is set to 0 in any positive example (label 1).
    2. Eliminate any \bar{x}_i if x_i is set to 1 in any positive example.
    3. Take the conjunction of all that are left.
  • To say that our algorithms can “generalize” well for future examples, we use the PAC model.
  • Definition: (\epsilon, \delta)-PAC-learning:
    • Given i.i.d. examples from an unknown distribution \mathcal{D}, and labeled by \ell\in\mathcal{H}, a PAC learning algorithm finds a hypothesis h with probability at least 1-\delta, such that Pr_{\mathcal{D}}\left( h(x) \neq \ell(x) \right) \le \epsilon, in time polynomial in \frac{1}{\epsilon}, \log\frac{1}{\delta}, the dimension of each example and the description length of the target hypothesis. (We will talk about VC(\mathcal{H}) in detail soon).
  • Let us show that the above algorithm PAC-learns the conjunction class.
    • Consider some “bad” conjunction whose error is at least \epsilon. The probability that it is consistent with m examples drawn from \mathcal{D} is at most (1-\epsilon)^m.
    • There are 3^n possible conjunctions.
    • Therefore, the probability that there exists a “bad” conjunction consistent with all m examples is at most 3^n(1-\epsilon)^m.
    • We want this probability to be small, such that 3^n(1-\epsilon)^m \le \delta, which means m\ge \frac{1}{\epsilon} \left[ n\ln(3) + \ln\frac{1}{\delta} \right]. Note that we used the inequality 1-x \le e^{-x}.
  • Let us consider another function class, decision lists. A decision list is a function of the form: “if \ell_1 then b_1, else if \ell_2 then b_2, …, else b_k,” where \ell_i is a literal (either a variable or its negation) and each b_i \in \{0, 1\}.
    • One algorithm to find a consistent decision list is as follows. In each iteration, find a rule consistent with all the examples. Eliminate those examples that fire the rule. Continue this process until there is no examples left. Note that in each iteration at least one example will be eliminated, so the algorithm terminates in at most m steps.
  • Note that there are at most n!4^n decision list, and we can use the same method above to find the number of examples m we need to PAC-learn this function class. In general, for a function class \mathcal{H}, we need m\ge \frac{1}{\epsilon} \left[ \ln(|\mathcal{H}|) + \ln\frac{1}{\delta} \right].
  • For the function class of decision lists with at most k rules. |\mathcal{H}| = \left( \begin{matrix} n \\ k \end{matrix} \right)4^k\cdot k! \le n^k 4^k, and so \log(|\mathcal{H}|) = O(k\log n). However, It remains an open  problem to design an efficient algorithm that achieves this sample complexity bound.
  • Mistake bound model:
    • An example is presented in each round chosen by an adversary; the algorithm predicts its label before seeing the true label; then the true label is revealed. The goal is to minimize the number of total mistakes.
    • A simple (although inefficient) algorithm is to predict the majority of the predictions of all consistent hypotheses. Whenever this majority algorithm makes a mistake, it can eliminate at least half of the remaining hypotheses, so it will make at most \log(|\mathcal{H}|) mistakes.
  • We can convert an algorithm with a mistake bound of M to a sample complexity bound of \frac{M}{\epsilon}\ln\frac{M}{\delta} in PAC learning.
    • Run the algorithm until it produces a hypothesis that survives \frac{1}{\epsilon}\ln\frac{M}{\delta} examples.
    • The probability that a hypothesis with error rate > \epsilon survives \frac{1}{\epsilon}\ln\frac{M}{\delta} examples is at most \frac{\delta}{M}
    • There are at most M hypotheses generated by the algorithm, so, by union bound, the probability that the final hypothesis output by the above procedure has  error rate > \epsilon is at most \delta.
    • If we don’t know M, we can instead produce  a hypothesis that survives \frac{1}{\epsilon}\ln\frac{(i+2)^2}{\delta} examples after i-th mistake. Then, by union bound, the probability that the final hypothesis output by this algorithm has error rate > \epsilon is at most \sum_{i=1}^\infty \frac{\delta}{(i+2)^2} \le \delta. The sample complexity is \frac{1}{\epsilon} \sum_{i=1}^M\log\frac{(i+2)^2}{\delta} \le\frac{M}{\epsilon}\log\frac{M^2}{\delta} .


  1. The bound on the size of the hypothesis space of decision lists with at most k rules should be n^k*4^k instead of n^4*4^k.

Comments are closed.