# PAC learning

• In supervised learning, the input is a set of labeled examples $(x_1, \ell(x_1)), \dots (x_m, \ell(x_m))$, and the goal is to learn the target function $\ell$ based on these examples. Here we assume $x\in X = \{0,1\}^n$ and $\ell\in \mathcal{H}: X \rightarrow \{ 0,1\}$.
• E.g. $\mathcal{H}=$ Conjunctions, i.e. $\ell(x) = x_{i_1} \wedge x_{i_2} \wedge \bar{x}_{i_3}, \dots$. We can find a consistent conjunction by the following method:
1. Eliminate any $x_i$ that is set to 0 in any positive example (label $1$).
2. Eliminate any $\bar{x}_i$ if $x_i$ is set to 1 in any positive example.
3. Take the conjunction of all that are left.
• To say that our algorithms can “generalize” well for future examples, we use the PAC model.
• Definition: ($\epsilon, \delta$)-PAC-learning:
• Given i.i.d. examples from an unknown distribution $\mathcal{D}$, and labeled by $\ell\in\mathcal{H}$, a PAC learning algorithm finds a hypothesis $h$ with probability at least $1-\delta$, such that $Pr_{\mathcal{D}}\left( h(x) \neq \ell(x) \right) \le \epsilon$, in time polynomial in $\frac{1}{\epsilon}, \log\frac{1}{\delta}$, the dimension of each example and the description length of the target hypothesis. (We will talk about VC$(\mathcal{H})$ in detail soon).
• Let us show that the above algorithm PAC-learns the conjunction class.
• Consider some “bad” conjunction whose error is at least $\epsilon$. The probability that it is consistent with $m$ examples drawn from $\mathcal{D}$ is at most $(1-\epsilon)^m$.
• There are $3^n$ possible conjunctions.
• Therefore, the probability that there exists a “bad” conjunction consistent with all $m$ examples is at most $3^n(1-\epsilon)^m$.
• We want this probability to be small, such that $3^n(1-\epsilon)^m \le \delta$, which means $m\ge \frac{1}{\epsilon} \left[ n\ln(3) + \ln\frac{1}{\delta} \right]$. Note that we used the inequality $1-x \le e^{-x}$.
• Let us consider another function class, decision lists. A decision list is a function of the form: “if $\ell_1$ then $b_1$, else if $\ell_2$ then $b_2$, …, else $b_k$,” where $\ell_i$ is a literal (either a variable or its negation) and each $b_i \in \{0, 1\}$.
• One algorithm to find a consistent decision list is as follows. In each iteration, find a rule consistent with all the examples. Eliminate those examples that fire the rule. Continue this process until there is no examples left. Note that in each iteration at least one example will be eliminated, so the algorithm terminates in at most $m$ steps.
• Note that there are at most $n!4^n$ decision list, and we can use the same method above to find the number of examples $m$ we need to PAC-learn this function class. In general, for a function class $\mathcal{H}$, we need $m\ge \frac{1}{\epsilon} \left[ \ln(|\mathcal{H}|) + \ln\frac{1}{\delta} \right]$.
• For the function class of decision lists with at most $k$ rules. $|\mathcal{H}| = \left( \begin{matrix} n \\ k \end{matrix} \right)4^k\cdot k! \le n^k 4^k$, and so $\log(|\mathcal{H}|) = O(k\log n)$. However, It remains an open  problem to design an efficient algorithm that achieves this sample complexity bound.
• Mistake bound model:
• An example is presented in each round chosen by an adversary; the algorithm predicts its label before seeing the true label; then the true label is revealed. The goal is to minimize the number of total mistakes.
• A simple (although inefficient) algorithm is to predict the majority of the predictions of all consistent hypotheses. Whenever this majority algorithm makes a mistake, it can eliminate at least half of the remaining hypotheses, so it will make at most $\log(|\mathcal{H}|)$ mistakes.
• We can convert an algorithm with a mistake bound of $M$ to a sample complexity bound of $\frac{M}{\epsilon}\ln\frac{M}{\delta}$ in PAC learning.
• Run the algorithm until it produces a hypothesis that survives $\frac{1}{\epsilon}\ln\frac{M}{\delta}$ examples.
• The probability that a hypothesis with error rate $> \epsilon$ survives $\frac{1}{\epsilon}\ln\frac{M}{\delta}$ examples is at most $\frac{\delta}{M}$
• There are at most $M$ hypotheses generated by the algorithm, so, by union bound, the probability that the final hypothesis output by the above procedure has  error rate $> \epsilon$ is at most $\delta$.
• If we don’t know $M$, we can instead produce  a hypothesis that survives $\frac{1}{\epsilon}\ln\frac{(i+2)^2}{\delta}$ examples after i-th mistake. Then, by union bound, the probability that the final hypothesis output by this algorithm has error rate $> \epsilon$ is at most $\sum_{i=1}^\infty \frac{\delta}{(i+2)^2} \le \delta$. The sample complexity is $\frac{1}{\epsilon} \sum_{i=1}^M\log\frac{(i+2)^2}{\delta} \le\frac{M}{\epsilon}\log\frac{M^2}{\delta}$.