# Notes: Statistical learning

• Random Classification Noise Model: The same as in PAC learning, except that when drawing an example, there is a probability of $\eta$ that the class label is flipped.
• Take learning disjunctions as an example:
• Let $\ell(x) = x_1 \vee \dots \vee x_k$ be the true label and $\hat{\ell}(x)$ the observed (probably flipped) label.
• Let $S\subseteq \{1,\dots,n \}$ be the set of relevant literals in the target function and $q_i = Pr(\ell(x)=1 | x_i = 1)$.
• $Pr(\hat{\ell}(x) =0 |x_i = 1) = \eta q_i + (1-\eta)(1-q_i) = \eta + (1-2\eta)(1-q_i)$
• For $i\in S$, we have $q_i =1$ and $Pr(\hat{\ell}(x) =0 |x_i = 1) = \eta$.
• Assume for $i\notin S$, $1-q_i \ge \frac{\epsilon}{n}$. Then, to identify if $i\in S$, we only need to estimate (by sampling) $Pr(\hat{\ell}(x) =0 |x_i = 1)$ to within $\pm \frac{\epsilon}{n}(1-2\eta)$. The number of samples required is $poly(n, \frac{1}{\epsilon}, \frac{1}{1-2\eta})$.
• Statistical Query Model:
• SQ-Model can be seen as a restriction of PAC Model, where the learning algorithm cannot access directly to the example oracle. Instead, a SQ-learner can only make statistical queries of the form $(\chi, \tau)$, which receives $P_\chi = Pr_{x\in D}[\chi(x,\ell(x))=1]$ within additive error $\tau$, where $\chi$ is a mapping $\chi:X\times \{0,1\}\rightarrow \{0,1\}$. So the statistical algorithm can ask for the expectation, up to accuracy $\tau$ of any Boolean function of an example and its label. (We can replace Boolean function with bounded real-valued function).
• Algorithms in SQ-model are resistant to random classification noise.
• The proof idea is to show that each statistical query can be efficiently estimated by sampling from the noisy examples.
• Let $L = \{ x: \chi(x,0) \neq \chi(x,1)\}$ and $\bar{L} = \{ x: \chi(x,0) = \chi(x,1)\}$
• $P_\chi = P(\chi(x,\ell(x))=1 | x\in L)P(x\in L) +P(\chi(x,\ell(x))=1 | x\in \bar{L})P(x \in \bar{L})$.
• $P(x\in L)$ can be estimated by sampling from the noisy examples (ignore the labels) and calculate the fraction of $\chi(x,0) \neq \chi(x,1)$.
• Note that $P(\chi(x,\ell(x))=1 | x\in \bar{L}) = P(\chi(x,\hat{\ell}(x))=1 | x\in \bar{L})$, and so can be estimated from the noisy examples.
• Therefore, it suffices to show $p = P(\chi(x,\ell(x))=1 | x\in L)$ can be estimated efficiently.
• $p_\eta = P(\chi(x,\hat{\ell}(x))=1 | x\in L) = p(1-\eta) + (1-p)\eta$. Rearrange, we get $p = \frac{p_\eta - \eta}{1-2\eta}$
• Therefore, we only need to estimate $p_\eta$ to within additive error of $\epsilon(1-2\eta)$.
• E.g. Perceptron in SQ-model
• Ask for $p = P(\ell(x) x | \ell(x) (w\cdot x) < 0)$ to within $\pm \tau$. Then update $w \leftarrow w + p$.