Notes: Statistical learning

  • Random Classification Noise Model: The same as in PAC learning, except that when drawing an example, there is a probability of \eta that the class label is flipped.
  • Take learning disjunctions as an example:
    • Let \ell(x) = x_1 \vee \dots \vee x_k be the true label and \hat{\ell}(x) the observed (probably flipped) label.
    • Let S\subseteq \{1,\dots,n \} be the set of relevant literals in the target function and q_i = Pr(\ell(x)=1 | x_i = 1).
    • Pr(\hat{\ell}(x) =0 |x_i = 1) = \eta q_i + (1-\eta)(1-q_i) = \eta + (1-2\eta)(1-q_i)
    • For i\in S, we have q_i =1 and Pr(\hat{\ell}(x) =0 |x_i = 1) = \eta.
    • Assume for i\notin S, 1-q_i \ge \frac{\epsilon}{n}. Then, to identify if i\in S, we only need to estimate (by sampling) Pr(\hat{\ell}(x) =0 |x_i = 1) to within \pm \frac{\epsilon}{n}(1-2\eta). The number of samples required is poly(n, \frac{1}{\epsilon}, \frac{1}{1-2\eta}).
  • Statistical Query Model:
    • SQ-Model can be seen as a restriction of PAC Model, where the learning algorithm cannot access directly to the example oracle. Instead, a SQ-learner can only make statistical queries of the form (\chi, \tau), which receives P_\chi = Pr_{x\in D}[\chi(x,\ell(x))=1] within additive error \tau, where \chi is a mapping \chi:X\times \{0,1\}\rightarrow \{0,1\}. So the statistical algorithm can ask for the expectation, up to accuracy \tau of any Boolean function of an example and its label. (We can replace Boolean function with bounded real-valued function).
    • Algorithms in SQ-model are resistant to random classification noise.
      • The proof idea is to show that each statistical query can be efficiently estimated by sampling from the noisy examples.
      • Let L = \{ x: \chi(x,0) \neq \chi(x,1)\} and \bar{L} = \{ x: \chi(x,0) = \chi(x,1)\}
      • P_\chi = P(\chi(x,\ell(x))=1 | x\in L)P(x\in L) +P(\chi(x,\ell(x))=1 | x\in \bar{L})P(x \in \bar{L}).
      • P(x\in L) can be estimated by sampling from the noisy examples (ignore the labels) and calculate the fraction of \chi(x,0) \neq \chi(x,1).
      • Note that P(\chi(x,\ell(x))=1 | x\in \bar{L}) = P(\chi(x,\hat{\ell}(x))=1 | x\in \bar{L}), and so can be estimated from the noisy examples.
      • Therefore, it suffices to show p = P(\chi(x,\ell(x))=1 | x\in L) can be estimated efficiently.
      • p_\eta = P(\chi(x,\hat{\ell}(x))=1 | x\in L) = p(1-\eta) + (1-p)\eta. Rearrange, we get p = \frac{p_\eta - \eta}{1-2\eta}
      • Therefore, we only need to estimate p_\eta to within additive error of \epsilon(1-2\eta).
    • E.g. Perceptron in SQ-model
      • Ask for p = P(\ell(x) x | \ell(x) (w\cdot x) < 0) to within \pm \tau. Then update w \leftarrow w + p.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s