Weighted Majority, Winnow

(based on Avrim Blum’s notes)

Predicting from expert advice

  • In each round, each of the n experts makes a prediction in \{0,1\}
  • The learner makes a prediction based on the experts’ advice
  • The learner observes the actual outcome
  • The goal is to predict nearly as well as the best expert in hindsight
    • M:= \# mistakes by the learner
    • m:= \# mistakes by the best expert in hindsight
  • If there is a perfect expert, then by predicting the majority class and removing the erring experts in each round,  M \le \log_2 n.
  • If no perfect expert, we can restart the above process whenever all the experts are removed. Each expert makes at least one mistake when all of them are removed, thus M \le m \cdot \log_2 n

Weighted Majority

  • Set w_i=1,  \forall i\in\{1,\dots,n\},  W = \sum_i w_i
  • On a mistake, for very expert E_i who predicted incorrectly, set w_i \leftarrow \frac{w_i}{2}
  • On a mistake, at least half of the weights are reduced by half, so W goes down by a factor of \frac{3}{4}.
  • After M mistakes, W \le n(\frac{3}{4})^M
  • On the other hand, the best expert has weight \ge (\frac{1}{2})^m.
  • Since the total weight is larger than any single weight, by combining the above two inequalities, we get \log_2 (4/3) M\le \log n + m or M \le 2.5(\log n + m).

Randomized Weighted Majority

  • Instead of predicting the majority, pick expert i randomly with probability proportional to \frac{w_i}{W} and follow the advice
  • Fix \epsilon < 1/2. In each time step, for each erring expert, set w_i \leftarrow (1-\epsilon)w_i
  • At time t, Let f_t be the fraction of experts who make a mistake.
  • The total weight W at time T is n\prod_{t=1}^T (1-\epsilon f_t), so \ln W = \ln n + \sum_t \ln(1-\epsilon f_t) \le \ln n - \epsilon \sum_t f_t =\ln n - \epsilon E(M) , where we use the fact that 1+x \le e^x and E(M) = \sum_t f_t
  • On the other hand, the weight of the best expert is (1-\epsilon)^m. Therefore, m\ln(1-\epsilon) \le \ln n - \epsilon E(M). Rearranging gives E(M) \le (1+\epsilon) m + \frac{1}{\epsilon}\ln n
  • Choose \epsilon = \sqrt{\frac{\ln n}{m}}, then E(M) \le m + 2\sqrt{m\cdot \ln n}
  • Since m\le T, where T is the total number of rounds, we have \frac{E(M)}{T} \le \frac{m}{T} + 2\sqrt{\frac{\ln n}{T}}. This means the expected number of mistakes per step (“regret”) approaches that of the best expert as T increases.

Winnow  

Learn an OR of r out of n variables

  • Initialize w_i=1,  \forall i\in\{1,\dots,n\}. Predict + if w \cdot x \ge n, otherwise -.
  • Mistake on a positive example: w_i \leftarrow 2w_i ,  \forall i: x_i =1
  • Mistake on a negative example: w_i \leftarrow \frac{w_i}{2} ,  \forall i: x_i =1 (could set w_i = 0).
  • The weights of r relevant variables never decrease, and  each of them can only be doubled \log n times, so #mistakes on positive examples is at most \le r\log_2 n
  • For each mistake on +, W increases at most n, and for each mistake on -W decreases at least \frac{n}{2}. Since W never drops below zero, we have (#mistakes on -) \le 2 (#mistakes on +).
  • Therefore, total #mistakes =O(r\log n).

k-out-of-r majority-vote function

  • Initialize w_i=1,  \forall i\in\{1,\dots,n\}. Predict + if w\cdot x \ge n, otherwise -.
  • Mistake on a positive example: w_i \leftarrow (1+\epsilon)w_i ,  \forall i: x_i =1
  • Mistake on a negative example: w_i \leftarrow \frac{w_i}{(1+\epsilon)},  \forall i: x_i =1
  • Let M_+ be number of mistakes on positives and M_- be the number of mistakes on negatives. Then,
  • k M_+ -(k-1)M_- \le r \log_{1+\epsilon} n, since the total number of (1+\epsilon) factors that can go on the r relevant variables is at most r \log_{1+\epsilon} n.  Also,
  • n + (\epsilon n) M_+ \ge \frac{\epsilon n}{1+\epsilon}M_- since the total weight remains positive. Thus, M_- \le (1+\epsilon)M_+ + \frac{1+\epsilon}{\epsilon}. Using this in the first inequality,
  • (k -(k-1)(1+\epsilon))M_+ \le r \log n + (k-1)\frac{1+\epsilon}{\epsilon}
  • Setting \epsilon = \frac{1}{2(k-1)}, M_+ = O(rk\log n), and M_- has the same bound.
  • Total #mistakes =O(rk\log n).

Halfspaces

  • OR function is a halfspace: x_1+\dots,+x_r \ge 1
  • k-out-of-r majority-vote function is a halfspace: x_1+\dots,+x_r \ge k
  • Consider a general halfspace w^*_1 x_1 + \dots + w^*_n x_n \ge w^*_0
  • We can scale so that all weights are integers, and assume all are
    non-negative (by introducing new variables y_i = 1-x_i).
  • Duplicate each variable W=\sum_{i=1}^n w^*_i times, then it becomes the w^*_0-out-of-W problem with nW total number of variables. Therefore, the mistake bound is O(W^2 \log(nW)).
  • Can be generalized to any w^*, x\in R^n. The general mistake bound is O\left(\frac{||w^*||_1^2 ||X||_\infty^2} {\gamma^2}\log (nW)\right), where \gamma = \min_x |w^* \cdot x| and ||X||_\infty^2 = \max_x ||x||_\infty^2.

Winnow vs. Perceptron

  • E.g x\in\{0,1\}^n, ||X||^2_2 \sim n and  ||X||_\infty^2 =1. w= (1,\dots,1,0,\dots,0) (k ones), then ||w||^2_2 =k and ||w||^2_1 = k^2.
    • Winnow: O(\frac{k^2\log n}{\gamma^2})
    • Perceptron: O(\frac{nk}{\gamma^2})
  • E.g. w=(\frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}}), then ||w||^2_2 = 1 and ||w||^2_1 = n. Assume ||X||^2_2\le 1 and ||X||_\infty^2 =1.
    • Winnow: O(\frac{n\log n}{\gamma^2})
    • Perceptron: O(\frac{1}{\gamma^2})
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s