# Weighted Majority, Winnow

(based on Avrim Blum’s notes)

• In each round, each of the $n$ experts makes a prediction in $\{0,1\}$
• The learner makes a prediction based on the experts’ advice
• The learner observes the actual outcome
• The goal is to predict nearly as well as the best expert in hindsight
• $M:= \#$ mistakes by the learner
• $m:= \#$ mistakes by the best expert in hindsight
• If there is a perfect expert, then by predicting the majority class and removing the erring experts in each round,  $M \le \log_2 n$.
• If no perfect expert, we can restart the above process whenever all the experts are removed. Each expert makes at least one mistake when all of them are removed, thus $M \le m \cdot \log_2 n$

Weighted Majority

• Set $w_i=1$,  $\forall i\in\{1,\dots,n\}$,  $W = \sum_i w_i$
• On a mistake, for very expert $E_i$ who predicted incorrectly, set $w_i \leftarrow \frac{w_i}{2}$
• On a mistake, at least half of the weights are reduced by half, so $W$ goes down by a factor of $\frac{3}{4}$.
• After $M$ mistakes, $W \le n(\frac{3}{4})^M$
• On the other hand, the best expert has weight $\ge (\frac{1}{2})^m$.
• Since the total weight is larger than any single weight, by combining the above two inequalities, we get $\log_2 (4/3) M\le \log n + m$ or $M \le 2.5(\log n + m)$.

Randomized Weighted Majority

• Instead of predicting the majority, pick expert $i$ randomly with probability proportional to $\frac{w_i}{W}$ and follow the advice
• Fix $\epsilon < 1/2$. In each time step, for each erring expert, set $w_i \leftarrow (1-\epsilon)w_i$
• At time $t$, Let $f_t$ be the fraction of experts who make a mistake.
• The total weight $W$ at time $T$ is $n\prod_{t=1}^T (1-\epsilon f_t)$, so $\ln W = \ln n + \sum_t \ln(1-\epsilon f_t) \le \ln n - \epsilon \sum_t f_t =\ln n - \epsilon E(M)$ , where we use the fact that $1+x \le e^x$ and $E(M) = \sum_t f_t$
• On the other hand, the weight of the best expert is $(1-\epsilon)^m$. Therefore, $m\ln(1-\epsilon) \le \ln n - \epsilon E(M)$. Rearranging gives $E(M) \le (1+\epsilon) m + \frac{1}{\epsilon}\ln n$
• Choose $\epsilon = \sqrt{\frac{\ln n}{m}}$, then $E(M) \le m + 2\sqrt{m\cdot \ln n}$
• Since $m\le T$, where $T$ is the total number of rounds, we have $\frac{E(M)}{T} \le \frac{m}{T} + 2\sqrt{\frac{\ln n}{T}}$. This means the expected number of mistakes per step (“regret”) approaches that of the best expert as $T$ increases.

Winnow

Learn an OR of $r$ out of $n$ variables

• Initialize $w_i=1$,  $\forall i\in\{1,\dots,n\}$. Predict $+$ if $w \cdot x \ge n$, otherwise $-$.
• Mistake on a positive example: $w_i \leftarrow 2w_i$ ,  $\forall i: x_i =1$
• Mistake on a negative example: $w_i \leftarrow \frac{w_i}{2}$ ,  $\forall i: x_i =1$ (could set $w_i = 0$).
• The weights of $r$ relevant variables never decrease, and  each of them can only be doubled $\log n$ times, so #mistakes on positive examples is at most $\le r\log_2 n$
• For each mistake on $+$, $W$ increases at most $n$, and for each mistake on $-$$W$ decreases at least $\frac{n}{2}$. Since $W$ never drops below zero, we have (#mistakes on $-$) $\le 2$ (#mistakes on $+$).
• Therefore, total #mistakes $=O(r\log n)$.

k-out-of-r majority-vote function

• Initialize $w_i=1$,  $\forall i\in\{1,\dots,n\}$. Predict $+$ if $w\cdot x \ge n$, otherwise $-$.
• Mistake on a positive example: $w_i \leftarrow (1+\epsilon)w_i$ ,  $\forall i: x_i =1$
• Mistake on a negative example: $w_i \leftarrow \frac{w_i}{(1+\epsilon)}$,  $\forall i: x_i =1$
• Let $M_+$ be number of mistakes on positives and $M_-$ be the number of mistakes on negatives. Then,
• $k M_+ -(k-1)M_- \le r \log_{1+\epsilon} n$, since the total number of $(1+\epsilon)$ factors that can go on the $r$ relevant variables is at most $r \log_{1+\epsilon} n$.  Also,
• $n + (\epsilon n) M_+ \ge \frac{\epsilon n}{1+\epsilon}M_-$ since the total weight remains positive. Thus, $M_- \le (1+\epsilon)M_+ + \frac{1+\epsilon}{\epsilon}$. Using this in the first inequality,
• $(k -(k-1)(1+\epsilon))M_+ \le r \log n + (k-1)\frac{1+\epsilon}{\epsilon}$
• Setting $\epsilon = \frac{1}{2(k-1)}$, $M_+ = O(rk\log n)$, and $M_-$ has the same bound.
• Total #mistakes $=O(rk\log n)$.

Halfspaces

• OR function is a halfspace: $x_1+\dots,+x_r \ge 1$
• k-out-of-r majority-vote function is a halfspace: $x_1+\dots,+x_r \ge k$
• Consider a general halfspace $w^*_1 x_1 + \dots + w^*_n x_n \ge w^*_0$
• We can scale so that all weights are integers, and assume all are
non-negative (by introducing new variables $y_i = 1-x_i$).
• Duplicate each variable $W=\sum_{i=1}^n w^*_i$ times, then it becomes the $w^*_0$-out-of-$W$ problem with $nW$ total number of variables. Therefore, the mistake bound is $O(W^2 \log(nW))$.
• Can be generalized to any $w^*$, $x\in R^n$. The general mistake bound is $O\left(\frac{||w^*||_1^2 ||X||_\infty^2} {\gamma^2}\log (nW)\right)$, where $\gamma = \min_x |w^* \cdot x|$ and $||X||_\infty^2 = \max_x ||x||_\infty^2$.

Winnow vs. Perceptron

• E.g $x\in\{0,1\}^n$, $||X||^2_2 \sim n$ and  $||X||_\infty^2 =1$. $w= (1,\dots,1,0,\dots,0)$ (k ones), then $||w||^2_2 =k$ and $||w||^2_1 = k^2$.
• Winnow: $O(\frac{k^2\log n}{\gamma^2})$
• Perceptron: $O(\frac{nk}{\gamma^2})$
• E.g. $w=(\frac{1}{\sqrt{n}}, \dots, \frac{1}{\sqrt{n}})$, then $||w||^2_2 = 1$ and $||w||^2_1 = n$. Assume $||X||^2_2\le 1$ and $||X||_\infty^2 =1$.
• Winnow: $O(\frac{n\log n}{\gamma^2})$
• Perceptron: $O(\frac{1}{\gamma^2})$