# Halfspaces, Perceptron

• A halfspace is defined by a vector $w=(w_1, \dots, w_n)$ and a threshold $b$ as follows:
• if $\sum_{i=1}^n w_i x_i > b$ then $\ell(x)=1$, else $\ell(x)=0$
• In order to use the sample complexity bound we derived last time, which depends on $|\mathcal{H}|$, we need to show that the number of effective halfspaces is finite.
• Two halfspaces are effectively the same if their predictions on all $n$ points are the same.
• A halfspace can be moved till it touches $n$ points without changing any labels. There are ${2^n}$ different points in $\{0,1\}^n$, so at most ${2^n \choose n} \le 2^{n^2}$ different halfspaces.
• This is a loose bound. We will show a better bound next time.
• disjunction (and conjunction) is a special case of halfspace
• E.g. $x_1 \wedge\bar{x}_2 \wedge \dots \wedge x_r$   <=> $x_1 + (1-x_2)+\dots +x_r \ge 1$
• decision list is a special case of halfspace
• E.g. “if $x_1=1$  then 1, else if $x_2=0$ then 0, else 1” <=> $4x_1- 2(1-x_2) + 1 >0$
• A non-linear threshold function $\sum_{i,j} A_{ij} x_i x_j \ge b$ can be written as a halfspace by introducing new variables $y_{ij} =x_i x_j$, ie. $\sum_{i,j}A_{ij}y_{ij}\ge b$
• We can write a halfspace as $w\cdot x>0$, by extending $x=(x_1, \dots, x_n, 1)$ and $w=(w_1, \dots, w_n, -b)$. We can also assume $||w^*||=1$ and $||x||=1$.
• Perceptron
• $w:=0$
• On each mistake, update $w = w + \ell(x) x$
• Theorem: Let $\gamma=\min_x |w^* \cdot x|$. The Perceptron algorithm makes at most $\frac{1}{\gamma^2}$ mistakes
• Proof:
• $\frac{w\cdot w^*}{||w||} = cos\left( \theta(w, w^*) \right)$
• on a mistake, $w' = w + \ell(x) x$.
• $w^* \cdot w' = w^* \cdot w + \ell(x) (w^* \cdot x) =w^* \cdot w + |w^* \cdot x| \ge w^* \cdot w + \gamma$
• So after $t$ mistakes, $w^* \cdot w \ge \gamma t$
• On the other hand, $||w'||^2 = ||w||^2 + ||x||^2 + 2\ell(x)(w\cdot x) =||w||^2 + 1 -2|w\cdot x| \le ||w||^2+1$
• So after $t$ mistakes, $||w||^2 \le t$
• Combining the two inequalities
• $1\ge\frac{w\cdot w^*}{||w||} \ge \frac{\gamma t}{\sqrt{t}} \ge \gamma\sqrt{t}$
• Therefore, $t\le\frac{1}{\gamma^2}$
• A modified Perceptron
• $w:=$ random unit vector
• On each mistake and if $|w\cdot x|>\sigma$, update $w = w + \ell(x) |w\cdot x|x$
• Theorem: The above algorithm updates at most $\frac{\log n}{\sigma^2}$ times  (w.h.p.)
• Proof:
• w.l.o.g, assume all labels are positive (because we can negate the feature part)
• $w' = w - (w\cdot x)x$
• $w^* \cdot w' = w^* \cdot w - (w \cdot x )(w^* \cdot x)\ge w^* \cdot w \ge w^* \cdot w_0 \ge \frac{1}{\sqrt{n}}$ (w.p. $\ge \frac{1}{8}$)
• $||w'||^2 = ||w||^2 + (w\cdot x)^2 ||x||^2 - 2(w\cdot x)^2 = ||w||^2 -(w\cdot x)^2 \le ||w||^2 - \sigma^2 \le ||w||^2(1-\sigma^2)$
• After $t$ steps, $1\ge\frac{w\cdot w^*}{||w||} \ge \frac{1/\sqrt{n}}{(1-\sigma^2)^{t/2}}$, thus $t\le\frac{\ln n}{\sigma^2}$ is the maximum possible number of mistakes.