# Halfspaces, Perceptron

• A halfspace is defined by a vector $w=(w_1, \dots, w_n)$ and a threshold $b$ as follows:
• if $\sum_{i=1}^n w_i x_i > b$ then $\ell(x)=1$, else $\ell(x)=0$
• In order to use the sample complexity bound we derived last time, which depends on $|\mathcal{H}|$, we need to show that the number of effective halfspaces is finite.
• Two halfspaces are effectively the same if their predictions on all $n$ points are the same.
• A halfspace can be moved till it touches $n$ points without changing any labels. There are ${2^n}$ different points in $\{0,1\}^n$, so at most ${2^n \choose n} \le 2^{n^2}$ different halfspaces.
• This is a loose bound. We will show a better bound next time.
• disjunction (and conjunction) is a special case of halfspace
• E.g. $x_1 \wedge\bar{x}_2 \wedge \dots \wedge x_r$   <=> $x_1 + (1-x_2)+\dots +x_r \ge 1$
• decision list is a special case of halfspace
• E.g. “if $x_1=1$  then 1, else if $x_2=0$ then 0, else 1” <=> $4x_1- 2(1-x_2) + 1 >0$
• A non-linear threshold function $\sum_{i,j} A_{ij} x_i x_j \ge b$ can be written as a halfspace by introducing new variables $y_{ij} =x_i x_j$, ie. $\sum_{i,j}A_{ij}y_{ij}\ge b$
• We can write a halfspace as $w\cdot x>0$, by extending $x=(x_1, \dots, x_n, 1)$ and $w=(w_1, \dots, w_n, -b)$. We can also assume $||w^*||=1$ and $||x||=1$.
• Perceptron
• $w:=0$
• On each mistake, update $w = w + \ell(x) x$
• Theorem: Let $\gamma=\min_x |w^* \cdot x|$. The Perceptron algorithm makes at most $\frac{1}{\gamma^2}$ mistakes
• Proof:
• $\frac{w\cdot w^*}{||w||} = cos\left( \theta(w, w^*) \right)$
• on a mistake, $w' = w + \ell(x) x$.
• $w^* \cdot w' = w^* \cdot w + \ell(x) (w^* \cdot x) =w^* \cdot w + |w^* \cdot x| \ge w^* \cdot w + \gamma$
• So after $t$ mistakes, $w^* \cdot w \ge \gamma t$
• On the other hand, $||w'||^2 = ||w||^2 + ||x||^2 + 2\ell(x)(w\cdot x) =||w||^2 + 1 -2|w\cdot x| \le ||w||^2+1$
• So after $t$ mistakes, $||w||^2 \le t$
• Combining the two inequalities
• $1\ge\frac{w\cdot w^*}{||w||} \ge \frac{\gamma t}{\sqrt{t}} \ge \gamma\sqrt{t}$
• Therefore, $t\le\frac{1}{\gamma^2}$
• A modified Perceptron
• $w:=$ random unit vector
• On each mistake and if $|w\cdot x|>\sigma$, update $w = w + \ell(x) |w\cdot x|x$
• Theorem: The above algorithm updates at most $\frac{\log n}{\sigma^2}$ times  (w.h.p.)
• Proof:
• w.l.o.g, assume all labels are positive (because we can negate the feature part)
• $w' = w - (w\cdot x)x$
• $w^* \cdot w' = w^* \cdot w - (w \cdot x )(w^* \cdot x)\ge w^* \cdot w \ge w^* \cdot w_0 \ge \frac{1}{\sqrt{n}}$ (w.p. $\ge \frac{1}{8}$)
• $||w'||^2 = ||w||^2 + (w\cdot x)^2 ||x||^2 - 2(w\cdot x)^2 = ||w||^2 -(w\cdot x)^2 \le ||w||^2 - \sigma^2 \le ||w||^2(1-\sigma^2)$
• After $t$ steps, $1\ge\frac{w\cdot w^*}{||w||} \ge \frac{1/\sqrt{n}}{(1-\sigma^2)^{t/2}}$, thus $t\le\frac{\ln n}{\sigma^2}$ is the maximum possible number of mistakes.
Advertisements

### 2 Comments

1. Hi,
I have a quick question about the Perceptron notes. In developing the bound for ||w’||^2, we have:
||w’||^2 = ||w||^2 + ||x||^2 + 2*l(x)*(w \cdot x)
= ||w||^2 + 1 – 2|w \cdot x|

In moving from the first equality to the second, we seem to assume that x is a unit vector. Could you mention how we justify this?

1. We assume here that each example is from the unit L2 ball. In general, without ths assumption, the bound would go up by the maximum length squared of an example, as scaling to fit within the unit ball would reduce the margin proportionately; see next set of notes.