Halfspaces, Perceptron

  • A halfspace is defined by a vector w=(w_1, \dots, w_n) and a threshold b as follows:
    • if \sum_{i=1}^n w_i x_i > b then \ell(x)=1, else \ell(x)=0
  • In order to use the sample complexity bound we derived last time, which depends on |\mathcal{H}|, we need to show that the number of effective halfspaces is finite.
    • Two halfspaces are effectively the same if their predictions on all n points are the same.
    • A halfspace can be moved till it touches n points without changing any labels. There are {2^n} different points in \{0,1\}^n, so at most {2^n \choose n} \le 2^{n^2} different halfspaces.
    • This is a loose bound. We will show a better bound next time.
  • disjunction (and conjunction) is a special case of halfspace
    • E.g. x_1 \wedge\bar{x}_2 \wedge \dots \wedge x_r   <=> x_1 + (1-x_2)+\dots +x_r \ge 1
  • decision list is a special case of halfspace
    • E.g. “if x_1=1  then 1, else if x_2=0 then 0, else 1” <=> 4x_1- 2(1-x_2) + 1 >0
  • A non-linear threshold function \sum_{i,j} A_{ij} x_i x_j \ge b can be written as a halfspace by introducing new variables y_{ij} =x_i x_j , ie. \sum_{i,j}A_{ij}y_{ij}\ge b
  • We can write a halfspace as w\cdot x>0, by extending x=(x_1, \dots, x_n, 1) and w=(w_1, \dots, w_n, -b). We can also assume ||w^*||=1 and ||x||=1.
  • Perceptron
    • w:=0
    • On each mistake, update w = w + \ell(x) x
  • Theorem: Let \gamma=\min_x |w^* \cdot x|. The Perceptron algorithm makes at most \frac{1}{\gamma^2} mistakes
  • Proof:
    • \frac{w\cdot w^*}{||w||} = cos\left( \theta(w, w^*) \right)
    • on a mistake, w' = w + \ell(x) x.
      • w^* \cdot w' = w^* \cdot w + \ell(x) (w^* \cdot x) =w^* \cdot w + |w^* \cdot x| \ge w^* \cdot w + \gamma
      • So after t mistakes, w^* \cdot w \ge \gamma t
    • On the other hand, ||w'||^2 = ||w||^2 + ||x||^2 + 2\ell(x)(w\cdot x) =||w||^2 + 1 -2|w\cdot x| \le ||w||^2+1
      • So after t mistakes, ||w||^2 \le t
    • Combining the two inequalities
      • 1\ge\frac{w\cdot w^*}{||w||} \ge \frac{\gamma t}{\sqrt{t}} \ge \gamma\sqrt{t}
      • Therefore, t\le\frac{1}{\gamma^2}
  • A modified Perceptron
    • w:= random unit vector
    • On each mistake and if |w\cdot x|>\sigma, update w = w + \ell(x) |w\cdot x|x
  • Theorem: The above algorithm updates at most \frac{\log n}{\sigma^2} times  (w.h.p.)
  • Proof:
    • w.l.o.g, assume all labels are positive (because we can negate the feature part)
    • w' = w - (w\cdot x)x
    • w^* \cdot w' = w^* \cdot w - (w \cdot x )(w^* \cdot x)\ge w^* \cdot w \ge w^* \cdot w_0 \ge \frac{1}{\sqrt{n}} (w.p. \ge \frac{1}{8})
    • ||w'||^2 = ||w||^2 + (w\cdot x)^2 ||x||^2 - 2(w\cdot x)^2 = ||w||^2 -(w\cdot x)^2 \le ||w||^2 - \sigma^2 \le ||w||^2(1-\sigma^2)
    • After t steps, 1\ge\frac{w\cdot w^*}{||w||} \ge \frac{1/\sqrt{n}}{(1-\sigma^2)^{t/2}}, thus t\le\frac{\ln n}{\sigma^2} is the maximum possible number of mistakes.


  1. Hi,
    I have a quick question about the Perceptron notes. In developing the bound for ||w’||^2, we have:
    ||w’||^2 = ||w||^2 + ||x||^2 + 2*l(x)*(w \cdot x)
    = ||w||^2 + 1 – 2|w \cdot x|

    In moving from the first equality to the second, we seem to assume that x is a unit vector. Could you mention how we justify this?


    1. We assume here that each example is from the unit L2 ball. In general, without ths assumption, the bound would go up by the maximum length squared of an example, as scaling to fit within the unit ball would reduce the margin proportionately; see next set of notes.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s