Notes: Fourier learning

  • PAC Learning for arbitrary distribution is hard
    • e.g., Learning DNF, f(x) = (x_1 \wedge, \dots, \wedge, \bar{x}_k) \vee \dots, Learning Decision Trees.
    • We don’t have efficient algorithms for arbitrary distributions
    • We can make the problem easier by only considering special distributions
      • uniform distribution
      • product distribution: each x_i is independent with p_i = P[x_i = 1].
      • logconcave distribution (e.g. Gaussian)
    • We can also allow “membership queries”
      • Instead of drawing a labeled example from the distribution, we can ask the label for any point in the feature space
  • Below we describe how to learn Decision Trees under the uniform distribution
  • Assume x\in\{-1, 1\}^n, f(x)=\{-1,1\} . Since f is defined by the outputs for the 2^n possible x, we can also represent f\in\{-1, 1\}^{2^n}.
  • Define \left< f, g\right>_D = E_D[f(x)g(x)] = \sum_x D(x) f(x) g(x)
  • So \left< f, f\right>_D = 1
  • f is a vector with 2^n coordinates:
    • standard basis: e_1, \dots, e_{2^n}
    • Suppose B is an orthonormal basis \{v_1, \dots, v_{2^n}\} , then f can be represented as \sum_{v\in B} \left< f, v \right> v
  • Parity basis:
    • For S\subseteq [n], the parity function \chi_S: \{-1,1\}^n \rightarrow \{ -1,1\} is the parity function over variables in S: \chi_S(x) = \prod_{i\in S} x_i .
    • \left< \chi_S, \chi_S \right> =1
    • \left< \chi_S, \chi_T \right> =E_D[\prod_{i\in S} x_i \prod_{j\in T} x_j] = 0
    • Therefore, \{ \chi_S \}  is an orthonormal basis
  • f(x) = \sum_{S\subseteq [n]} \hat{f}_S \chi_S(x) , where \hat{f}_S = \left< f, \chi_S \right> are called discrete Fourier coefficients. The following identities are immediate.
  • Lemma 1 (Parseval): \left< f, f \right> = \sum_{S \subseteq [n]} \hat{f}^2_S
  • Lemma 2 (Plancharel): \left< f, g \right>_D = \left< \hat{f}, \hat{g} \right>
    • pf: \left< f, g \right>_D = \left< \sum_S \hat{f}_S \chi_S(x), \sum_T \hat{g}_T \chi_T(x) \right>_D = \sum_{S,T} \hat{f}_S \hat{g}_T E[\chi_S(x) \chi_T(x)] = \sum_S \hat{f}_S \hat{g}_S
  • We want to weak-learn f by approximately finding the Fourier coefficients
    • g(x) = \sum_S \tilde{f}_S \chi_S(x)
    • Pr(f(x) \neq sign(g(x))) \le E[|f - g|^2] = \sum_S \left< \hat{f}_S - \tilde{g}_S \right>^2
      • pf: f(x) \neq sign(g(x)) \Rightarrow |f(x) - g(x)| \ge 1
    • It suffices to learn all “large” coefficients
  • Goal: Find all \hat{f}_S such that \hat{f}^2_S \ge \tau , in time poly(1/\tau, \dots ).
  • If we know the subsets with large Fourier coefficients, then we can estimate \left< f, \chi_S \right> = E[f(x) \chi_S(x)] by sampling.
  • Kushilevitz-Mansour (K-M) algorithm:
    • We build up a binary tree as follow.
    • Initially, the root contains the set of all possible S, where each S is a binary vector of length 2^n. From Lemma 1, we know \sum_{S} \hat{f}^2_S = 1.
    • Split the node into two nodes:
      • \mathcal{S}_0 = \{ S: S \text{ begins with } 0 \}
      • \mathcal{S}_1 = \{ S: S \text{ begins with } 1 \}
    • Recursively split all the nodes whose sum of \hat{f}^2_S in that set is larger then \tau.
      • All S in a level-i node will have the same prefix of length i
    • Max depth: n. Max width \frac{1}{\tau}.
    • Only \frac{n}{\tau} times needed to ask the sum of \hat{f}^2_S.
  • The only question now is how can we ask for the sum of squares of \hat{f}_S in these sets.
  • Consider a set with prefix \alpha = 0\dots 0 (k zeros).
  • Claim: \sum_{S\in \mathcal{S}_\alpha} \hat{f}^2_S = E[f(yx) f(zx)] , where x\in\{0, 1\}^{n-k}, y, z \in\{0, 1\}^k are uniform random vectors. (yx is the concatenation of y and x)
    • Suppose f is a parity function \chi_U
      • If U agrees with \alpha (doesn’t contain the first k variables), then f(yx)=f(zx) so E[f(yx)f(zx)] = 1.
      • If U doesn’t agree with \alpha, then Pr(f(yx)=f(zx)) = \frac{1}{2} so E[f(yx)f(zx)] = 0.
    • We know that any f can be written as a sum of parities, i.e. f = \sum_U \hat{f}_U \chi_U
      • E[f(yx)f(zx)] = \sum_{U, V} \hat{f}_U \hat{f}_V E[\chi_U(yx) \chi_V(zx)] = \sum_{U\in\mathcal{S}_\alpha} \hat{f}^2_U
  • For arbitrary prefix \alpha, we can instead ask E[f(yx)f(zx) \chi_\alpha(y) \chi_\alpha(z)], where \chi_\alpha is the parity of the subset of variables that are set to 1 in the string \alpha.
  • Lemma: \sum_{S: |\hat{f}_S| \le \frac{\epsilon}{|\hat{f}|_1}} \hat{f}^2_S \le \epsilon
    • LHS \le \sum_{S: |\hat{f}_S| \le \frac{\epsilon}{|\hat{f}|_1}} \frac{|\hat{f}_S \cdot \epsilon|}{|\hat{f}|_1}
  • It suffices to learn all \hat{f}_S such that \hat{f}_S \ge \frac{\epsilon}{|\hat{f}|_1}.
  • For Decision Trees with at most m leaves, |\hat{f}|_1 \le 2m+1.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s