# Notes: Fourier learning

• PAC Learning for arbitrary distribution is hard
• e.g., Learning DNF, $f(x) = (x_1 \wedge, \dots, \wedge, \bar{x}_k) \vee \dots$, Learning Decision Trees.
• We don’t have efficient algorithms for arbitrary distributions
• We can make the problem easier by only considering special distributions
• uniform distribution
• product distribution: each $x_i$ is independent with $p_i = P[x_i = 1]$.
• logconcave distribution (e.g. Gaussian)
• We can also allow “membership queries”
• Instead of drawing a labeled example from the distribution, we can ask the label for any point in the feature space
• Below we describe how to learn Decision Trees under the uniform distribution
• Assume $x\in\{-1, 1\}^n$, $f(x)=\{-1,1\}$. Since $f$ is defined by the outputs for the $2^n$ possible $x$, we can also represent $f\in\{-1, 1\}^{2^n}$.
• Define $\left< f, g\right>_D = E_D[f(x)g(x)] = \sum_x D(x) f(x) g(x)$
• So $\left< f, f\right>_D = 1$
• $f$ is a vector with $2^n$ coordinates:
• standard basis: $e_1, \dots, e_{2^n}$
• Suppose $B$ is an orthonormal basis $\{v_1, \dots, v_{2^n}\}$, then f can be represented as $\sum_{v\in B} \left< f, v \right> v$
• Parity basis:
• For $S\subseteq [n]$, the parity function $\chi_S: \{-1,1\}^n \rightarrow \{ -1,1\}$ is the parity function over variables in $S$: $\chi_S(x) = \prod_{i\in S} x_i$.
• $\left< \chi_S, \chi_S \right> =1$
• $\left< \chi_S, \chi_T \right> =E_D[\prod_{i\in S} x_i \prod_{j\in T} x_j] = 0$
• Therefore, $\{ \chi_S \}$ is an orthonormal basis
• $f(x) = \sum_{S\subseteq [n]} \hat{f}_S \chi_S(x)$, where $\hat{f}_S = \left< f, \chi_S \right>$ are called discrete Fourier coefficients. The following identities are immediate.
• Lemma 1 (Parseval): $\left< f, f \right> = \sum_{S \subseteq [n]} \hat{f}^2_S$
• Lemma 2 (Plancharel): $\left< f, g \right>_D = \left< \hat{f}, \hat{g} \right>$
• pf: $\left< f, g \right>_D = \left< \sum_S \hat{f}_S \chi_S(x), \sum_T \hat{g}_T \chi_T(x) \right>_D = \sum_{S,T} \hat{f}_S \hat{g}_T E[\chi_S(x) \chi_T(x)] = \sum_S \hat{f}_S \hat{g}_S$
• We want to weak-learn $f$ by approximately finding the Fourier coefficients
• $g(x) = \sum_S \tilde{f}_S \chi_S(x)$
• $Pr(f(x) \neq sign(g(x))) \le E[|f - g|^2] = \sum_S \left< \hat{f}_S - \tilde{g}_S \right>^2$
• pf: $f(x) \neq sign(g(x)) \Rightarrow |f(x) - g(x)| \ge 1$
• It suffices to learn all “large” coefficients
• Goal: Find all $\hat{f}_S$ such that $\hat{f}^2_S \ge \tau$, in time poly($1/\tau, \dots$).
• If we know the subsets with large Fourier coefficients, then we can estimate $\left< f, \chi_S \right> = E[f(x) \chi_S(x)]$ by sampling.
• Kushilevitz-Mansour (K-M) algorithm:
• We build up a binary tree as follow.
• Initially, the root contains the set of all possible $S$, where each $S$ is a binary vector of length $2^n$. From Lemma 1, we know $\sum_{S} \hat{f}^2_S = 1$.
• Split the node into two nodes:
• $\mathcal{S}_0 = \{ S: S \text{ begins with } 0 \}$
• $\mathcal{S}_1 = \{ S: S \text{ begins with } 1 \}$
• Recursively split all the nodes whose sum of $\hat{f}^2_S$ in that set is larger then $\tau$.
• All $S$ in a level-i node will have the same prefix of length i
• Max depth: $n$. Max width $\frac{1}{\tau}$.
• Only $\frac{n}{\tau}$ times needed to ask the sum of $\hat{f}^2_S$.
• The only question now is how can we ask for the sum of squares of $\hat{f}_S$ in these sets.
• Consider a set with prefix $\alpha = 0\dots 0$ (k zeros).
• Claim: $\sum_{S\in \mathcal{S}_\alpha} \hat{f}^2_S = E[f(yx) f(zx)]$ , where $x\in\{0, 1\}^{n-k}$, $y, z \in\{0, 1\}^k$ are uniform random vectors. (yx is the concatenation of y and x)
• Suppose $f$ is a parity function $\chi_U$
• If $U$ agrees with $\alpha$ (doesn’t contain the first k variables), then $f(yx)=f(zx)$ so $E[f(yx)f(zx)] = 1$.
• If $U$ doesn’t agree with $\alpha$, then $Pr(f(yx)=f(zx)) = \frac{1}{2}$ so $E[f(yx)f(zx)] = 0$.
• We know that any f can be written as a sum of parities, i.e. $f = \sum_U \hat{f}_U \chi_U$
• $E[f(yx)f(zx)] = \sum_{U, V} \hat{f}_U \hat{f}_V E[\chi_U(yx) \chi_V(zx)] = \sum_{U\in\mathcal{S}_\alpha} \hat{f}^2_U$
• For arbitrary prefix $\alpha$, we can instead ask $E[f(yx)f(zx) \chi_\alpha(y) \chi_\alpha(z)]$, where $\chi_\alpha$ is the parity of the subset of variables that are set to 1 in the string $\alpha$.
• Lemma: $\sum_{S: |\hat{f}_S| \le \frac{\epsilon}{|\hat{f}|_1}} \hat{f}^2_S \le \epsilon$
• $LHS \le \sum_{S: |\hat{f}_S| \le \frac{\epsilon}{|\hat{f}|_1}} \frac{|\hat{f}_S \cdot \epsilon|}{|\hat{f}|_1}$
• It suffices to learn all $\hat{f}_S$ such that $\hat{f}_S \ge \frac{\epsilon}{|\hat{f}|_1}$.
• For Decision Trees with at most $m$ leaves, $|\hat{f}|_1 \le 2m+1$.