# Notes: Online Optimization

The online optimization problem is defined as follows:

• For each time $t = 1,\dots, T$
1. Learner makes a decision $d_t \in \mathcal{D}$
2. Observes the cost function $c_t$ and incurs cost $c_t(d_t)$
• The goal is to minimize the regret: the difference between the algorithm’s total cost and the cost of always using a single best decision $d^*$
• We now consider online linear optimization
• The decision space $\mathcal{D} \subseteq R^n$
• The cost function is a linear function: $c_t(d_t) = d_t \cdot s_t$, where $s_t\in S\subseteq R^n$
• Assume we have access to an oracle $M$, such that $M(s) = \arg min_{d\in \mathcal{D}} d\cdot s$
• $d_t = M(s_{1:t-1})$, where $s_{1:t-1} =\sum_{i=1}^{t-1} s_i$
• The algorithm fails even for a simple 2-dimension case
• Follow the Perturbed leader: $FPL(\epsilon)$
1. Pick $p_t$ uniformly from $[0, \frac{1}{\epsilon}]^n$
2. Choose $M(s_{1:t-1} + p_t)$
• Another variant: $FPL^*(\epsilon)$
1. Pick $p_t = x$ with density $d\mu(x) \propto e^{-\epsilon ||x||_1}$
2. Choose $M(s_{1:t-1} + p_t)$
• Assume
• $|d-d'| \le D$,  $\forall d,d' \in \mathcal{D}$.
• $|d \cdot s| \le R$,  $\forall d\in \mathcal{D}, s\in S$.
• $|s_t| \le A$,  $\forall t\in \{ 1,\dots, T \}$
• Theorem 1:
• Set $\epsilon = \sqrt{\frac{D}{ART}}$, then$FPL(\epsilon)$ has cost $\le OPT + \epsilon \cdot ART + \frac{D}{\epsilon}$
• Set $\epsilon = \min \{ \frac{1}{2A}, \sqrt{\frac{D(\ln n + 1)}{A\cdot OPT}} \}$ , then $FPL^*(\epsilon)$ has cost $\le (1+\epsilon)OPT + \frac{4AD(\ln n +1)}{\epsilon}$
• Lemma 1:
• Consider the “Be the leader” algorithm (impossible in practice, only for analysis): use $d_t = M(s_{1:t} )$ at time $t$
• $\sum_{t=1}^T M(s_{1:t}) \cdot s_t \le M(s_{1:T}) \cdot (s_{1:T})$
• We prove this by induction:
• $\sum_{t=1}^{T+1} M(s_{1:t}) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + M(s_{1:T+1}) \cdot s_{T+1} \le M(s_{1:T+1}) \cdot s_{1:T+1}$
• Lemma 2: “Be the perturbed leader” has a small regret
• $\sum_{t=1}^{T} M(s_{1:t} + p_t) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + D\sum_{t=1}^T |p_t - p_{t-1}|_\infty$.
• pf: Suppose the cost is $s'_t = s_t + p_t - p_{t-1}$
• $\sum_{t=1}^T M(s'_{1:t}) \cdot s'_t \le M(s'_{1:T}) \cdot s'_{1:T}$
• $\sum_{t=1}^T M(s_{1:t} + p_t) \cdot (s_t + p_t - p_{t-1}) \le M(s_{1:T} + p_T) \cdot (s_{1:T} + p_T) \le M(s_{1:T}) \cdot (s_{1:T} + \sum_{t=1}^T (p_t - p_{t-1}))$
• $\sum_{t=1}^T M(s_{1:t} + p_t) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + \sum_{t=1}^T \left( M(s_{1:T}) - M(s_{1:t} + p_t) \right) (p_t - p_{t-1}) \le M(s_{1:T}) \cdot s_{1:T} + D\sum_{t=1}^T |p_t - p_{t-1}|_\infty$
• Lemma 3:
• The distributions over $s_{1:t-1} + p_t$ and $s_{1:t} + p_t$ are similar: they are both distributions over cubes. If they overlap on a fraction $f$ of their volume, then $E(M(s_{1:t-1} + p_t)\cdot s_t) \le E(M(s_{1:t}+p_t)) + (1-f) R$
• Lemma 4:
• For any $v\in R^n$, the cubes $[0, \frac{1}{\epsilon}]^n$ and $v + [0, \frac{1}{\epsilon}]^n$ overlap in at least a $(1-\epsilon |v|_1)$ fraction.
• Proof of Theorem 1:
• $E(cost(FPL(\epsilon))) =\sum_{t=1}^T E(M(s_{1:t-1}+p_t)\cdot s_t)$
• (using Lemma 3) $\le \sum_{t=1}^T E(M(s_{1:t}+p_t)\cdot s_t) + \epsilon ART$
• (using Lemma 2) $\le OPT + \epsilon ART + \frac{D}{\epsilon}$
• Online Convex Optimization
• We now consider the case where the cost function $c^t$ are convex
• At each time $t$, choose the decision $x^t\in K$, where $K$ is a convex body of diameter $D$.
• Let $\nabla c^t(x) = \left( \frac{dc^t(x)}{dx_1}, \dots, \frac{dc^t(x)}{dx_n} \right)$ be the gradient of the cost function.
• $x^{t+1} = Proj_K \left( x^t - \eta_t \nabla c^t(x^t) \right)$
• Facts:
1. For a convex function $f$, $f(y) - f(x) \ge \nabla c(x) \cdot (y-x)$
2. $\forall a\in K$, $||Proj_K (y) - a|| \le || y -a ||$
• Theorem:
• Let $x^* = \arg min_{x\in K}\sum_{t=1}^T c^t(x)$.  Assume $||\nabla c^t|| \le G$.
• If set $\eta_t = \frac{D}{G\sqrt{T}}$, then $\sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le DG\sqrt{T}$.
• If set $\eta_t =\frac{1}{\sqrt{t}}$, then $\sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le (\frac{D^2}{2} + G^2)\sqrt{T}$.
• Proof:
• Let $\phi_t = ||x^t - x^*||^2$
• $\phi_{t+1} -\phi_t = ||x^{t+1} - x^*||^2 - ||x^t - x^*||^2 = ||Proj_K \left( x^t - \eta_t \nabla c^t(x^t) \right) - x^*||^2 - ||x^t - x^*||^2 \le || x^t - \eta_t \nabla c^t(x^t) - x^*||^2 - ||x^t - x^*||^2 = \eta_t^2||\nabla c^t(x^t)||^2 - 2\eta_t \nabla c^t(x^t) (x^t - x^*)$
• $\nabla c^t(x^t) (x^t - x^*) \le \frac{\eta_t}{2} G^2 - \frac{1}{2\eta_t} (\phi_{t+1} - \phi_t)$
• By Fact 1, we have $c^t(x^t) - c^t(x^*) \le \nabla c^t(x^t) (x^t - x^*)$
• Set $\eta_t =\eta = \frac{D}{G\sqrt{T}}$, then $\sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le \frac{\eta T}{2} G^2 + \frac{D^2}{2\eta} \le DG\sqrt{T}$
• Set $\eta_t = \frac{1}{\sqrt{t}}$, then $\sum_{t=1}^T \le \frac{G^2}{2}\sum_{t=1}^T \eta_t + \frac{\phi_1}{2 \eta_1} + \sum_{t=2}^T \phi_t ( \frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}) \le \frac{G^2} {2}\sum_{t=1}^T \eta_t + \frac{D^2}{2\eta_T}$
• $\sum_{t=1}^T \frac{1}{\sqrt{t}} \le 1 + \int_{t=1}^T \frac{dt}{\sqrt{t}} \le 2\sqrt{T}$
• In HW4, we prove that FPL also works for online convex optimization, by using the gradients as the linear cost functions.