Notes: Online Optimization

The online optimization problem is defined as follows:

  • For each time t = 1,\dots, T
    1. Learner makes a decision d_t \in \mathcal{D}
    2. Observes the cost function c_t and incurs cost c_t(d_t)
  • The goal is to minimize the regret: the difference between the algorithm’s total cost and the cost of always using a single best decision d^*
  • We now consider online linear optimization
    • The decision space \mathcal{D} \subseteq R^n
    • The cost function is a linear function: c_t(d_t) = d_t \cdot s_t, where s_t\in S\subseteq R^n
    • Assume we have access to an oracle M, such that M(s) = \arg min_{d\in \mathcal{D}} d\cdot s
  • A naive approach would be “follow the leader”
    • d_t = M(s_{1:t-1}), where s_{1:t-1} =\sum_{i=1}^{t-1} s_i
    • The algorithm fails even for a simple 2-dimension case
  • Follow the Perturbed leader: FPL(\epsilon)
    1. Pick p_t uniformly from [0, \frac{1}{\epsilon}]^n
    2. Choose M(s_{1:t-1} + p_t)
  • Another variant: FPL^*(\epsilon)
    1. Pick p_t = x with density d\mu(x) \propto e^{-\epsilon ||x||_1}
    2. Choose M(s_{1:t-1} + p_t)
  • Assume
    • |d-d'| \le D,  \forall d,d' \in \mathcal{D} .
    • |d \cdot s| \le R,  \forall d\in \mathcal{D}, s\in S.
    • |s_t| \le A,  \forall t\in \{ 1,\dots, T \}
  • Theorem 1:
    • Set \epsilon = \sqrt{\frac{D}{ART}}, thenFPL(\epsilon) has cost \le OPT + \epsilon \cdot ART + \frac{D}{\epsilon}
    • Set \epsilon = \min \{ \frac{1}{2A}, \sqrt{\frac{D(\ln n + 1)}{A\cdot OPT}} \} , then FPL^*(\epsilon) has cost \le (1+\epsilon)OPT + \frac{4AD(\ln n +1)}{\epsilon}
  • Lemma 1:
    • Consider the “Be the leader” algorithm (impossible in practice, only for analysis): use d_t = M(s_{1:t} ) at time t
    • \sum_{t=1}^T M(s_{1:t}) \cdot s_t \le M(s_{1:T}) \cdot (s_{1:T})
    • We prove this by induction:
      • \sum_{t=1}^{T+1} M(s_{1:t}) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + M(s_{1:T+1}) \cdot s_{T+1} \le M(s_{1:T+1}) \cdot s_{1:T+1}
  • Lemma 2: “Be the perturbed leader” has a small regret
    • \sum_{t=1}^{T} M(s_{1:t} + p_t) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + D\sum_{t=1}^T |p_t - p_{t-1}|_\infty.
    • pf: Suppose the cost is s'_t = s_t + p_t - p_{t-1}
    • \sum_{t=1}^T M(s'_{1:t}) \cdot s'_t \le M(s'_{1:T}) \cdot s'_{1:T}
    • \sum_{t=1}^T M(s_{1:t} + p_t) \cdot (s_t + p_t - p_{t-1}) \le M(s_{1:T} + p_T) \cdot (s_{1:T} + p_T) \le M(s_{1:T}) \cdot (s_{1:T} + \sum_{t=1}^T (p_t - p_{t-1}))
    • \sum_{t=1}^T M(s_{1:t} + p_t) \cdot s_t \le M(s_{1:T}) \cdot s_{1:T} + \sum_{t=1}^T \left( M(s_{1:T}) - M(s_{1:t} + p_t) \right) (p_t - p_{t-1}) \le M(s_{1:T}) \cdot s_{1:T} + D\sum_{t=1}^T |p_t - p_{t-1}|_\infty
  • Lemma 3:
    • The distributions over s_{1:t-1} + p_t and s_{1:t} + p_t are similar: they are both distributions over cubes. If they overlap on a fraction f of their volume, then E(M(s_{1:t-1} + p_t)\cdot s_t) \le E(M(s_{1:t}+p_t)) + (1-f) R
  • Lemma 4:
    • For any v\in R^n, the cubes [0, \frac{1}{\epsilon}]^n and v + [0, \frac{1}{\epsilon}]^n overlap in at least a (1-\epsilon |v|_1) fraction.
  • Proof of Theorem 1:
    • E(cost(FPL(\epsilon))) =\sum_{t=1}^T E(M(s_{1:t-1}+p_t)\cdot s_t)
    • (using Lemma 3) \le \sum_{t=1}^T E(M(s_{1:t}+p_t)\cdot s_t) + \epsilon ART
    • (using Lemma 2) \le OPT + \epsilon ART + \frac{D}{\epsilon}
  • Online Convex Optimization
    • We now consider the case where the cost function c^t are convex
    • At each time t, choose the decision x^t\in K, where K is a convex body of diameter D.
    • Let \nabla c^t(x) = \left( \frac{dc^t(x)}{dx_1}, \dots, \frac{dc^t(x)}{dx_n} \right) be the gradient of the cost function.
  • Online Gradient Descent:
    • x^{t+1} = Proj_K \left( x^t - \eta_t \nabla c^t(x^t) \right)
  • Facts:
    1. For a convex function f, f(y) - f(x) \ge \nabla c(x) \cdot (y-x)
    2. \forall a\in K, ||Proj_K (y) - a|| \le || y -a ||
  • Theorem:
    • Let x^* = \arg min_{x\in K}\sum_{t=1}^T c^t(x).  Assume ||\nabla c^t|| \le G.
    • If set \eta_t = \frac{D}{G\sqrt{T}}, then \sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le DG\sqrt{T}.
    • If set \eta_t =\frac{1}{\sqrt{t}}, then \sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le (\frac{D^2}{2} + G^2)\sqrt{T}.
  • Proof:
    • Let \phi_t = ||x^t - x^*||^2
    • \phi_{t+1} -\phi_t = ||x^{t+1} - x^*||^2 - ||x^t - x^*||^2 = ||Proj_K \left( x^t - \eta_t \nabla c^t(x^t) \right) - x^*||^2 - ||x^t - x^*||^2 \le || x^t - \eta_t \nabla c^t(x^t) - x^*||^2 - ||x^t - x^*||^2 = \eta_t^2||\nabla c^t(x^t)||^2 - 2\eta_t \nabla c^t(x^t) (x^t - x^*)
    • \nabla c^t(x^t) (x^t - x^*) \le \frac{\eta_t}{2} G^2 - \frac{1}{2\eta_t} (\phi_{t+1} - \phi_t)
    • By Fact 1, we have c^t(x^t) - c^t(x^*) \le \nabla c^t(x^t) (x^t - x^*)
    • Set \eta_t =\eta = \frac{D}{G\sqrt{T}}, then \sum_{t=1}^T \left[ c^t(x^t) - c^t(x^*) \right] \le \frac{\eta T}{2} G^2 + \frac{D^2}{2\eta} \le DG\sqrt{T}
    • Set \eta_t = \frac{1}{\sqrt{t}}, then \sum_{t=1}^T \le \frac{G^2}{2}\sum_{t=1}^T \eta_t + \frac{\phi_1}{2 \eta_1} + \sum_{t=2}^T \phi_t ( \frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}) \le \frac{G^2} {2}\sum_{t=1}^T \eta_t + \frac{D^2}{2\eta_T}
    • \sum_{t=1}^T \frac{1}{\sqrt{t}} \le 1 + \int_{t=1}^T \frac{dt}{\sqrt{t}} \le 2\sqrt{T}
  • In HW4, we prove that FPL also works for online convex optimization, by using the gradients as the linear cost functions.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s