Paper ID: 212
Title: Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies online convex optimization with dynamic regret, where regret is measured versus the sequence of optimizers of the individual loss functions (in contrast to the typical setting where regret is versus the optimizer of the sum of loss functions). The authors prove upper and lower bounds on the dynamic regret in terms of the "path variation"---the total variation/movement of the sequence of optimizers---under various feedback models and assumptions on the loss functions. This complements recent results that characterized dynamic regret using different notions of variation (functional variation, gradient variation).
Clarity - Justification: The paper is mostly well-written; the main results are stated clearly and previous work is summarized nicely. The technical proofs are easy to follow.
Significance - Justification: I think the paper contributes nicely to a recently active line of research in online learning. It gives in several cases a tight characterization of the dynamic regret in terms of the path variation, which is a rather natural regularity metric.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper gives a nice and extensive set of results characterizing the dynamic regret in terms of the path variation. It is unfortunate, though, that one of the main cases of interest, namely Lipschitz functions with true gradient feedback, remains unaddressed. The technical contribution of the paper is perhaps not too substantial, but the results appear to be novel. Few questions/comments: * Is the smoothness (+vanishing gradients) of the loss functions really required for the O(V_T) result? Can you give an example showing that O(V_T) dynamic regret cannot be achieved for general convex functions? I think that addressing the non-smooth (Lipschitz) case can make the paper much stronger. * I believe that a result of Zinkevich (2003) directly implies a sqrt(V_T * T) dynamic regret bound for the Lipshitz+True gradient case, and his proof seem to be easily extended to the stochastic gradient feedback case. This should be mentioned in the introduction/related work. Additional comments: * Eqs (1), (4), (5): end sentence with a period. * Line 314: E_f[] is defined but is not used in the equation above. * Lines 334-336: funny English, please rephrase the sentence. * Line 345: "with" => "without". * Thm 3: I suggest citing "Smoothness, Low-Noise and Fast Rates" by Srebro et al. in this context, as the argument used in the proof is very similar to the one used there. * Line 520: "Besbes" => "Besbes et al.". * Assumption 5: what is a cumulative function of a random vector? Did you mean that P() is the pdf? * Thm 7: why does the theorem need Assumption 5? It seems that only a bound on the variance of the noise is required. * Sec 3.3: please define bandit and two-point feedbacks.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies dynamic regret bound in terms of the variation between consecutive optimal points, in a general online convex optimization setting. Both lower and upper bounds are discussed, under several different feedback models, such as full information of the loss function, a single (noisy) gradient, or a single point function value (bandit).
Clarity - Justification: In general the paper is pretty clear and well written. However, it would be more clear if a discussion of how to compare those different variation measurements is included.
Significance - Justification: The improved results over previous work are nice. However, I don't see too many new techniques in this paper. The overall contribution is incremental.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. Table 1 is a good summary, but as I mentioned above, it would be nice to have a discussion to compare those different variation terms (e.g. under what circumstance is one better than the others). 2. Line 314, E_f[] is actually not used in the equation above. 3. For several places, such as the proof of Thm 6, a picture of the concrete functions would be very helpful for readers to understand the ideas. 4. I don't understand why the vanishing gradient condition is necessary. In Line 498, f_t(w_t^*) is at most f_t(w_t') just by the definition of w_t^*. I don't think the argument in Line 493-494 is necessary. 5. Line 608, V_p' is defined as a set of loss functions, but really should be defined as a set of sequences of loss functions. Also, Line 644-651, "P" should be "P_f^\pi" or "P_g^\pi"? 6. Algorithms in Sec 3.2 and Sec 3.3 both require knowing B_T to set the parameters, but one can't even figure out B_T after playing the game since the feedback is partial. So is there a way to get around this? 7. Line 773, what does "\leq" mean for sets? 8. Line 4 of Algorithm 1, the two w's are identicalâ€¦ one of the "+" must be "-". 9. In the appendix, the numberings are wrong for all three section titles (i.e. Lemma 1 should be Lemma 4, Theorem 7 and 8 should be 8 and 9 respectively). 10. It would be interesting to replace the dependence in T in some of the bounds by a finer measurement, see an example in the expert setting in Luo and Schapire, "Achieving All with No Parameters: AdaNormalHedge", 2015, Sec 5.
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the online convex optimization where the performance is regret against the best sequence of dynamic choices (equivalently, where the losses are non-negative and the performance measure is total loss). Following on previous work, the paper aims to show that exiting algorithms attain optimal regret bounds, where the bounds are in terms of the "path variation", i.e., how fast the minimizer points of the losses change. The paper studies algorithms that receive full, true gradient and noisy gradient feedback, as well as two-point bandit feedback, and establishes regret bounds for them.
Clarity - Justification: The paper has various grammatical and mathematical bugs in presentation (see detailed comments for some examples).
Significance - Justification: The paper seems interesting since it provides lower bounds and matching upper bounds. However, the paper also has numerous presentation bugs and is sloppy, and makes assumptions that seem unnecessarily restrictive.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In summary, given the many presentation issues and the questions below, I only marginally recommend accepting this paper. Presentation and clarity: The paper repeatedly has errors in English usage, and sloppiness in math usage. For example: 1- The first sentences of Section 2.1 2- Incorrect use of "follows that" repeatedly 3- The last sentence of the Remark on page 3 is extra; that's already the meaning of what was said before.. 4- Shouldn't it be $o(T)$ rather than $O(T)$ on lines 284 and 291? 5- The statement of Proposition 1 defines $C$ twice, and misses $C_2$. Which $C$ is which, and where is $C_2$? One needs to read the proof to find out. 6- You mention $E_f$ on like 314, but there is no $E_f$ in the equation above that sentence. Please consider fixing these and all other typos, grammatical problems, and sloppy mathematics. Questions about significance and assumptions: 1- Is the vanishing gradient assumption really necessary? Can you give an example where the Theorem fails if this assumption is not satisfied? 2- The overall significance of the results is not established clearly. The comparison of the regret bounds with previous work should mention why we should care about the "path variation" rather than the other, existing types of variation?
=====