We are grateful to the reviewers for their helpful and constructive comments for improving the paper. We will revise the paper according to these comments by polishing the sentences, clarifying some notations and references and correcting some typos. We respond to the major concerns in the following. $
Q (R1, R2, R3): Is the vanishing gradient assumption really necessary? In Line 498, f_t(w_t^*) is at most f_t(w_t') just by the definition of w_t^*. I don't think the argument in Line 493-494 is necessary.
A: We are not clear at the moment whether the vanishing gradient assumption is necessary for achieving an O(V^p_T) dynamic regret by using a different analysis. We will mark it as an open problem. Nonetheless, our analysis requires this condition because that w_t’ is not necessarily in the problem domain \Omega and w_t^* is just the minimizer of f_t(w) in the domain \Omega. Therefore we do not have f_t(w_t^*)\leq f_t(w_t’) directly. When w_t^* has vanishing gradient, it implies that w_t^* is also a minimizer of f(w) in the domain R^d, thus making f_t(w_t^*)\leq f_t(w_t’). For example, we can consider f_t(w) =w^2/2 and \Omega\in[1,3]. The minimizer w_t^* in this case is 1 (which does not satisfy the vanishing gradient condition). Consider w_t = 2. Then w_t’ = 0 and f(w_t’)\leq f(w_t). Then the inequality in line 498 does not hold.
Q (R1, R2, R3): E_f[] is defined but is not used in the equation above.
A: The expectation used in the equation above is indeed the expectation over the randomness in the function. We will add the subscript _f in the equation.
Q (R1): why we should care about the "path variation" rather than the other, existing types of variation?
A: Path variation seems to be a rather natural regularity measure (also mentioned in reviewer 3’s comments), since $w_t^*$ are the optimal comparators in the definition of the dynamic regret. The regret bound based on path variation directly relates the regret with the properties of the optimal comparators. It has also been used in shifting regret analysis for tracking the best expert (Herbster & Warmuth, 1998), and drifting regret analysis (Cesa-Bianchi et al., 2012; Buchbinder et al., 2012). However, we do not intend to disregard the work on bounding the regret using other types of variation.
Q(R1): Shouldn't it be $o(T)$ rather than $O(T)$ on lines 284 and 291?
A: Yes, we will correct it.
Q (R2): it would be nice to have a discussion to compare those different variation terms
A: We will add some discussions.
Q (R2): Line 608, V_p' is defined as a set of loss functions, but really should be defined as a set of sequences of loss functions. Also, Line 644-651, "P" should be "P_f^\pi" or "P_g^\pi"?
A: Yes, V_p' is a set of sequence of loss functions. We will correct it. Line 631 should be max{P_f^\pi(X_t\leq 0), P_g^\pi(X_t>0)}\ge q 1/(4e^\beta). Line 645 should be P_f^\pi. Line 648 should be P_g^\pi. And line 651 is P_f^\pi(X_t\leq 0) + P_g^\pi(X_t>0).
Q (R2): Algorithms in Sec 3.2 and Sec 3.3 both require knowing B_T to set the parameters, but one can't even figure out B_T after playing the game since the feedback is partial. So is there a way to get around this?
A: It is still an open problem. The algorithm in Jadbabaie et al. has an automatic strategy for setting the step size, but it requires the full knowledge of the loss functions.
Q(R2): Line 773, what does "\leq" mean for sets?
A: It should be subset.
Q(R3): what is a cumulative function of a random vector? Did you mean that P() is the pdf? why does the theorem need Assumption 5
A: The cumulative function means the cumulative distribution function. In the proof of Theorem 6, the inequality in line 631 uses the Lemma 1 and 2 from (Besbes et al.) which requires Assumption 5. When the noise is Gaussian noise with bounded variance, then assumption 5 holds.