(Assigned_Reviewer_2=R1,Assigned_Reviewer_5=R2,Assigned_Reviewer_6=R3)$We would like to thank the reviewers for their constructive comments and insights. Below, we will refer to Mahdavi et al. 2012 as M12.
R1) Faster empirical convergence: We reused the MP algorithm from [M12, Alg.2]. The guarantees in [M12, Th.12] indicate that the O(T^2/3) regret bound is valid only for large values of T (see comments end of Sec. 4.1). We believe this is the reason why MP performs poorly. The step sizes for OGD were taken from [M12, Th.4] and we observed that OGD progresses more slowly and does not manage to reach the borders of the domain (numerical comparisons between our step sizes (Table 1) and those from [M12] presented a factor from 100 to 1000). This translates into an increase in the cumulative regret loss. We tried other choices for the step sizes of OGD (see [M12, Remark 5]), but we observed the same behavior.
More fundamentally, we believe that our algorithm achieves faster convergence because of the conjunction of 3 features: a) different step sizes for x and lambda that make it possible to account for the different properties of the primal and dual problems respectively, b) adaptive step sizes (eta_t, mu_t) lead to faster and more substantial progress in the update rules, c) adaptive regularization parameter (i.e., 1/theta_t as showed in equation (3)) that gives more freedom for x to move in the first iterations (our definition of theta_t > theta from M12, implying a smaller regularization 1/theta_t).
R2) -References: Thank you for pointing us to these references. We will integrate the suggested references into the revised version.
-Assumption (4): This technical assumption can be better understood in light of Thm 7 [M12]. In particular, the optimality conditions of the offline problem show that the dual variable lambda is inversely proportional to the norm of the gradient defined in Assumption (4). Lower bounding the norm of the gradient makes it possible to upper bound the dual variable lambda. In turn, this allows for a tradeoff between the original problem and one that is perturbed by the slack variable gamma.
-Critical technical innovation: The main challenge in our analysis lies in the fact we cannot simply plug-in the adaptive step sizes advocated in the convex online learning literature in the case where there are no long-term constraints, i.e., O(1/sqrt(t)) for the convex case, or O(1/(t*c)) for the strongly convex case with parameter c. In the strongly convex case, this leads to a non-vanishing regret bound for either the loss or constraint. Care must be taken to adapt appropriately the regularization parameters (theta_t) accordingly. We capture this non-trivial property by deriving general sufficient conditions C1-2-3, which are novel to the best of our knowledge.
R3) -Optimality: We do agree that, both for the convex and strongly convex cases, our rates could be improved upon. Similarly, we do agree that providing a lower bound and/or a sqrt(T)-sqrt(T) guarantee would be a substantial contribution, which would be sensible next steps in future work.
-Details and intuitions on the choices of the parameters: Intuitively, there is a tension between (a) making as much progress as possible in x with large eta_t's, and (b) adaptively controlling the constraint violation via 1/theta_t (see equation (3)). Setting 1/theta_t to be too large, risks impeding the progress in x; setting them too small introduces large constraint violations. As for mu_t, making these larger implies stronger penalization of the violations. Having this intuitive picture in mind, we make use of conditions C1-2-3, and especially C2, to formalize all of these tradeoffs. We will clarify these points in the revised version.