To reviewer 1 $We thank the reviewer for comments, and in particular for the suggestion about the extension with smoothness and local smoothness assumption.
A quick calculation seems to suggest exploiting the smoothness of F(w) following the framework of this paper is not straightforward. From the current proof, one key step is to bound Reg_T-\sum_{t=1}^{T}(\frac{\beta}{4}\|w_t-w^*\|_2^2) instead of Reg_T. The term \frac{\beta}{4}\|w_t-w^*\|_2^2 is then related to F(w_t)-F(w^*) using strong convexity. This technique is not easily extendable to the case of F(w) being smooth (or local smooth). At any rate, additional effort is needed to make it work.
To reviewer 3
With due respect, it seems that the reviewer misunderstands the setting and contributions of this paper. Let us restate the contributions of the paper.
Our contribution is two-fold.
The main contribution is that we need strongly convex assumption on F(w) (expectation of f(w,z)) instead of f(w,z). In particular, previous works[1][2][3] showed that these algorithms converge at a rate of O(ln T/T ) when the loss function f(w,z) is strongly convex. We show that the algorithms converge at a rate O(ln T/T ) as long as F(w), i.e., the expectation of the loss function, is strongly convex. This relaxation is very important. Some examples are Lasso and logistic regression, where the loss function f(w,(x,y)) are (y-w^Tx)^2 and \ln (1+\exp(-y(x^Tw))). Obviously, f(w,(x,y)) is not strongly convex. Previous works [1][2][3] mentioned in the paper predict the rate O(1/\sqrt{t}), which is very conservative. Since F(w)=E(f(w,(x,y))) is strongly convex, our work shows that all three algorithms can achieve the rate O(\ln T/T). This point seems ignored by the reviewer.
The minor contribution is that we assume the locally strong convexity instead of the globally strong convexity of F(w).
The reviewer points out that function f(w,(x,y))=(y-w^Tx)^2 is quadratic and its hessian is constant. However it is not strongly convex, since the Hessian is just a rank one matrix and the smallest eigenvalue is 0. And this is exactly when our analysis is useful – f(.) is not strongly convex, but F(.) is.
The reviewer stated that due to boundedness, the logistic loss is strongly convex. This is not true – again f(.) is not strongly convex, but F(.) is.
We are not sure what is mu mentioned in the review, since mu does not appear in the paper.
Lastly, we thank the reviewer for pointing out a missing reference "Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression" (Bach 2014). We want to point out that our work is more general and not only work for the logistic regression problem. Moreove, Bach (2014) needs three-times differentiable assumption on the objective. This may limit its applicability on some machine learning problem with non-differentiable regularization term, such as the Logistic regression with L1-norm regularization and Lasso. In contrast, our work considers the regularized optimization which covers more machine learning formulations.
[1] Xiao, Lin. Dual averaging method for regularized stochastic learning and online optimization
[2] Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Tewari, Ambuj. Composite objective mirror descent.
[3] Suzuki, Taiji. Dual averaging and proximal gradient descent for online alternating direction multiplier method.
To reviewer 4:
We thank the reviewer for comments and in particular for pointing out typos and errors, which will be duly corrected.