Paper ID: 1318
Title: Shifting Regret, Mirror Descent, and Matrices
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a framework based on mirror descent for thinking about shifting regret. It applies the results to a shifting problem with matrices.
Clarity - Justification: The text is easy to follow and well-structured. However, I am missing motivation for the problem studied.
Significance - Justification: The main result (TH 7) is not sufficient to warrant publication. See below.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I do not think this contribution of this paper is sufficient for publication. The Twisted Mirror Descent framework is fully general. Any arbitrary algorithm may be implemented fully in the 'twist' $\psi_t$, ignoring the mirror descent iterates $v_t$ altogether. Yet this generality is not used, as the only examples given of 'twists' are (minor variants of) Fixed Share. This is not in itself a big deal. It could be fixed by a rewrite. My main problem is that the matrix problem studied in Section 4 (main result) is actually a vector problem, and the proposed algorithm for it is (a minor variation of) Fixed Share. The matrix problem proposed in this paper is not a matrix problem. The loss matrices are assumed to be diagonal. The matrices in the learner's action space are not necessarily diagonal. But we can always simply zero their off-diagonal entries. This will preserve positive definiteness, preserve the trace, and preserve the diagonal entries (and hence their magnitudes). And it will preserve the loss. So this is a vector problem, where we are learning an element from the positive orthanth with sum bounded by tau and entries bounded by beta. I do not think the paper gives sufficient motivation for why this problem is interesting. Also, the proposed algorithm (see equation 4) is matrix Fixed Share. (yes, with the minor generalisation that the sum is at most tau and that the entries are bounded). It is not surprising that the analysis for fixed share goes through for the Von Neumann relative entropy. I do not consider this result by itself sufficient material for publication. In fact, in Online Variance Minimization by Warmth and Kuzmin; Machine Learning 2011, the authors mention in their conclusion, top of page 23, how straightforward this is. In Theorem 7 you ask $c_\psi \ge \tau$. I do not think you can deal with $c_\psi > \tau$. For this would allow the mapping (4) to bring us outside the learner action space $K_\tau$. I suggest setting $c_\psi = \tau$. I do not understand why Theorem 7 is so dissimilar from the vector bound in line 508. These should be extremely close. What is going on? In the quoted vector bound the constraint is that the sum is one, here the sum is at most \tau. This should not be a big difference, as we can always introduce another dimension to absorb the slack. So why are the norms of $U$ appearing? This seems very expensive. Line 182: the $_T$ on the comparator loss should be $_{q:s}$. Line 273: what is $w'$? Line 548: omit subscript on $F_t$
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a generalisation of mirror descent that allows casting several types of online algorithms into a unified framework. In particular, the with a suitable instantiation the bounds for the unified algorithm yield bounds for fixed-share update for shifting targets. This leads to generalising some recent matrix prediction results to the shifting target case.
Clarity - Justification: This is a nice self-contained paper with all the proofs included. Both the general motivation and the technical parts are presented very clearly.
Significance - Justification: This is a nice general framework that allows recovering exactly some earlier results as special cases, and also gives some totally new results in an interesting and non-trivial problem.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some typos: line 329, left column: what is w'? line 399, right: what are w and u? line 475, left: "=" should be \leq line 548, right: F_t should be F line 621, right: v should be V line 719, right: I assume "=" should be \leq, and you are also here using that tau\leq c_phi
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces the “Twisted Mirror Descent” (TMD) algorithm for performing online learning with changing environments (e.g. Tracking and Shifting). The twisted mirror descent algorithm is essentially just mirror descent with a second update (w_t=\phi_t(v_t, l_1, .., l_{t-1}) were v_t comes from the normal mirror descent update. The paper shows how to choose \phi_t in order to get meaningful bounds. The paper then shows that a couple of existing shifting problems can be solved with the TMD algorithm and finally uses TMD to give a novel result to the problem of shifting positive semi-definite matrices.
Clarity - Justification: Some letters, like D, are overloaded. L_T(u^s_q) is undefined although it could only mean one thing. There is a typo in the definition of \hat{L}_{q:s} - it should be \sum_{t=q}^sl_t(w_t). There is a paragraph between Theorem 7 and its proof which is unclear. I am confused by this paper’s definition of “strongly adaptive regret” - the true definition of strongly adaptive regret is the regret suffered on an interval with respect to a single predictor - not a changing sequence of predictors. Since this paper is doing regret with respect to a changing sequence of predictors I don’t see how adding the notion of “strongly adaptive regret” adds anything. Also, in the preliminaries section it says “comparing the loss of the forecaster to the loss of an arbitrary predictor sequence”: This can easily be misunderstood to mean “the minimum loss out of all predictor sequences” which, as you note above, is not possible.
Significance - Justification: It is difficult to see the significance of TMD in its most general form as for all the examples the function \phi is only a function of v_t and not of the losses. The fact that TMD unifies some existing algorithms, and the novel application to positive semi-definite matrices is pretty significant. The main significance of this paper is that it gives a general theorem that spits out a bound given bounds of various quantities involving R and \phi.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The claims are well-supported by theoretical analysis. I think that this is a decent paper although I have some issues with it that are given above.
=====