Paper ID: 371
Title: Minimizing the Maximal Loss: How and Why

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper concerns minimizing the maximal loss over the training data using online algorithm. The analysis focus on the noise-free case (where it is possible to achieve loss close to 0 for some parameter). It is explained why the new algorithm can be beneficial to a standard perceptron/SGD algorithm in such cases. The authors also provide two ways to use their algorithm in the presence of outliers: by subsampling the data, and by introducing slack variables.

Clarity - Justification: The clarity of the paper is mixed. Most of the time, the paper is very clear as easy to follow. However, the main contribution, FOL algorithm, is not explained sufficiently well (Section "How"). For instance, the strategy for the "p" player is never given formally (except for the proof of Theorem 1 in the supplementary materials). The reader is only referred to Auer et al., 2002. Moreover, Theorem 1 contains statements not used/explained before, such as "realizable with respect to some surrogate loss", or "we run the online boosting algorithm". The notation w_{t_j} in Theorem 1 is only explained in the pseudocode of the algorithm in Section 2.3 (t_j are randomly sampled weight vectors from the past, but this is not mentioned while introducing the algorithm). The sampling in O(log(m)) turns out to be a quite simple concept, but is hard to get from the pseudocode and should be explained more. I think the "Tree" listing on Page 4 contains mistake: the binary tree should have height h = ceil(log_2(m)), not the power 2 of it (as long as tree height is the length of the path from the root to the leaves).

Significance - Justification: The paper presents a clever idea to minimize the maximal loss by means of a game between two players: one player uses any online learning algorithm with provably small regret and the other player uses a modification of bandit algorithm. The new idea turns out to be superior to standard algorithm for noise-free (consistent) case and is shown to work well in practice on image recognition case studies. I found the theoretical analysis very interesting. I think the theoretical contribution to machine learning is significant.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper presents a novel and interesting algorithm which provably gets a better bound than the Perceptron algorithm. The theoretical analysis is insightful and unusual (at least for me), and it is a bit unfortunate that the proof of the main result was moved to the supplementary materials. The presentation of the algorithm should be improved in the final version of the paper (see my comments in the "Clarity" section).  I am not sure if the argument in Section 3.2 (typical vs. rare distributions) is convincing. It is not clear how larger c is comparing to epsilon. Also, if L_{avg}(h) = O(epsilon), it is likely that such h would be a member of H_{1,epsilon}, given that examples from D_1 are much more likely than from D_2, right? So Theorem 1 could also apply to any h with L_avg(h) = O(epsilon). It is possible that I did not get it right.  I tried to follow the proof of Theorem 1 in the supplementary materials, and there are places where, I think, it is assumed that the loss is at most 1 (line 1155 on Page 11, line 1167, line 1182), but I don't see such an assumption in Theorem 1. Can you reflect on this?  The statement "Training examples which are realizable with respect to some surrogate loss" in Theorem 1 should be explained. I guess it means that the total surrogate loss on all examples is 0.  The bound claimed in Example 1, which is an important motivation for using FOL in the noise-free case, should be derived, as it is not straightforward how it follows from Theorem 1.   The authors claim in the intro that many boosting algorithms minimize the maximal loss. I am not sure if I agree with this statement. For instance, AdaBoost is known to minimize the sum of exponential losses over the sample, and LogitBoost replaces the loss with the logistic loss. It is true that boosting algorithms (as well as SVM) can be formulated to minimize the maximal margin on the data, but this is only valid when we assume the data is separable and one can achieve zero error. In this case, the margin maximization serves more as a regularizer rather than a loss. In almost all practical algorithm, some trade-off between the margin (regularizer) and some (average) loss on the sample is employed.  I think your approach could be quite effective for general imbalanced data sets, when the negative class dominates the positive class, similarly as in your case study. I think it is worth mentioning in the paper.  

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper develops a boosting-like procedure to adapt online learning algorithms to optimize the maximal loss on the training set. The motivations and preliminaries are developed nicely, and the procedure and accompanying theoretical guarantees are easy to understand. The experiments show scenarios where this procedure can help. 

Clarity - Justification: The paper is well written. There are a few scattered typos (e.g. Line 95: gradient). There is sufficient detail to recreate the ImageNet fraction of the dataset, but it will be more useful to provide the 28k-246k "face" dataset that they used to define distribution D_1, as well as code implementing their procedure (the text does not specify whether they used slack variables/sub-sampling to handle outliers and unrealizable w, etc. so a reference implementation is useful).

Significance - Justification: The algorithm is heavily influenced by EXP3.P1 and uses a clever trick to update sample distributions in log-time. Robustness to outliers is also achieved using known techniques. I found the discussion in Section 3.2 "Why: Typical vs Rare distributions" most interesting (and possibly significant). Typical domain adaptation or covariate shift approaches would define a sample distribution p (that emphasizes the rare examples) and proceed to pick the L_avg minimizer under this distribution. FOL seems to be able to "filter" the hypothesis class using the typical samples from D_1, and pick the best using rare examples from D_2.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Disclaimer: I have only taken a cursory look through the appendices to verify the proofs at a high level.  The "face" prediction task appears to have been a realizable setting. Can we see the procedure in action for unrealizable scenarios? It will be interesting to see if slacks/sub-sampling indeed ensures the procedure remains robust in practical scenarios.  It might help to clarify that in Lemma 3, the FOL algorithm is instantiated with the Perceptron as the OLA. At first glance, it seemed to be an outlandish claim (e.g. what if the halfspace player was a particularly slow OLA, could one still guarantee additive O(1/alpha) ...)  It will be interesting to see how existing sample weighting schemes in domain adaptation/covariate shift perform in Expt 5.2 (does the theoretical drawback in terms of sample complexity upper bound manifest in practice?). Similarly, the approach of Clarkson et al would also be a useful baseline to compare against experimentally.  For Figure 3, how many epochs were run for SGD and FOL?  Could the authors contrast the proposed algorithm with Online Boosting (ICML '15) in the Related Work section -- one assumes a particular weak learnability assumption, the other assumes a low regret w player?        

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers a seemingly extreme objective of minimizing the maximum loss. The authors offer decent arguments for situations where one may want to do this, while also developing and analyzing a very interesting algorithm that they describe as a form of online boosting.  Overall I found this paper a fascinating read. I checked in detail the proof of Theorem 1 and can vouch for its correctness (modulo some quick changes to the algorithm described below). Examples 1 and 2 are particularly interesting as they demonstrate improved runtime complexity as compared to SGD. Overall, I still worry that the setup of the problem is somewhat contrived, but the authors motivation and experiments, combined with the freshness of the paper itself, makes me positive on this work. The main concern I have is robustness, but Theorem 3 is at least a start in how to handle problems with outliers and rare (but important) examples. 

Clarity - Justification: Overall the paper is very clear, but mainly the issues with the "Tree" algorithm seriously slowed down my reading of the paper. See the Detailed Comments for suggested fixes.

Significance - Justification: The idea of minimizing the maximum loss is extreme and hence garners interest if one can make a passable argument for why one should do this. I have seen this argument made in the context of multi-task learning, where one might minimize the maximum risk over the tasks, but actually minimizing the maximum loss over all the points in single task is quite refreshing.  I think the authors' arguments succeed in motivating the paper by appealing to problems with rare but important examples. This paper should lead to future work that pushes robustness to the forefront, making the ideas herein more practically useful rather than being of primarily theoretical interest. I have not seen the speedup technique used in the Tree algorithm before; if this is indeed new, it is a nice trick that likely could be used in many other places. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): \documentclass{article} \usepackage{amsmath,amssymb,amsfonts}  \begin{document}    Overall I found the exposition very clear, the core idea of the paper very novel, and the motivation for minimizing the maximum loss sufficiently strong.   The only major comment I have is with regards to statement on line 610: ``we would like to think of $c$ as a constant'' - Why should we believe that $c$ is larger than $\epsilon$? Can you motivate this assumption based on the 'faces' example? I think you have put in place some pieces for that statement but stopped short of properly explaining when the assumption may hold true. If we cannot reasonably believe the assumption to hold in practical situations, then Theorem 2 does not really seem useful.  I mainly just have some comments on corrections the authors should make, the most important being those related to the Tree algorithm.    \subsection*{Minor comments}  On lines 193-194: ``Since the loss is in [0, 1]'' - this seems like the first time you make this assumption; the assumption should be made explicit and earlier.  statement of Theorem 1: change ``online boosting'' to ``FOL''  Line 264: $\mathcal{T}$ was never defined! Also, Theorem 1 gives the average loss over $\{w_{t_1}, \ldots, w_{t_k}\}$, not over $\{w_t\}_{t \in [T]}$.  In the ``Tree'' algorithm box:  Disambiguate by specifying to set the value of each internal node to be the sum of its children, starting from bottom to top (leaves to root).  You must have a mistake in the height of the tree. The height should instead be $\lceil \log_2(m) \rceil + 1$. The way you have things, the height of the tree is linear in $m$ and the last step of ``update'' costs $O(m)$. I think everything works with proposed fix.  line 467-468: change ``the restriction of'' to ``an upper bound on''   Appendix: In Section A.2.1, $\epsilon/6$ should be replaced by $\epsilon/8$ (in two places).  In the math display at lines 1189-1190, change $u$ to $e_i$.   \end{document} 

=====