Paper ID: 330
Title: Loss factorization, weakly supervised learning and label noise robustness

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a novel algorithmic point of view for various forms of weakly supervised learning and noise robust classification algorithm. The main contribution of the paper is a factorization of many common loss function in way that admits a straightforward, principled way to deal with weak or noisy labels. Interestingly the proposed method can in many cases be implemented by minor adjustments of SGD like algorithms.

Clarity - Justification: The motivation and the basic ideas of the paper are very clear and quite easy to follow. However some of the mathematical parts are a bit more sketchy in my opinion. For instance in the muSGD algorithm there seem to be two update for \theta^{t+1}, the second one seems like a typo and I did not manage to figure out what was the author's intent.  

Significance - Justification: The problem the paper address is of very high interest in my opinion, both from the theoretical and practical perspective. The technique presented seems applicable on one hand and novel on the other hand.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In the muSGD algorithm there seem to be two update for \theta^{t+1}, the second one seems like a typo and I did not manage to figure out what was intent.   Line 84 "not does"  Deep learning is only mention in the last two words of the paper. A more detailed discussion of the applicability (or lack thereof) of the results of the paper to deep learning would be very nice to have.  

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper looks at the problem of factorization of losses into different parts, one of which can be written out as a linear odd loss(LOL) which can often be gauged via the sufficient statistic on the dataset thus helping optimization over the datasets via optimization over the sufficient statistic. The authors use the Neyman Fisher factorization over an exponential family to get the LOL defined over the sufficient statistic of the dataset. They show that these LOL have an injection with a proper subclass of even functions. They also show Rademacher complexity results in terms of minimizing the loss of an LOL function by its surrogate minimum over a dataset. They can generalize all of these results in the context of weakly supervised learning when the labels are noisy, both in the context of symmetric and asymmetric label noise. The authors also define a notion of \epsilon noise robustness and show that LOL losses are approximately noise robust. They perform  experiments to show the performance of the \mu-sgd algorithms for minimizing LOL losses , comparing the difference from the true minimum. They also discuss the generality of factorization (including how losses can be broken up as sum of even and odd functions) as well as how the Regularized risk minimization over RKHS can be written down as a sum of even and odd losses. They also discuss the notion in the context of learning reductions as well as extending optimization beyond \mu-sgd to proximal step based methods.

Clarity - Justification: The paper is clearly written and most of the mathematical proofs are detailed in the appendix. It is easy to follow the flow of the results. However it would be good to get an idea of the strength of the results in the context of noisy labels and how it generalizes to results that involve robust optimization.  The experimental comparison between SGD and \mu-SGD is a good start to comparing how these methods perform in the setting of noisy labels, for robust optimization. It would be good to have more background on the robust optimization results and how theorem 10 applies in a real life experimental setting.

Significance - Justification: This paper is a strong piece of contribution to the area of factorization of losses, particularly in the setting of noisy labels and weakly supervised learning. The authors provide strong theoretical results reducing the need of optimization to only the sufficient statistic for the LOL losses that encompass most of the commonly used losses. They show both Rademacher complexity based theoretical results, \epsilon- robustness as well as decomposition of losses into even and odd components besides showing reasonable experimental results showing how their methods compare to general SGD as well. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper looks at the problem of factorization of losses into different parts, one of which can be written out as a linear odd loss(LOL) which can often be gauged via the sufficient statistic on the dataset thus helping optimization over the datasets via optimization over the sufficient statistic. The authors use the Neyman Fisher factorization over an exponential family to get the LOL defined over the sufficient statistic of the dataset. They show that these LOL have an injection with a proper subclass of even functions. They also show Rademacher complexity results in terms of minimizing the loss of an LOL function by its surrogate minimum over a dataset. They can generalize all of these results in the context of weakly supervised learning when the labels are noisy, both in the context of symmetric and asymmetric label noise. The authors also define a notion of \epsilon noise robustness and show that LOL losses are approximately noise robust. They perform  experiments to show the performance of the \mu-sgd algorithms for minimizing LOL losses , comparing the difference from the true minimum. They also discuss the generality of factorization (including how losses can be broken up as sum of even and odd functions) as well as how the Regularized risk minimization over RKHS can be written down as a sum of even and odd losses. They also discuss the notion in the context of learning reductions as well as extending optimization beyond \mu-sgd to proximal step based methods.  The math is strong and can be followed. The importance of the results particularly significance of Corollary 5, (theorem 7) and section 5.1 should be explored in more detail in a detailed version of the paper in the near future if possible.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The aper provides a simple constructive proof to show the following result: Any algorithm for minimizing a loss for fully labeled data can be turned into an equivalent algorithm for minimizing a related loss for weakly supervised data with provable generalization and robustness (under mild decomposability assumptions on the loss function). 

Clarity - Justification: Very well written. However, it is still not light reading :)

Significance - Justification: see below. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Thank you for submitting a very interesting paper.   I enjoyed the intuition, sound theory, clean mathematical presentation, and good experimental results. Overall, an excellent paper. 

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper reveals an interesting and significant rule for all weakly-supervised learning problems (Theorem 3 in the paper): When the loss function l(x) satisfies that l_o(x)=(l(x)-l(-x))/2 is a linear function and the underlying hypothesis class has only linear-in-parameter models, the empirical risk to be minimized can be decomposed into two terms; the first term is an empirical risk that depends only on the raw data and independent of the labels, and the second term is the loss l_o at the hypothesis of an empirical mean operator. Thus, two tiny changes make a fully-supervised optimization algorithm such as SGD a weakly-supervised algorithm (Algorithm 1 in the paper).

Clarity - Justification: This paper is generally well-written and the English is very good at a local level. There are minor issues at the global level.  The organization is basically linear, in an order that the theoretical results are derived. However, the order of presentation may or even should be different from the order of development of the idea for such a theoretical paper. The author guide of this COLT has recommended an author guide from previous FOCS, and I think it may be helpful to improve this paper: http://windowsontheory.org/2014/02/09/advice-for-focs-authors/.  The brief summary of contributions in the introduction is nice but not enough. Before Theorem 3 in the paper, it is stated the next theorem is the main contribution, but there are much more contributions besides this single theorem. There are very interesting results in Section 5 and even Section 6. To be honest, if I am not asked to review this paper, I may stop after I finish Section 3. I understand that this paper contains too many results such that it can be split into two papers. Now the proofs are not in the main body but all sections except the introduction are still very technical. As pointed out by that FOCS author guide, the writing of theoretical papers should take into account a wide range of readers, and all important messages of theoretical results should be captured by engineers who are not experts in this area and experts who don't have enough time to go through the paper carefully.  Moreover, some sections have long detailed discussions before the first subsection, such as Sections 2 and 5. Section 1 has a similar organization with a very long Section 1.1. I personally do not think this is a good structure especially for the introduction.

Significance - Justification: This paper makes a strong connection between fully-supervised learning and weakly-supervised learning. WSL includes the cases where labels are noisy, labels are missing in semi-supervised learning, negative labels are hidden in PU learning, and labels are aggregated in multi-instance learning. According to the theoretical results in this paper, if we are happy to use linear-odd losses and linear-in-parameter models, it is extremely easy to modify a fully-supervised algorithm into a weakly-supervised correspondence. Based on this contribution, I think it is a significant paper.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper is technically very strong and all claims are supported by theorems plus a few toy experiments. Many future directions are discussed in Section 6. The weakness has been mentioned in the paper, but it is not explicit to me. The constraint on the loss has no problem, but in my mind, the constraint on linear-in-parameter model is a bit strong. This includes many models, either kernelized or not in areas such as statistical learning theory and kernel methods, while tree models and neural network models are ruled out by that constraint.  My main comment is on the clarity and given above. Besides that, the only issue I have found is the reference and citation. What is the order of the references? It is not numbered and not sorted by family names, years, or titles. It seems not the default latex package, and it is really difficult to locate the paper in this order. Moreover, sometimes the command citet should be used rather than citep for citations when the citation itself is a part of the natural sentences.

=====