Paper ID: 1096 Title: Expressiveness of Rectifier Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper compares the expressivity of shallow threshold networks vs RELU networks for real-valued inputs. This paper is fairly modest in its goals, since expressivity of shallow networks is already well studied, and the case of threshold activations with real-valued inputs is essentially the easiest case to consider, from a mathematical standpoint, followed closely by RELUs, and last by sigmoids. Still, there I did learn a couple things reading this paper, so it probably has some value and I would recommend acceptance, provided that the authors make the revisions I'm asking for. Clarity - Justification: ... Significance - Justification: The main problem with this paper is that threshold networks are not used in practice, and so it's not particularly interesting to study them. And they are not a really good analogy for sigmoids in this context, since sigmoids can much more easily simulate the linear region of RELU activation than a threshold can within a single layer. Plus, this paper considers demands that the networks make no errors when simulating each other, which is not realistic in practice. Another big issue with this paper is that it ignores much of the previous work on the topic of expressivity of neural networks. Results from the 90s establish various equivalences between activation functions, and there are more recent and highly relevant results which have also been omitted. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Line 63: This needs to be argued. Even linear networks can have vanishing/exploding gradients. Line 91: What does "logarithmically fewer" parameters mean? Line 123: You are leaving out a key result from 2013 ("On the Representational Efficiency of Restricted Boltzmann Machines", Theorem 17) which already proved limitations for general shallow neural networks (which includes RELU networks). And this result builds on much older work considering similar questions from the 90s (e.g. "Threshold circuits of bounded depth", and the subsequent papers of W. Maass). The 2013 paper also showed that 1 hidden layer threshold nets can simulate 1 hidden layer RELU networks over *binary* inputs, incuring only a quadratic penalty (Theorem 6). This result should be discussed, as it shows that your result depends very much on input being R^n instead of discrete. Moreover, even for bounded inputs in [0,1]^n (say) it's possible to simulate RELUs with thresholds relatively efficiently, using the obvious "stair-case" style construction like the one in "A comparison of the computational power of sigmoid and boolean threshold circuits" (from the 90s) while being wrong on only vanishingly low-measure subset of the input space. The paper "The Power of Approximating: a Comparison of Activation Functions" from the 90s also shows that all of the reasonable activation functions are roughly equivalent in expressiveness, up to a constant multiple increase in depth. This is a serious challenge to the idea that RELUs are fundementally more powerful than, say, sigmoids. Line 135: You are right that this 2014 result doesn't say anything about expressivity for classification (i.e. when the output is thresholded). However, this result concerns itself with real-valued outputs. The 2013 paper discussed above however *does* consider classification. Line 228: So to be clear, this is a 2 layer neural network with RELU/rectifier hidden units and a threshold output unit? It seems a bit counterintuitive/ambiguous to call this a "two layer rectifier network" since people could easily use that term to describe the situation where the output unit activiation is also R. Line 487: This proof isn't totally rigorous, although I think the basic idea is sound. For example, when you have this set S, you don't properly argue that the only possible hyperplane that can classify it correctly is the one given. You only show that one of the hyperplanes you constructed works, and not that it is the only possible hyperplane that can work. Without this, it might be possible that there are some hyperplanes, not nessesarily in the family you have constructed, that would work for multiple choices of S. I don't think there is, but you need to rigorously establish this. Line 665: I had a very hard time following this section. The first two paragraphs, which seem to define a technically precise term "hidden layer equivalence" are vague and non-rigorous. This section also seems poorly motivated. I don't know what the significance of Theorem 3 is supposed to be. You might want to consider cutting this section, or moving it to the appendix, in favor of an expanded related work section. Line 741: I fail to see the motivation of this section. Thresholds are not used in practice, and especially not exponentially large threshold networks. You don't need to do an experiment to argue for why an exponentially large network is bad to use in practice. The reasons are self-evident: efficiency concerns and the inevitable overfitting that will occur when you have that many free parameters. Line 770: Threshold are going to be hard to optimize because they are discontinuous. A continuous approximation like tanh(cx) for large values of c is equivalent to just a regular tanh network with a very bad initialization. Initialized sensibly, a tanh network should do fine. If not on this particular task (which is taylor-made to be good for a shallow network of RELUs), then on some other one where RELUs might struggle. It is certainly not true that RELUs are more powerful than tanh in general, as the theoretical analyses from the 90s clearly establish. Line 815: How do you know that "backprop can successfully find it"? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper compares the expressiveness of ReLU units compared to threshold units. Clarity - Justification: The paper is well written. Significance - Justification: The paper compares ReLU with threshold units. It is not obviously useful to compare the two activations, as threshold are virtually never used in practice. This result might pave the way to other results though (certain properties might be easier to prove for threshold units), however this is not immediately true. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Interesting paper. My first comment is that the empirical evaluations are hard to be trusted because of the learning issues. Gradients behave much worse when using tanh activation, especially the compressed tanh. In fact, the conventional wisdom for picking ReLUs over tanh is not that ReLUs are more expressive than tanh but rather that the gradients are better behaved. From this perspective I do not find figure 4 useful. The other global comment I would have is that is not clear to me that the comparison between threshold units and ReLU is useful. My understanding of the paper is that is shows convincingly that for certain class of problems ReLU can be exponentially more efficient at expressing the solution compared to a threshold model. However, threshold models are not really used in practice. Also is not clear that this result can be used to better (or completely) characterize on what classes of problems ReLUs are good. Or to discuss the importance of depth, at least not trivially. However the constructions in the paper are elegant, and their description is pretty well written. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies the decision boundary of a neural network with one hidden layer of rectified linear units and a binary output unit. This kind of network is compared to a shallow threshold network. The paper shows that in some cases threshold networks require exponentially more units to represent the decision boundaries that can be represented with rectified linear networks. Clarity - Justification: Overall the paper is well written. It would be very useful if the paper included (more) intuitions behind Theorem 2. Given the theoretical nature of the paper, I think that the proof of this theorem, or a simplified version thereof, would fit very well in the main part of the paper. Significance - Justification: The representational power of rectifier networks is a very contemporary topic. Comparing rectifier networks with threshold networks seems to be a natural way of approaching this problem. In my opinion, the most interesting part of this paper is the Theorem 2 describing how some Boolean functions that can be represented by a ReLU network can also be represented by a threshold network, but only if the number of hidden variables is exponentially larger. I did not read the proof, however. One point that is missing is a discussion of whether most functions represented by a rectifier network require many threshold units, or just a few of them. Furthermore, there are several possible extensions mentioned in the paper, elaborating on which could make this a much stronger contribution, such as an extension to deep networks and a formulation of Lemma 2 for a family of examples. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): MINOR COMMENTS: * Line 394: ``In the previous section we saw ... that correspond to much larger threshold networks'' This is correct, but somewhat confusing, since there are exponentially many conditions e.g. in Figure 2, but they all are implied by the condition with $\mathcal{S}_2=\mathcal{N}$. * Lemma 2: Usually we are interested in the asymptotics. Here it would be important to provide a class of examples, with one example say for each possible number of input units and of hidden units. Otherwise one could also simply consider the trivial example of a network with one hidden unit. * Line 362: \in -> \subseteq =====