We thank the reviewers for their insightful comments! We will incorporate the suggestions into the final version.$ Specific responses: * Reviewer 1 1. Relative sizes of function spaces: Theorem 2 shows that functions represented by n ReLUs form a subset of the functions represented by O(2^n) threshold networks. This result is tight, i.e, we show an n-ReLU network that requires O(2^n) threshold units. However, the question of what fraction of all ReLU networks can be expressed using fewer than exponential number of threshold units is open. Sometimes, for eg. with Boolean inputs, using quadratic number of threshold units suffice, as pointed out by reviewer 3. We believe that the function space of n-ReLU networks only occupies a small portion of the space represented by O(2^n) threshold units. However, to exactly quantify the size relationship is unsolved. 2. Lemma 2: This is an existence statement; the example suggested by the reviewer does not contradict it. Also, since the original submission, we have generalized the proof showing that for any n there exists an n-unit threshold network that cannot be expressed by n-1 unit ReLU networks. We will add this to the final version. * Reviewer 3 1. Comparing to thresholds: Yes. Threshold activations are not practically used. However, not only are they well-understood theoretically (eg, in terms of decision boundaries), but naturally emerge as a result of our analysis (Theorem 1). Studying the connection to thresholds can help us understand the expressiveness of ReLU networks. In essence, we only use threshold units as a vehicle to represent the decision boundaries of ReLU networks. Eg, we show that the decision boundary of a rectifier network with n positive ReLUs is a polytope with exponential (in n) number of faces. It just happens that this polytope is naturally characterized by threshold units. 2. Line 63: The reviewer is right: piecewise-linear activations can also suffer from vanishing/exploding gradients. Empirically, the ReLU appear to be more robust. Eg: Glorot et al 2011. 3. Line 91: This simply makes the observation that since the network is smaller, so will the number of parameters. 4. References: Thanks for the valuable references that strengthen this work. We will include detailed comparisons in the paper (by moving some content from sections 4 & 5 to supplementary material). Briefly: * Martens:2013: The main difference between this work and ours is that, unlike our work, Martens et al considered only Boolean inputs and showed that if we constrained the inputs to be Boolean then a quadratic number of thresholds suffice. We show a tight upper bound in the general case when the inputs are real valued. Thus, our result is not contradictory to theirs. * Dasgupta:1992 discusses the equivalence of activations when approximation is allowed and primarily focuses on real outputs. Our paper discusses the *exact* case for binary outputs. * Simulating ReLUs with thresholds: Good point. In fact, this is the same as lemma 1. We will add the appropriate references. * Maass:1991 focuses on sigmoids vs threshold networks. 5. Line 228: Line 160 in the paper clarifies this. 6. Line 487: The decision boundary of the given rectifier network is a polytope with exactly 2^n-1 faces. No matter how we construct the threshold network, we need a hidden threshold unit to represent each face of the polytope. Therefore, we need at least 2^n-1 hidden units, independent of how the threshold network is constructed. We will expand the proof of theorem 2. 7. Line 665: This section considers a different way of asking whether two networks are equivalent -- using the hidden layer state as a multiclass classifier. This is a stronger requirement than just asking the final output to be the same; the section shows a necessary condition for this requirement to hold. 8. Experiments: The reviewers (3 & 4) are right about the nature of the experiments. However, we note that the purpose of this section is to demonstrate that expressiveness is not the only thing to consider, and that sample complexity and learning complexity matter. In figure 4 we show that: (middle column) If the threshold network cannot express the decision boundary then the accuracy will suffer, and (right column) even if expressiveness is enough, there is still the learnability issue. * Reviewer 4 1. Comparison to threshold: See reviewer 3: response point 1 2. Experiments: See reviewer 3: response point 8 3. The reviewer is right in that we do not characterize what application problems can be solved by ReLUs -- while we agree that this is an important question, it is not clear how to address this in the general case. Here, we consider a restricted setting of shallow networks with real-valued inputs and Boolean outputs. In this setting, we essentially show that ReLU networks selectively explore a much larger function space of threshold networks.