Paper ID: 439 Title: Convolutional Rectifier Networks as Generalized Tensor Decompositions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an analysis of the expressiveness of convolutional networks by means of introducing generalized tensor decompositions. This is possible by making the assumption of covering templates, that is, given that each input instance is composed of N patches, there exists a finite number M of patches (templates) that lead to a tensor of N modes having M dimensions in each of them, such that each input of the tensor contain the output of the learned deep net function, and that this is all information required to express the learning function in the domain of interest. Thus, different network architectures lead to different (generalized) decompositions of this grid tensor, so the authors use known results in tensor decomposition to show, for example, whether a given network is universal by checking that the corresponding tensor decomposition can produce all possible tensors. This analysis leads to several interesting claims regarding the expressivity and depth efficiency of different architectures, such as convolutional arithmetic circuits (product pooling), and the common convolutional rectifier networks with either max or sum pooling. Clarity - Justification: The paper reads well, although some clarifications could be made to help the reader understands the explained concepts. For example, at the beginning of Sec. 4, it is stated that each instance is composed by a set of N patches in R^s. It could be confusing in a first read, as visual patches are tensors of order 2 (spatial) or 3 (+RGB), but here they appear as a vector. Similarly, the concept of grid tensor and templates are explained very briefly. I found much clearer the explanation given at the beginning of appendix A. Significance - Justification: The analytical framework based on generalized tensor decompositions, and the results obtained are interesting, particularly so the lack of universality of average pooling rectifier networks, and the incomplete depth efficiency of max pooling efficiency networks in comparison to the complete depth efficiency of product pooling networks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In general I think this paper provides an interesting framework for analysing similar learning structures as the ones practitioners use. It is mostly well written, and the proofs look correct although I have not thoroughly checked all details of the appendix. Minor typos: Line 20: posses --> possess Line 1004: where referred --> were ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies the representational power of convolutional neural networks with rectified linear activations and average or max pooling. The topic is very contemporary. The problem is approached form a tensor decomposition perspective, which is very interesting. The paper includes a number of interesting claims regarding the universality of the discussed networks and their depth efficiency. Clarity - Justification: The paper is well written. The presentation of statements as claims make it unclear whether they are conjectures or facts. In my opinion, the main technical concepts presented in the paper could be complemented by more intuitions. This is especially so for the concept of covering templates, the description of the constraints that appear in convolutional networks depending on the size of the convolutions, which are not further mentioned in the claims, and the number log2N of layers of the considered deep networks. Significance - Justification: The paper presents results showing that convolutional rectifier networks enjoy some form of depth efficiency. The paper goes on and argues in favour of networks with product pooling, which enjoy complete depth efficiency. The statements are very interesting. The paper makes connections to the theory of tensor decompositions and arithmetic circuits. This is an interesting way of looking at the problem and possibly a very promising one for future developments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper mentions that depth efficiency is only applicable to universal architectures. However, depth efficiency is defined as the property that the functions represented by a deep architecture require many more hidden units when represented by a shallow architecture. This extends naturally to the non universal case. In Section 5.2 the paper discusses independent weights across spatial locations. But actually the interest of studying conv nets specifically are the additional constraints that are imposed over those from unconstrained rectifier networks. The results for shared weights are briefly mentioned in Section 5.3. It would seem natural to present these results more prominently. Strictly speaking, depth efficiency does not necessarily mean that the functions represented by the deep network are more interesting than those represented by the shallow network. For instance, it could be that the deep network represents the same functions as the shallow network just slightly translated. In this case, the intersection of the two function classes could be very small, but overall the two could be very similar. The discussion from Section 5.4 does not quite address this either and in fact it actually discusses arbitrarily well approximation, which is essentially the same as exact representation. It would be interesting to consider the ``shallow efficiency'' as well. In other words, whether the functions represented by a shallow network cannot be compactly represented by a deep network with `small' layers. It would seem natural to (also) express the results from Claims 8 and 9 in terms of the number of model parameters. Depth efficiency alone is not the key factor of deep learning. It is also important that the classes of functions that can be represented by the deep networks related to the problems at hand. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this submission, the authors study Convolutional Neural Networks and Shallow Network in framework of arithmetic circuits. The authors propose Generalized Tensor Decompositions based on their Generalised Tensor Product. Next, authors make transition from networks to tensors with help of the score function and a set of hierarchical tensor decompositions in equation 7 as compared to the shallow case in equation 6. The main conclusions from their work is that convolutional rectifier networks are universal with max pooling but not with average pooling. Also, the authors show that the depth efficiency of convolutional rectifier networks is weaker compared to convolutional arithmetic circuits. Strengths: - overall, an interesting framework to study CNNs and similar networks - relation to the tensor decompositions Weaknesses: - the notation is very difficult to follow, very often informal or incomplete - the arguments the authors make are not very clearly formulated - there is a complete lack of illustrations beyond basics that could aid authors' arguments Clarity - Justification: See the major comments. Overall, the notation could be significantly improved, made more consistent and easier to follow. Significance - Justification: Studying the family of functions that can be learned by a representation is fundamental to understanding hierarchical approaches such as CNNs. However, the obscurity of mathematical arguments made by the authors makes it hard to assess. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In this submission, the authors study Convolutional Neural Networks and Shallow Network in framework of arithmetic circuits. The authors propose Generalized Tensor Decompositions based on their Generalised Tensor Product. Next, authors make transition from networks to tensors with help of the score function and a set of hierarchical tensor decompositions in equation 7 as compared to the shallow case in equation 6. The main conclusions from their work is that convolutional rectifier networks are universal with max pooling but not with average pooling. Also, the authors show that the depth efficiency of convolutional rectifier networks is weaker compared to convolutional arithmetic circuits. Strengths: - overall, an interesting framework to study CNNs and similar networks - relation to the tensor decompositions Weaknesses: - the notation is very difficult to follow, very often informal or incomplete - the arguments the authors make are not very clearly formulated - there is a complete lack of illustrations beyond basics that could aid authors' arguments Major comments: 1. The definition of tensor-product does seem to differ from the traditional tensor-product definitions e.g. given in for instance [A]. The authors should make this clearer. What is proposed in definition 1 is rather an outer-product on tensors. 2. The definition 2 of the generalised tensor-product feels incomplete, e.g. is g(A,B)=g(B,A) for A\in R^{KxM} and B\in R^{NxO} where K,M,N,O are not equal to each other? From what reviewer notices, g(A,B) and g(B,A) would result in tensors that of different dimensionalities in each mode. 3. The notation of the score function in line 371 is somewhat difficult to follow. The further details such as \Omega(100) in line 406 make the reader completely confused while lines 415-439 are reasonably clear. The notation provided for the figures 1 and 2 is very informal and poorly explained. The overall diagram of the grid tensor and what authors try to achieve could help the arguments. 4. The authors provide a rather dry set of claims for the universality and the depth efficiency. It would be better to focus on fewer claims but describe them in a manner approachable to the general reader. 5. Despite the authors make a connection to tensor decompositions, it seems that the actual tensor decompositions and the mathematics behind e.g. Tucker/CP etc. is never used in the paper. What is the role of this connection? The tensor decompositions do provide mathematical tools for their rank etc. Is nay of this tools applicable in this work? Minor comments: 6. The authors 'jump' into a discussion about arithmetic circuits in the abstract without explaining what these circuits are. A few words of explanation or rephrasing is needed. 7. Again, abstract talks about 'convolutional rectifier networks are universal with max pooling' without introducing what 'universal' means in this context. Similarly, line 133 talks about average pooling and 'universality' without introducing the concept first. Similar is true for the 'product pooling' which is discussed from the beginning but only defined in line 322. Overall, the results in this paper seem interesting, however, the paper feels rushed and discussing at length various matters before they are even fully defined. It seems that it may be somewhat hard to access by the general audience - it feels as the major rewrite could simply the arguments made by the authors. [A] Tensor decompositions and applications by T. G. Kolda. =====