Paper ID: 992
Title: Discrete Deep Feature Extraction: A Theory and New Architectures

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper the authors introduce a rigorous definition for deep convolutional networks (DCNNs) that encompass many standard and widely used CNN architectures. The main contribution of this paper is to formally proof a set of global and local properties for the activations at various layers and for the extracted features (lines 466 ff.). The second contribution is to empirically evaluate DCNNs as feature extractors for SVN classifiers. The features are (a subset of) the activations at all layers within the DCNN; which is expressed a tree structured sequence of layer-wise feature extractors. 

Clarity - Justification: The paper is well written and well structured; although the second part of the contribution (the empirical evaluation) is disconnected from the theoretical results. 

Significance - Justification: The first part of the paper describes and proves interesting properties of the features extracted in (and by) DCNNs. Most of them seem aligned with the intuitions that are widely spread in the community.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The theoretical results extend the results for the time-continuous case (Mallat, 2012); these are very interesting and have not seen these formal statements and their proofs before. The second part of this paper, the empirical evaluation, is comparably weak and it is not clear how to interpret these results in light of the theoretical results. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this submission, the authors study deep discrete feature extractors which are considered as an extension of S. Mallat's work. The authors set the scene for their main theorem by describing the building blocks of their network, that is the module-sequence. Next, authors define and discuss cartoon sampled functions of maximal variation K and outline that as a result of feature extractor to be Lipschitz-continuous we get feature extractor to be stable to additive bounded noise, e.g. eq. 13. Next the authors show the shift-invariance, that is, small translations result in small changes of of resulting feature vector. Furthermore, the authors discuss the link between the deformation stability and energy preservation in features resulting from applying the module-sequence. The authors provide evaluations of their feature extractor on MNIST and facial landmark detection.  Strengths: - interesting theoretical analysis of invariance to translations and deformations - strong links between the main theorems and Lipschitz-continuity - the authors extend Mallat's work to the module-sequence, that is feature extractors, non-linearities, and pooling operators on discrete signals  Weaknesses: - over-complicated notation hard to follow - lack of illustrations of the proposed methodology, e.g. even the definition of cartoon functions appear unnecessarily over-complicated with no intuitive illustrations  

Clarity - Justification: As explained in the 'major comments', better discussion and illustrations seem needed to clarify importance of the findings on translation and deformation-invariance.

Significance - Justification: There are no doubts that the studies on invariances in Deep Architectures are of primary interest to machine learning.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):   Major comments: 1. In line 086, the authors summarise the essence of work in Wiatowski 2015 and state that the deformation stability and translational-invariance are induced by the structure of network. Can we rely these findings? It seems as the cited work in anXiv pre-print and as such is not peer reviewed? Can the authors comment a bit more on this matter if it is relevant to this manuscript?  2. The authors state that the existence of a lower bound on eq. 3 implies the set in line 199 to be complete. What is the definition of completeness (the reviewer assumes the classical set theory) and why is it important intuitively in further considerations given that authors say the bound does not have to hold for their theory to apply?  3. The authors indicate in line 457 that Figure 2 illustrates the cartoon function. Can authors explain exactly where this function is illustrated? Figure 2(right) illustrates indeed the brightness levels across a section and could be described perhaps by a set of cartoon functions but itself is not a cartoon function unless the reviewer missed something.  4. The experiemnt reported in Table 1 seems to have no conclusions. So RBIO2.2 wavelets limited the error-rate on MNIST by 0.06%. Is this statistically significant finding?  Similarly, the 'abs' and 'ReLU' non-linearities seem to work better irrespective of the pooling. How does this translate into the proposed theory on invariance to translations and deformations? Such discussion is a must for a paper of this nature to be accepted.  5. Now, the experiment on the landmark detection and classification in section 6.2 is indeed more supportive of the proposed theory. It indeed makes sense that the localisation task utilises heavily features of layer 1 (small translational invariance) while classification on MNIST would deem features of layer 1-3 important. The msot interesting is the case of translated MNIST where layer 3 turns out to contain the most relevant features.    Overall, the results of this work are interesting, e.g. experiments in section 6.2. The reviewer's main concern lies with the clarity of theorems and lack of any illustrations that would simplify these theorems. While differences to Mallat's works seem reasonable clear, the main sections 5.1 (global properties) and 5.2 (local properties) require much better description, examples, intuitive explanations and more extensive discussion section which would: i) discuss/recap properties of each finding, ii) give examples and/or illustrations, iii) explain implications. The reviewer happens to be familiar with Mallat's work, hence it was not a terrible chore to follow this work, yet, more clarity and less obscurity is needed for a general machine learning reader to appreciate this paper.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors present a theory for discrete deep feature extraction. The framework for discrete deep feature extraction has formalized, then the paper analyzes the stability of the framework to deformations and translations. It is worth noting that the  deformation stability only holds for Cartoon functions.

Clarity - Justification: The paper is well-written, while occasionally not easy to follow for non-experts.

Significance - Justification: The architecture of the network is similar to what has been presented in (Bruna and Mallat, 2013), but adding extra pooling layers. The theoretical analysis, as far as the reviewer concerns, is similar to the earlier work on ScatNet. Although this paper emphasizes the discrete nature of the architecture, it turns out that the earlier work can also be successfully applied to discrete inputs. Thus I would consider the contribution to be rather incremental.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper studies an interesting variation of the deep convolutional neural network. It proposes to use multiple layers of wavelet filters to build the network, while ensuring that the network is translation invariant and deformation stable. The paper focuses on the discrete input, which is an improvement over the earlier papers' focus on continuous inputs.  There are several concerns on the paper:  1. A key feature of ScatNet is its translation invariance. However, neither Theorem 1 nor Theorem 2 establish the translation invariance for the proposed network. Theorem 1 says that the network is translation stable, which may not be a good enough property. In Theorem 2, it says that the output will align with the pooling grid if the translation itself align with the grid. This appears to be a much weaker conclusion than local translation invariance.  2. In experiment, the proposed method was not compared with any baseline approach. One should at least compare with (Bruna and Mallat, 2013) to show that the proposed method is empirically attractive.  3. If the network becomes deep, the number of features will be exponentially large. How could the user handle the computation intractability? What is the running time of the proposed method on MNIST?

=====