Paper ID: 335
Title: Analysis of Deep Neural Networks with Extended Data Jacobian Matrix

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a way to analyze the function learned by a neural net by splitting it into several neural nets with only a single output unit each and using stacking Jacobian vectors of these networks evaluated at each training point (or validation point). Then we can examine the singular value spectrum of this “extended data Jacobian matrix.” Intuitively, nets that have a flatter, more uniform spectrum implement more complicated and rapidly varying functions. Deeper nets have higher normalized singular values for the major components, as do convnets,  nets trained with dropout, and nets trained using distillation. Ensembles have smaller normalized singular values for the major components than the individual ensemble members. The paper argues that this sort of analysis provides interesting insights on neural nets that may lead to better models and a greater understanding of when models will work well. The paper provides an example algorithm that incorporates the spectrum of the extended data Jacobian matrix into the cost function.

Clarity - Justification: There are no major clarity or reproducibility problems. However, some sections could be much clearer. The ReLU-centric presentation is somewhat confusing given the generality of the method. In section 2.2 some comment about why the biases are not in the equations would also not be amiss. The authors should define the [ ] notation before they first use it as well. The mathematical notation could be clearer about when quantities are vectors or scalars, given the abundance of vectors with both superscripts and subscripts. At least the text should be modified to make this clearer.

Significance - Justification: The paper makes a convincing case that this way of analyzing neural networks will teach us a lot about these models and has the potential to improve how we train and design them. Although similar ideas have been thought of by many researchers, I am not aware of other work that gives as impressive a set of insights. Better, more practical, notions of complexity, variability, and capacity for deep neural networks will be very useful.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper reasonably clear and discusses a compelling new tool for analyzing neural networks. The results for nets trained with dropout do not make sense to me and more explanation of these results should be added. Shouldn’t dropout produce models with less overfitting and reduce capacity? Another idea in this vein is to do experiments on nets that have overfit badly and ones that have not overfit much. How does the spectrum train as training progresses?  The language surrounding the selection of epsilon is confusing. Is it 90% from the top or the bottom? How many singular values are discarded, 90% or 10%?  I do not find the spectrum regularization method very compelling, but the paper has enough of a contribution without it to be interesting.  The plots need to be fixed. They are almost unreadable in color and in black and white or for readers with bad color vision they would be useless. Some sort of line marker is needed and many of them are very small.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Given the difficulty of analyzing deep neural networks with build-in nonlinearities, this work considers the special case where relu is used everywhere, making the observation that the layer-by-layer forward computation is reduced to a per-data-per-output-dimension linear operation. With such simplification, one could analyze so-called ‘Extended data jacobian matrix’, to measure quantitatively the capacity of the learned models. Experiments are done with 3 standard datasets. This also inspires a regularization method that encourages the large magnitude of singular values for the model weights.

Clarity - Justification: The paper is well-written and easy to follow. 

Significance - Justification: This work complements the existing analysis and proposes a novel way to measure the capacity of different neural networks. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, it is a well written paper, with sufficient experiments to backup the claims.   The main source of my confusion comes from the weak connection between the model complexity and the proposed metric (e.g. distribution of the singular value spectrum, or the defined score). It contradicts to the fact that models achieving higher test accuracies are models with strong regularizations, hence limited model capacity, especially on datasets such as MNIST, CIFAR10, and TIMIDE. Those models, however, have higher scores in your experiments.   It also worries me in Section 5. The transition from Equ (13) to the equation in L759 seems quite abrupt. One is based on the per-data linear assumption, the other is based on the actual weights. I failed to see the tight connection between those two.   Could you say some words about the computational cost using the proposed regularization term, as compared with training a model without it?  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a method for characterizing the complexity of the nonlinearities employed by different neural network architectures, called the extended data jacobian matrix. This matrix is formed by linearizing a network around each data example, and stacking these linear functions into a matrix. The spectrum of this matrix therefore captures the degree of variation in the maps necessary to map from each example to the network’s output. Experiments show that the spectrum of this matrix correlates with several features of a deep network, including accuracy and depth. Finally, the paper proposes a regularization technique based on these insights, to regularize towards flatter singular value spectra for each weight matrix of a deep network. This beats the accuracy of a dropout-trained network on one of three datasets tried.

Clarity - Justification: The paper is clear and easy to follow.

Significance - Justification: The paper makes several nice points about the complexity of deep network functions, but there are several ways the arguments could be made more theoretically sound. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):  The paper makes several nice points about the complexity of deep network functions, but there are several ways the arguments could be made more theoretically sound.   First, rather than treading each output dimension independently, it would seem to be better to construct a Data Jacobian Tensor of dimension n x d_in x d_out. Decomposing this tensor would give a better sense of the total complexity of the learned input-output map.   Second, the normalization of the spectrum by the largest singular value is not clearly motivated. Why not use the raw spectrum? The scale would appear to be already set by the scale of y and x used during training.  Third, while the regularization scheme on the EDJM, Eqn. 13, may be sensible, the layer wise version actually implemented has no clear relationship. The spectrum of the individual weight matrices need say little about the EDJM of the overall map. For instance, consider identity weight matrices at every layer with positive input examples (raw MNIST, say). With ReLU units, they will always be switched on. Then the layer wise spectrum is perfectly flat, but the EDJM is the exact opposite, with only one nonzero singular value (see discussion of linear nets below).  Fourth, it would be very useful to discuss simple limiting cases: what do the proposed metrics measure for a fully linear deep network such as considered in Eqn. 1? I believe the EDJM will be rank 1, with a single nonzero singular value. This also illustrates the point about normalizing the spectrum: after normalization by the maximum singular value, all linear maps will look identical according to the proposed score metric, yet clearly their performance can differ markedly. Likewise, it would be useful to discuss the regularization method proposed in section 5 with reference to the linear networks. It seems as though the layer wise regularization of singular values will have no impact on the EDJM score as it stands.  Finally the experiments contain intriguing insights, and most appear to be done to a high standard. However the claim of the abstract of a “novel spectral regularization method, which improves network performance” may be a bit of an over claim, given that it does so only on 1 of 3 datasets.  Minor comments: Line 601: In fact dropout encourages rows to become _more_ correlated; when a unit is dropped out, other units must be redundantly carrying that information to be able to correctly perform the classification. 

=====