We thank all the reviewers for their time and helpful comments.$
R_1:
"First ... better to construct a Data Jacobian Tensor..."
This is a good point, we will add a discussion about it in the paper. We were thinking of the two approaches and decided to use matrix factorization instead of tensor factorization for the following reasons: 1) there is identifiability issue with tensor factorization; 2) tensor factorization in general is NP-hard; 3) the computational cost for tensor factorization in our experiment setting can be infeasible; 4) matrix factorization of EDJMs can be used to investigate and compare different linear systems that map inputs to different output classes.
"Second, the normalization of the spectrum by the largest singular value is not clearly motivated..."
The loss for training the classification tasks in our experiments is the cross-entropy between the true 1-of-k label and the softmax of the neural network output. Because of the softmax normalization, different networks could have significantly different scales for the linear systems.
"Third ... the layer wise version actually implemented has no clear relationship..."
We agree with the reviewer that the paper currently lacks a strong mathematical relationship between regularizing the per layer spectrum and the EDJM spectrum. We have extensively tested this hypothesis empirically where layerwise regularization boosted the EDJM spectrum. We are working on regularizing the network by explicitly manipulating the singular values of the EDJM. We decided to add our preliminary results at the end of the paper to point for directions where network analysis could lead to proposals for performance improvement.
"Fourth ... discuss simple limiting cases ... for a fully linear deep network..."
For a completely linear network as well as the case of logistic regression, the rank of the EDJM is rank 1 which will make the analysis unconditional on data. The intuitions/observations in this paper are based on well-trained ReLU neural networks. If for different input data instances, the linear system does not vary a lot (for linear network the linear system is the same as you pointed out), there is no reason to analyze such networks by conditioning on the data set. However, we believe that in practice, which is also reflected in our experiments, with deep networks of nonlinear units we have different linear systems for different input data points.
"Finally ... given that it does so only on 1 of 3 datasets."
Though our regularization method only outperforms dropout on one dataset, it is also worth mentioning that our regularization method outperforms the baseline (no regularization) quite a bit, and on TIMIT our improvement is significant.
R_2:
"... It contradicts to the fact that models achieving higher test accuracies are models with strong regularizations..."
We conducted all our analysis conditioned on a data set. This is different from unconditioned model complexity analysis for which we make no claims. For the experimental results we have, we use the validation set to prevent overfitting. Therefore, the models could still benefit by learning a more complex function (e.g., 4-layer v.s. 1-layer). When overfitting becomes the major issue, the overly complex models might have very different linear systems for different data inputs while their performance is poor.
"... The transition from Equ (13) to eq. in L759 seems quite abrupt..."
We agree, please check our reply for reviewer 1 regarding this point. In general, our regularization (eq. in L759) works as a computationally feasible surrogate for the true regularization on the spectra of EDJMs. In experiments we also show that by using our regularization, the spectra of EDJMs do change accordingly.
"... computational cost using the proposed regularization term..."
We will add a discussion about that in the paper. Since we decompose one layer into three linear layers, the computational cost for training is about 3 times more than ordinary. However, for inference, we can merge the three linear matrices back together, and there is no extra cost.
R_4:
"... some sections could be much clearer ..."
Thank you for the comments. We hope to address them in the accepted version of the paper.
"...Shouldn't dropout produce models with less overfitting and reduce capacity?"
In this paper we study the network complexity conditioned on a data set which is different from the unconditioned model complexity, which dropout addresses. Suggested experiments on training progress and overfitting models are definitely of interest and will be studied thoroughly in our future work.
"...Is it 90% from the top or the bottom?"
90% of the singular values are discarded in the metric given since their values are pretty small and tend to be noise. We will clarify that in the paper given the chance.
"The plots need to be fixed..."
We hope to have the opportunity to fix it in a final version.