We sincerely thank the reviewers for their appreciation of our work and the thoughtful comments.$
Reviewer #1:
Line 269 – 270: What we meant to state is that we have explored the possibility of extending traditional EBMs to those with deep structured architectures. The reviewer is correct that EBM is a very general framework which does not assume any particular encoding of the energy function. But considering deep structured encodings is of particular practical importance, as verified by our experiments. We will rephrase this part in the final version to avoid unnecessary confusion and misleading. 

Prevision --> precision: it should be precision. We will correct it in the final version.

Code: we plan to release the experimental code for public access after the paper gets accepted.

Reviewer #2:
Novelty: we agree with the reviewer that our modeling approach is built upon the several recent works on connecting autoencoders and EBMs with score matching. But it was not clear in the existing literature until this work 1) whether it is possible to successfully train deep structured (CNN, RNN) EBMs with score matching; 2) whether using density models formulated by deep EBMs is an effective approach for performing anomaly detection. Our work confirms both questions and fills the gap between the recent understanding and developments of EBMs (and deep learning in general) and the anomaly detection community.

Reviewer #3:
Figure 1: the thick green curve at the bottom is the energy curve, so points with energy above the threshold (those with low probability) are outliers, which x_2 is. The thin blue curve at the top is the reconstruction error, which is also low at x_2 (because it happens to sit at a local maxima of the energy curve). x_2 is then considered as an inlier by the reconstruction error, and thus a false positive.

Comparison to the state-of-the-art: we have tried our best to include state-of-the-art baselines in all the three settings.

For the detailed comments:
1.	We agree with the reviewer’s point about the local nature of score matching, which could be problematic especially when considering the existence of adversarial examples. We believe that adversarial examples are essentially very difficult to deal with unless the model/algorithm is explicitly designed so, which score matching is not (neither are other off-the-shelf methods). It would be very interesting to explore possibilities to extend the standard score matching procedure to go beyond local and perform correct credit assignment to data points far from training data, but this idea probably deserves another paper itself, and we’ll leave this as future work.
2.	We did not experience the vanishing gradient problem, possibly due to the fact that we did not use very deep models (a 2-layer encoder and 4 layers in total). We hypothesize that for very deep models this may be overcome by using batch normalization.
3.	All the parameters are selected on a hold out validation set for each benchmark, and we did not give detailed settings due to the content limit, but we will make them public by releasing the experimental code.
4.	The  autoencoder scoring paper does to some extent provide a more general view of interpreting autoencoders as EBMs, and could be applied to discrete data such as binary inputs. However, the main issue is that autoencoder scoring requires integrating the vector field (reconstruction error), which could only be done in closed form for one-layer autoencoders whose activation function has known antiderivative. As for dealing with discrete data with deep EBMs, one naive way would be normalizing the inputs and treating them as continuous ones (eg., using TF-IDF instead of raw BOW). We will cite this paper and add related discussions in the final version.