Thank you very much for the valuable feedback.  We describe below, if accepted, how we will improve the paper based on the reviews.    $
--------
Review#1

* Table 1:
We will unify the notation as suggested. 

* Presentation of Section 2.1:
Points are well taken.  We will revise the section (including its title) based on the comments. 

* "During the first read ... I thought the model used local (or even, symbolic) representations in some way (as opposed to distributed) ... ":
We will work on making it clear that "one-hot" as in "one-hot LSTM" refers to the type of (externally made) input to LSTM, not internal representations learned by LSTM. 

--------
Review#2

* Suggestion to compare with queryCategorizr:
Interesting.  We will look into it. 

--------
Review#3

* "I'd like to see the dimensionalities of the matrices ... e.g. if V is the vocab size and d is the hidden size of the LSTM, what's W?":
As LSTM receives one word at each time step, W^{(*)} (* in {i,o,f,u}) of one-hot LSTM would be d-by-V.  A weight matrix of one-hot convolution layer of region size 3 would be d-by-3V.  Indeed, it's clearer with dimensionalities.  

* Some details are missing:
To save space, in several places we ended up with omitting info that appeared in the references.  We will try hard to put them back.  
- Distinct region sizes of CNNs are 2 and 3 on IMDB and 3 and 4 on Elec. 
- Prediction in tv-embedding learning: Weighted square loss sum_{i,j}q_{i,j}(z_i[j]-p_i[j])^2 was minimized where i goes through the instances, z represents the target regions by bow vectors, p is the model output, and q was set to achieve the negative sampling effect.