Thank you very much for the valuable feedback. We describe below, if accepted, how we will improve the paper based on the reviews. $ -------- Review#1 * Table 1: We will unify the notation as suggested. * Presentation of Section 2.1: Points are well taken. We will revise the section (including its title) based on the comments. * "During the first read ... I thought the model used local (or even, symbolic) representations in some way (as opposed to distributed) ... ": We will work on making it clear that "one-hot" as in "one-hot LSTM" refers to the type of (externally made) input to LSTM, not internal representations learned by LSTM. -------- Review#2 * Suggestion to compare with queryCategorizr: Interesting. We will look into it. -------- Review#3 * "I'd like to see the dimensionalities of the matrices ... e.g. if V is the vocab size and d is the hidden size of the LSTM, what's W?": As LSTM receives one word at each time step, W^{(*)} (* in {i,o,f,u}) of one-hot LSTM would be d-by-V. A weight matrix of one-hot convolution layer of region size 3 would be d-by-3V. Indeed, it's clearer with dimensionalities. * Some details are missing: To save space, in several places we ended up with omitting info that appeared in the references. We will try hard to put them back. - Distinct region sizes of CNNs are 2 and 3 on IMDB and 3 and 4 on Elec. - Prediction in tv-embedding learning: Weighted square loss sum_{i,j}q_{i,j}(z_i[j]-p_i[j])^2 was minimized where i goes through the instances, z represents the target regions by bow vectors, p is the model output, and q was set to achieve the negative sampling effect.