Paper ID: 270 Title: Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a series of models and model combinations for text classification. Mainly, the authors propose to drop the embedding layer in a text LSTM and instead push that embedding layer directly into the layers of the LSTM. As the authors note, that still means there's an embedding matrix since the first layer that's applied to their "one-hot LSTM" still selects a specific column from a now very large matrix inside the LSTM. The authors get great performance on several common benchmarks. Clarity - Justification: I liked most of the flow of the paper. I'd like to see the dimensionalities of the matrices and word vectors declared to avoid confusion, e.g. if V is the vocab size and d is the hidden size of the LSTM, what's W? I assume if you have a window of 3 words it would be 3V x d? Significance - Justification: The paper pushes the state of the art on some hard datasets. Well done! Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I do not like your citation style, e.g. "JZ15" it interrupts the flow Great to upload the code. I do hope the authors really do that since there seem to be a lot of hidden tricks as well as complex model combinations in these experiments. There are some interesting tidbits in this paper! In the section "More on one-hot CVV vs. one-hot LSTM": What are the distinct region sizes that were used? Most important missing piece: How do you predict the words in the tv-embeddings? You predict multiple words with a square loss? It's quite unclear. Please provide the equation in the paper. It would be interesting to see some examples of documents that you get right that the single word vector models do not get right. I'd love to see how well this works on shorter documents or single sentences, e.g. on the Stanford Sentiment Treebank. You can easily ignore the tree structure and just take in each phrase. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes extending the idea of 1-hot CNNs into 1-hot LSTMs to produce larger context-dependent word or region embeddings. This approach yields good performance on the task of text classification. Clarity - Justification: In general the paper is easy to read and the figures are making it easy to follow. Some issues: Table 1: My understanding here is that all entries here except the top have pooling. Can you unify the notation here, using “…+pooling” for one entry and “…p” for others is unnecessarily confusing. Can you otherwise clarify if they are different in some manner? Significance - Justification: Novelty is incremental in the sense that the paper replaces part of an existing method (CNNs in 1-hot CNNs) with another existing method (LSTMs). The change is validated by the experimental results. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some notes: Section 2.1: Second paragraph (lines 275-280) talks about LSTM having word embeddings inside, which is a trivial result shared by any type of neural network that starts with a feedforward layer. “Word-vector lookup table” and “projection layer” is already used interchangeably in the literature. Thus, this discussion does not contribute to the main question (why is it a good idea). Third paragraph (281-295) is similar, it discusses a trivial results and fails to answer the question posed in the section title. I don’t think the main question of interest is representational power, many different architectures are universal approximators after all, but still some work well and others don’t. I think main point that needs addressing is how word vectors change learning dynamics / priors assigned to different models. Is it possible that having word vectors starts you from a region in the parameter space that is closer to optimal models that generalize better to new data? Following paragraph discusses starting from untrained word vectors (and agrees that pretrained word-vectors is a form of unsupervised learning) but my understanding is that the general advocacy for using word vectors in the literature specifically suggests using pretrained vectors exactly for the semi-supervised learning advantages discussed here. Main advantage is you get good representations for free (in the sense that you can use large raw data with no labeling process) and you start your supervised search from there. In that sense, using pretrained word-vectors or training one-hot LSTMs on unlabeled data in a semisupervised fashion is not really a much different approach. Granted, they don’t really are word-vectors anymore since they don’t represent words in isolation, they’re more like context-dependent word embeddings. In conclusion I think the discussion here is potentially a bit confusing or misleading. Experiments: I enjoyed the experiments, they are very comprehensive and test orthogonal aspects of the models. Discussions about the experiments were satisfactory. Conclusion: In general I think this paper is a good contribution however the presentation in some instances have some issues. During the first read, since it claims that one-hot representations is as good as (or better) that word vectors, I thought the model used local (or even, symbolic) representations in some way (as opposed to distributed), but the model still learns distributed representations after all (though as mentioned, I agree that they are not really word-vectors). I also think that Section 2.1. needs a revision to address other points. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Given the recent success of Convolutional Neural Network in text categorization, the authors propose a related, but more involved method that leverages LSTM and jointly learns features for text and a linear model that uses these features for categorization. Experimental results on 3 benchmark datasets show improvements in performance compared to methods that train word embeddings and previous methods on text region embeddings. Clarity - Justification: Paper is very easy to follow. All the concepts are well explained. There is a Preliminary subsections that explains building blocks of the proposed method, including LSTM and its application on text categorization. Significance - Justification: The proposed method is a simplification of the previously proposed word2vec-LSTM model. The authors first remove the word embedding layer of the existing model and introduced additional simplifications, such as bidirectional layer, pooling and chopping. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the modifications are incremental and one may say straightforward, I still feel the paper contributes to the field, as it shows improvement in performance compared to existing baselines. One suggestion I have is to compare against a simple semi-supervised extension of word2vec proposed in paper: M Grbovic, N Djuric, V Radosavljevic, N Bhamidipati, "querycategorizr: A large-scale semi-supervised system for categorization of web search queries", WWW 2015 Since for subset of text you know the class labels, you could learn the vector of the class labels themselves by leveraging context (previous and next sentences/regions in your document). It is very trivial to implement. Since after training, text and labels are in the same feature space, for new text you can simply find the nearest label vector in the joint feature space. =====