Paper ID: 322 Title: Fast k-Nearest Neighbour Search via Dynamic Continuous Indexing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides a near neighbor search scheme which has all the nice properties possessed by LSH scheme such as dynamic updates, sub-liner approximate search and fine control over accuracy efficiency trade-off. The whole algorithm is uses LSH bits but instead of simple indexing and tables uses trees over the LSH indices for search. The algorithms empirically outperforms LSH, however the comparisons does not seem fair. Clarity - Justification: The paper is easy to read. Significance - Justification: It is know that LSH based hamming distance can lead to sub-liner search. Instead of building hash tables the proposal uses tree. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper looks reasonable but far from complete. Pros: The authors try to analyse random projection based LSH in terms of preserving the order of top k near neighbors (Probability that top-k points are away compared to non-neighbors). this is a good line of thought, however, it is not clear why it beats LSH. Cons: The authors show sublinear guarantees and but there is no argument why it is better than LSH n^{\rho}. Without the argument (or analysis) of when the proposed guarantee dk(n/k)^(1-log\gamma) < n^{\rho}d the theoretical contributions are not substantial, as getting sublinear guarantees with random projection is a known fact. Experiments: Experiments with LSH is little tricky as the performance is very sensitive to the choice of K and L parameters depend on the operating threshold. The paper forces K and L for LSH to be same as the m and L of their algorithm while comparing. This is not right. The collision probability affects the choice of K and L. So K and L should be tuned independently and let every algorithm use the best they can for a fair comparison. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes an algorithm for approximate nearest-neighbor search in high-dimensional euclidean spaces. The proposed method is interesting in that it does not follow standard space-partitioning strategies, and thereby side-steps some of the inherent difficulties of that approach. The proposed method is simple enough in concept, and appears to work well (compared to LSH) on a couple of image datasets. The bulk of the paper is devoted to theoretical analysis of nearest neighbor retrieval success. Clarity - Justification: The basic idea and motivation for this paper are clearly expressed, but the technical presentation could be improved. First, notation is not consistently applied: the same symbols get reused in several different ways (listed below). Second, the two algorithms (indexing and query) should be described in separate figures with clearly stated inputs and outputs. Third, the illustrative figures (1a--c) are quite difficult to parse, since the quantities of interest are unlabeled. Finally, there should be some explicit discussion of the query algorithm (not just its theoretical properties). Significance - Justification: The proposed method is conceptually simple and elegant, and appears to work well (compared to LSH). While I would personally like to see a more thorough comparative evaluation against more advanced methods than plain LSH, I expect that this work could be of broad interest to the community. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In Section 2, the authors claim that the number of leaves in RP-trees grows exponentially with (ambient) dimension (line 156). This doesn't seem correct: the analysis of RP trees depends on doubling dimension of the dataset, not ambient dimension. Another line of work which may be relevant to the authors (and related work section) is the rank cover tree method [1], since it also relies on rank-ordering of distances rather than space partitions to facilitate nearest neighbor retrieval. The query algorithm is never clearly described, outside of the pseudo-code. Additionally, the pseudo-code for Query does not appear to be completely parameterized; the authors claim that the algorithm allows the user to control the size of the result set, but this is not exposed. A bit of reformatting here could significantly improve readability. Minor comments: - Algorithm 1, line 251: "T_j" => "T_{jl}", "w_j" => "w_{jl}" - Figure 1 (a-c): these would be easier to read with less visual clutter, clearly labeled vectors, and not relying on color (which can be problematic for color-blind individuals). - Some notation is overloaded in many places: - "l" refers to a vector (lemma 1) or tree index (Algorithm 1) - "p" refers to a random projection (lemma 1), a query point (thm 12), or an element of D (algorithm 1) - "d" denotes dimensionality or distance [1] Houle, Michael E., and Michael Nett. "Rank-based similarity search: Reducing the dimensional dependence." Pattern Analysis and Machine Intelligence, IEEE Transactions on 37.1 (2015): 136-150. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a very interesting new data structure for approximate nearest-neighbour search and claims to avoid the unfavourable dependence on the data dimensionality by completely avoiding space partitioning. The authors present a very simple preprocessing step to index the data in a data-dependent manner and allow for easy insertion of points. Given this index, the authors present a search algorithm and theoretically analyse the failure probability of this algorithm for any query. Using insights from this analysis, the authors present search runtime bounds for a query-independent version of the search algorithm. The authors then present a query-adaptive version of the search algorithm which utilizes results from the analysis to stop the search algorithm early for "easier" queries and continue on for "harder" queries. The proposed method is then compared to vanilla LSH and the empirical results are extremely promising. Clarity - Justification: The paper is very well written and the details presented in the paper are easy to follow. The algorithm and the analysis is clearly presented. One piece in this paper that is not clear is the three-dimensional images in Figures 1b and 1c. The idea of the images is great but the images are little hard to understand given the image caption and the text that refers to these images. The idea being conveyed through these images is not completely clear to me. Another unclear piece is how the runtime claims made for the data-dependent version of the search algorithm is justified. The text in lines 660-666 describes this but I feel that readers could use more elaboration as to how the stopping criteria is explicitly set so that the runtime bound in Theorem 8 is satisfied. Significance - Justification: The authors fairly argue that space-partitioning tree data structures for nearest-neighbour search suffer from unfavourable dependence on the data dimensionality and LSH struggles on datasets with varying density. However, both these methods have their strengths -- the trees adapt to the data density while random projections in LSH allows it to avoid the exponential dependence on the data dimensionality. Given this, the authors propose a search algorithm (along with an indexing scheme) which adapts to the data density while avoiding the exponential dependence on the dimensionality by making a key (but often ignored) observation -- while Johnson-Lindenstrauss transform preserves all pairwise distances with high probability, nearest-neighbour search *only* needs the preservation of the relative order between the true neighbour(s) and the rest of the data points in terms of their distances to the query. In my view, the combination of the strengths of the two existing methods and the exploitation of the aforementioned key idea makes this a very significant contribution. The proposed method also gives the user complete control over the accuracy-efficiency tradeoff which sets it apart from LSH. However, this manuscript misses some crucial existing work. For example, the authors do not compare empirically or theoretically to the search algorithm with randomized partition trees proposed by Sanjoy Dasgupta and Kaushik Sinha (see next section for citations). The authors did not compare and contrast their proposed method to nearest-neighbour search with other data structures that do not rely on space partitioning, such as skip-lists (Clarkson, 2006, section 5.2.3) or cover trees (Beygelzimer, et al., 2006). Related to this, the authors introduce \gamma in Definitions 4 & 5 without acknowledging the notion of expansion rate (or expansion constant) present in Karger & Ruhl, 2002. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Returning to the discussion of the relationship between \gamma and the expansion rate c from Karger & Ruhl, 2002, using the authors' notation, |B_p(2r)| \leq c |B_p(r)|. So \gamma and c might be related as \log_2 \gamma = 1 / \log_2 c. Now, the quantity \log_2 c is related to a notion of intrinsic dimension d' (doubling dimension) of the dataset (Clarkson, 2006, sections 4.1 & 4.2), which is equal to the original data dimension d for points sitting on an uniform grid. Given this, the runtime bound in Theorem 8 can be written as O(max(dk log (n/k), dk (n/k)^{1- 1/d'})), and for moderately large d', the runtime will be dominated by the second term. Since both the cover trees (Beygelzimer, et al., 2006) and randomized partition trees (Dasgupta & Sinha, 2013, 2015, Sinha, 2014) have results depending on this expansion rate, the authors should contrast how the proposed method has a more favourable dependence on the dimensionality (intrinsic or original) compared to these existing methods. Aforementioned citations: • Dasgupta, Sanjoy, and Kaushik Sinha. "Randomized partition trees for nearest neighbor search." Algorithmica 72.1 (2015): 237-263. • Dasgupta, Sanjoy, and Kaushik Sinha. "Randomized partition trees for exact nearest neighbor search." Conference on Learning Theory. 2013. • Sinha, Kaushik. "LSH vs Randomized Partition Trees: Which One to Use for Nearest Neighbor Search?." Machine Learning and Applications (ICMLA), 2014 13th International Conference on. IEEE, 2014. • Beygelzimer, Alina, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." Proceedings of the 23rd international conference on Machine learning. ACM, 2006. • Clarkson, Kenneth L. "Nearest-neighbor searching and metric space dimensions." Nearest-neighbor methods for learning and vision: theory and practice (2006): 15-59. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a new algorithm for fast k-nearest-neighbor search. The algorithm is based on using a continuus indexing scheme which aims to preserve neighborhood relations, based on random projections. The paper seems well motivated and well written. However, the paper is very far from my expertise and I have very little on which to base my review. Clarity - Justification: I was able to follow the main story of the paper; other than that it is very far from my expertise. Significance - Justification: N/A Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My area of expertise is very far from this topic; I did not bid for the paper; I have no idea why it was given to me because I am not qualified to review it. I am therefore giving scores that aim not to negatively influence its possibility for acceptance. =====