Focused Crawling with Scalable Ordinal Regression Solvers
Rashmin Babaria - Indian Institute of Science, INDIA
Saketha Nath Jagarlapudi - Indian Institute of Science, INDIA
Krishnan S. Kumar - Indian Institute of Science, INDIA
Sivaramakrishnan Ramanujam Kaveri - Indian Institute of Science, INDIA
Chiranjib Bhattacharyya - Indian Institute of Science, INDIA
M. Narasimha Murty - Indian Institute of Science, INDIA
In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CBOR, which solves the proposed formulation more effciently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an effcient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.