New results on information theoretic clustering
Ferdinando Cicalese · Eduardo Laber · Lucas Murtinho

Wed Jun 12th 02:35 -- 02:40 PM @ Room 102

An impurity measure I is a function that assigns a vector v to a non-negative value so that the more homogeneous v, with respect to the values of its components, the larger its impurity. We study the problem of optimizing the clustering of a set of vectors when the quality of the clustering is measured by the Entropy or the Gini impurity. Notwithstanding the wide use in relevant applications of this type of clustering, what is known in terms of optimal (approximation) algorithms and/or related complexity limitations (inapproximability) is still limited.

Our results improve the state of the art both in terms of best known approximation guarantees and inapproximability bounds: (i) we give the first polynomial time algorithm for Entropy impurity based clustering with approximation guarantee independent of the number of vectors and (ii) we show that the problem of clustering based on entropy impurity does not admit a PTAS. This also implies on an inapproximability result in information theoretic clustering for probability distributions closing a problem left open in [Chaudhury and McGregor, COLT08] and [Ackermann et al., ECCC11].

We also report experiments on a novel clustering method based on the theoretical tools leading to the above results. Tested for word clustering, our implementation produces clusterings with impurity close to those given by state of the art algorithms with the advantage of running up to 3 orders of magnitude faster.

Author Information

Ferdinando Cicalese (University of Verona)
Eduardo Laber (PUC-RIO)
Lucas Murtinho Murtinho (PUC-RJ)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors