Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering
Sharlee Climer - Washington University in St. Louis
Weixiong Zhang - Washington University in St. Louis
Cluster analysis is a fundamental problem and technique in many areas related to machine learning. In this paper, we consider rearrangement clustering, which is the problem of finding sets of objects that share common or similar features by arranging the rows (objects) of a matrix (specifying object features) in such a way that adjacent objects are similar to each other (based on a similarity measure of the features) so as to maximize the overall similarity. Based on formulating this problem as the Traveling Salesman Problem (TSP), we develop a new TSP-based optimal clustering algorithm called TSPCluster. We overcome a flaw that is inherent in previous approaches by relaxing restrictions on dissimilarities between clusters. Our new algorithm has three important features: finding the optimal k clusters for a given k, automatically detecting cluster borders, and ascertaining a set of most viable clustering results that make good balances among maximizing the overall similarity within clusters and dissimilarity between clusters. We apply TSPCluster to cluster and display 500 genes of flowering plant Arabidopsis which are regulated under various abiotic stress conditions. We compare TSPCluster to the bond energy algorithm and two existing clustering algorithms. Our TSPCluster code is available at the authors' websites.