Skip to yearly menu bar Skip to main content

Workshop: Subset Selection in Machine Learning: From Theory to Applications

Geometrical Homogeneous Clustering for Image Data Reduction

Shril Mody · Janvi Thakkar · Devvrat Joshi · Siddharth Soni · Nipun Batra · Rohan Patil


In this paper, we present novel variations of an earlier approach called homogeneous clustering algorithm for reducing dataset size. The intuition behind the approaches proposed in this paper is to partition the dataset into homogeneous clusters and select some images which contribute significantly to the accuracy. We place an important criteria on sampled images that they should be human-readable. We propose four variations: RHCKON, KONCW, CWKC & GHCIDR upon the baseline algorithm - RHC to achieve better accuracy. RHCKON involves selecting k farthest and one nearest neighbour of the centroid of the clusters. The intuition behind RHCKON is that the boundary points contribute significantly towards the representation of clusters. KONCW and CWKC introduce cluster weights to RHCKON. They are based on the fact that larger clusters contribute more than smaller sized clusters. The final variation is GHCIDR which selects points based on the geometrical aspect of data distribution. We performed the experiments on two deep learning models- Fully Connected Networks (FCN), and VGG1. We experimented the four variants on three datasets- MNIST, CIFAR10, and Fashion-MNIST. We found that GHCIDR gave the best accuracy of 99.35%, 81.10%, and 91.66% and a training data reduction of 87.27%, 32.34%, and 76.80% on MNIST, CIFAR10, and Fashion-MNIST respectively.