In this work, we investigate the role of data for enhancing the venerable kmeans. Instead of doing the traditional data agnostic iteration, data is treated as a first class citizen to proactively channelize significant computations towards high expressive (HE) data (as opposed to low expressive (LE)). We show that LE does not affect the convergence or quality of results. Our experiments revealed that, real world data contains substantial amount of HE points, resulting in significant saving of compute, training resources (memory) and time. The concept is illustrated by the figure in cover.
Main contributions are
- kmeans-d pushes DCAI into the model building phase itself, observing whether benefits downstream can be as significant in a classical, well studied algorithm like k-means.
- Depending on the data, kmeans-d achieves significant speedups (10X-300X) while preserving algorithmic accuracy.
- The key innovation classifies data points as high expressive (HE), impacting the objective function significantly, or low expressive (LE), with minimal influence.
- By rethinking kmeans through a data lens, kmeans-d delivers superior efficiency without sacrificing properties like accuracy and convergence, paving the way for infusing data-centricity into other canonical algorithms.
Cite this article (BibTeX):
TBA


DOI (TBA)
Code
Data (TBA)