What Data-Centric AI Can Do For kmeans–a Faster, Robust kmeans-d

June 17, 2024

Parichit Sharma, Hasan Kurban, Mehmet M. Dalkilic – Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, Data Centric Machine Learning Research Workshop (DMLR), 2024 (Accepted)

In this work, we investigate the role of data for enhancing the venerable kmeans. Instead of doing the traditional data agnostic iteration, data is treated as a first class citizen to proactively channelize significant computations towards high expressive (HE) data (as opposed to low expressive (LE)). We show that LE does not affect the convergence or quality of results. Our experiments revealed that, real world data contains substantial amount of HE points, resulting in significant saving of compute, training resources (memory) and time. The concept is illustrated by the figure in cover.

Main contributions are

kmeans-d pushes DCAI into the model building phase itself, observing whether benefits downstream can be as significant in a classical, well studied algorithm like k-means.
Depending on the data, kmeans-d achieves significant speedups (10X-300X) while preserving algorithmic accuracy.
The key innovation classifies data points as high expressive (HE), impacting the objective function significantly, or low expressive (LE), with minimal influence.
By rethinking kmeans through a data lens, kmeans-d delivers superior efficiency without sacrificing properties like accuracy and convergence, paving the way for infusing data-centricity into other canonical algorithms.

Cite this article (BibTeX):

TBA

What Data-Centric AI Can Do For kmeans–a Faster, Robust kmeans-d

Share this: