{ "id": "2006.06856", "version": "v1", "published": "2020-06-11T22:17:16.000Z", "updated": "2020-06-11T22:17:16.000Z", "title": "Bandit-PAM: Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits", "authors": [ "Mo Tiwari", "Martin Jinye Zhang", "James Mayclin", "Sebastian Thrun", "Chris Piech", "Ilan Shomorony" ], "comment": "18 pages", "categories": [ "cs.LG", "cs.AI", "stat.ML" ], "abstract": "Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering algorithm, $k$-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \\log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from Code.org, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.", "revisions": [ { "version": "v1", "updated": "2020-06-11T22:17:16.000Z" } ], "analyses": { "keywords": [ "multi-armed bandits", "68k pbmc single-cell rna", "genomics 68k pbmc single-cell", "bandit-pam", "linear time" ], "note": { "typesetting": "TeX", "pages": 18, "language": "en", "license": "arXiv", "status": "editable" } } }