Skip to content

[FEA] Out of Core KMeans Clustering#1691

Open
tarang-jain wants to merge 67 commits intorapidsai:mainfrom
tarang-jain:minibatch-kmeans
Open

[FEA] Out of Core KMeans Clustering#1691
tarang-jain wants to merge 67 commits intorapidsai:mainfrom
tarang-jain:minibatch-kmeans

Conversation

@tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jan 9, 2026

This PR adds support for out of core (dataset on host) kmeans clustering. The idea is simple:

  1. Batched accumulation of centroid updates: Data is processed in batches and batch-wise means and cluster counts are accumulated until all the batches i.e., the full dataset pass has completed.
  2. MiniBatchKmeans: Randomly sampled batches on host + online learning rate, which changes with each batch.

[01/09/2026]

  1. Mathematically equivalent to current kmeans. This PR just brings a batch-size parameter to load and compute cluster assignments and (weighted) centroid adjustments on batches of the dataset. The final centroid 'updates' i.e. a single kmeans iteration only completes when all these accumulated sums are averaged once the whole dataset pass has completed. Distinction from miniBatchKmeans: The centroid updates are done for each batch in miniBatchKmeans (faster to converge).
  2. Binary size:
    While the header is different for the batched approach, I've put the batched fit functions into the same TU as kmeans_fit. Common functions such as minClusterAndDistance with the same template params SHOULD not recompile.

[02/17/2026]
This PR also adds a GPU version of sklearn's MiniBatchKMeans so centroids are updated at every batch (potentially faster convergence, but incurs the cost of preparing randomly sampled batches on CPU).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change cpp labels Jan 9, 2026
@tarang-jain tarang-jain self-assigned this Jan 9, 2026
@tarang-jain
Copy link
Contributor Author

tarang-jain commented Feb 24, 2026

Visualization of Clusters On wiki_all_1M (1M * 768 fp32) 256 centroids + CPU UMAP for dimensionality reduction:

Regular KMeans:
image

Full batch (LLoyd's algo with batched accumulation of centroid updates):
batch_size = 100k
image

MiniBatchKmeans:
Minibatch size = 100k with early stopping criteria = stop when 10 consecutive batches produce no improvement in inertia.
image

@tarang-jain
Copy link
Contributor Author

tarang-jain commented Feb 24, 2026

Benchmarks on wiki_all_1M (1M * 768 fp32 dataset):

Method n_clusters batch_size max_iter Time (s) Number of Times Centroids are Updated Inertia
Regular fit 256 100 1.401 101 2.748459e+06
Batched FULL-BATCH 256 100,000 100 32.224 101 2.751807e+06
Batched MINI-BATCH 256 100,000 100 7.671 147 2.844728e+06

@tarang-jain
Copy link
Contributor Author

tarang-jain commented Feb 24, 2026

Fit time with different batch sizes (wiki_all_1M 1M * 768):
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp feature request New feature or request non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

2 participants