[FEA] Out of Core KMeans Clustering by tarang-jain · Pull Request #1691 · rapidsai/cuvs

tarang-jain · 2026-01-09T22:04:20Z

This PR adds support for out of core (dataset on host) kmeans clustering. The idea is simple:

Batched accumulation of centroid updates: Data is processed in batches and batch-wise means and cluster counts are accumulated until all the batches i.e., the full dataset pass has completed.
MiniBatchKmeans: Randomly sampled batches on host + online learning rate, which changes with each batch.

[01/09/2026]

Mathematically equivalent to current kmeans. This PR just brings a batch-size parameter to load and compute cluster assignments and (weighted) centroid adjustments on batches of the dataset. The final centroid 'updates' i.e. a single kmeans iteration only completes when all these accumulated sums are averaged once the whole dataset pass has completed. Distinction from miniBatchKmeans: The centroid updates are done for each batch in miniBatchKmeans (faster to converge).
Binary size:
While the header is different for the batched approach, I've put the batched fit functions into the same TU as kmeans_fit. Common functions such as minClusterAndDistance with the same template params SHOULD not recompile.

[02/17/2026]
This PR also adds a GPU version of sklearn's MiniBatchKMeans so centroids are updated at every batch (potentially faster convergence, but incurs the cost of preparing randomly sampled batches on CPU).

copy-pr-bot · 2026-01-09T22:04:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…into minibatch-kmeans

…h-kmeans

…into minibatch-kmeans

tarang-jain · 2026-02-24T21:34:58Z

Visualization of Clusters On wiki_all_1M (1M * 768 fp32) 256 centroids + CPU UMAP for dimensionality reduction:

Regular KMeans:

Full batch (LLoyd's algo with batched accumulation of centroid updates):
batch_size = 100k

MiniBatchKmeans:
Minibatch size = 100k with early stopping criteria = stop when 10 consecutive batches produce no improvement in inertia.

tarang-jain · 2026-02-24T21:54:37Z

Benchmarks on wiki_all_1M (1M * 768 fp32 dataset):

Method	n_clusters	batch_size	max_iter	Time (s)	Number of Times Centroids are Updated	Inertia
Regular fit	256	—	100	1.401	101	2.748459e+06
Batched FULL-BATCH	256	100,000	100	32.224	101	2.751807e+06
Batched MINI-BATCH	256	100,000	100	7.671	147	2.844728e+06

tarang-jain · 2026-02-24T23:20:37Z

Fit time with different batch sizes (wiki_all_1M 1M * 768):

…h-kmeans

…into minibatch-kmeans

first commit (unclean)

ca07c08

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Jan 9, 2026

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Jan 9, 2026

Merge branch 'main' into minibatch-kmeans

bc872c8

tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change cpp labels Jan 9, 2026

tarang-jain self-assigned this Jan 9, 2026

tarang-jain and others added 21 commits January 9, 2026 17:30

Merge branch 'main' into minibatch-kmeans

daf6d6e

style

f1a19df

Merge branch 'minibatch-kmeans' of https://github.com/tarang-jain/cuvs …

181d536

…into minibatch-kmeans

copyright

0fa00b0

Merge branch 'main' into minibatch-kmeans

371543f

Merge branch 'main' into minibatch-kmeans

c81650c

python test

fcbdda5

minibatch first commit

d6ed934

fix docs

5d4b498

replace thrust calls:

72fe789

Merge branch 'main' of https://github.com/rapidsai/cuvs into minibatc…

aefae6e

…h-kmeans

common function in helper

526ac04

Merge branch 'main' into minibatch-kmeans

e9c85b9

Merge branch 'main' into minibatch-kmeans

1efadde

Merge branch 'main' into minibatch-kmeans

ee45045

fix templates

9b6f1ef

Merge branch 'minibatch-kmeans' of https://github.com/tarang-jain/cuvs …

ad20d0a

…into minibatch-kmeans

namespace and init fixes

4b65df5

fix docs in main header

5eb2be5

several fixes

c23985a

Merge branch 'main' into minibatch-kmeans

c103f87

tarang-jain and others added 13 commits February 25, 2026 04:52

Merge branch 'main' into minibatch-kmeans

8d629f1

Merge branch 'main' of https://github.com/rapidsai/cuvs into minibatc…

3961187

…h-kmeans

Merge branch 'batched-kmeans' of https://github.com/tarang-jain/cuvs …

296942d

…into minibatch-kmeans

rejection sampling

aacd543

style

d746af2

update test with inertia check

6d01aed

fix style

2f1189f

add reassignment; update minibatch params struct

0862da6

style

8b448f6

fix merge conflict

8225e15

simplify minibatch update step

7c3965c

fix oom

c490208

update tests

b09611a

tarang-jain mentioned this pull request Feb 26, 2026

[WIP] [FEA] Separate MiniBatchKMeans and Streaming Batch Kmeans #1853

Closed

tarang-jain and others added 13 commits February 27, 2026 02:49

Merge branch 'main' into minibatch-kmeans

d404255

update n_init use

350ee82

Merge branch 'minibatch-kmeans' of https://github.com/tarang-jain/cuvs …

13584b8

…into minibatch-kmeans

Merge branch 'main' into minibatch-kmeans

46d9754

abstract away commonalities into helpers

63a34a3

Merge branch 'minibatch-kmeans' of https://github.com/tarang-jain/cuvs …

de9206a

…into minibatch-kmeans

fix compilation errors

de34c93

fix bug, add cpp tests

29a2358

style

4f119ba

make cpp tests more rigorous

bf5726b

style

74ec728

fix learning rate bug

568904d

revert

48a776b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Out of Core KMeans Clustering#1691

[FEA] Out of Core KMeans Clustering#1691
tarang-jain wants to merge 67 commits intorapidsai:mainfrom
tarang-jain:minibatch-kmeans

tarang-jain commented Jan 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 9, 2026

Uh oh!

tarang-jain commented Feb 24, 2026 •

edited

Loading

Uh oh!

tarang-jain commented Feb 24, 2026 •

edited

Loading

Uh oh!

tarang-jain commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarang-jain commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 9, 2026

Uh oh!

tarang-jain commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarang-jain commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarang-jain commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarang-jain commented Jan 9, 2026 •

edited

Loading

tarang-jain commented Feb 24, 2026 •

edited

Loading

tarang-jain commented Feb 24, 2026 •

edited

Loading

tarang-jain commented Feb 24, 2026 •

edited

Loading