[FEA] Out of Core KMeans Clustering#1691
Open
tarang-jain wants to merge 67 commits intorapidsai:mainfrom
Open
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…into minibatch-kmeans
…into minibatch-kmeans
Contributor
Author
Contributor
Author
|
Benchmarks on wiki_all_1M (1M * 768 fp32 dataset):
|
Contributor
Author
…into minibatch-kmeans
…into minibatch-kmeans
…into minibatch-kmeans
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




This PR adds support for out of core (dataset on host) kmeans clustering. The idea is simple:
MiniBatchKmeans: Randomly sampled batches on host + online learning rate, which changes with each batch.[01/09/2026]
While the header is different for the batched approach, I've put the batched fit functions into the same TU as kmeans_fit. Common functions such as
minClusterAndDistancewith the same template params SHOULD not recompile.[02/17/2026]
This PR also adds a GPU version of sklearn's MiniBatchKMeans so centroids are updated at every batch (potentially faster convergence, but incurs the cost of preparing randomly sampled batches on CPU).