perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline#36127
Closed
airbag_deer (airbagdeer) wants to merge 1 commit intolangchain-ai:masterfrom
Closed
Conversation
…mbedding pipeline Fixes langchain-ai#36126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This PR has been automatically closed because you are not assigned to the linked issue. External contributors must be assigned to an issue before opening a PR for it. Please:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
HuggingFaceEmbeddingscan be much slower than callingsentence_transformersortransformersdirectly. Profiling revealed twocompounding root causes — neither of which was obvious from the code alone.
Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)
sentence_transformers.encode()defaults toconvert_to_numpy=True, whichcalls
.cpu().float().numpy()inside every micro-batch iteration. OnMPS (Apple Silicon) and CUDA, each
.cpu()flushes the hardware commandbuffer — a full synchronisation point. For 1,000 texts at
batch_size=32that means 32 synchronisations instead of 1, adding ~640 ms of pure
transfer overhead on an M4 MacBook Air.
Root cause 2 — re-embedding per storage batch (Chroma)
Chroma.from_textscallscreate_batches(documents=texts)withoutpre-computed embeddings and then calls
add_texts()(which callsembed_documents()) for each ChromaDB storage batch. For a corpusthat spans multiple ChromaDB batches (default
max_batch_size ≈ 5,461),embeddings are recomputed from scratch for each slice.
update_documentsalready does the right thing — embed once, batch only the storage — but
from_textsdid not follow the same pattern.Fixes #36126
Changes
langchain_huggingface/embeddings/huggingface.pyconvert_to_tensor=Truein_embedso all micro-batchoutputs stay on the model's device and are
torch.cat'd there. Thisreduces device→CPU transfers from N_batches to 1 (at the very end).
.cpu().numpy().tolist()— one sync, thennumpy's C-implemented
tolist(). (PyTorch'sTensor.tolist()is aPython-level loop and is significantly slower for 2-D float arrays.)
batch_size: int = 32as a first-class field (sentence-transformersalready uses 32 as its default; this just surfaces it for easy tuning
without knowing the internal
encode_kwargsAPI). Users can stilloverride both fields via
encode_kwargs.langchain_chroma/vectorstores.pyfrom_texts: pre-compute all embeddings once withembedding.embed_documents(texts), then passembeddings=all_embeddingstocreate_batchesand write directly via_collection.add— matching the pattern already used inupdate_documents.Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)
convert_to_numpy=True,batch_size=32convert_to_tensor=True,batch_size=32convert_to_tensor=True,batch_size=128sentence_transformers(reference)With the Chroma fix applied on top (for datasets > 5,461 docs) the combined
improvement is proportional to the number of storage batches.
Backward compatibility
embed_documents/embed_queryreturn types are unchanged.encode_kwargs={"convert_to_tensor": False}orencode_kwargs={"convert_to_numpy": True}get the original numpy path(the
hasattr(embeddings, "cpu")branch handles this correctly).Chroma.from_texts/from_documentssignatures are unchanged.the Chroma change is functionally identical to the previous behaviour.
Areas requiring careful review
hasattr(embeddings, "cpu")duck-typing — used to distinguish atorch.Tensorfrom anumpy.ndarraywithout importing torch at themodule level. This should be safe but reviewers should verify edge cases
(e.g. IPEX models, custom pooling layers that return unusual objects).
_collection.addbypass —from_textsnow writes directlyto
_collection.addrather than going throughadd_texts. The metadataand document handling is delegated to
create_batches(same asupdate_documents). Please verify this covers all the metadata edgecases that
add_textshandled.Test plan
tests/unit_tests/test_embeddings.py) — allpassing, no network required
(
libs/partners/huggingface/scripts/benchmark_embeddings.py)confirms 1.84× wall-clock improvement on M4