Skip to content

perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline#36127

Closed
airbag_deer (airbagdeer) wants to merge 1 commit intolangchain-ai:masterfrom
airbagdeer:perf/reduce-embedding-transfer-overhead
Closed

perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline#36127
airbag_deer (airbagdeer) wants to merge 1 commit intolangchain-ai:masterfrom
airbagdeer:perf/reduce-embedding-transfer-overhead

Conversation

@airbagdeer

Motivation

HuggingFaceEmbeddings can be much slower than calling
sentence_transformers or transformers directly. Profiling revealed two
compounding root causes — neither of which was obvious from the code alone.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

sentence_transformers.encode() defaults to convert_to_numpy=True, which
calls .cpu().float().numpy() inside every micro-batch iteration. On
MPS (Apple Silicon) and CUDA, each .cpu() flushes the hardware command
buffer — a full synchronisation point. For 1,000 texts at batch_size=32
that means 32 synchronisations instead of 1, adding ~640 ms of pure
transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without
pre-computed embeddings and then calls add_texts() (which calls
embed_documents()) for each ChromaDB storage batch. For a corpus
that spans multiple ChromaDB batches (default max_batch_size ≈ 5,461),
embeddings are recomputed from scratch for each slice. update_documents
already does the right thing — embed once, batch only the storage — but
from_texts did not follow the same pattern.

Fixes #36126

Changes

langchain_huggingface/embeddings/huggingface.py

  • Default to convert_to_tensor=True in _embed so all micro-batch
    outputs stay on the model's device and are torch.cat'd there. This
    reduces device→CPU transfers from N_batches to 1 (at the very end).
  • Final conversion uses .cpu().numpy().tolist() — one sync, then
    numpy's C-implemented tolist(). (PyTorch's Tensor.tolist() is a
    Python-level loop and is significantly slower for 2-D float arrays.)
  • Add batch_size: int = 32 as a first-class field (sentence-transformers
    already uses 32 as its default; this just surfaces it for easy tuning
    without knowing the internal encode_kwargs API). Users can still
    override both fields via encode_kwargs.

langchain_chroma/vectorstores.py

  • In from_texts: pre-compute all embeddings once with
    embedding.embed_documents(texts), then pass
    embeddings=all_embeddings to create_batches and write directly via
    _collection.add — matching the pattern already used in
    update_documents.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Config Time vs baseline
baseline — convert_to_numpy=True, batch_size=32 1.157 s
fixed — convert_to_tensor=True, batch_size=32 0.631 s 1.84× faster
fixed — convert_to_tensor=True, batch_size=128 0.642 s 1.80× faster
direct sentence_transformers (reference) 0.594 s 1.95× faster

With the Chroma fix applied on top (for datasets > 5,461 docs) the combined
improvement is proportional to the number of storage batches.

Backward compatibility

  • embed_documents / embed_query return types are unchanged.
  • Users who explicitly pass encode_kwargs={"convert_to_tensor": False} or
    encode_kwargs={"convert_to_numpy": True} get the original numpy path
    (the hasattr(embeddings, "cpu") branch handles this correctly).
  • Chroma.from_texts / from_documents signatures are unchanged.
  • For datasets with a single ChromaDB batch (< 5,461 docs, the common case)
    the Chroma change is functionally identical to the previous behaviour.

Areas requiring careful review

  1. hasattr(embeddings, "cpu") duck-typing — used to distinguish a
    torch.Tensor from a numpy.ndarray without importing torch at the
    module level. This should be safe but reviewers should verify edge cases
    (e.g. IPEX models, custom pooling layers that return unusual objects).
  2. Chroma _collection.add bypassfrom_texts now writes directly
    to _collection.add rather than going through add_texts. The metadata
    and document handling is delegated to create_batches (same as
    update_documents). Please verify this covers all the metadata edge
    cases that add_texts handled.

Test plan

  • 13 new unit tests (tests/unit_tests/test_embeddings.py) — all
    passing, no network required
  • All existing unit tests pass
  • Benchmark script
    (libs/partners/huggingface/scripts/benchmark_embeddings.py)
    confirms 1.84× wall-clock improvement on M4

🤖 This contribution was developed with the assistance of Claude Code
(Anthropic). The root-cause analysis, implementation, and tests were
designed collaboratively.

…mbedding pipeline

Fixes langchain-ai#36126

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added chroma `langchain-chroma` package issues & PRs dependencies Pull requests that update a dependency file (e.g. `pyproject.toml` or `uv.lock`) huggingface `langchain-huggingface` package issues & PRs integration PR made that is related to a provider partner package integration performance size: M 200-499 LOC labels Mar 20, 2026
@github-actions
Copy link

This PR has been automatically closed because you are not assigned to the linked issue.

External contributors must be assigned to an issue before opening a PR for it. Please:

  1. Comment on the linked issue to request assignment from a maintainer
  2. Once assigned, edit your PR description and the PR will be reopened automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chroma `langchain-chroma` package issues & PRs dependencies Pull requests that update a dependency file (e.g. `pyproject.toml` or `uv.lock`) external huggingface `langchain-huggingface` package issues & PRs integration PR made that is related to a provider partner package integration missing-issue-link new-contributor performance size: M 200-499 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: HuggingFaceEmbeddings causes excessive device-to-cpu transfers per batch

1 participant