perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline by airbagdeer · Pull Request #36127 · langchain-ai/langchain

airbag_deer (airbagdeer) · 2026-03-20T17:35:56Z

Motivation

HuggingFaceEmbeddings can be much slower than calling
sentence_transformers or transformers directly. Profiling revealed two
compounding root causes — neither of which was obvious from the code alone.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

sentence_transformers.encode() defaults to convert_to_numpy=True, which
calls .cpu().float().numpy() inside every micro-batch iteration. On
MPS (Apple Silicon) and CUDA, each .cpu() flushes the hardware command
buffer — a full synchronisation point. For 1,000 texts at batch_size=32
that means 32 synchronisations instead of 1, adding ~640 ms of pure
transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without
pre-computed embeddings and then calls add_texts() (which calls
embed_documents()) for each ChromaDB storage batch. For a corpus
that spans multiple ChromaDB batches (default max_batch_size ≈ 5,461),
embeddings are recomputed from scratch for each slice. update_documents
already does the right thing — embed once, batch only the storage — but
from_texts did not follow the same pattern.

Fixes #36126

Changes

`langchain_huggingface/embeddings/huggingface.py`

Default to convert_to_tensor=True in _embed so all micro-batch
outputs stay on the model's device and are torch.cat'd there. This
reduces device→CPU transfers from N_batches to 1 (at the very end).
Final conversion uses .cpu().numpy().tolist() — one sync, then
numpy's C-implemented tolist(). (PyTorch's Tensor.tolist() is a
Python-level loop and is significantly slower for 2-D float arrays.)
Add batch_size: int = 32 as a first-class field (sentence-transformers
already uses 32 as its default; this just surfaces it for easy tuning
without knowing the internal encode_kwargs API). Users can still
override both fields via encode_kwargs.

`langchain_chroma/vectorstores.py`

In from_texts: pre-compute all embeddings once with
embedding.embed_documents(texts), then pass
embeddings=all_embeddings to create_batches and write directly via
_collection.add — matching the pattern already used in
update_documents.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Config	Time	vs baseline
baseline — `convert_to_numpy=True`, `batch_size=32`	1.157 s	—
fixed — `convert_to_tensor=True`, `batch_size=32`	0.631 s	1.84× faster
fixed — `convert_to_tensor=True`, `batch_size=128`	0.642 s	1.80× faster
direct `sentence_transformers` (reference)	0.594 s	1.95× faster

With the Chroma fix applied on top (for datasets > 5,461 docs) the combined
improvement is proportional to the number of storage batches.

Backward compatibility

embed_documents / embed_query return types are unchanged.
Users who explicitly pass encode_kwargs={"convert_to_tensor": False} or
encode_kwargs={"convert_to_numpy": True} get the original numpy path
(the hasattr(embeddings, "cpu") branch handles this correctly).
Chroma.from_texts / from_documents signatures are unchanged.
For datasets with a single ChromaDB batch (< 5,461 docs, the common case)
the Chroma change is functionally identical to the previous behaviour.

Areas requiring careful review

hasattr(embeddings, "cpu") duck-typing — used to distinguish a
torch.Tensor from a numpy.ndarray without importing torch at the
module level. This should be safe but reviewers should verify edge cases
(e.g. IPEX models, custom pooling layers that return unusual objects).
Chroma _collection.add bypass — from_texts now writes directly
to _collection.add rather than going through add_texts. The metadata
and document handling is delegated to create_batches (same as
update_documents). Please verify this covers all the metadata edge
cases that add_texts handled.

Test plan

13 new unit tests (tests/unit_tests/test_embeddings.py) — all
passing, no network required
All existing unit tests pass
Benchmark script
(libs/partners/huggingface/scripts/benchmark_embeddings.py)
confirms 1.84× wall-clock improvement on M4

🤖 This contribution was developed with the assistance of Claude Code
(Anthropic). The root-cause analysis, implementation, and tests were
designed collaboratively.

…mbedding pipeline Fixes langchain-ai#36126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-20T17:36:27Z

This PR has been automatically closed because you are not assigned to the linked issue.

External contributors must be assigned to an issue before opening a PR for it. Please:

Comment on the linked issue to request assignment from a maintainer
Once assigned, edit your PR description and the PR will be reopened automatically

perf(huggingface,chroma): reduce device-to-cpu transfer overhead in e…

a77b354

…mbedding pipeline Fixes langchain-ai#36126 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

org-membership-reviewer bot added new-contributor external labels Mar 20, 2026

github-actions bot added the missing-issue-link label Mar 20, 2026

github-actions bot closed this Mar 20, 2026

airbag_deer (airbagdeer) mentioned this pull request Mar 20, 2026

perf: HuggingFaceEmbeddings causes excessive device-to-cpu transfers per batch #36126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline#36127

perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline#36127
airbag_deer (airbagdeer) wants to merge 1 commit intolangchain-ai:masterfrom
airbagdeer:perf/reduce-embedding-transfer-overhead

airbag_deer (airbagdeer) commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

airbag_deer (airbagdeer) commented Mar 20, 2026

Motivation

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

Root cause 2 — re-embedding per storage batch (Chroma)

Changes

langchain_huggingface/embeddings/huggingface.py

langchain_chroma/vectorstores.py

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Backward compatibility

Areas requiring careful review

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`langchain_huggingface/embeddings/huggingface.py`

`langchain_chroma/vectorstores.py`