feat: Created DocEmbedder class#5973
Conversation
|
@ntkathole @jyejare can You pls review this PR and let me know if any changes is needed. |
jyejare
left a comment
There was a problem hiding this comment.
Great Addition @patelchaitany , this is a milestone for Feast in RAG. Glad to see multiple types of data are being supported by Embedder.
Few comments and we should be good to go.
| chunker = TextChunker() | ||
| text = " ".join([f"word{i}" for i in range(200)]) | ||
|
|
||
| chunks = chunker.load_parse_and_chunk(source=text, source_id="doc1") |
There was a problem hiding this comment.
I think chunker should decide the text in source to chunk the text, we should not need to manually feed that.
There was a problem hiding this comment.
Yes, Chunker can decide the which field is text in the DataFrame We Just need to pass the column name in the chunk_dataframe function of the Chunker class.
This test only for testing that load_parse_and_chunk function return the match the required return type.
| def test_supported_modalities(self): | ||
| """After init, supported_modalities returns text and image.""" | ||
| embedder = MultiModalEmbedder() | ||
| modalities = embedder.supported_modalities() |
There was a problem hiding this comment.
Supported modalities can be set as a property
| assert embedder._image_model is None | ||
| assert embedder._image_processor is None | ||
|
|
||
| def test_custom_modality_registration(self): |
There was a problem hiding this comment.
This is for when we register new Modality then It will correctly route to the New Modality.
But I agree we can remove this test.
1ef9d8e to
083eadb
Compare
083eadb to
bb74079
Compare
|
@patelchaitany filename typo - |
| from dataclasses import dataclass | ||
| from typing import Any, Callable, List, Optional | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Consider lazy-loading numpy too
There was a problem hiding this comment.
we cannot do the lazy of numpy as it is required for the Type checking.
daf292f to
fcc85cd
Compare
…ng them into the FeatureView schema. - Added BaseChunker and TextChunker classes for document chunking. - Updated pyproject.toml to include sentence-transformers dependency. - Created a new Jupyter notebook example for using the RAG retriever with document embedding. Signed-off-by: Chaitany patel <patelchaitany93@gmail.com>
fcc85cd to
e00ee22
Compare
What this PR does / why we need it:
This PR adds a Document Embedder capability to Feast, allowing users to go from raw documents to embeddings stored in the online vector store in a single step. It handles chunking, embedding generation, and writing the results to the online store — providing an end-to-end ingestion pipeline for RAG workflows within Feast.
What changed:
sdk/python/feast/chunker.py
Defines the document chunking layer. Provides:
Currently only basic text chunking is implemented. There is room for improvement — future iterations can support more advanced strategies like semantic chunking, sentence-aware splitting, or format-specific chunkers (PDF, HTML, etc.).
sdk/python/feast/embedder.py
Defines the embedding generation layer. Provides:
sdk/python/feast/doc_embedder.py
The high-level orchestrator that coordinates chunking, embedding, and storage. Provides:
sdk/python/feast/init.py
Updated to export DocEmbedder, LogicalLayerFn, BaseChunker, TextChunker, ChunkingConfig, BaseEmbedder, MultiModalEmbedder, and EmbeddingConfig as part of Feast's public API.
Which issue(s) this PR fixes:
Create DocEmbedder class along with RAGRetriever #5426
Misc