feat : Add Vespa integration with document store and retrievers#3233
feat : Add Vespa integration with document store and retrievers#3233kudos07 wants to merge 14 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Thanks for creating the PR @kudos07! |
|
Hello @bogdankostic, I also added tests, examples, and the workflow wiring, and verified it locally with both the normal checks and a real local Vespa smoke test. I kept the scope focused on existing Vespa apps/schemas for now. |
bogdankostic
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR @kudos07! I had a first look and found a few things to adapt before we merge:
1. Default ranking profiles for BM25 and embedding retrieval
Right now both _bm25_retrieval and _embedding_retrieval pass ranking=None by default, which means Vespa falls back to the schema's default rank profile (typically nativeRank). That's misleading:
_bm25_retrievalwon't actually score by BM25 unless the user has set up a rank profile that usesbm25(field)._embedding_retrievalwill return the correct nearest-neighbor candidate set, but the final ordering won't be by vector similarity unless the rank profile referencescloseness(field, embedding).
Maybe we could ship a default schema with the document store that includes index: enable-bm25 on the content field plus bm25 and semantic rank profiles. The retrievers can then default to ranking="bm25" and ranking="semantic". Users should be able to still override the schema and ranking profile names.
2. Add a docker-compose.yml and run Vespa in CI for real integration tests
Right now CI only runs unit tests with mocks. In other document store integrations (see opensearch, elasticsearch, or weaviate), we provide a docker-compose.yml that spins up a local container that can be used for integration tests. We should also update the test workflow to spin up a docker container in the CI and run integration tests there.
Let me know if anything is unclear to you or you need further input.
There was a problem hiding this comment.
Let's remove this file here and instead have integration tests.
| :param ranking: Optional Vespa ranking profile. | ||
| :param query_tensor_name: Query tensor name referenced in Vespa YQL. | ||
| :param target_hits: Optional Vespa nearest-neighbor `targetHits` value. |
There was a problem hiding this comment.
Let's give some example values and link to the Vespa docs for ranking and target_hits.
| self._document_store = document_store | ||
| self._filters = filters or {} | ||
| self._top_k = top_k | ||
| self._ranking = ranking | ||
| self._query_tensor_name = query_tensor_name | ||
| self._target_hits = target_hits |
There was a problem hiding this comment.
We can make use of default serialization if the instance variables have the same names as the init parameters.
| self._document_store = document_store | |
| self._filters = filters or {} | |
| self._top_k = top_k | |
| self._ranking = ranking | |
| self._query_tensor_name = query_tensor_name | |
| self._target_hits = target_hits | |
| self.document_store = document_store | |
| self.filters = filters | |
| self.top_k = top_k | |
| self.ranking = ranking | |
| self.query_tensor_name = query_tensor_name | |
| self.target_hits = target_hits |
| def to_dict(self) -> dict[str, Any]: | ||
| """ | ||
| Serialize the retriever to a dictionary. | ||
|
|
||
| :returns: Serialized retriever data. | ||
| """ | ||
| return default_to_dict( | ||
| self, | ||
| document_store=self._document_store.to_dict(), | ||
| filters=self._filters, | ||
| top_k=self._top_k, | ||
| ranking=self._ranking, | ||
| query_tensor_name=self._query_tensor_name, | ||
| target_hits=self._target_hits, | ||
| ) | ||
|
|
||
| @classmethod | ||
| def from_dict(cls, data: dict[str, Any]) -> "VespaEmbeddingRetriever": | ||
| """ | ||
| Deserialize the retriever from a dictionary. | ||
|
|
||
| :param data: Serialized retriever data. | ||
| :returns: Deserialized retriever. | ||
| """ | ||
| data["init_parameters"]["document_store"] = VespaDocumentStore.from_dict( | ||
| data["init_parameters"]["document_store"] | ||
| ) | ||
| return default_from_dict(cls, data) |
There was a problem hiding this comment.
We can remove these methods and make use of default serialization of components, see our docs.
There was a problem hiding this comment.
Let's make use of our DocumentStoreBaseTests class here that comes standard tests that any Document Store is expected to pass.
There was a problem hiding this comment.
This file can be removed, as it will be automatically generated once we release this integration.
| readme = "README.md" | ||
| requires-python = ">=3.10" | ||
| license = "Apache-2.0" | ||
| keywords = [] |
There was a problem hiding this comment.
Let's add some keywords here.
There was a problem hiding this comment.
Let's keep this readme minimal, see for example the opensearch readme.
| "Programming Language :: Python :: Implementation :: CPython", | ||
| "Programming Language :: Python :: Implementation :: PyPy", | ||
| ] | ||
| dependencies = ["haystack-ai", "pyvespa"] |
There was a problem hiding this comment.
We should probably add minimum versions here.
|
Hello @bogdankostic, I addressed the review feedback and the follow-up CI failures. Main changes:
Verified locally with:
|
|
Hello @bogdankostic, please do check whenever you are free. |
bogdankostic
left a comment
There was a problem hiding this comment.
Thanks for addressing the feedbacl @kudos07! I left a few in-line comments that should be addressed as well. Also, please use single backticks for in-line code blocks in doc strings.
| port: int = 8080, | ||
| cert: str | None = None, | ||
| key: str | None = None, | ||
| vespa_cloud_secret_token: str | None = None, |
| ) | ||
| return self._app | ||
|
|
||
| def to_dict(self) -> dict[str, Any]: |
There was a problem hiding this comment.
Let's be explicit here, so have something that's similar to the other document stores.
def to_dict(self) -> dict[str, Any]:
return default_to_dict(
self,
url=self.url,
port=self.port,
cert=self.cert,
...There was a problem hiding this comment.
This file should be removed.
There was a problem hiding this comment.
This file should be removed.
There was a problem hiding this comment.
This file should be removed.
| @@ -0,0 +1,15 @@ | |||
| services: | |||
| vespa: | |||
| image: vespaengine/vespa:latest | |||
There was a problem hiding this comment.
Let's pin this to the latest version.
|
Thanks! Addressed the remaining comments @bogdankostic :
Also fixed the CI app-copy issue by copying the generated test app contents into |
Related Issues
Proposed Changes:
This PR adds an initial
vespaintegration based onpyvespafor using Vespa as a Haystack document store and retrieval backend.It includes:
VespaDocumentStorefor writing, filtering, counting, fetching, deleting, and updating documents against an existing Vespa application/schemaVespaKeywordRetrieverfor lexical retrievalVespaEmbeddingRetrieverfor dense retrieval using Vespa nearest-neighbor queriesto_dict/from_dictThis first iteration is intentionally scoped to work with an existing Vespa application/schema rather than managing the full Vespa application lifecycle.
How did you test it?
hatch run test:unithatch run test:typeshatch run fmt-checkhatch run python scripts/local_keyword_smoke_test.pyThe local smoke test verifies:
VespaKeywordRetrieverNotes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.