feat : Add Vespa integration with document store and retrievers by kudos07 · Pull Request #3233 · deepset-ai/haystack-core-integrations

kudos07 · 2026-04-27T05:10:49Z

Related Issues

partially addresses Add new Vespa integration with DocumentStore and Retrievers #2281

Proposed Changes:

This PR adds an initial vespa integration based on pyvespa for using Vespa as a Haystack document store and retrieval backend.

It includes:

VespaDocumentStore for writing, filtering, counting, fetching, deleting, and updating documents against an existing Vespa application/schema
VespaKeywordRetriever for lexical retrieval
VespaEmbeddingRetriever for dense retrieval using Vespa nearest-neighbor queries
serialization support via to_dict / from_dict
usage examples for keyword and embedding retrieval
a local smoke-test script for validating the integration against a real Vespa deployment
repository wiring for the new integration, including workflow, labeler updates, pydoc config, and README inventory updates

This first iteration is intentionally scoped to work with an existing Vespa application/schema rather than managing the full Vespa application lifecycle.

How did you test it?

Ran unit tests with:
- hatch run test:unit
Ran type checks with:
- hatch run test:types
Ran formatting/lint checks with:
- hatch run fmt-check
Performed manual end-to-end verification against a real local Vespa deployment with:
- hatch run python scripts/local_keyword_smoke_test.py

The local smoke test verifies:

Vespa application deployment
document writes
document reset/delete flow
filtered document queries
keyword retrieval through VespaKeywordRetriever

Notes for the reviewer

This integration assumes an existing Vespa application/schema for normal usage.
The embedding retriever expects a Vespa tensor field and an appropriate ranking profile to already exist in the target schema.
During local smoke testing, I adjusted a few Vespa-specific behaviors in the implementation:
- safer bulk delete handling
- query limits aligned with Vespa defaults
- string filter translation aligned with Vespa YQL semantics

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

socket-security · 2026-04-27T11:51:58Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pyvespa@1.2.1

View full report

bogdankostic · 2026-04-27T12:23:26Z

Thanks for creating the PR @kudos07!
I will review it shortly. In the meantime, could you please make sure that the CI checks pass?

kudos07 · 2026-04-29T09:25:54Z

Hello @bogdankostic,
Implemented the first pass of the Vespa integration with a document store and keyword/embedding retrievers.

I also added tests, examples, and the workflow wiring, and verified it locally with both the normal checks and a real local Vespa smoke test. I kept the scope focused on existing Vespa apps/schemas for now.

bogdankostic

Thanks a lot for the PR @kudos07! I had a first look and found a few things to adapt before we merge:

1. Default ranking profiles for BM25 and embedding retrieval

Right now both _bm25_retrieval and _embedding_retrieval pass ranking=None by default, which means Vespa falls back to the schema's default rank profile (typically nativeRank). That's misleading:

_bm25_retrieval won't actually score by BM25 unless the user has set up a rank profile that uses bm25(field).
_embedding_retrieval will return the correct nearest-neighbor candidate set, but the final ordering won't be by vector similarity unless the rank profile references closeness(field, embedding).

Maybe we could ship a default schema with the document store that includes index: enable-bm25 on the content field plus bm25 and semantic rank profiles. The retrievers can then default to ranking="bm25" and ranking="semantic". Users should be able to still override the schema and ranking profile names.

2. Add a `docker-compose.yml` and run Vespa in CI for real integration tests

Right now CI only runs unit tests with mocks. In other document store integrations (see opensearch, elasticsearch, or weaviate), we provide a docker-compose.yml that spins up a local container that can be used for integration tests. We should also update the test workflow to spin up a docker container in the CI and run integration tests there.

Let me know if anything is unclear to you or you need further input.

bogdankostic · 2026-04-29T09:49:30Z

Let's remove this file here and instead have integration tests.

bogdankostic · 2026-04-29T09:54:33Z

+        :param ranking: Optional Vespa ranking profile.
+        :param query_tensor_name: Query tensor name referenced in Vespa YQL.
+        :param target_hits: Optional Vespa nearest-neighbor `targetHits` value.


Let's give some example values and link to the Vespa docs for ranking and target_hits.

bogdankostic · 2026-04-29T09:55:48Z

+        self._document_store = document_store
+        self._filters = filters or {}
+        self._top_k = top_k
+        self._ranking = ranking
+        self._query_tensor_name = query_tensor_name
+        self._target_hits = target_hits


We can make use of default serialization if the instance variables have the same names as the init parameters.

Suggested change

self._document_store = document_store

self._filters = filters or {}

self._top_k = top_k

self._ranking = ranking

self._query_tensor_name = query_tensor_name

self._target_hits = target_hits

self.document_store = document_store

self.filters = filters

self.top_k = top_k

self.ranking = ranking

self.query_tensor_name = query_tensor_name

self.target_hits = target_hits

bogdankostic · 2026-04-29T09:56:40Z

+    def to_dict(self) -> dict[str, Any]:
+        """
+        Serialize the retriever to a dictionary.
+
+        :returns: Serialized retriever data.
+        """
+        return default_to_dict(
+            self,
+            document_store=self._document_store.to_dict(),
+            filters=self._filters,
+            top_k=self._top_k,
+            ranking=self._ranking,
+            query_tensor_name=self._query_tensor_name,
+            target_hits=self._target_hits,
+        )
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "VespaEmbeddingRetriever":
+        """
+        Deserialize the retriever from a dictionary.
+
+        :param data: Serialized retriever data.
+        :returns: Deserialized retriever.
+        """
+        data["init_parameters"]["document_store"] = VespaDocumentStore.from_dict(
+            data["init_parameters"]["document_store"]
+        )
+        return default_from_dict(cls, data)


We can remove these methods and make use of default serialization of components, see our docs.

bogdankostic · 2026-04-29T12:17:34Z

Let's make use of our DocumentStoreBaseTests class here that comes standard tests that any Document Store is expected to pass.

bogdankostic · 2026-04-29T12:18:40Z

This file can be removed, as it will be automatically generated once we release this integration.

bogdankostic · 2026-04-29T12:20:20Z

+readme = "README.md"
+requires-python = ">=3.10"
+license = "Apache-2.0"
+keywords = []


Let's add some keywords here.

bogdankostic · 2026-04-29T12:21:42Z

Let's keep this readme minimal, see for example the opensearch readme.

bogdankostic · 2026-04-29T12:33:41Z

+  "Programming Language :: Python :: Implementation :: CPython",
+  "Programming Language :: Python :: Implementation :: PyPy",
+]
+dependencies = ["haystack-ai", "pyvespa"]


We should probably add minimum versions here.

kudos07 · 2026-05-03T23:14:26Z

Hello @bogdankostic,

I addressed the review feedback and the follow-up CI failures.

Main changes:

Added Docker-backed Vespa integration tests and CI coverage.
Added bundled Vespa app schema with BM25 and semantic rank profiles.
Defaulted keyword/embedding retrieval to bm25 / semantic.
Added targetHits for nearest-neighbor retrieval.
Updated serialization, docstrings, README, pydoc config, dependencies, and metadata.
Removed the smoke-test script and changelog.
Added/expanded DocumentStoreBase integration tests.

Verified locally with:

hatch run fmt-check .\src .\tests .\examples .\pyproject.toml
hatch run test:types
hatch run test:unit-cov-retry
VESPA_RUN_INTEGRATION_TESTS=1 hatch run test:integration-cov-append-retry

kudos07 · 2026-05-11T08:09:01Z

Hello @bogdankostic, please do check whenever you are free.

bogdankostic

Thanks for addressing the feedbacl @kudos07! I left a few in-line comments that should be addressed as well. Also, please use single backticks for in-line code blocks in doc strings.

bogdankostic · 2026-05-14T12:04:43Z

+        port: int = 8080,
+        cert: str | None = None,
+        key: str | None = None,
+        vespa_cloud_secret_token: str | None = None,


Let's use Secret here instead of str.

bogdankostic · 2026-05-14T12:10:07Z

+            )
+        return self._app
+
+    def to_dict(self) -> dict[str, Any]:


Let's be explicit here, so have something that's similar to the other document stores.

def to_dict(self) -> dict[str, Any]: return default_to_dict( self, url=self.url, port=self.port, cert=self.cert, ...

bogdankostic · 2026-05-14T12:14:23Z

This file should be removed.

bogdankostic · 2026-05-14T12:14:34Z

This file should be removed.

bogdankostic · 2026-05-14T12:14:43Z

This file should be removed.

bogdankostic · 2026-05-14T12:15:35Z

@@ -0,0 +1,15 @@
+services:
+  vespa:
+    image: vespaengine/vespa:latest


Let's pin this to the latest version.

kudos07 · 2026-05-15T09:05:52Z

Thanks! Addressed the remaining comments @bogdankostic :

Switched Vespa auth params (cert, key, vespa_cloud_secret_token) to Secret.
Made to_dict() explicit using default_to_dict(...).
Removed the tracked vespa_app files and now generate the test app in the integration fixture.
Pinned the Vespa Docker image.
Updated docstrings to use single backticks for inline code.

Also fixed the CI app-copy issue by copying the generated test app contents into /tmp/vespa_app before deployment.

Add Vespa integration with document store and retrievers

7bb77a8

kudos07 requested a review from a team as a code owner April 27, 2026 05:10

kudos07 requested review from bogdankostic and removed request for a team April 27, 2026 05:10

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 27, 2026

Merge branch 'main' into vespa-work

2b61f47

bogdankostic self-assigned this Apr 27, 2026

kudos07 added 4 commits April 28, 2026 19:00

fix: use stable FilterPolicy import path

0899cf7

fix: support FilterPolicy import across Haystack versions

f1e48a3

refactor: simplify Vespa retriever filter handling

2287506

ci: quote GITHUB_OUTPUT in Vespa workflow

553ac99

bogdankostic added the integration:vespa label Apr 29, 2026

bogdankostic requested changes Apr 29, 2026

View reviewed changes

kudos07 added 6 commits May 3, 2026 15:00

Address Vespa integration review feedback

9cc4fc5

Fix Vespa base integration test fixture

bb10c26

Fix Vespa test app services node

b726945

Fix Vespa integration readiness check

ccdc3d4

Normalize Vespa boolean metadata defaults in tests

e9215df

Fix Vespa integration test filter edge cases

3fc4494

kudos07 requested a review from bogdankostic May 3, 2026 23:06

bogdankostic requested changes May 14, 2026

View reviewed changes

kudos07 added 2 commits May 15, 2026 01:49

Address Vespa review follow-ups

4972305

Fix Vespa test app copy in CI

ff28cc0

kudos07 requested a review from bogdankostic May 15, 2026 09:07

Conversation

kudos07 commented Apr 27, 2026 • edited by bogdankostic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

socket-security Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdankostic commented Apr 27, 2026

Uh oh!

kudos07 commented Apr 29, 2026

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

1. Default ranking profiles for BM25 and embedding retrieval

2. Add a docker-compose.yml and run Vespa in CI for real integration tests

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kudos07 commented May 3, 2026

Uh oh!

kudos07 commented May 11, 2026

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kudos07 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kudos07 commented Apr 27, 2026 •

edited by bogdankostic

Loading

socket-security Bot commented Apr 27, 2026 •

edited

Loading

2. Add a `docker-compose.yml` and run Vespa in CI for real integration tests