Skip to content

feat : Add Vespa integration with document store and retrievers#3233

Open
kudos07 wants to merge 14 commits into
deepset-ai:mainfrom
kudos07:vespa-work
Open

feat : Add Vespa integration with document store and retrievers#3233
kudos07 wants to merge 14 commits into
deepset-ai:mainfrom
kudos07:vespa-work

Conversation

@kudos07
Copy link
Copy Markdown
Contributor

@kudos07 kudos07 commented Apr 27, 2026

Related Issues

Proposed Changes:

This PR adds an initial vespa integration based on pyvespa for using Vespa as a Haystack document store and retrieval backend.

It includes:

  • VespaDocumentStore for writing, filtering, counting, fetching, deleting, and updating documents against an existing Vespa application/schema
  • VespaKeywordRetriever for lexical retrieval
  • VespaEmbeddingRetriever for dense retrieval using Vespa nearest-neighbor queries
  • serialization support via to_dict / from_dict
  • usage examples for keyword and embedding retrieval
  • a local smoke-test script for validating the integration against a real Vespa deployment
  • repository wiring for the new integration, including workflow, labeler updates, pydoc config, and README inventory updates

This first iteration is intentionally scoped to work with an existing Vespa application/schema rather than managing the full Vespa application lifecycle.

How did you test it?

  • Ran unit tests with:
    • hatch run test:unit
  • Ran type checks with:
    • hatch run test:types
  • Ran formatting/lint checks with:
    • hatch run fmt-check
  • Performed manual end-to-end verification against a real local Vespa deployment with:
    • hatch run python scripts/local_keyword_smoke_test.py

The local smoke test verifies:

  • Vespa application deployment
  • document writes
  • document reset/delete flow
  • filtered document queries
  • keyword retrieval through VespaKeywordRetriever

Notes for the reviewer

  • This integration assumes an existing Vespa application/schema for normal usage.
  • The embedding retriever expects a Vespa tensor field and an appropriate ranking profile to already exist in the target schema.
  • During local smoke testing, I adjusted a few Vespa-specific behaviors in the implementation:
    • safer bulk delete handling
    • query limits aligned with Vespa defaults
    • string filter translation aligned with Vespa YQL semantics

Checklist

@kudos07 kudos07 requested a review from a team as a code owner April 27, 2026 05:10
@kudos07 kudos07 requested review from bogdankostic and removed request for a team April 27, 2026 05:10
@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 27, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 27, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpyvespa@​1.2.196100100100100

View full report

@bogdankostic
Copy link
Copy Markdown
Contributor

Thanks for creating the PR @kudos07!
I will review it shortly. In the meantime, could you please make sure that the CI checks pass?

@bogdankostic bogdankostic self-assigned this Apr 27, 2026
@kudos07
Copy link
Copy Markdown
Contributor Author

kudos07 commented Apr 29, 2026

Hello @bogdankostic,
Implemented the first pass of the Vespa integration with a document store and keyword/embedding retrievers.

I also added tests, examples, and the workflow wiring, and verified it locally with both the normal checks and a real local Vespa smoke test. I kept the scope focused on existing Vespa apps/schemas for now.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @kudos07! I had a first look and found a few things to adapt before we merge:

1. Default ranking profiles for BM25 and embedding retrieval

Right now both _bm25_retrieval and _embedding_retrieval pass ranking=None by default, which means Vespa falls back to the schema's default rank profile (typically nativeRank). That's misleading:

  • _bm25_retrieval won't actually score by BM25 unless the user has set up a rank profile that uses bm25(field).
  • _embedding_retrieval will return the correct nearest-neighbor candidate set, but the final ordering won't be by vector similarity unless the rank profile references closeness(field, embedding).

Maybe we could ship a default schema with the document store that includes index: enable-bm25 on the content field plus bm25 and semantic rank profiles. The retrievers can then default to ranking="bm25" and ranking="semantic". Users should be able to still override the schema and ranking profile names.

2. Add a docker-compose.yml and run Vespa in CI for real integration tests

Right now CI only runs unit tests with mocks. In other document store integrations (see opensearch, elasticsearch, or weaviate), we provide a docker-compose.yml that spins up a local container that can be used for integration tests. We should also update the test workflow to spin up a docker container in the CI and run integration tests there.

Let me know if anything is unclear to you or you need further input.

Comment thread integrations/vespa/pydoc/config_docusaurus.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this file here and instead have integration tests.

Comment on lines +28 to +30
:param ranking: Optional Vespa ranking profile.
:param query_tensor_name: Query tensor name referenced in Vespa YQL.
:param target_hits: Optional Vespa nearest-neighbor `targetHits` value.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's give some example values and link to the Vespa docs for ranking and target_hits.

Comment on lines +36 to +41
self._document_store = document_store
self._filters = filters or {}
self._top_k = top_k
self._ranking = ranking
self._query_tensor_name = query_tensor_name
self._target_hits = target_hits
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make use of default serialization if the instance variables have the same names as the init parameters.

Suggested change
self._document_store = document_store
self._filters = filters or {}
self._top_k = top_k
self._ranking = ranking
self._query_tensor_name = query_tensor_name
self._target_hits = target_hits
self.document_store = document_store
self.filters = filters
self.top_k = top_k
self.ranking = ranking
self.query_tensor_name = query_tensor_name
self.target_hits = target_hits

Comment on lines +43 to +70
def to_dict(self) -> dict[str, Any]:
"""
Serialize the retriever to a dictionary.

:returns: Serialized retriever data.
"""
return default_to_dict(
self,
document_store=self._document_store.to_dict(),
filters=self._filters,
top_k=self._top_k,
ranking=self._ranking,
query_tensor_name=self._query_tensor_name,
target_hits=self._target_hits,
)

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "VespaEmbeddingRetriever":
"""
Deserialize the retriever from a dictionary.

:param data: Serialized retriever data.
:returns: Deserialized retriever.
"""
data["init_parameters"]["document_store"] = VespaDocumentStore.from_dict(
data["init_parameters"]["document_store"]
)
return default_from_dict(cls, data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these methods and make use of default serialization of components, see our docs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make use of our DocumentStoreBaseTests class here that comes standard tests that any Document Store is expected to pass.

Comment thread integrations/vespa/CHANGELOG.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be removed, as it will be automatically generated once we release this integration.

Comment thread integrations/vespa/pyproject.toml Outdated
readme = "README.md"
requires-python = ">=3.10"
license = "Apache-2.0"
keywords = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some keywords here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this readme minimal, see for example the opensearch readme.

Comment thread integrations/vespa/pyproject.toml Outdated
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = ["haystack-ai", "pyvespa"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add minimum versions here.

@kudos07 kudos07 requested a review from bogdankostic May 3, 2026 23:06
@kudos07
Copy link
Copy Markdown
Contributor Author

kudos07 commented May 3, 2026

Hello @bogdankostic,

I addressed the review feedback and the follow-up CI failures.

Main changes:

  • Added Docker-backed Vespa integration tests and CI coverage.
  • Added bundled Vespa app schema with BM25 and semantic rank profiles.
  • Defaulted keyword/embedding retrieval to bm25 / semantic.
  • Added targetHits for nearest-neighbor retrieval.
  • Updated serialization, docstrings, README, pydoc config, dependencies, and metadata.
  • Removed the smoke-test script and changelog.
  • Added/expanded DocumentStoreBase integration tests.

Verified locally with:

  • hatch run fmt-check .\src .\tests .\examples .\pyproject.toml
  • hatch run test:types
  • hatch run test:unit-cov-retry
  • VESPA_RUN_INTEGRATION_TESTS=1 hatch run test:integration-cov-append-retry

@kudos07
Copy link
Copy Markdown
Contributor Author

kudos07 commented May 11, 2026

Hello @bogdankostic, please do check whenever you are free.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedbacl @kudos07! I left a few in-line comments that should be addressed as well. Also, please use single backticks for in-line code blocks in doc strings.

port: int = 8080,
cert: str | None = None,
key: str | None = None,
vespa_cloud_secret_token: str | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Secret here instead of str.

)
return self._app

def to_dict(self) -> dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be explicit here, so have something that's similar to the other document stores.

  def to_dict(self) -> dict[str, Any]:
      return default_to_dict(
          self,
          url=self.url,
          port=self.port,
          cert=self.cert,
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed.

Comment thread integrations/vespa/vespa_app/hosts.xml Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed.

Comment thread integrations/vespa/docker-compose.yml Outdated
@@ -0,0 +1,15 @@
services:
vespa:
image: vespaengine/vespa:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pin this to the latest version.

@kudos07
Copy link
Copy Markdown
Contributor Author

kudos07 commented May 15, 2026

Thanks! Addressed the remaining comments @bogdankostic :

  • Switched Vespa auth params (cert, key, vespa_cloud_secret_token) to Secret.
  • Made to_dict() explicit using default_to_dict(...).
  • Removed the tracked vespa_app files and now generate the test app in the integration fixture.
  • Pinned the Vespa Docker image.
  • Updated docstrings to use single backticks for inline code.

Also fixed the CI app-copy issue by copying the generated test app contents into /tmp/vespa_app before deployment.

@kudos07 kudos07 requested a review from bogdankostic May 15, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:vespa topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants