Skip to content

feat: Implement AlloyDB integration with comprehensive test coverage#3229

Merged
davidsbatista merged 24 commits intodeepset-ai:mainfrom
garybadwal:feat/alloydb-clean
May 5, 2026
Merged

feat: Implement AlloyDB integration with comprehensive test coverage#3229
davidsbatista merged 24 commits intodeepset-ai:mainfrom
garybadwal:feat/alloydb-clean

Conversation

@garybadwal
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes

Added a new AlloyDBDocumentStore, AlloyDBEmbeddingRetriever, and AlloyDBKeywordRetriever to support Google Cloud AlloyDB as a Haystack document store backend.

  • Uses the AlloyDB Python Connector for secure, IAM-authenticated connections without manual SSL or firewall configuration.
  • Supports vector similarity search via the pgvector extension (cosine similarity, inner product, L2 distance).
  • Supports full-text keyword search using PostgreSQL's tsvector/tsquery.
  • Supports HNSW indexing for approximate nearest-neighbor search.
  • Supports password auth (default) and IAM auth (enable_iam_auth=True), with configurable IP type (PRIVATE, PUBLIC, PSC).
  • Credentials are loaded from ALLOYDB_INSTANCE_URI, ALLOYDB_USER, and ALLOYDB_PASSWORD environment variables via Secret.from_env_var.
  • Structure and SQL logic mirrors the existing pgvector integration closely.

How did you test it?

  • 35 unit tests covering document store operations, filter conversion, document converters, retrievers, and retrieval logic — all passing (hatch run test:unit).
  • Lint and type checks pass (hatch run fmt-check && hatch run test:types).
  • Integration tests included under @pytest.mark.integration (require a live AlloyDB instance with ALLOYDB_INSTANCE_URI, ALLOYDB_USER, ALLOYDB_PASSWORD set).

Notes for the reviewer

  • The src/ and tests/ structure is a direct mirror of integrations/pgvector/ — reviewers familiar with that integration should find it straightforward.
  • Async methods are intentionally not implemented — the AlloyDB Python Connector is sync-only.
  • The Connector object is lazily initialized and reused across calls; close() / __del__ ensure the background refresh thread is stopped cleanly.
  • GitHub Actions workflow uses max-parallel: 1 and no services block — integration tests require a live GCP instance.

Checklist

- Add `py.typed` file for type hinting support in AlloyDB document store.
- Create initial test suite for AlloyDB integration, including fixtures for document stores.
- Implement tests for document conversion functions between Haystack and PostgreSQL formats.
- Develop extensive unit tests for the AlloyDB document store, covering CRUD operations and metadata handling.
- Add filter tests to validate query capabilities of the AlloyDB document store.
- Implement embedding retrieval tests for both cosine similarity and inner product methods.
- Create keyword retrieval tests to ensure accurate document retrieval based on query strings.
- Ensure all tests handle various edge cases and validate expected outcomes.
@garybadwal garybadwal requested a review from a team as a code owner April 25, 2026 09:06
@garybadwal garybadwal requested review from sjrl and removed request for a team April 25, 2026 09:06
@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 25, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 25, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgoogle-cloud-alloydb-connector@​1.12.1100100100100100

View full report

@davidsbatista davidsbatista changed the title Feat: Implement AlloyDB integration with comprehensive test coverage feat: Implement AlloyDB integration with comprehensive test coverage Apr 27, 2026
Comment thread integrations/alloydb/src/haystack_integrations/document_stores/alloydb/filters.py Outdated
Copy link
Copy Markdown
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garybadwal thanks for the contribution!

The PR looks already very good - I left a few comments, after being addressed I think we can merge.

@davidsbatista
Copy link
Copy Markdown
Contributor

@garybadwal did you had the chance to test the integration tests against a real google cloud AlloyDB instance?

@garybadwal
Copy link
Copy Markdown
Contributor Author

Not yet @davidsbatista but will do it this Saturday along with the comments that you mentioned.

I just wrote the code and run the test cases. But I'm sure that this will work as I have worked with AlloyDB initially. Still I'll test the complete component with it.

@davidsbatista
Copy link
Copy Markdown
Contributor

Let me than know when you have tested it against a google cloud AlloyDB instance - if it's all good we can then merge it and make an official release.

@garybadwal
Copy link
Copy Markdown
Contributor Author

Sure @davidsbatista will so these tests by tomorrow and will update you.

@garybadwal
Copy link
Copy Markdown
Contributor Author

garybadwal commented May 1, 2026

Hi @davidsbatista

How I tested

The integration uses the GCP google-cloud-alloydb-connector, which authenticates via Google Cloud. To exercise the same code paths against AlloyDB Omni, I temporarily monkey-patched google.cloud.alloydbconnector.Connector from tests/conftest.py (gated by ALLOYDB_LOCAL_TEST=1). Hence, its connect()returns a direct psycopg connection to the local instance. That patch has been reverted; only the bug has been fixed.

Real bug found while running against an actual AlloyDB

get_metadata_field_min_max was using MIN(meta->>field) / MAX(meta->>field). meta->>field returns text, so MIN/MAX on integer-valued metadata returned lexicographic results like "1" (string), and the haystack standard tests (test_get_metadata_field_min_max_numeric/_float/_single_value/_meta_prefix) failed against any real Postgres — local or cloud. This bug would have shipped.

The fix mirrors the pattern already used in the sister pgvector integration: use the existing get_metadata_fields_info() to infer the JSONB field's Python type, then cast (::integer / ::real) before MIN/MAX, and use COLLATE "C" for text fields.

@garybadwal garybadwal requested a review from davidsbatista May 1, 2026 08:19
@garybadwal
Copy link
Copy Markdown
Contributor Author

Hi @davidsbatista, did you got time to review the PR once ?

@davidsbatista
Copy link
Copy Markdown
Contributor

@garybadwal It looks good.

I've added a skip if env vars are not set and removed on test.

@garybadwal
Copy link
Copy Markdown
Contributor Author

Thanks, @davidsbatista. Let me know if you require any further changes or if it's good to merge. Excited to share my contribution to Haystack.

Copy link
Copy Markdown
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Did a few adjustments!

@davidsbatista davidsbatista merged commit 0dad135 into deepset-ai:main May 5, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alloydb Integration

2 participants