Skip to content

vectors - bg delete, reindex#9562

Open
alphaprinz wants to merge 1 commit intonoobaa:masterfrom
alphaprinz:vectors_bg
Open

vectors - bg delete, reindex#9562
alphaprinz wants to merge 1 commit intonoobaa:masterfrom
alphaprinz:vectors_bg

Conversation

@alphaprinz
Copy link
Copy Markdown
Contributor

@alphaprinz alphaprinz commented Apr 23, 2026

Describe the Problem

Background maintenance for vectors for containerized.

Explain the Changes

  1. DB cleaner should clean deleted vector buckets and indices.
  2. Vector index's vector index should be re-created after enough changes.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • Configurable automatic reindexing for vector indexes (enable/disable, delay, threshold).
    • Periodic background reindex scheduler to detect and run reindexes when thresholds are exceeded.
    • API and SDK endpoints to report and update "rows since index" for tracking reindex needs.
    • Integrated reindex operation for the vector storage backend.
  • Bug Fixes / Maintenance

    • System-store cleanup now includes vector-related collections.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds per-index row-change tracking, SDK and API hooks to update that counter on vector insert/delete, a background reindexer that triggers Lance reindexing when thresholds are exceeded, Lance reindex utilities, config flags, and DB-cleaner updates to include vector collections.

Changes

Cohort / File(s) Summary
Configuration
config.js
Adds REINDEX_VECTOR_BUCKETS (bool), REINDEX_VECTOR_BUCKETS_DELAY (ms), and VECTOR_INDEX_ROWS_REINDEX (rows threshold).
Public API
src/api/bucket_api.js
Adds admin-only POST update_rows_since_index endpoint accepting vector_bucket_name, vector_index_name, op (SET
Server RPC & Schema
src/server/system_services/bucket_server.js, src/server/system_services/schemas/vector_index_schema.js
Adds rows_since_index integer to vector index schema; implements/exported update_rows_since_index RPC to SET or ADD rows_since_index; new default field on index creation.
SDK surface
src/sdk/vector_sdk.js, src/sdk/bucketspace_nb.js, src/sdk/bucketspace_fs.js, src/sdk/nb.d.ts
After put/delete vector ops, SDK calls add_rows_since_reindex (BucketSpaceNB forwards RPC with op: 'ADD'; BucketSpaceFS contains placeholder). Adds type declaration.
Background service & wiring
src/server/bg_services/vector_buckets_reindexer.js, src/endpoint/endpoint.js, src/server/bg_workers.js
Adds VectorBucketsReindexer background worker; conditionally registers it at startup when enabled; bg_workers ESLint override added.
Vector utilities
src/util/vectors_util.js
Adds reindex(vector_bucket, vector_index) and LanceConn.reindex() to recreate Lance index; resolves namespace_resource when provided as name or object.
DB cleaner
src/server/bg_services/db_cleaner.js
Includes vector_buckets and vector_indices in counts, skip logic, deletion candidate collection, and purge payload.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SDK
    participant BucketSpaceRPC as BucketSpace RPC
    participant Server
    participant Store as SystemStore
    participant BGWorker
    participant VUtils as VectorUtils

    Client->>SDK: put_vectors(params)
    SDK->>VUtils: put_vectors(...)
    VUtils-->>SDK: result
    SDK->>BucketSpaceRPC: add_rows_since_reindex({vector_bucket_name, vector_index_name, delta})
    BucketSpaceRPC->>Server: update_rows_since_index(op: ADD, value: delta)
    Server->>Store: locate vector_index & $inc rows_since_index
    Store-->>Server: persisted
    Server-->>BucketSpaceRPC: ack
    BucketSpaceRPC-->>SDK: ack
    SDK-->>Client: return result

    par Periodic reindex check
        BGWorker->>Store: scan vector_indices
        alt rows_since_index >= VECTOR_INDEX_ROWS_REINDEX
            BGWorker->>VUtils: reindex(vector_bucket, vector_index)
            VUtils->>VUtils: LanceConn.reindex() recreate index
            VUtils-->>BGWorker: done
            BGWorker->>Server: update_rows_since_index(op: SET, value: 0)
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • guymguym
  • jackyalbo
  • romayalon
  • tangledbytes
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'vectors - bg delete, reindex' directly matches the PR's primary changes: vector bucket/index cleanup via DB cleaner and vector index reindexing via background service.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/api/bucket_api.js`:
- Around line 1215-1232: The handler metadata for update_rows_since_index
currently forces admin-only RPC which causes post-write bookkeeping
(add_rows_since_reindex called after put_vectors/delete_vectors) to fail for
non-admin writers; either remove or relax the auth block on
update_rows_since_index so user-authenticated writers can call it (and enforce
bucket/index authorization inside the update_rows_since_index handler), or keep
the admin restriction but change the callers (put_vectors/delete_vectors) to
call add_rows_since_reindex using an internal/admin token; update the code paths
referencing update_rows_since_index and add_rows_since_reindex and ensure
authorization checks occur in the handler (validate
vector_bucket_name/vector_index_name) if you choose to allow user auth.

In `@src/endpoint/endpoint.js`:
- Around line 262-267: Guard registration of the VectorBucketsReindexer on the
presence of internal_rpc_client as well as the config flag: change the condition
that calls background_scheduler.register_bg_worker(new
VectorBucketsReindexer(...)) to require both internal_rpc_client and
config.REINDEX_VECTOR_BUCKETS (i.e., if (internal_rpc_client &&
config.REINDEX_VECTOR_BUCKETS) { ... }), ensuring the VectorBucketsReindexer is
never instantiated with client: internal_rpc_client when that client is
undefined.

In `@src/sdk/bucketspace_fs.js`:
- Around line 1467-1469: Implement the previously-noop method
add_rows_since_reindex to persistently increment the per-index
"rows_since_index" counter in the BucketSpaceFS backend: load the index metadata
for the given vector_bucket_name and vector_index_name (the same storage object
used by the existing FS config read/write helpers), validate and coerce the
incoming delta, add it to the stored rows_since_index (defaulting to 0 if
missing), then write the updated metadata back to disk; perform the
read-modify-write under the same serialized lock or atomic update pattern used
by config_fs to avoid races, and ensure this method is used by
VectorSDK.put_vectors and delete_vectors to trigger threshold-based reindexing.

In `@src/sdk/vector_sdk.js`:
- Around line 120-127: In put_vectors, move the bookkeeping call to
bs.add_rows_since_reindex so it happens before the actual vector mutation to
avoid reporting a failed write after Lance has accepted it: call
bs.add_rows_since_reindex(...) prior to invoking
vector_utils.put_vectors(this.req.vector_bucket, this.req.vector_index,
params.vectors); similarly apply the same change to the delete path that
currently calls add_rows_since_reindex after vector_utils.delete_vectors (refer
to the delete_vectors call and its add_rows_since_reindex usage) so counters are
incremented before performing the mutation.

In `@src/server/bg_services/db_cleaner.js`:
- Around line 132-151: The early-skip precheck omits vector_indices so deleted
vector indices never trigger cleanup; add a call to
system_store.count_total_docs('vector_indices') (e.g. const
total_vector_indices_count) alongside
total_accounts_count/total_buckets_count/total_vector_buckets_count/total_pools_count,
include total_vector_indices_count in the combined if condition that compares to
config.DB_CLEANER_MAX_TOTAL_DOCS, and update the dbg.log0 message (DB_CLEANER
skip message) to report total_vector_indices_count so the log reflects the full
set of stores being checked; note that system_store.find_deleted_docs already
requests 'vector_indices' so no change there.

In `@src/server/system_services/bucket_server.js`:
- Around line 2549-2568: In update_rows_since_index, avoid blindly using SET on
vector_indices[0].rows_since_index (which can overwrite concurrent ADDs);
instead implement a compare-and-set or use a subtracting increment so concurrent
increments are preserved: when req.rpc_params.op === 'SET' treat the value as
the number of processed rows and either (a) perform a conditional update that
sets rows_since_index only if the current stored value equals the
observed/expected value provided in req.rpc_params (CAS), or (b) convert the
reset into an $inc by -processedCount (so rows_since_index = rows_since_index -
processedCount) and call system_store.make_changes with that conditional/$inc
change; update the change object construction in update_rows_since_index
accordingly and ensure req.rpc_params carries the observed/expected value if
using CAS.

In `@src/server/system_services/schemas/vector_index_schema.js`:
- Around line 52-54: The rows_since_index schema allows negative values; update
the vector_index_schema.js definition for the rows_since_index field to enforce
non-negativity by adding a minimum constraint (and optionally a default of 0) so
the JSON/schema validator rejects negative counters—locate the rows_since_index
property in the exported schema and add a "minimum: 0" (and set "default: 0" if
desired) to ensure the counter cannot be negative.

In `@src/util/vectors_util.js`:
- Around line 282-287: In reindex, guard against missing or too-small Lance
tables: when calling get_table(table_name) inside reindex(vector_bucket,
vector_index), check for undefined and return/skip early if no table; when
calling table.createIndex('vector', {replace: true}), wrap it in a try/catch and
detect the known "Not enough rows" (or equivalent) error coming from
createIndex/put_vectors and treat it as non-fatal by logging/debugging and
returning without throwing so the background reindexer won't retry; keep error
handling generic enough to rethrow unexpected errors.
- Around line 362-366: The code dereferences nsr.nsfs_config.fs_root_path
without checking that nsr exists; when system_store lookup returns undefined
(stale SDK info or buckets created without namespace_resource) this causes a
TypeError—update the logic around vector_bucket.namespace_resource/resource and
nsr resolution (the nsr variable and lance_path assignment) to explicitly guard
that nsr is defined and has nsfs_config before computing lance_path, and if not
throw a clear configuration error (e.g., "Missing namespace_resource or
nsfs_config for vector bucket: <identifier>") so callers get a meaningful
message instead of a generic crash.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f62823dc-2990-47b9-98fe-70fc9fc4c60a

📥 Commits

Reviewing files that changed from the base of the PR and between 69d5a21 and 3432c26.

📒 Files selected for processing (12)
  • config.js
  • src/api/bucket_api.js
  • src/endpoint/endpoint.js
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/sdk/nb.d.ts
  • src/sdk/vector_sdk.js
  • src/server/bg_services/db_cleaner.js
  • src/server/bg_workers.js
  • src/server/system_services/bucket_server.js
  • src/server/system_services/schemas/vector_index_schema.js
  • src/util/vectors_util.js

Comment thread src/api/bucket_api.js
Comment thread src/endpoint/endpoint.js Outdated
Comment thread src/sdk/bucketspace_fs.js
Comment on lines +1467 to +1469
async add_rows_since_reindex(/*{vector_bucket_name, vector_index_name, delta}*/) {
//TODO
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Implement the FS reindex counter instead of silently no-oping.

VectorSDK.put_vectors() and delete_vectors() now rely on this method, so the TODO makes BucketSpaceFS-backed vector indexes never accumulate rows_since_index and therefore never trigger threshold-based reindexing.

I can help wire this through config_fs with a serialized read/update/write if you want.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sdk/bucketspace_fs.js` around lines 1467 - 1469, Implement the
previously-noop method add_rows_since_reindex to persistently increment the
per-index "rows_since_index" counter in the BucketSpaceFS backend: load the
index metadata for the given vector_bucket_name and vector_index_name (the same
storage object used by the existing FS config read/write helpers), validate and
coerce the incoming delta, add it to the stored rows_since_index (defaulting to
0 if missing), then write the updated metadata back to disk; perform the
read-modify-write under the same serialized lock or atomic update pattern used
by config_fs to avoid races, and ensure this method is used by
VectorSDK.put_vectors and delete_vectors to trigger threshold-based reindexing.

Comment thread src/sdk/vector_sdk.js
Comment thread src/server/bg_services/db_cleaner.js
Comment thread src/server/system_services/bucket_server.js
Comment thread src/server/system_services/schemas/vector_index_schema.js
Comment thread src/util/vectors_util.js
Comment thread src/util/vectors_util.js
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/endpoint/endpoint.js (1)

262-267: ⚠️ Potential issue | 🟠 Major

Guard reindexer registration on internal_rpc_client.

Line 265 can pass undefined when init_request_sdk is supplied externally or in deployments without an internal RPC client. Match the NamespaceMonitor guard above.

Proposed fix
-        if (config.REINDEX_VECTOR_BUCKETS) {
+        if (internal_rpc_client && config.REINDEX_VECTOR_BUCKETS) {
             background_scheduler.register_bg_worker(new VectorBucketsReindexer({
                 name: 'Vector_buckets_reindexer',
                 client: internal_rpc_client,
             }));
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/endpoint/endpoint.js` around lines 262 - 267, The VectorBucketsReindexer
is being registered unconditionally when config.REINDEX_VECTOR_BUCKETS is true,
but internal_rpc_client may be undefined; update the block around
background_scheduler.register_bg_worker(new VectorBucketsReindexer(...)) to
mirror the NamespaceMonitor guard by only registering the reindexer when
internal_rpc_client is truthy (and config.REINDEX_VECTOR_BUCKETS is true), so
that VectorBucketsReindexer is not constructed/passed an undefined client;
reference the symbols VectorBucketsReindexer,
background_scheduler.register_bg_worker, internal_rpc_client, and
config.REINDEX_VECTOR_BUCKETS when making the check.
src/util/vectors_util.js (2)

362-366: ⚠️ Potential issue | 🟡 Minor

Guard namespace-resource resolution before deriving lance_path.

A stale/missing namespace_resource, unloaded systems[0], or unresolved resource name currently turns into a generic TypeError at Line 366. Return a clear configuration error before dereferencing nsr.nsfs_config.

Proposed guard
                 //for vc from system store, nsr is already an object.
                 //for vc from sdk info, lookup nsr name in sysem store.namespace_resources_by_name
-                const nsr = typeof vector_bucket.namespace_resource.resource === 'object' ? vector_bucket.namespace_resource.resource :
-                    system_store.data.systems[0].namespace_resources_by_name[vector_bucket.namespace_resource.resource];
+                const resource = vector_bucket.namespace_resource?.resource;
+                const nsr = typeof resource === 'object' ? resource :
+                    system_store.data.systems[0]?.namespace_resources_by_name?.[resource];
+                if (!nsr?.nsfs_config?.fs_root_path) {
+                    throw new Error(`Missing namespace_resource or nsfs_config for vector bucket ${vector_bucket.name.unwrap()}`);
+                }
                 lance_path = nsr.nsfs_config.fs_root_path;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/util/vectors_util.js` around lines 362 - 366, The code dereferences
nsr.nsfs_config without validating namespace resolution; in the block that
computes nsr from vector_bucket.namespace_resource.resource and
system_store.data.systems[0].namespace_resources_by_name, add guards to ensure
system_store.data.systems exists and has index 0, that
vector_bucket.namespace_resource and its resource are present, and that the
resolved nsr is an object with nsfs_config and fs_root_path; if any check fails,
throw/return a clear configuration error (e.g., "Missing or unresolved
namespace_resource for vector_bucket" or similar) before assigning lance_path so
the error is explicit instead of a TypeError.

282-287: ⚠️ Potential issue | 🟠 Major

Keep background reindex from crashing on absent or too-small Lance tables.

get_table() may return undefined, and createIndex() can hit the same “Not enough rows” path handled elsewhere. Skip these cases so the background worker does not fail-loop.

Proposed guard
     async reindex(vector_bucket, vector_index) {
         dbg.log0("reindex vector_bucket =", vector_bucket.name.unwrap(), ", vector_index =", vector_index.name.unwrap());
         const table_name = vector_bucket.name.unwrap() + "_" + vector_index.name.unwrap();
         const table = await this.get_table(table_name);
-        await table.createIndex('vector', {replace: true});
+        if (!table) {
+            dbg.log1("reindex skipped, table not found =", table_name);
+            return;
+        }
+        try {
+            await table.createIndex('vector', {replace: true});
+        } catch (err) {
+            if (err.message?.includes('Not enough rows')) {
+                dbg.log1("reindex skipped, not enough rows for table =", table_name);
+                return;
+            }
+            throw err;
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/util/vectors_util.js` around lines 282 - 287, The reindex method can
throw when get_table() returns undefined or when table.createIndex('vector')
fails for small tables; modify reindex (in src/util/vectors_util.js) to first
guard if table is falsy and simply return (or log and return), then wrap the
await table.createIndex('vector', {replace: true}) call in a try/catch that
catches the "Not enough rows" condition (match by error message or specific
error class if available) and treats it as a no-op (log debug and return) while
rethrowing unexpected errors; reference the reindex function, get_table call,
and createIndex invocation when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/endpoint/endpoint.js`:
- Around line 262-267: The VectorBucketsReindexer is being registered
unconditionally when config.REINDEX_VECTOR_BUCKETS is true, but
internal_rpc_client may be undefined; update the block around
background_scheduler.register_bg_worker(new VectorBucketsReindexer(...)) to
mirror the NamespaceMonitor guard by only registering the reindexer when
internal_rpc_client is truthy (and config.REINDEX_VECTOR_BUCKETS is true), so
that VectorBucketsReindexer is not constructed/passed an undefined client;
reference the symbols VectorBucketsReindexer,
background_scheduler.register_bg_worker, internal_rpc_client, and
config.REINDEX_VECTOR_BUCKETS when making the check.

In `@src/util/vectors_util.js`:
- Around line 362-366: The code dereferences nsr.nsfs_config without validating
namespace resolution; in the block that computes nsr from
vector_bucket.namespace_resource.resource and
system_store.data.systems[0].namespace_resources_by_name, add guards to ensure
system_store.data.systems exists and has index 0, that
vector_bucket.namespace_resource and its resource are present, and that the
resolved nsr is an object with nsfs_config and fs_root_path; if any check fails,
throw/return a clear configuration error (e.g., "Missing or unresolved
namespace_resource for vector_bucket" or similar) before assigning lance_path so
the error is explicit instead of a TypeError.
- Around line 282-287: The reindex method can throw when get_table() returns
undefined or when table.createIndex('vector') fails for small tables; modify
reindex (in src/util/vectors_util.js) to first guard if table is falsy and
simply return (or log and return), then wrap the await
table.createIndex('vector', {replace: true}) call in a try/catch that catches
the "Not enough rows" condition (match by error message or specific error class
if available) and treats it as a no-op (log debug and return) while rethrowing
unexpected errors; reference the reindex function, get_table call, and
createIndex invocation when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 171beb2f-4687-4d5c-bde7-99b04252115e

📥 Commits

Reviewing files that changed from the base of the PR and between 3432c26 and ee4c4f9.

📒 Files selected for processing (12)
  • config.js
  • src/api/bucket_api.js
  • src/endpoint/endpoint.js
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/sdk/nb.d.ts
  • src/sdk/vector_sdk.js
  • src/server/bg_services/db_cleaner.js
  • src/server/bg_workers.js
  • src/server/system_services/bucket_server.js
  • src/server/system_services/schemas/vector_index_schema.js
  • src/util/vectors_util.js
✅ Files skipped from review due to trivial changes (2)
  • src/server/bg_workers.js
  • src/sdk/vector_sdk.js
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/server/system_services/schemas/vector_index_schema.js
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/server/system_services/bucket_server.js
  • src/api/bucket_api.js
  • src/server/bg_services/db_cleaner.js

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/server/bg_services/vector_buckets_reindexer.js`:
- Around line 29-30: In vector_buckets_reindexer.js replace the inconsistent
dbg.log call with dbg.log0 so logging uses the same verbosity level as the rest
of the module; locate the return path that currently calls dbg.log("Vector
Bucket Reindexer: no vector indices.") and change it to call dbg.log0 with the
same message to maintain consistent logging behavior in the Vector Bucket
Reindexer flow.
- Around line 33-53: The loop currently races and can lose progress: compute a
safe rows counter (e.g., const rows = vi.rows_since_index || 0) and only
consider reindex when rows >= config.VECTOR_INDEX_ROWS_REINDEX; wrap each
iteration in a try/catch so one failure doesn't stop the rest; call await
vector_utils.reindex(vi.vector_bucket, vi) (ensure you await the async reindex)
before updating/resetting the counter; when resetting use a decrement/INC
operation with the observed rows (call
this.client.bucket.update_rows_since_index with op: 'INC' and value: -rows) or
otherwise subtract the observed value instead of unconditionally SET 0 to avoid
dropping concurrent writes; use auth_server.make_auth_token and
system_store.data.systems[0] as before for the update call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ee8e48b2-3a88-40b7-aa0d-a7da9612ca80

📥 Commits

Reviewing files that changed from the base of the PR and between ee4c4f9 and 6252cf9.

📒 Files selected for processing (13)
  • config.js
  • src/api/bucket_api.js
  • src/endpoint/endpoint.js
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/sdk/nb.d.ts
  • src/sdk/vector_sdk.js
  • src/server/bg_services/db_cleaner.js
  • src/server/bg_services/vector_buckets_reindexer.js
  • src/server/bg_workers.js
  • src/server/system_services/bucket_server.js
  • src/server/system_services/schemas/vector_index_schema.js
  • src/util/vectors_util.js
✅ Files skipped from review due to trivial changes (2)
  • src/server/bg_workers.js
  • src/server/system_services/schemas/vector_index_schema.js
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/sdk/nb.d.ts
  • src/sdk/bucketspace_nb.js
  • src/endpoint/endpoint.js
  • src/sdk/bucketspace_fs.js
  • config.js
  • src/sdk/vector_sdk.js
  • src/server/bg_services/db_cleaner.js

Comment thread src/server/bg_services/vector_buckets_reindexer.js Outdated
Comment thread src/server/bg_services/vector_buckets_reindexer.js
Signed-off-by: Amit Prinz Setter <alphaprinz@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/server/system_services/bucket_server.js (1)

2357-2360: ⚠️ Potential issue | 🟠 Major

Complete the REST error mapping for NO_SUCH_VECTOR_INDEX.

The server now throws NO_SUCH_VECTOR_INDEX, but the REST endpoint mapping is incomplete. Three changes are needed:

  1. Add VectorError.NoSuchVectorIndex to src/endpoint/vector/vector_errors.js
  2. Map NO_SUCH_VECTOR_INDEX to the new VectorError.NoSuchVectorIndex in RPC_ERRORS_TO_VECTOR in src/endpoint/vector/vector_rest.js (lines 16-40)
  3. Without these mappings, clients currently receive a fallback InternalFailure (HTTP 500) error instead of the proper vector index not found error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/server/system_services/bucket_server.js` around lines 2357 - 2360, Add a
new VectorError variant named NoSuchVectorIndex in the vector error definitions
(VectorError.NoSuchVectorIndex) and export it from the vector_errors module,
then update the RPC_ERRORS_TO_VECTOR mapping in the vector_rest module to map
the RPC error string "NO_SUCH_VECTOR_INDEX" to VectorError.NoSuchVectorIndex so
REST endpoints convert the RpcError('NO_SUCH_VECTOR_INDEX', ...) into the
correct HTTP/vector error instead of falling back to InternalFailure.
♻️ Duplicate comments (3)
src/util/vectors_util.js (2)

282-287: ⚠️ Potential issue | 🟠 Major

Handle non-indexable tables in background reindex().

get_table() can return undefined, and createIndex() can fail for small tables; this should be treated as a skip, not a hard failure.

Suggested guard
     async reindex(vector_bucket, vector_index) {
         dbg.log0("reindex vector_bucket =", vector_bucket.name.unwrap(), ", vector_index =", vector_index.name.unwrap());
         const table_name = vector_bucket.name.unwrap() + "_" + vector_index.name.unwrap();
         const table = await this.get_table(table_name);
-        await table.createIndex('vector', {replace: true});
+        if (!table) {
+            dbg.log1("reindex skipped, table not found =", table_name);
+            return;
+        }
+        try {
+            await table.createIndex('vector', { replace: true });
+        } catch (err) {
+            if (err.message?.includes('Not enough rows')) {
+                dbg.log1("reindex skipped, not enough rows =", table_name);
+                return;
+            }
+            throw err;
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/util/vectors_util.js` around lines 282 - 287, The reindex method should
treat non-indexable or tiny tables as skippable: check the result of
get_table(table_name) and if it's undefined, log a debug/info message and return
without throwing; also wrap table.createIndex('vector', {replace: true}) in a
try/catch that catches errors from non-indexable/small tables, logs a
debug/info-level skip message (including table_name and the error) and does not
propagate the exception. Update the async reindex(vector_bucket, vector_index)
function to perform these guards around get_table and createIndex.

364-366: ⚠️ Potential issue | 🟡 Minor

Guard unresolved namespace resource before dereference.

If lookup fails, current code throws a generic TypeError instead of a clear configuration error.

Suggested guard
                 const nsr = typeof vector_bucket.namespace_resource.resource === 'object' ? vector_bucket.namespace_resource.resource :
                     system_store.data.systems[0].namespace_resources_by_name[vector_bucket.namespace_resource.resource];
+                if (!nsr?.nsfs_config?.fs_root_path) {
+                    throw new Error(`Missing namespace_resource/nsfs_config for vector bucket ${vector_bucket.name}`);
+                }
                 lance_path = nsr.nsfs_config.fs_root_path;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/util/vectors_util.js` around lines 364 - 366, The code dereferences
nsr.nsfs_config.fs_root_path without verifying nsr exists; update the lookup
around vector_bucket.namespace_resource/resource to validate that nsr is not
null/undefined (after computing nsr from
system_store.data.systems[0].namespace_resources_by_name or the inline object)
and if it's missing throw or return a clear configuration error mentioning the
missing namespace resource and the offending
vector_bucket.namespace_resource.resource value; reference the symbols
vector_bucket, namespace_resource, nsr, system_store,
namespace_resources_by_name, nsfs_config, and fs_root_path when adding the guard
and error to make the failure explicit.
src/server/system_services/bucket_server.js (1)

2560-2565: ⚠️ Potential issue | 🟠 Major

Prevent lost updates in SET path for rows_since_index.

SET can overwrite concurrent ADD updates, so row changes can be dropped and reindex timing drifts.

Suggested fix (CAS-style update)
 async function update_rows_since_index(req) {
@@
-    if (req.rpc_params.op === 'SET') {
-        change.update.vector_indices[0].rows_since_index = req.rpc_params.value;
-    } else {
+    if (req.rpc_params.op === 'SET') {
+        if (typeof req.rpc_params.expected !== 'number') {
+            throw new RpcError('BAD_REQUEST', 'expected rows_since_index is required for SET');
+        }
+        change.update.vector_indices[0].$find = {
+            _id: vector_index._id,
+            rows_since_index: req.rpc_params.expected,
+        };
+        change.update.vector_indices[0].$set = {
+            rows_since_index: req.rpc_params.value,
+        };
+    } else if (req.rpc_params.op === 'ADD') {
         change.update.vector_indices[0].$inc = {
             rows_since_index: req.rpc_params.value
         };
+    } else {
+        throw new RpcError('BAD_REQUEST', `Unsupported op ${req.rpc_params.op}`);
     }

As per coding guidelines: "**/*.{js,ts,tsx}: Review code for race conditions in async code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/server/system_services/bucket_server.js` around lines 2560 - 2565, The
current SET branch directly assigns
change.update.vector_indices[0].rows_since_index = req.rpc_params.value which
can overwrite concurrent ADD ($inc) updates; replace this with a CAS-style
update: when req.rpc_params.op === 'SET' read the current rows_since_index for
that vector index, then issue an atomic update that includes the current value
in the update filter (e.g. update ... where rows_since_index == current) to $set
it to req.rpc_params.value and retry the read/update if the conditional update
fails; alternatively, require/use an expected value from req.rpc_params (e.g.
req.rpc_params.expected) and include it in the update filter so
change.update.vector_indices[0].rows_since_index is only set if the stored value
equals expected, falling back to a retry loop if needed. Ensure you modify the
code that now sets change.update.vector_indices[0].rows_since_index and the
surrounding update execution logic to perform the conditional update instead of
a blind assignment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/server/system_services/bucket_server.js`:
- Around line 2357-2360: Add a new VectorError variant named NoSuchVectorIndex
in the vector error definitions (VectorError.NoSuchVectorIndex) and export it
from the vector_errors module, then update the RPC_ERRORS_TO_VECTOR mapping in
the vector_rest module to map the RPC error string "NO_SUCH_VECTOR_INDEX" to
VectorError.NoSuchVectorIndex so REST endpoints convert the
RpcError('NO_SUCH_VECTOR_INDEX', ...) into the correct HTTP/vector error instead
of falling back to InternalFailure.

---

Duplicate comments:
In `@src/server/system_services/bucket_server.js`:
- Around line 2560-2565: The current SET branch directly assigns
change.update.vector_indices[0].rows_since_index = req.rpc_params.value which
can overwrite concurrent ADD ($inc) updates; replace this with a CAS-style
update: when req.rpc_params.op === 'SET' read the current rows_since_index for
that vector index, then issue an atomic update that includes the current value
in the update filter (e.g. update ... where rows_since_index == current) to $set
it to req.rpc_params.value and retry the read/update if the conditional update
fails; alternatively, require/use an expected value from req.rpc_params (e.g.
req.rpc_params.expected) and include it in the update filter so
change.update.vector_indices[0].rows_since_index is only set if the stored value
equals expected, falling back to a retry loop if needed. Ensure you modify the
code that now sets change.update.vector_indices[0].rows_since_index and the
surrounding update execution logic to perform the conditional update instead of
a blind assignment.

In `@src/util/vectors_util.js`:
- Around line 282-287: The reindex method should treat non-indexable or tiny
tables as skippable: check the result of get_table(table_name) and if it's
undefined, log a debug/info message and return without throwing; also wrap
table.createIndex('vector', {replace: true}) in a try/catch that catches errors
from non-indexable/small tables, logs a debug/info-level skip message (including
table_name and the error) and does not propagate the exception. Update the async
reindex(vector_bucket, vector_index) function to perform these guards around
get_table and createIndex.
- Around line 364-366: The code dereferences nsr.nsfs_config.fs_root_path
without verifying nsr exists; update the lookup around
vector_bucket.namespace_resource/resource to validate that nsr is not
null/undefined (after computing nsr from
system_store.data.systems[0].namespace_resources_by_name or the inline object)
and if it's missing throw or return a clear configuration error mentioning the
missing namespace resource and the offending
vector_bucket.namespace_resource.resource value; reference the symbols
vector_bucket, namespace_resource, nsr, system_store,
namespace_resources_by_name, nsfs_config, and fs_root_path when adding the guard
and error to make the failure explicit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8e69b9c5-d7d3-4f41-a71e-5f050f5c5e2d

📥 Commits

Reviewing files that changed from the base of the PR and between 6252cf9 and 1994c20.

📒 Files selected for processing (13)
  • config.js
  • src/api/bucket_api.js
  • src/endpoint/endpoint.js
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/sdk/nb.d.ts
  • src/sdk/vector_sdk.js
  • src/server/bg_services/db_cleaner.js
  • src/server/bg_services/vector_buckets_reindexer.js
  • src/server/bg_workers.js
  • src/server/system_services/bucket_server.js
  • src/server/system_services/schemas/vector_index_schema.js
  • src/util/vectors_util.js
✅ Files skipped from review due to trivial changes (3)
  • src/server/system_services/schemas/vector_index_schema.js
  • src/server/bg_workers.js
  • src/api/bucket_api.js
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/sdk/bucketspace_fs.js
  • src/sdk/bucketspace_nb.js
  • src/sdk/vector_sdk.js
  • src/server/bg_services/vector_buckets_reindexer.js
  • src/server/bg_services/db_cleaner.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant