Centralize shared utilities across benchmark backends by jnke2016 · Pull Request #2040 · rapidsai/cuvs

jnke2016 · 2026-04-25T18:04:34Z

Summary

This PR centralizes duplicated logic identified during review of the OpenSearch and Elasticsearch backend PRs. Both backends independently reimplemented common operations (vector file loading, YAML parsing, dataset lookup, DatasetConfig construction, parameter expansion, recall computation) that should be shared across all backends. This PR moves that shared logic into the framework so that backend authors focus only on backend-specific concerns.

Motivation

With the pluggable backend architecture, each new Python-native backend needs to load binary vector files, parse dataset YAML configs, look up datasets by name, expand parameter combinations, and compute recall. Without centralized utilities, each backend reimplements these independently, leading to:

Format drift (e.g., OpenSearch supports 5 binary formats while Elasticsearch only supports 2)
If a bug is found or a general change is needed, every backend needs to be updated individually instead of fixing it in one place
Unnecessary code for new backend authors to write

Changes

1. Shared vector loading utility (backends/utils.py)

Added load_vectors() and dtype_from_filename() (originating from generate_groundtruth/utils.py) for reading binary vector files. Supports all formats (.fbin, .f16bin, .u8bin, .i8bin, .ibin) with subset_size support and input validation (unsupported extensions, truncated headers/payloads). The C++ backend is unaffected since it passes file paths to the subprocess.

2. Promoted config-loading methods to base ConfigLoader class

Moved load_yaml_file, get_dataset_configuration, and gather_algorithm_configs from CppGBenchConfigLoader to the base ConfigLoader class. These were being reimplemented by both the OpenSearch and Elasticsearch config loaders. They are now inherited automatically. Also added a warning for invalid algorithm_configuration paths (previously silently ignored).

3. Dataset refactored from dataclass to regular class with transparent vector loading

The Dataset class now transparently loads vectors from file paths on first access via properties. Python-native backends just access dataset.base_vectors and get a numpy array regardless of whether vectors were provided directly or loaded from disk. The C++ backend continues to use dataset.base_file only, never triggering any loading. This was done in response to reviewer feedback that backends should not need to know about file loading utilities.

4. ConfigLoader template method pattern

ConfigLoader.load() is now a concrete method that handles the shared steps (loading dataset YAML, finding the dataset by name, constructing DatasetConfig with path resolution via build_dataset_config()) then delegates to the abstract _build_benchmark_configs() for backend-specific logic. Backend authors implement _build_benchmark_configs() and receive the DatasetConfig already built. They cannot skip or reimplement the shared steps. The orchestrator is unchanged.

5. Shared parameter expansion and recall computation (backends/utils.py)

Added expand_param_grid() for expanding YAML parameter specs (dict-of-lists to list-of-dicts via Cartesian product) and compute_recall() for computing recall@k from neighbors and ground truth. Both are used by Python-native backends. The C++ backend does not use these (it handles parameter expansion interleaved with constraint validation, and computes recall in the subprocess).

Remaining work

Consider consolidating pre-existing I/O duplicates across generate_groundtruth/utils.py, get_dataset/fbin_to_f16bin.py, and get_dataset/hdf5_to_fbin.py into a top-level cuvs_bench/io.py module
Streaming write capability (incremental batches without knowing total size upfront)

…ing across Python-native backends

…nfiguration, gather_algorithm_configs) from CppGBenchConfigLoader to base ConfigLoader class.

copy-pr-bot · 2026-04-25T18:04:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-25T18:08:48Z

📝 Walkthrough

Walkthrough

Introduces a backend utility function to load vectors from binary files with NumPy dtype inference based on file extension, and refactors shared YAML/dataset/algorithm configuration logic from CppGBenchConfigLoader into the abstract ConfigLoader base class for reuse across loaders.

Changes

Cohort / File(s)	Summary
Backend Vector Loading `python/cuvs_bench/cuvs_bench/backends/utils.py`	New `load_vectors()` function that infers NumPy dtype from binary file extension (`.fbin`, `.f16bin`, `.u8bin`, `.i8bin`, `.ibin`), reads row/column counts from file headers, optionally subsets data, and constructs 2D NumPy arrays.
Configuration Loading Refactoring `python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py`	Promotes three methods from `CppGBenchConfigLoader` to abstract `ConfigLoader` base class: `load_yaml_file()`, `get_dataset_configuration()`, and `gather_algorithm_configs()`. Improves dataset lookup error messaging with specific dataset names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Centralize shared utilities across benchmark backends' accurately reflects the main changes: introducing a shared load_vectors utility and moving shared configuration methods from CppGBenchConfigLoader to the base ConfigLoader class.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description provides comprehensive details about centralizing shared utility logic across backends, including motivation, specific changes, and remaining work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/backends/utils.py`:
- Around line 57-58: The code currently assigns dtype = _DTYPE_FOR_EXT.get(ext,
np.float32) which silently defaults unknown file extensions to float32; change
this to fail fast by looking up ext in _DTYPE_FOR_EXT and raising a clear
ValueError (including ext and path) when not found instead of defaulting. Update
the logic around ext and dtype (the ext variable and _DTYPE_FOR_EXT lookup) so
unknown suffixes raise an exception with a descriptive message to prevent
corrupted benchmark inputs.
- Around line 60-66: The code that reads binary matrices (reading n_rows/n_cols
and payload) must validate header and payload lengths and ensure subset_size is
non-negative and within bounds: check that f.read(4) calls returned 4 bytes
before converting to uint32 and raise a ValueError if not; validate subset_size
>= 0 and if provided clamp it to n_rows but error if it's > n_rows; compute
expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize and verify f.read
returns exactly that many bytes (or at least enough for the requested subset)
before calling np.frombuffer, raising a clear ValueError if sizes mismatch;
update the logic around n_rows, n_cols, subset_size, dtype, np.frombuffer and
f.read to use these checks so malformed files produce actionable errors rather
than cryptic crashes.

In `@python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py`:
- Around line 234-243: The code silently ignores invalid algorithm_configuration
values when they are neither files nor directories; update the block handling
algorithm_configuration (the variable algorithm_configuration that populates
algos_conf_fs) to detect the invalid path and raise a clear exception (e.g.,
raise ValueError) or log-and-raise with the provided path and its type so
callers know the override is invalid instead of silently falling back to
defaults; implement this in the same conditional after the
os.path.isdir/os.path.isfile checks where algos_conf_fs is modified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d1822e3-7adb-4f35-a054-0962f282eec1

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 6e91f0e.

📒 Files selected for processing (2)

python/cuvs_bench/cuvs_bench/backends/utils.py
python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py

coderabbitai · 2026-04-25T18:08:51Z

+    ext = os.path.splitext(path)[1].lower()
+    dtype = _DTYPE_FOR_EXT.get(ext, np.float32)


⚠️ Potential issue | 🟠 Major

Fail fast on unsupported vector file extensions.

Line 58 silently defaults unknown suffixes to float32, which can corrupt benchmark inputs when a filename is mistyped.

Proposed fix

def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray: @@ ext = os.path.splitext(path)[1].lower() - dtype = _DTYPE_FOR_EXT.get(ext, np.float32) + if ext not in _DTYPE_FOR_EXT: + raise ValueError( + f"Unsupported vector file extension '{ext}'. " + f"Expected one of: {', '.join(sorted(_DTYPE_FOR_EXT.keys()))}" + ) + dtype = _DTYPE_FOR_EXT[ext]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ext = os.path.splitext(path)[1].lower()

dtype = _DTYPE_FOR_EXT.get(ext, np.float32)

ext = os.path.splitext(path)[1].lower()

if ext not in _DTYPE_FOR_EXT:

raise ValueError(

f"Unsupported vector file extension '{ext}'. "

f"Expected one of: {', '.join(sorted(_DTYPE_FOR_EXT.keys()))}"

)

dtype = _DTYPE_FOR_EXT[ext]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 57 - 58, The code currently assigns dtype = _DTYPE_FOR_EXT.get(ext, np.float32) which silently defaults unknown file extensions to float32; change this to fail fast by looking up ext in _DTYPE_FOR_EXT and raising a clear ValueError (including ext and path) when not found instead of defaulting. Update the logic around ext and dtype (the ext variable and _DTYPE_FOR_EXT lookup) so unknown suffixes raise an exception with a descriptive message to prevent corrupted benchmark inputs.

coderabbitai · 2026-04-25T18:08:52Z

+        n_rows = int(np.frombuffer(f.read(4), dtype=np.uint32)[0])
+        n_cols = int(np.frombuffer(f.read(4), dtype=np.uint32)[0])
+        if subset_size is not None:
+            n_rows = min(n_rows, subset_size)
+        raw = f.read(n_rows * n_cols * np.dtype(dtype).itemsize)
+        data = np.frombuffer(raw, dtype=dtype)
+    return data.reshape(n_rows, n_cols)


⚠️ Potential issue | 🟠 Major

Validate file header/payload size and subset_size bounds.

Lines 60-66 assume complete header/payload and accept negative subset_size, which leads to non-actionable errors on bad input.

Proposed fix

def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray: @@ + if subset_size is not None and subset_size < 0: + raise ValueError("subset_size must be >= 0") + with open(path, "rb") as f: - n_rows = int(np.frombuffer(f.read(4), dtype=np.uint32)[0]) - n_cols = int(np.frombuffer(f.read(4), dtype=np.uint32)[0]) + header = f.read(8) + if len(header) != 8: + raise ValueError(f"Invalid vector file header in '{path}'") + n_rows, n_cols = np.frombuffer(header, dtype=np.uint32, count=2) + n_rows, n_cols = int(n_rows), int(n_cols) if subset_size is not None: n_rows = min(n_rows, subset_size) - raw = f.read(n_rows * n_cols * np.dtype(dtype).itemsize) - data = np.frombuffer(raw, dtype=dtype) + expected = n_rows * n_cols * np.dtype(dtype).itemsize + raw = f.read(expected) + if len(raw) != expected: + raise ValueError( + f"Unexpected payload size in '{path}': expected {expected} bytes, got {len(raw)}" + ) + data = np.frombuffer(raw, dtype=dtype, count=n_rows * n_cols) return data.reshape(n_rows, n_cols)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 60 - 66, The code that reads binary matrices (reading n_rows/n_cols and payload) must validate header and payload lengths and ensure subset_size is non-negative and within bounds: check that f.read(4) calls returned 4 bytes before converting to uint32 and raise a ValueError if not; validate subset_size >= 0 and if provided clamp it to n_rows but error if it's > n_rows; compute expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize and verify f.read returns exactly that many bytes (or at least enough for the requested subset) before calling np.frombuffer, raising a clear ValueError if sizes mismatch; update the logic around n_rows, n_cols, subset_size, dtype, np.frombuffer and f.read to use these checks so malformed files produce actionable errors rather than cryptic crashes.

coderabbitai · 2026-04-25T18:08:52Z

+        if algorithm_configuration:
+            if os.path.isdir(algorithm_configuration):
+                algos_conf_fs += [
+                    os.path.join(algorithm_configuration, f)
+                    for f in os.listdir(algorithm_configuration)
+                    if ".json" not in f
+                ]
+            elif os.path.isfile(algorithm_configuration):
+                algos_conf_fs.append(algorithm_configuration)
+        return algos_conf_fs


⚠️ Potential issue | 🟠 Major

Do not silently ignore invalid algorithm_configuration paths.

Line 234 accepts an override input, but when it is neither a file nor a directory, the code currently falls back silently to defaults.

Proposed fix

if algorithm_configuration: if os.path.isdir(algorithm_configuration): algos_conf_fs += [ os.path.join(algorithm_configuration, f) for f in os.listdir(algorithm_configuration) if ".json" not in f ] elif os.path.isfile(algorithm_configuration): algos_conf_fs.append(algorithm_configuration) + else: + raise FileNotFoundError( + f"algorithm_configuration path does not exist: " + f"{algorithm_configuration}" + ) return algos_conf_fs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py` around lines 234 - 243, The code silently ignores invalid algorithm_configuration values when they are neither files nor directories; update the block handling algorithm_configuration (the variable algorithm_configuration that populates algos_conf_fs) to detect the invalid path and raise a clear exception (e.g., raise ValueError) or log-and-raise with the provided path and its type so callers know the override is invalid instead of silently falling back to defaults; implement this in the same conditional after the os.path.isdir/os.path.isfile checks where algos_conf_fs is modified.

…idate header/payload size, warn on invalid algorithm config paths.

…tor loading from file paths on first access

…or load_vectors to use it.

…stry.py to match current BenchmarkBackend abstract interface

…ansparent vector loading, and base ConfigLoader inherited methods

…taset lookup, DatasetConfig construction) and delegate backend-specific logic to abstract _build_benchmark_configs().

…sts for Python-native backends.

… user-specified executable path capability.

…-n-trials, and --backend-config flags.

…figs.

…ints flags.

jrbourbeau

Thanks @jnke2016 for moving these to a shared place. Overall LGTM -- just left a couple minor (nonblocking) comments

jrbourbeau · 2026-04-27T15:36:59Z

+        # it was passed in or loaded from disk.
+        self._base_vectors = base_vectors if base_vectors is not None else np.empty((0, 0))
+        self._query_vectors = query_vectors if query_vectors is not None else np.empty((0, 0))
+        self._groundtruth_neighbors = groundtruth_neighbors


Why is _groundtruth_neighbors treated differently here than _base_vectors and _query_vectors?

jrbourbeau · 2026-04-27T15:55:46Z

+            if ".json" not in f
+            and "constraint" not in f
+            and ".py" not in f
+            and "__pycache__" not in f


Are these supposed to be yaml files? If so, checking for that extension (.yaml + .yml) might be better. Maybe the situation is more complex though

…parently after backend.search() returns

jnke2016 added 2 commits April 25, 2026 16:45

Add shared load_vectors utility to centralize binary vector file load…

c067d52

…ing across Python-native backends

Promote shared config-loading methods (load_yaml_file, get_dataset_co…

6e91f0e

…nfiguration, gather_algorithm_configs) from CppGBenchConfigLoader to base ConfigLoader class.

jnke2016 requested a review from a team as a code owner April 25, 2026 18:04

github-project-automation Bot added this to Unstructured Data Processing Apr 25, 2026

jnke2016 marked this pull request as draft April 25, 2026 18:04

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

jnke2016 added 13 commits April 25, 2026 18:17

Add input validation: fail on unsupported vector file extensions, val…

99fb908

…idate header/payload size, warn on invalid algorithm config paths.

update docstrings

1e1ef66

Refactor Dataset from dataclass to regular class with transparent vec…

21d8621

…tor loading from file paths on first access

Add dtype_from_filename from generate_groundtruth/utils.py and refact…

60f3aeb

…or load_vectors to use it.

Fix outdated DummyBackend/AnotherDummyBackend signatures in test_regi…

39867d9

…stry.py to match current BenchmarkBackend abstract interface

Add test_utils.py with tests for shared backend utilities, Dataset tr…

1e2de0f

…ansparent vector loading, and base ConfigLoader inherited methods

Refactor ConfigLoader.load() to handle shared steps (YAML loading, da…

9480257

…taset lookup, DatasetConfig construction) and delegate backend-specific logic to abstract _build_benchmark_configs().

Add shared expand_param_grid() and compute_recall() utilities with te…

3399692

…sts for Python-native backends.

Add executable_dir support to CppGBenchConfigLoader to match run.py's…

896323d

… user-specified executable path capability.

Route CLI through BenchmarkOrchestrator, add --mode, --constraints, -…

93b78db

…-n-trials, and --backend-config flags.

update docs

c99cf88

Fix executable_dir not being passed through to _prepare_benchmark_con…

4f01af3

…figs.

Add CLI tests for --mode, --backend-config, --n-trials, and --constra…

aa913dc

…ints flags.

jrbourbeau approved these changes Apr 27, 2026

View reviewed changes

Move recall computation from backends to orchestrator, computed trans…

dddf1d2

…parently after backend.search() returns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize shared utilities across benchmark backends#2040

Centralize shared utilities across benchmark backends#2040
jnke2016 wants to merge 16 commits intorapidsai:mainfrom
jnke2016:consolidate-shared-utils

jnke2016 commented Apr 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Uh oh!

coderabbitai Bot Apr 25, 2026

Uh oh!

coderabbitai Bot Apr 25, 2026

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau Apr 27, 2026

Uh oh!

jrbourbeau Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ext = os.path.splitext(path)[1].lower()
		dtype = _DTYPE_FOR_EXT.get(ext, np.float32)

-    ext = os.path.splitext(path)[1].lower()
-    dtype = _DTYPE_FOR_EXT.get(ext, np.float32)
+    ext = os.path.splitext(path)[1].lower()
+    if ext not in _DTYPE_FOR_EXT:
+        raise ValueError(
+            f"Unsupported vector file extension '{ext}'. "
+            f"Expected one of: {', '.join(sorted(_DTYPE_FOR_EXT.keys()))}"
+        )
+    dtype = _DTYPE_FOR_EXT[ext]

Conversation

jnke2016 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Remaining work

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jnke2016 commented Apr 25, 2026 •

edited

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading