Skip to content

Speedup available() for minio backends#555

Draft
hagenw wants to merge 1 commit into
mainfrom
speedup-available
Draft

Speedup available() for minio backends#555
hagenw wants to merge 1 commit into
mainfrom
speedup-available

Conversation

@hagenw
Copy link
Copy Markdown
Member

@hagenw hagenw commented Apr 7, 2026

Speedup audb.available() for s3/minio backends by not checking if the header file exists. For artifactory backends we also base the list of available versions on the available folders and don't check for the existence of the header file.

This approach is more risky, but it brings a speedup.

Execution time.

Backend Current Proposed
artifactory 2.834s 2.834s
minio 35.483s 10.904s

Summary by Sourcery

Optimize database discovery for object-storage backends when listing available databases and versions.

Enhancements:

  • Speed up S3/MinIO available-database listing by inferring versions from object subfolders instead of checking for header file existence.
  • Align artifactory backend version discovery to use available folders without validating header files for each version.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 7, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Optimize audb.available() for S3/MinIO and Artifactory backends by deriving available database versions from listed folders instead of probing for header file existence, significantly reducing calls and latency on remote storage.

Sequence diagram for the optimized audb.available() on MinIO backend

sequenceDiagram
    actor User
    participant AudbAPI
    participant BackendInterface
    participant Repository
    participant MinioClient

    User->>AudbAPI: call available()
    AudbAPI->>BackendInterface: get_backend(repository)
    BackendInterface-->>AudbAPI: backend

    AudbAPI->>BackendInterface: is_s3_or_minio_backend()
    BackendInterface-->>AudbAPI: True

    AudbAPI->>MinioClient: list_objects(repository.name)
    loop for each obj in top_level_objects
        AudbAPI->>AudbAPI: name = obj.object_name without trailing slash
        AudbAPI->>MinioClient: list_objects(repository.name, obj.object_name)
        MinioClient-->>AudbAPI: sub_folders
        loop for each sub_folder in sub_folders
            AudbAPI->>AudbAPI: version = sub_folder.object_name.split("/")[1]
            AudbAPI->>AudbAPI: check version not in [attachment, media, meta]
            alt version is valid database version
                AudbAPI->>AudbAPI: add_database(name, version, repository)
            else version is attachment, media, or meta
                AudbAPI-->>AudbAPI: skip
            end
        end
    end

    AudbAPI-->>User: return list of available databases and versions
Loading

Flow diagram for backend-specific version discovery in audb.available()

flowchart TD
    A["Start audb.available()"] --> B["Get backend for repository"]
    B --> C{"Backend is S3 or MinIO?"}

    C -- "Yes" --> D["Call client.list_objects(repository.name)
(top level objects)"]
    D --> E["For each obj: derive name from obj.object_name"]
    E --> F["Call client.list_objects(repository.name, obj.object_name)
(sub_folders)"]
    F --> G["For each sub_folder: version = sub_folder.object_name.split('/')[1]"]
    G --> H{"version in [attachment, media, meta]?"}
    H -- "Yes" --> I["Skip version"]
    H -- "No" --> J["add_database(name, version, repository)"]
    I --> K["Next sub_folder/obj"]
    J --> K
    K --> L["All objects processed"]

    C -- "No" --> M["backend_interface.ls('/')"]
    M --> N["For each (path, version)"]
    N --> O{"path endswith HEADER_FILE?"}
    O -- "Yes" --> P["add_database(name from path, version, repository)"]
    O -- "No" --> Q["Skip entry"]
    P --> R["Next entry"]
    Q --> R
    R --> L

    L --> S["Return list of available databases and versions"]
Loading

File-Level Changes

Change Details Files
Optimize S3/MinIO backend path discovery by avoiding per-version header existence checks and relying solely on listed sub-folders.
  • Iterate over top-level objects from backend._client.list_objects(repository.name) and compute database name from object_name
  • Query sub-folders for each database with backend._client.list_objects(repository.name, obj.object_name) instead of recursively scanning entire bucket
  • Derive version from each sub-folder path, skip reserved folders (attachment, media, meta), and register all remaining versions via add_database() without calling backend.exists()
audb/core/api.py
Align Artifactory backend version detection logic with the folder-based approach used for S3/MinIO by relying on backend_interface.ls('/') results.
  • Retain code path using backend_interface.ls('/') for non-S3/MinIO backends, implicitly treating folder listings as source of truth for available versions
  • Remove dependence on HEADER_FILE existence checks for determining available versions, as header lookups are no longer used in the optimized path
audb/core/api.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Apr 22, 2026

I checked, for the repository audb.Repository('audb-internal', 's3.dualstack.eu-north-1.amazonaws.com', 's3') the number of entries audb.available() returns is identical for main and this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant