Skip to content

feat(prepro): Host name validation#6203

Merged
maverbiest merged 67 commits intomainfrom
prepro-hostname-validation
Apr 9, 2026
Merged

feat(prepro): Host name validation#6203
maverbiest merged 67 commits intomainfrom
prepro-hostname-validation

Conversation

@maverbiest
Copy link
Copy Markdown
Contributor

@maverbiest maverbiest commented Mar 27, 2026

Adding host validation to preprocessing

Implementation

Host validation is implemented across three new processing functions:

validate_host Takes unvalidated input, calls out to the taxonomy service to validate it, and returns a taxon ID when validation succeeds. If the input casts to an int, it is assumed to be a taxon id. Otherwise, it is assumed to be a scientific name. Responses to successful validations are cached in taxon_cache (see below).

It is possible that multiple taxa have the same scientific name. In these cases,
we return the tax_id of the most generic taxon (i.e., the one that's closest to
the root of the taxonomy).

If the taxon ID or host name does not exists, we return None and add a warning (if is_insdc_ingest_group) or an error (for everyone else).

@staticmethod
    def validate_host(
        input_data: InputMetadata,
        output_field: str,
        input_fields: list[str],
        args: FunctionArgs,
    ) -> ProcessingResult:

scientific_name_from_id takes a validated taxon ID and maps it to a scientific name. Responses to successful validations are cached in taxon_cache (see below).

    @staticmethod
    def scientific_name_from_id(
        input_data: InputMetadata,
        output_field: str,
        input_fields: list[str],
        args: FunctionArgs,
    ) -> ProcessingResult:

common_name_from_id takes a validated taxon ID and maps it to a common name. If the input taxon has a common name itself, that taxon is returned. If it does not, the nearest ancestor with a common name is returned instead. Responses to successful validations are cached in common_name_cache (see below).

    @staticmethod
    def common_name_from_id(
        input_data: InputMetadata,
        output_field: str,
        input_fields: list[str],
        args: FunctionArgs,
    ) -> ProcessingResult:

Caching

Successful requests to the taxonomy service are cached. Since the Number of distinct host organisms in a dataset are expected to be relatively constrained, this reduces the amount of network requests needed at the cost of keeping a small cache in memory.

The caching approach is implemented in the new class RequestCache:

class RequestCache:    
    """Class for caching requests to external services during preprocessing.

    Keys are the fully formatted URLs that have already been used to make sucessful requests.
    Values are requests.Response as they were returned by the service.
    """

This is now used to cache requests to the taxonomy service:

taxonomy_cache = RequestCache(max_size=64)

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://prepro-hostname-validatio.loculus.org

@claude claude bot added preprocessing Issues related to the preprocessing component deployment Code changes targetting the deployment infrastructure labels Mar 27, 2026
@maverbiest maverbiest added the preview Triggers a deployment to argocd label Mar 27, 2026
@maverbiest maverbiest added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Mar 27, 2026
@maverbiest maverbiest added the preview Triggers a deployment to argocd label Mar 30, 2026
Comment thread kubernetes/loculus/values.yaml Outdated
Comment thread kubernetes/loculus/values.yaml
Comment thread kubernetes/loculus/values.yaml
Comment thread kubernetes/loculus/values.yaml Outdated
Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
@anna-parker
Copy link
Copy Markdown
Contributor

We tested on staging and this is working well - using these functions in PPX should be done only after a migration of the originally submitted metadata to handle cases where the scientific name is invalid. We should wait till #6254 is merged to do this

@maverbiest maverbiest merged commit 9329058 into main Apr 9, 2026
41 of 42 checks passed
@maverbiest maverbiest deleted the prepro-hostname-validation branch April 9, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment Code changes targetting the deployment infrastructure preprocessing Issues related to the preprocessing component preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants