feat: add create_fasta_and_index tool by ameynert · Pull Request #11 · fg-labs/divref-wf

ameynert · 2026-04-10T23:11:42Z

Summary

Ports create_fasta_and_index.py from human-diversity-reference/scripts as a defopt-compatible toolkit tool
Generates DivRef FASTA sequences with flanking reference context and a DuckDB index for use by remap_divref
Reuses get_haplo_sequence and split_haplotypes from divref.haplotype
Fixes a loop-variable closure bug (B023), renames chr loop variable to chrom to avoid shadowing the built-in, and replaces print/typer.echo with logging
Adds tests/tools/test_create_fasta_and_index.py with happy-path tests for get_haplo_sequence and split_haplotypes (Hail JVM test marked skip)

Test plan

uv run --directory divref poe check-all passes
uv run --directory divref pytest tests/tools/test_create_fasta_and_index.py — all non-skip tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added CLI command to generate FASTA outputs and searchable indexes from haplotype data for remapping workflows.
- Added CLI subcommand to run downstream remapping workflows.
Tests
- Added test coverage for haplotype sequence generation edge cases.

coderabbitai · 2026-04-10T23:11:49Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f901fb75-ef60-47a7-a8f1-0dad00d02163

📥 Commits

Reviewing files that changed from the base of the PR and between ee5ccc4 and 2fc7a93.

📒 Files selected for processing (1)

divref/divref/tools/create_fasta_and_index.py

✅ Files skipped from review due to trivial changes (1)

divref/divref/tools/create_fasta_and_index.py

📝 Walkthrough

Walkthrough

Adds a new CLI subcommand create_fasta_and_index (and registers remap_divref), implements a pipeline to produce FASTA files and a DuckDB index from haplotype Hail tables, and adds unit tests for haplotype sequence edge cases.

Changes

Cohort / File(s)	Summary
CLI Tool Registration `divref/divref/main.py`	Imports `create_fasta_and_index` and `remap_divref` and appends them to the internal `_tools` list passed to `defopt.run()`, exposing new subcommands.
Haplotype → FASTA + Index Tool `divref/divref/tools/create_fasta_and_index.py`	New module adding `create_fasta_and_index` and helpers: Hail init/load, haplotype filtering/merge, window split & dedupe, sequence construction (`get_haplo_sequence`), checkpoint/export to TSV, FASTA writing (single or per-contig), and DuckDB index creation with `sequences` and metadata tables.
Unit Tests `divref/tests/tools/test_create_fasta_and_index.py`	New tests for `get_haplo_sequence` edge cases (SNP/ins/del) that mock `hail.get_sequence` and validate haplotype sequence output via hl.eval.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant CLI
    end
    rect rgba(200,255,200,0.5)
    participant Hail
    end
    rect rgba(255,200,200,0.5)
    participant FS
    end
    rect rgba(255,255,200,0.5)
    participant DuckDB
    end

    CLI->>Hail: load haplotypes table & gnomAD VA
    CLI->>Hail: register reference FASTA
    Hail->>Hail: compute per-haplotype metrics, filter, split windows
    Hail->>Hail: dedupe, build sequences & variant strings, assign sequence_id
    Hail->>FS: export TSV / DataFrame
    CLI->>FS: write FASTA file(s) (single or per-contig)
    CLI->>DuckDB: create/replace .duckdb and insert sequences + metadata

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add haplotype.py shared utilities module #4: Adds a create_fasta_and_index tool that directly interfaces with the haplotype utilities used here.
feat: get basic tests working #18: Modifies get_haplo_sequence behavior/signature which this new pipeline and tests call.
feat: add create_gnomad_sites_vcf tool #6: Updates haplotype utilities and CLI tool registration related to the new subcommands added.

Poem

🐰 I hopped through tables, fields, and strands,
Turned haplotypes into shiny bands.
With Hail I pranced and DuckDB sang,
FASTA footprints stitched each little tang.
A rabbit cheers — index made, bells rang!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a new create_fasta_and_index tool to the CLI.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_10_create_fasta_and_index

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Port create_fasta_and_index.py from human-diversity-reference/scripts as a defopt-compatible toolkit tool. Generates DivRef FASTA sequences with flanking reference context and a DuckDB index for use by remap_divref. Reuses get_haplo_sequence and split_haplotypes from divref.haplotype. Renames the chr loop variable to chrom to avoid shadowing the built-in, fixes a loop-variable closure bug (B023), and replaces print/typer.echo with logging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds tests/tools/__init__.py to create the package and test_create_fasta_and_index.py with happy-path tests for get_haplo_sequence and split_haplotypes. The Hail JVM test is marked skip; the remaining tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

divref/divref/tools/create_fasta_and_index.py (1)

55-180: Consider splitting this orchestration into smaller helpers.

The function currently mixes Hail transforms, export, FASTA writing, and DuckDB materialization in one block. Extracting pipeline stages into small helpers would improve readability and reduce risk during future edits.

As per coding guidelines: “Extract logic into small–medium functions with clear inputs/outputs”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/create_fasta_and_index.py` around lines 55 - 180, The
current create_fasta_and_index orchestration mixes Hail transforms (ht building
with split_haplotypes, get_haplo_sequence), export (ht.select().export -> df),
FASTA writing, and DuckDB materialization in one long block; refactor by
extracting at least four helpers with clear inputs/outputs: (1)
build_haplotype_table(haplotypes_table_path, gnomad_va_file, window_size,
frequency_cutoff, merge, version_str) that performs the Hail pipeline (reading
tables, annotations, split_haplotypes, get_haplo_sequence, checkpoint) and
returns the final ht or path to checkpointed HT, (2) export_ht_to_tsv_bgz(ht,
output_base, file_suffix, pops_legend) which does the ht.select(...).export and
returns the polars DataFrame, (3) write_fasta_files(df, output_base,
file_suffix, split_contigs) which writes FASTA files (the split_contigs logic),
and (4) create_duckdb_index(df, output_base, file_suffix, window_size,
pops_legend, version_str) which creates the duckdb file and tables; wire these
helpers together in the original routine, preserve existing names like
split_haplotypes, get_haplo_sequence, ht, pops_legend, and ensure each helper
has minimal side effects and explicit return values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 17-29: Add unit tests exercising the new public function
create_fasta_and_index: one happy-path test that supplies valid
haplotypes_table_path, gnomad_va_file, reference_fasta, window_size, output_base
and version_str (and exercise both merge=True and split_contigs=True variants)
asserting expected output files are created and indexed; and at least one
error-path test that calls create_fasta_and_index with invalid inputs (e.g.,
non-existent haplotypes_table_path or an out-of-range frequency_cutoff like -0.1
or >1.0) and asserts it raises the expected exception. Use the
create_fasta_and_index symbol to locate the implementation and mock or create
temporary HailPath-like test fixtures (tmp_dir/output_base) so tests are
deterministic; ensure coverage includes the merge branch and invalid-parameter
validation branches.
- Around line 67-75: The calculation fraction_phased = ht.max_empirical_AF /
ht.min_variant_frequency can divide by zero; update the computation in the block
that computes fraction_phased and annotates estimated_gnomad_AF so you guard
against ht.min_variant_frequency == 0 (or missing) by computing fraction_phased
with a conditional (e.g., use hl.if_else or equivalent to return a safe default
or missing when min_variant_frequency <= 0), then use that guarded
fraction_phased when annotating estimated_gnomad_AF and ensure downstream
filtering (ht = ht.filter(ht.estimated_gnomad_AF >= frequency_cutoff)) behaves
correctly with the chosen default/missing value; reference the fraction_phased
variable, ht.min_variant_frequency, ht.max_empirical_AF, the annotate call that
sets estimated_gnomad_AF, and the subsequent filter to implement this fix.
- Around line 177-179: The three con.execute calls misuse f-string
interpolation: replace direct interpolation with parameterized queries for
window_size and version_str (use bound parameters in con.execute) and serialize
pops_legend before storing (e.g., json.dumps(pops_legend) to create a safe SQL
text value or convert to a valid SQL array literal) then pass that serialized
string as a bound parameter when creating the pops_legend table; update the
CREATE TABLE statements that reference window_size, pops_legend, and VERSION to
accept the bound parameters and use the serialized pops_legend so you avoid
malformed SQL and injection risks.

---

Nitpick comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 55-180: The current create_fasta_and_index orchestration mixes
Hail transforms (ht building with split_haplotypes, get_haplo_sequence), export
(ht.select().export -> df), FASTA writing, and DuckDB materialization in one
long block; refactor by extracting at least four helpers with clear
inputs/outputs: (1) build_haplotype_table(haplotypes_table_path, gnomad_va_file,
window_size, frequency_cutoff, merge, version_str) that performs the Hail
pipeline (reading tables, annotations, split_haplotypes, get_haplo_sequence,
checkpoint) and returns the final ht or path to checkpointed HT, (2)
export_ht_to_tsv_bgz(ht, output_base, file_suffix, pops_legend) which does the
ht.select(...).export and returns the polars DataFrame, (3)
write_fasta_files(df, output_base, file_suffix, split_contigs) which writes
FASTA files (the split_contigs logic), and (4) create_duckdb_index(df,
output_base, file_suffix, window_size, pops_legend, version_str) which creates
the duckdb file and tables; wire these helpers together in the original routine,
preserve existing names like split_haplotypes, get_haplo_sequence, ht,
pops_legend, and ensure each helper has minimal side effects and explicit return
values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7d920d4-9ace-4ed2-a631-69763bcc9114

📥 Commits

Reviewing files that changed from the base of the PR and between 316c402 and eafe72a.

📒 Files selected for processing (4)

divref/divref/main.py
divref/divref/tools/create_fasta_and_index.py
divref/tests/tools/__init__.py
divref/tests/tools/test_create_fasta_and_index.py

coderabbitai · 2026-04-17T00:00:45Z

+def create_fasta_and_index(
+    *,
+    haplotypes_table_path: HailPath,
+    gnomad_va_file: HailPath,
+    reference_fasta: HailPath,
+    window_size: int,
+    output_base: HailPath,
+    version_str: str,
+    merge: bool = False,
+    frequency_cutoff: float = 0.005,
+    split_contigs: bool = False,
+    tmp_dir: HailPath = "/tmp",
+) -> None:


⚠️ Potential issue | 🟠 Major

Add direct tests for the new public CLI function (create_fasta_and_index).

This PR adds a new public function, but the added tests target get_haplo_sequence only. Please add at least one happy-path and one error-path test for this function (e.g., invalid input path / invalid cutoff / merge-path behavior).

As per coding guidelines: “**/*.py: New public functions require at least one happy-path test + one error case; bug fixes should include regression test”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/create_fasta_and_index.py` around lines 17 - 29, Add unit tests exercising the new public function create_fasta_and_index: one happy-path test that supplies valid haplotypes_table_path, gnomad_va_file, reference_fasta, window_size, output_base and version_str (and exercise both merge=True and split_contigs=True variants) asserting expected output files are created and indexed; and at least one error-path test that calls create_fasta_and_index with invalid inputs (e.g., non-existent haplotypes_table_path or an out-of-range frequency_cutoff like -0.1 or >1.0) and asserts it raises the expected exception. Use the create_fasta_and_index symbol to locate the implementation and mock or create temporary HailPath-like test fixtures (tmp_dir/output_base) so tests are deterministic; ensure coverage includes the merge branch and invalid-parameter validation branches.

- Guard against division by zero when min_variant_frequency <= 0; log warning with count of removed haplotypes - Fix SQL interpolation in DuckDB metadata tables: use parameterized queries for window_size and version_str, serialize pops_legend with json.dumps - Refactor monolithic function into four helpers: build_haplotype_table, export_ht_to_dataframe, write_fasta_files, create_duckdb_index Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ameynert had a problem deploying to github-actions-snakemake-linting April 10, 2026 23:11 — with GitHub Actions Failure

ameynert force-pushed the am_09_compute_variation_ratios branch from 0fe6bba to 2e3ca4f Compare April 10, 2026 23:17

ameynert force-pushed the am_10_create_fasta_and_index branch from 19bf206 to 6f51106 Compare April 10, 2026 23:17

ameynert temporarily deployed to github-actions-snakemake-linting April 10, 2026 23:17 — with GitHub Actions Inactive

ameynert force-pushed the am_09_compute_variation_ratios branch from 2e3ca4f to c0ec4e9 Compare April 16, 2026 23:40

Base automatically changed from am_09_compute_variation_ratios to main April 16, 2026 23:54

ameynert and others added 3 commits April 16, 2026 16:54

fix: remove hail test context

eafe72a

ameynert force-pushed the am_10_create_fasta_and_index branch from 6f51106 to eafe72a Compare April 16, 2026 23:54

ameynert temporarily deployed to github-actions-snakemake-linting April 16, 2026 23:54 — with GitHub Actions Inactive

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 03:24 — with GitHub Actions Inactive

ameynert force-pushed the am_10_create_fasta_and_index branch from ee5ccc4 to eafe72a Compare April 17, 2026 03:27

ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 03:27 — with GitHub Actions Inactive

ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 04:15 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add create_fasta_and_index tool#11

feat: add create_fasta_and_index tool#11
ameynert wants to merge 4 commits intomainfrom
am_10_create_fasta_and_index

ameynert commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented Apr 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading