Skip to content

feat: add create_fasta_and_index tool#11

Open
ameynert wants to merge 4 commits intomainfrom
am_10_create_fasta_and_index
Open

feat: add create_fasta_and_index tool#11
ameynert wants to merge 4 commits intomainfrom
am_10_create_fasta_and_index

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 10, 2026

Summary

  • Ports create_fasta_and_index.py from human-diversity-reference/scripts as a defopt-compatible toolkit tool
  • Generates DivRef FASTA sequences with flanking reference context and a DuckDB index for use by remap_divref
  • Reuses get_haplo_sequence and split_haplotypes from divref.haplotype
  • Fixes a loop-variable closure bug (B023), renames chr loop variable to chrom to avoid shadowing the built-in, and replaces print/typer.echo with logging
  • Adds tests/tools/test_create_fasta_and_index.py with happy-path tests for get_haplo_sequence and split_haplotypes (Hail JVM test marked skip)

Test plan

  • uv run --directory divref poe check-all passes
  • uv run --directory divref pytest tests/tools/test_create_fasta_and_index.py — all non-skip tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added CLI command to generate FASTA outputs and searchable indexes from haplotype data for remapping workflows.
    • Added CLI subcommand to run downstream remapping workflows.
  • Tests

    • Added test coverage for haplotype sequence generation edge cases.

@ameynert ameynert had a problem deploying to github-actions-snakemake-linting April 10, 2026 23:11 — with GitHub Actions Failure
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f901fb75-ef60-47a7-a8f1-0dad00d02163

📥 Commits

Reviewing files that changed from the base of the PR and between ee5ccc4 and 2fc7a93.

📒 Files selected for processing (1)
  • divref/divref/tools/create_fasta_and_index.py
✅ Files skipped from review due to trivial changes (1)
  • divref/divref/tools/create_fasta_and_index.py

📝 Walkthrough

Walkthrough

Adds a new CLI subcommand create_fasta_and_index (and registers remap_divref), implements a pipeline to produce FASTA files and a DuckDB index from haplotype Hail tables, and adds unit tests for haplotype sequence edge cases.

Changes

Cohort / File(s) Summary
CLI Tool Registration
divref/divref/main.py
Imports create_fasta_and_index and remap_divref and appends them to the internal _tools list passed to defopt.run(), exposing new subcommands.
Haplotype → FASTA + Index Tool
divref/divref/tools/create_fasta_and_index.py
New module adding create_fasta_and_index and helpers: Hail init/load, haplotype filtering/merge, window split & dedupe, sequence construction (get_haplo_sequence), checkpoint/export to TSV, FASTA writing (single or per-contig), and DuckDB index creation with sequences and metadata tables.
Unit Tests
divref/tests/tools/test_create_fasta_and_index.py
New tests for get_haplo_sequence edge cases (SNP/ins/del) that mock hail.get_sequence and validate haplotype sequence output via hl.eval.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant CLI
    end
    rect rgba(200,255,200,0.5)
    participant Hail
    end
    rect rgba(255,200,200,0.5)
    participant FS
    end
    rect rgba(255,255,200,0.5)
    participant DuckDB
    end

    CLI->>Hail: load haplotypes table & gnomAD VA
    CLI->>Hail: register reference FASTA
    Hail->>Hail: compute per-haplotype metrics, filter, split windows
    Hail->>Hail: dedupe, build sequences & variant strings, assign sequence_id
    Hail->>FS: export TSV / DataFrame
    CLI->>FS: write FASTA file(s) (single or per-contig)
    CLI->>DuckDB: create/replace .duckdb and insert sequences + metadata
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through tables, fields, and strands,
Turned haplotypes into shiny bands.
With Hail I pranced and DuckDB sang,
FASTA footprints stitched each little tang.
A rabbit cheers — index made, bells rang!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new create_fasta_and_index tool to the CLI.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_10_create_fasta_and_index

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert force-pushed the am_09_compute_variation_ratios branch from 0fe6bba to 2e3ca4f Compare April 10, 2026 23:17
@ameynert ameynert force-pushed the am_10_create_fasta_and_index branch from 19bf206 to 6f51106 Compare April 10, 2026 23:17
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 10, 2026 23:17 — with GitHub Actions Inactive
@ameynert ameynert force-pushed the am_09_compute_variation_ratios branch from 2e3ca4f to c0ec4e9 Compare April 16, 2026 23:40
Base automatically changed from am_09_compute_variation_ratios to main April 16, 2026 23:54
ameynert and others added 3 commits April 16, 2026 16:54
Port create_fasta_and_index.py from human-diversity-reference/scripts as a
defopt-compatible toolkit tool. Generates DivRef FASTA sequences with
flanking reference context and a DuckDB index for use by remap_divref.
Reuses get_haplo_sequence and split_haplotypes from divref.haplotype.
Renames the chr loop variable to chrom to avoid shadowing the built-in,
fixes a loop-variable closure bug (B023), and replaces print/typer.echo
with logging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/tools/__init__.py to create the package and
test_create_fasta_and_index.py with happy-path tests for
get_haplo_sequence and split_haplotypes. The Hail JVM test
is marked skip; the remaining tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ameynert ameynert force-pushed the am_10_create_fasta_and_index branch from 6f51106 to eafe72a Compare April 16, 2026 23:54
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 16, 2026 23:54 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
divref/divref/tools/create_fasta_and_index.py (1)

55-180: Consider splitting this orchestration into smaller helpers.

The function currently mixes Hail transforms, export, FASTA writing, and DuckDB materialization in one block. Extracting pipeline stages into small helpers would improve readability and reduce risk during future edits.

As per coding guidelines: “Extract logic into small–medium functions with clear inputs/outputs”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/create_fasta_and_index.py` around lines 55 - 180, The
current create_fasta_and_index orchestration mixes Hail transforms (ht building
with split_haplotypes, get_haplo_sequence), export (ht.select().export -> df),
FASTA writing, and DuckDB materialization in one long block; refactor by
extracting at least four helpers with clear inputs/outputs: (1)
build_haplotype_table(haplotypes_table_path, gnomad_va_file, window_size,
frequency_cutoff, merge, version_str) that performs the Hail pipeline (reading
tables, annotations, split_haplotypes, get_haplo_sequence, checkpoint) and
returns the final ht or path to checkpointed HT, (2) export_ht_to_tsv_bgz(ht,
output_base, file_suffix, pops_legend) which does the ht.select(...).export and
returns the polars DataFrame, (3) write_fasta_files(df, output_base,
file_suffix, split_contigs) which writes FASTA files (the split_contigs logic),
and (4) create_duckdb_index(df, output_base, file_suffix, window_size,
pops_legend, version_str) which creates the duckdb file and tables; wire these
helpers together in the original routine, preserve existing names like
split_haplotypes, get_haplo_sequence, ht, pops_legend, and ensure each helper
has minimal side effects and explicit return values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 17-29: Add unit tests exercising the new public function
create_fasta_and_index: one happy-path test that supplies valid
haplotypes_table_path, gnomad_va_file, reference_fasta, window_size, output_base
and version_str (and exercise both merge=True and split_contigs=True variants)
asserting expected output files are created and indexed; and at least one
error-path test that calls create_fasta_and_index with invalid inputs (e.g.,
non-existent haplotypes_table_path or an out-of-range frequency_cutoff like -0.1
or >1.0) and asserts it raises the expected exception. Use the
create_fasta_and_index symbol to locate the implementation and mock or create
temporary HailPath-like test fixtures (tmp_dir/output_base) so tests are
deterministic; ensure coverage includes the merge branch and invalid-parameter
validation branches.
- Around line 67-75: The calculation fraction_phased = ht.max_empirical_AF /
ht.min_variant_frequency can divide by zero; update the computation in the block
that computes fraction_phased and annotates estimated_gnomad_AF so you guard
against ht.min_variant_frequency == 0 (or missing) by computing fraction_phased
with a conditional (e.g., use hl.if_else or equivalent to return a safe default
or missing when min_variant_frequency <= 0), then use that guarded
fraction_phased when annotating estimated_gnomad_AF and ensure downstream
filtering (ht = ht.filter(ht.estimated_gnomad_AF >= frequency_cutoff)) behaves
correctly with the chosen default/missing value; reference the fraction_phased
variable, ht.min_variant_frequency, ht.max_empirical_AF, the annotate call that
sets estimated_gnomad_AF, and the subsequent filter to implement this fix.
- Around line 177-179: The three con.execute calls misuse f-string
interpolation: replace direct interpolation with parameterized queries for
window_size and version_str (use bound parameters in con.execute) and serialize
pops_legend before storing (e.g., json.dumps(pops_legend) to create a safe SQL
text value or convert to a valid SQL array literal) then pass that serialized
string as a bound parameter when creating the pops_legend table; update the
CREATE TABLE statements that reference window_size, pops_legend, and VERSION to
accept the bound parameters and use the serialized pops_legend so you avoid
malformed SQL and injection risks.

---

Nitpick comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 55-180: The current create_fasta_and_index orchestration mixes
Hail transforms (ht building with split_haplotypes, get_haplo_sequence), export
(ht.select().export -> df), FASTA writing, and DuckDB materialization in one
long block; refactor by extracting at least four helpers with clear
inputs/outputs: (1) build_haplotype_table(haplotypes_table_path, gnomad_va_file,
window_size, frequency_cutoff, merge, version_str) that performs the Hail
pipeline (reading tables, annotations, split_haplotypes, get_haplo_sequence,
checkpoint) and returns the final ht or path to checkpointed HT, (2)
export_ht_to_tsv_bgz(ht, output_base, file_suffix, pops_legend) which does the
ht.select(...).export and returns the polars DataFrame, (3)
write_fasta_files(df, output_base, file_suffix, split_contigs) which writes
FASTA files (the split_contigs logic), and (4) create_duckdb_index(df,
output_base, file_suffix, window_size, pops_legend, version_str) which creates
the duckdb file and tables; wire these helpers together in the original routine,
preserve existing names like split_haplotypes, get_haplo_sequence, ht,
pops_legend, and ensure each helper has minimal side effects and explicit return
values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7d920d4-9ace-4ed2-a631-69763bcc9114

📥 Commits

Reviewing files that changed from the base of the PR and between 316c402 and eafe72a.

📒 Files selected for processing (4)
  • divref/divref/main.py
  • divref/divref/tools/create_fasta_and_index.py
  • divref/tests/tools/__init__.py
  • divref/tests/tools/test_create_fasta_and_index.py

Comment on lines +17 to +29
def create_fasta_and_index(
*,
haplotypes_table_path: HailPath,
gnomad_va_file: HailPath,
reference_fasta: HailPath,
window_size: int,
output_base: HailPath,
version_str: str,
merge: bool = False,
frequency_cutoff: float = 0.005,
split_contigs: bool = False,
tmp_dir: HailPath = "/tmp",
) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add direct tests for the new public CLI function (create_fasta_and_index).

This PR adds a new public function, but the added tests target get_haplo_sequence only. Please add at least one happy-path and one error-path test for this function (e.g., invalid input path / invalid cutoff / merge-path behavior).

As per coding guidelines: “**/*.py: New public functions require at least one happy-path test + one error case; bug fixes should include regression test”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/tools/create_fasta_and_index.py` around lines 17 - 29, Add unit
tests exercising the new public function create_fasta_and_index: one happy-path
test that supplies valid haplotypes_table_path, gnomad_va_file, reference_fasta,
window_size, output_base and version_str (and exercise both merge=True and
split_contigs=True variants) asserting expected output files are created and
indexed; and at least one error-path test that calls create_fasta_and_index with
invalid inputs (e.g., non-existent haplotypes_table_path or an out-of-range
frequency_cutoff like -0.1 or >1.0) and asserts it raises the expected
exception. Use the create_fasta_and_index symbol to locate the implementation
and mock or create temporary HailPath-like test fixtures (tmp_dir/output_base)
so tests are deterministic; ensure coverage includes the merge branch and
invalid-parameter validation branches.

Comment thread divref/divref/tools/create_fasta_and_index.py
Comment thread divref/divref/tools/create_fasta_and_index.py Outdated
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 03:24 — with GitHub Actions Inactive
@ameynert ameynert force-pushed the am_10_create_fasta_and_index branch from ee5ccc4 to eafe72a Compare April 17, 2026 03:27
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 03:27 — with GitHub Actions Inactive
- Guard against division by zero when min_variant_frequency <= 0; log warning
  with count of removed haplotypes
- Fix SQL interpolation in DuckDB metadata tables: use parameterized queries
  for window_size and version_str, serialize pops_legend with json.dumps
- Refactor monolithic function into four helpers: build_haplotype_table,
  export_ht_to_dataframe, write_fasta_files, create_duckdb_index

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 17, 2026 04:15 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant