Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a new CLI subcommand Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,200,255,0.5)
participant CLI
end
rect rgba(200,255,200,0.5)
participant Hail
end
rect rgba(255,200,200,0.5)
participant FS
end
rect rgba(255,255,200,0.5)
participant DuckDB
end
CLI->>Hail: load haplotypes table & gnomAD VA
CLI->>Hail: register reference FASTA
Hail->>Hail: compute per-haplotype metrics, filter, split windows
Hail->>Hail: dedupe, build sequences & variant strings, assign sequence_id
Hail->>FS: export TSV / DataFrame
CLI->>FS: write FASTA file(s) (single or per-contig)
CLI->>DuckDB: create/replace .duckdb and insert sequences + metadata
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
0fe6bba to
2e3ca4f
Compare
19bf206 to
6f51106
Compare
2e3ca4f to
c0ec4e9
Compare
Port create_fasta_and_index.py from human-diversity-reference/scripts as a defopt-compatible toolkit tool. Generates DivRef FASTA sequences with flanking reference context and a DuckDB index for use by remap_divref. Reuses get_haplo_sequence and split_haplotypes from divref.haplotype. Renames the chr loop variable to chrom to avoid shadowing the built-in, fixes a loop-variable closure bug (B023), and replaces print/typer.echo with logging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/tools/__init__.py to create the package and test_create_fasta_and_index.py with happy-path tests for get_haplo_sequence and split_haplotypes. The Hail JVM test is marked skip; the remaining tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6f51106 to
eafe72a
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
divref/divref/tools/create_fasta_and_index.py (1)
55-180: Consider splitting this orchestration into smaller helpers.The function currently mixes Hail transforms, export, FASTA writing, and DuckDB materialization in one block. Extracting pipeline stages into small helpers would improve readability and reduce risk during future edits.
As per coding guidelines: “Extract logic into small–medium functions with clear inputs/outputs”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@divref/divref/tools/create_fasta_and_index.py` around lines 55 - 180, The current create_fasta_and_index orchestration mixes Hail transforms (ht building with split_haplotypes, get_haplo_sequence), export (ht.select().export -> df), FASTA writing, and DuckDB materialization in one long block; refactor by extracting at least four helpers with clear inputs/outputs: (1) build_haplotype_table(haplotypes_table_path, gnomad_va_file, window_size, frequency_cutoff, merge, version_str) that performs the Hail pipeline (reading tables, annotations, split_haplotypes, get_haplo_sequence, checkpoint) and returns the final ht or path to checkpointed HT, (2) export_ht_to_tsv_bgz(ht, output_base, file_suffix, pops_legend) which does the ht.select(...).export and returns the polars DataFrame, (3) write_fasta_files(df, output_base, file_suffix, split_contigs) which writes FASTA files (the split_contigs logic), and (4) create_duckdb_index(df, output_base, file_suffix, window_size, pops_legend, version_str) which creates the duckdb file and tables; wire these helpers together in the original routine, preserve existing names like split_haplotypes, get_haplo_sequence, ht, pops_legend, and ensure each helper has minimal side effects and explicit return values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 17-29: Add unit tests exercising the new public function
create_fasta_and_index: one happy-path test that supplies valid
haplotypes_table_path, gnomad_va_file, reference_fasta, window_size, output_base
and version_str (and exercise both merge=True and split_contigs=True variants)
asserting expected output files are created and indexed; and at least one
error-path test that calls create_fasta_and_index with invalid inputs (e.g.,
non-existent haplotypes_table_path or an out-of-range frequency_cutoff like -0.1
or >1.0) and asserts it raises the expected exception. Use the
create_fasta_and_index symbol to locate the implementation and mock or create
temporary HailPath-like test fixtures (tmp_dir/output_base) so tests are
deterministic; ensure coverage includes the merge branch and invalid-parameter
validation branches.
- Around line 67-75: The calculation fraction_phased = ht.max_empirical_AF /
ht.min_variant_frequency can divide by zero; update the computation in the block
that computes fraction_phased and annotates estimated_gnomad_AF so you guard
against ht.min_variant_frequency == 0 (or missing) by computing fraction_phased
with a conditional (e.g., use hl.if_else or equivalent to return a safe default
or missing when min_variant_frequency <= 0), then use that guarded
fraction_phased when annotating estimated_gnomad_AF and ensure downstream
filtering (ht = ht.filter(ht.estimated_gnomad_AF >= frequency_cutoff)) behaves
correctly with the chosen default/missing value; reference the fraction_phased
variable, ht.min_variant_frequency, ht.max_empirical_AF, the annotate call that
sets estimated_gnomad_AF, and the subsequent filter to implement this fix.
- Around line 177-179: The three con.execute calls misuse f-string
interpolation: replace direct interpolation with parameterized queries for
window_size and version_str (use bound parameters in con.execute) and serialize
pops_legend before storing (e.g., json.dumps(pops_legend) to create a safe SQL
text value or convert to a valid SQL array literal) then pass that serialized
string as a bound parameter when creating the pops_legend table; update the
CREATE TABLE statements that reference window_size, pops_legend, and VERSION to
accept the bound parameters and use the serialized pops_legend so you avoid
malformed SQL and injection risks.
---
Nitpick comments:
In `@divref/divref/tools/create_fasta_and_index.py`:
- Around line 55-180: The current create_fasta_and_index orchestration mixes
Hail transforms (ht building with split_haplotypes, get_haplo_sequence), export
(ht.select().export -> df), FASTA writing, and DuckDB materialization in one
long block; refactor by extracting at least four helpers with clear
inputs/outputs: (1) build_haplotype_table(haplotypes_table_path, gnomad_va_file,
window_size, frequency_cutoff, merge, version_str) that performs the Hail
pipeline (reading tables, annotations, split_haplotypes, get_haplo_sequence,
checkpoint) and returns the final ht or path to checkpointed HT, (2)
export_ht_to_tsv_bgz(ht, output_base, file_suffix, pops_legend) which does the
ht.select(...).export and returns the polars DataFrame, (3)
write_fasta_files(df, output_base, file_suffix, split_contigs) which writes
FASTA files (the split_contigs logic), and (4) create_duckdb_index(df,
output_base, file_suffix, window_size, pops_legend, version_str) which creates
the duckdb file and tables; wire these helpers together in the original routine,
preserve existing names like split_haplotypes, get_haplo_sequence, ht,
pops_legend, and ensure each helper has minimal side effects and explicit return
values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7d920d4-9ace-4ed2-a631-69763bcc9114
📒 Files selected for processing (4)
divref/divref/main.pydivref/divref/tools/create_fasta_and_index.pydivref/tests/tools/__init__.pydivref/tests/tools/test_create_fasta_and_index.py
| def create_fasta_and_index( | ||
| *, | ||
| haplotypes_table_path: HailPath, | ||
| gnomad_va_file: HailPath, | ||
| reference_fasta: HailPath, | ||
| window_size: int, | ||
| output_base: HailPath, | ||
| version_str: str, | ||
| merge: bool = False, | ||
| frequency_cutoff: float = 0.005, | ||
| split_contigs: bool = False, | ||
| tmp_dir: HailPath = "/tmp", | ||
| ) -> None: |
There was a problem hiding this comment.
Add direct tests for the new public CLI function (create_fasta_and_index).
This PR adds a new public function, but the added tests target get_haplo_sequence only. Please add at least one happy-path and one error-path test for this function (e.g., invalid input path / invalid cutoff / merge-path behavior).
As per coding guidelines: “**/*.py: New public functions require at least one happy-path test + one error case; bug fixes should include regression test”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@divref/divref/tools/create_fasta_and_index.py` around lines 17 - 29, Add unit
tests exercising the new public function create_fasta_and_index: one happy-path
test that supplies valid haplotypes_table_path, gnomad_va_file, reference_fasta,
window_size, output_base and version_str (and exercise both merge=True and
split_contigs=True variants) asserting expected output files are created and
indexed; and at least one error-path test that calls create_fasta_and_index with
invalid inputs (e.g., non-existent haplotypes_table_path or an out-of-range
frequency_cutoff like -0.1 or >1.0) and asserts it raises the expected
exception. Use the create_fasta_and_index symbol to locate the implementation
and mock or create temporary HailPath-like test fixtures (tmp_dir/output_base)
so tests are deterministic; ensure coverage includes the merge branch and
invalid-parameter validation branches.
ee5ccc4 to
eafe72a
Compare
- Guard against division by zero when min_variant_frequency <= 0; log warning with count of removed haplotypes - Fix SQL interpolation in DuckDB metadata tables: use parameterized queries for window_size and version_str, serialize pops_legend with json.dumps - Refactor monolithic function into four helpers: build_haplotype_table, export_ht_to_dataframe, write_fasta_files, create_duckdb_index Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
create_fasta_and_index.pyfromhuman-diversity-reference/scriptsas a defopt-compatible toolkit toolremap_divrefget_haplo_sequenceandsplit_haplotypesfromdivref.haplotypechrloop variable tochromto avoid shadowing the built-in, and replacesprint/typer.echowith loggingtests/tools/test_create_fasta_and_index.pywith happy-path tests forget_haplo_sequenceandsplit_haplotypes(Hail JVM test marked skip)Test plan
uv run --directory divref poe check-allpassesuv run --directory divref pytest tests/tools/test_create_fasta_and_index.py— all non-skip tests pass🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests