Conversation
📝 WalkthroughWalkthroughThis pull request introduces several enhancements across various components. In the conda recipe, a new dependency ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SNK as Snakemake Workflow
participant PR as Partitioning Rules
participant DM as Demultiplex Rule
participant MP as Merge Rules
participant AQ as Assess Quality
participant WS as Workflow Summary
User->>SNK: Trigger workflow
SNK->>PR: Execute partition_reads & partition_index (using seqkit)
PR->>DM: Supply partitioned read files and schema
DM->>MP: Process demultiplexing and output partitioned files
MP->>AQ: Merge partition outputs and barcode logs
AQ->>WS: Generate workflow summary with consolidated logs
WS->>User: Return final summary
Suggested labels
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (9)
harpy/view.py (1)
17-39: Consider removing the unusedstdscrvariable or referencing it meaningfully.
Currently,stdscr = curses.initscr()is never used, which triggers a linter warning. If no usage is intended, removing the assignment can improve clarity.- stdscr = curses.initscr() + curses.initscr()🧰 Tools
🪛 Ruff (0.8.2)
19-19: Local variable
stdscris assigned to but never usedRemove assignment to unused variable
stdscr(F841)
harpy/scripts/demultiplex_gen1.py (2)
2-4: Use context managers for file operations.
Opening the log file with a plainopen()can be replaced with a context manager for clearer resource handling. Also consider restoringsys.stdoutandsys.stderrafterwards to avoid side effects.- f = open(snakemake.log[0], "w") - sys.stderr = sys.stdout = f + with open(snakemake.log[0], "w") as f: + sys.stderr = sys.stdout = f + # rest of logic # Possibly restore sys.stderr/stdout here or rely on context manager scope🧰 Tools
🪛 Ruff (0.8.2)
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name
snakemake(F821)
119-165: Consider consolidating file operations into a single context manager.
The approach works, but opening multiple output files in separate dictionary comprehensions can be brittle if an exception arises midway. A single context manager block can mitigate partial resource management issues.- R1_output = {sample: open(...) for sample in samples} - R2_output = {sample: open(...) for sample in samples} ... - for r1_rec, r2_rec, i1_rec, i2_rec in zip(...): + with ExitStack() as stack: + R1_output = {sample: stack.enter_context(open(...)) for sample in samples} + R2_output = {sample: stack.enter_context(open(...)) for sample in samples} + # proceed🧰 Tools
🪛 Ruff (0.8.2)
119-119: Undefined name
snakemake(F821)
harpy/snakefiles/demultiplex_gen1 copy.smk (3)
12-20: File logging approach is consistent, but confirm file existence.
Removinglogger.logfileon success/error is fine if guaranteed to exist. Otherwise, consider checking before removal to avoid exceptions.
21-39: Schema parsing approach is functional but watch multiple barcode letters.
Similar to the Python script, ensure lines with different segment letters do not create unintended merges. Thekeep_unknownlogic is helpful for out-of-scope samples.
79-88: Compression step is straightforward.
Usinggzipis a minimal approach. Consider parallel compression if dealing with large files, but otherwise this is acceptable.harpy/snakefiles/demultiplex_gen1.smk (2)
112-122: Finalize or remove the commented-out merge command.The
default_target: Truesetting indicates a main rule, but the actual merging shell command is commented out, suggesting incomplete functionality.
226-226: Clarify the default target intention.
#default_target: Truewas commented out. If this is meant to be the final rule, consider re-enabling or removing it completely.harpy/_conda.py (1)
38-38: Consider pinningseqkitto a specific version.While adding
seqkitsupports new partitioning features, a version pin (e.g.,bioconda::seqkit=2.3.1) could help maintain reproducibility.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
harpy/_conda.py(1 hunks)harpy/reports/align_bxstats.qmd(4 hunks)harpy/scripts/demultiplex_gen1.py(3 hunks)harpy/scripts/demultiplex_gen1.py.bak(0 hunks)harpy/snakefiles/demultiplex_gen1 copy.smk(1 hunks)harpy/snakefiles/demultiplex_gen1.smk(5 hunks)harpy/view.py(2 hunks)
💤 Files with no reviewable changes (1)
- harpy/scripts/demultiplex_gen1.py.bak
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py
2-2: Use a context manager for opening files
(SIM115)
2-2: Undefined name snakemake
(F821)
83-83: Undefined name snakemake
(F821)
84-84: Undefined name snakemake
(F821)
85-85: Undefined name snakemake
(F821)
86-86: Undefined name snakemake
(F821)
87-87: Undefined name snakemake
(F821)
88-88: Undefined name snakemake
(F821)
89-89: Undefined name snakemake
(F821)
90-90: Undefined name snakemake
(F821)
91-91: Undefined name snakemake
(F821)
92-92: Undefined name snakemake
(F821)
93-93: Undefined name snakemake
(F821)
94-94: Undefined name snakemake
(F821)
95-95: Undefined name snakemake
(F821)
109-109: Use a context manager for opening files
(SIM115)
110-110: Use a context manager for opening files
(SIM115)
119-119: Undefined name snakemake
(F821)
harpy/view.py
10-10: click.echo_via_pager imported but unused
Remove unused import: click.echo_via_pager
(F401)
19-19: Local variable stdscr is assigned to but never used
Remove assignment to unused variable stdscr
(F841)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (24)
harpy/view.py (3)
6-10: Imports look consistent and essential.
These imports address gzip handling, terminal color checking (via curses), syntax highlighting, and paging. The static analysis warning thatecho_via_pageris unused appears to be a false positive since it is used at line 129.🧰 Tools
🪛 Ruff (0.8.2)
10-10:
click.echo_via_pagerimported but unusedRemove unused import:
click.echo_via_pager(F401)
110-120: Color-based formatter selection is clear and well-structured.
The logic for determining the best formatter based onn_colorsis straightforward and concise. No further concerns here.
133-133: Blue-colored title for better visual distinction.
Changing the title to"[bold blue] File viewed"helps highlight the file’s name. This styling choice is consistent.harpy/scripts/demultiplex_gen1.py (2)
25-43: Check for multi-letter schemas.
If the schema file references multiple distinct leading letters (e.g., “A”, “B”),id_letter = code_letters.pop()will only capture one letter and discard others. Ensure this is intentional or handle all letters if needed.
83-101: Snakemake references look valid.
All references tosnakemake.paramsand thebar_codesdictionary are consistent. However, this script will only run under Snakemake. If needed, add a guard or note clarifying that requirement.🧰 Tools
🪛 Ruff (0.8.2)
83-83: Undefined name
snakemake(F821)
84-84: Undefined name
snakemake(F821)
85-85: Undefined name
snakemake(F821)
86-86: Undefined name
snakemake(F821)
87-87: Undefined name
snakemake(F821)
88-88: Undefined name
snakemake(F821)
89-89: Undefined name
snakemake(F821)
90-90: Undefined name
snakemake(F821)
91-91: Undefined name
snakemake(F821)
92-92: Undefined name
snakemake(F821)
93-93: Undefined name
snakemake(F821)
94-94: Undefined name
snakemake(F821)
95-95: Undefined name
snakemake(F821)
harpy/snakefiles/demultiplex_gen1 copy.smk (10)
1-2: Container spec is straightforward.
Using a Docker container ensures reproducible environments. No issues noted.
3-4: Imports for OS and logging are appropriate.
These modules support directory manipulation and log handling across the workflow.
6-11: Config parameters read logically, but watch out for missing keys.
The code directly accessesconfig[...]keys. Make sure you handle KeyError or provide defaults if these keys are not guaranteed.
40-42: Storing parsed samples in a list is clear.
Fulfills the requirement for enumerating sample names from the schema dictionary.
43-52: Rulebarcode_segmentsreferences a script that isn’t shown.
Ensurehaplotag_acbd.pyis available and tested. Provide a docstring or usage note describing expected outputs of this script.
53-78: Ruledemultiplexeffectively ties input segments to the script.
Make sure the placeholders{sample}match whatdemultiplex_gen1.pyexpects. The usage oftemp()outputs is correct for ephemeral intermediates.
89-119: Ruleassess_quality: fallback to minimal output on error is helpful.
Generates a pseudo-Falco result if the primary command fails, preventing workflow breaks down the line. This is a practical approach.
120-142: Rulereport_configdynamically generates MultiQC config.
Usingyaml.dumpwithdefault_flow_style=Falseandsort_keys=Falseis a good style choice for readability. No issues found.
143-159: Rulequality_reportproperly aggregates QA outputs.
The collected fastqc data and forced override (--force) look correct. The redirect of errors tomultiqc.logensures consistent debugging.
160-192: Workflow summary is thorough but confirm existence of config keys.
The summary referencesconfig['workflow_call']directly. Ensure it always exists. Otherwise, guard with a default or a conditional check.harpy/snakefiles/demultiplex_gen1.smk (5)
19-21: Wildcard constraints look good.The pattern for
part = "\d+"andFR = "[12]"aligns well with Snakemake usage.
229-229: Dependency on merged barcode logs is coherent.Referencing
barcode_logsinworkflow_summaryaligns well with the newly introduced partitioning and merging steps.
133-155: Validate the barcode log structure.Your merge logic assumes four columns (
barcode,total,correct,corrected). Consider additional checks or error handling in case of unexpected file formats.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Optionally verify the presence of the expected columns in all barcode logs rg 'Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads' $(fd barcodes.log)Length of output: 78
Manual Verification Needed: Barcode Log Structure Check
The executed command didn't find any logs with the expected header. This likely means that the header ("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads") is not present in the repository's static content but created at runtime by the merge rule. Please manually verify that the barcode log files generated at runtime contain the expected four columns. Additionally, consider implementing extra checks or error handling in the merge rule (e.g., validating that each line splits into exactly four fields) to guard against unexpected file formats.
77-85: Check naming consistency for index partitioning.Ensure that the same split logic used in
partition_readsalso applies cleanly to the index reads, matching the output patternreads.I{FR}.part_{part}.fq.gz.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Confirm index partition files follow the naming scheme fd fastq_chunks -e gz | rg 'reads.I1.part_\d{3}\.fq\.gz|reads.I2.part_\d{3}\.fq\.gz'Length of output: 86
Action Required: Manual Verification Needed for Index Partitioning File Naming
The automated command did not return any files matching the
reads.I{FR}.part_{part}.fq.gzpattern, so we cannot confirm consistency from the current repository state. Please manually verify (post-pipeline run) that index partition files, if generated, follow this naming convention.
55-76: Confirm the filename pattern matches seqkit output.By default,
seqkit split2appends_part_001to the original file name. Verify that it correctly producesreads.R{FR}.part_{part}.fq.gzas expected by the rule output.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Example verification: list the generated partition files and validate their naming fd fastq_chunks -e gz --exec ls -l {}Length of output: 41
Filename Pattern Mismatch: Verify & Update Expected Pattern
The rule’s collect pattern currently expects files named like
reads.R{FR}.part_{part}.fq.gz,
but by default, seqkit split2 appends_part_001(e.g. producingreads.R1_part_001.fq.gz). This discrepancy suggests that the pattern in the Snakefile may not correctly capture the files produced by seqkit.Please manually verify the actual output naming from seqkit split2 and update the collect pattern if necessary.
harpy/reports/align_bxstats.qmd (4)
35-59: Robust handling of empty inputs.Returning a zero-filled data frame when
tbis empty prevents errors and improves clarity for downstream usage.
60-103: Gracefully handle an emptymultiplex_df.If
multiplex_dfis empty (e.g., all barcodes have fewer than 2 reads), mean and standard error computations can result inNaN. Consider a fallback or guard to avoid warnings.
127-127: Renaming "SEM" to "SE" enhances consistency.A straightforward terminology tweak that improves clarity.
150-150: Updated column titles align with new fields.The additional columns for coverage in bp neatly map to the updated return values in
process_input().
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
harpy/view.py (1)
17-38: LGTM! Minor improvement suggested.The function correctly handles terminal color detection. However, the
stdscrvariable is not used beyond initialization.Apply this diff to simplify the code:
- stdscr = curses.initscr() + curses.initscr()🧰 Tools
🪛 Ruff (0.8.2)
19-19: Local variable
stdscris assigned to but never usedRemove assignment to unused variable
stdscr(F841)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
harpy/reports/align_bxstats.qmd(4 hunks)harpy/snakefiles/demultiplex_gen1.smk(5 hunks)harpy/view.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- harpy/reports/align_bxstats.qmd
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/view.py
10-10: click.echo_via_pager imported but unused
Remove unused import: click.echo_via_pager
(F401)
19-19: Local variable stdscr is assigned to but never used
Remove assignment to unused variable stdscr
(F841)
🔇 Additional comments (9)
harpy/view.py (2)
121-127: LGTM!The function efficiently handles both gzipped and regular files, uses generators for memory efficiency, and correctly applies syntax highlighting.
110-130: LGTM!The changes correctly handle terminal color detection, select the appropriate formatter, and use
click.echo_via_pagerfor displaying content.harpy/snakefiles/demultiplex_gen1.smk (7)
19-21: LGTM!The wildcard constraints correctly limit
FRto 1 or 2 andpartto digits.
43-43: LGTM!The list comprehension correctly formats part numbers and limits the range to prevent excessive partitioning.
55-75: LGTM!The rule efficiently partitions reads using
seqkit split2and creates symbolic links to avoid copying large files.
77-84: LGTM!The rule efficiently reuses
partition_readslogic for index files.
88-91: LGTM!The input paths correctly point to partitioned files, and the
I2input path has been fixed.
112-120: LGTM!The rule efficiently merges partitioned files using a simple
catcommand.
132-153: LGTM!The rule efficiently aggregates barcode statistics while maintaining the original file format.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
132-154: Enhance error handling and memory efficiency in barcode log merging.The barcode log merging could benefit from better error handling and memory optimization.
run: + from collections import defaultdict - bc_dict = {} + bc_dict = defaultdict(lambda: [0, 0, 0]) for i in input.bc: - with open(i, "r") as bc_log: + try: + with open(i, "r") as bc_log: + # skip first row of column names + _ = bc_log.readline() + for line in bc_log: + try: + barcode,total,correct,corrected = line.split() + bc_stats = [int(total), int(correct), int(corrected)] + bc_dict[barcode] = list(map(sum, zip(bc_stats, bc_dict[barcode]))) + except ValueError as e: + print(f"Warning: Skipping malformed line in {i}: {line.strip()}", file=sys.stderr) + except IOError as e: + print(f"Error reading {i}: {e}", file=sys.stderr) + raise - # skip first row of column names - _ = bc_log.readline() - for line in bc_log: - barcode,total,correct,corrected = line.split() - bc_stats = [int(total), int(correct), int(corrected)] - if barcode not in bc_dict: - bc_dict[barcode] = bc_stats - else: - bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode]))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
19-21: LGTM! Well-defined wildcard constraints.The constraints ensure that
FRonly accepts 1 or 2 (for read pairs) andpartonly accepts digits, which is essential for the partitioning logic.
43-43: Consider using consistent naming for partition limits.The zero-padding to 3 digits (
{i:03d}) implies a maximum of 999 parts, which matches the limit inmin(workflow.cores, 999). However, this magic number should be defined as a constant for better maintainability.+MAX_PARTITIONS = 999 -fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)] +fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, MAX_PARTITIONS) + 1)]
86-111: LGTM! Input paths are correctly defined.The rule now correctly uses partitioned inputs and includes the schema parameter. The I2 input path issue from the past review has been addressed.
224-256: LGTM! Workflow summary correctly includes barcode logs.The workflow summary has been appropriately updated to include the barcode logs in its input dependencies.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
55-77: 🛠️ Refactor suggestionAddress platform compatibility and error handling.
- Using symlinks may not work on Windows systems.
- Missing error handling for seqkit command.
- Missing directory creation for output path.
Apply this diff to improve robustness:
shell: """ + mkdir -p {params.outdir} - ln -sr {input.r1} {output.r1} - ln -sr {input.r2} {output.r2} + cp {input.r1} {output.r1} + cp {input.r2} {output.r2} - seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2 > {log} + if ! seqkit split2 --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log}; then + echo "Error: seqkit split2 failed" >&2 + exit 1 + fi """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
140-162: Enhance merge_barcode_logs robustness.The current implementation might have memory issues with large number of files and lacks error handling.
Consider this improved implementation:
run: + from contextlib import ExitStack bc_dict = {} - for i in input.bc: - with open(i, "r") as bc_log: + try: + with ExitStack() as stack: + for i in input.bc: + bc_log = stack.enter_context(open(i, "r")) # skip first row of column names _ = bc_log.readline() for line in bc_log: barcode,total,correct,corrected = line.split() bc_stats = [int(total), int(correct), int(corrected)] if barcode not in bc_dict: bc_dict[barcode] = bc_stats else: bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode])) - with open(output.log, "w") as f: + with open(output.log, "w") as f: f.write("Barcode\tTotal_Reads\tCorrect_Reads\tCorrected_Reads\n") for k,v in bc_dict.items(): f.write(k + "\t" + "\t".join([str(i) for i in v]) + "\n") + except Exception as e: + logger.error(f"Error processing barcode logs: {str(e)}") + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk(4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
19-21: LGTM! Well-defined wildcard constraints.The constraints appropriately restrict FR to [1,2] and part to digits.
43-43: LGTM! Well-implemented chunk numbering.The implementation correctly synchronizes chunk numbering with the partition limit and uses zero-padded formatting.
90-115: LGTM! Correctly configured demultiplex rule.The input paths are properly configured, and temporary files are correctly marked.
232-264: LGTM! Workflow summary correctly updated.The workflow summary now properly includes the barcode logs in its dependencies.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
140-162: Improve robustness of barcode log merging.The barcode log merging could be enhanced for better error handling and memory efficiency.
Consider these improvements:
run: + from collections import defaultdict - bc_dict = {} + bc_dict = defaultdict(lambda: [0, 0, 0]) for i in input.bc: - with open(i, "r") as bc_log: + try: + with open(i, "r") as bc_log: + # validate header + header = bc_log.readline().strip().split() + if header != ["Barcode", "Total_Reads", "Correct_Reads", "Corrected_Reads"]: + raise ValueError(f"Invalid header in {i}") - # skip first row of column names - _ = bc_log.readline() for line in bc_log: barcode,total,correct,corrected = line.split() bc_stats = [int(total), int(correct), int(corrected)] - if barcode not in bc_dict: - bc_dict[barcode] = bc_stats - else: - bc_dict[barcode] = list(map(lambda x,y: x+y, bc_stats, bc_dict[barcode])) + bc_dict[barcode] = [x + y for x, y in zip(bc_stats, bc_dict[barcode])] + except Exception as e: + raise RuntimeError(f"Error processing {i}: {str(e)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk(4 hunks)
🔇 Additional comments (3)
harpy/snakefiles/demultiplex_gen1.smk (3)
19-21: LGTM! Well-defined wildcard constraints.The constraints appropriately restrict FR to read pairs and part to numeric values.
232-264: LGTM! Workflow summary properly integrates barcode logs.The changes correctly incorporate the barcode logs into the workflow summary.
43-43: Document and verify the partition limit assumptions.The implementation assumes:
- A maximum of 999 partitions (implied by 3-digit zero-padding)
- The partition format matches seqkit's output format
Please verify:
- That seqkit uses the same zero-padding format
- Whether 999 is a reasonable limit for the target use case
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check seqkit's output format # Create a small test file echo -e "@test\nACGT\n+\n!!!!" > test.fq # Run seqkit with 2 parts and check the naming seqkit split2 -1 test.fq -2 test.fq -p 2 -O . 2>&1 | grep "part"Length of output: 92
Manual verification needed: Confirm seqkit naming and partition limit assumptions
The initial test did not produce output, so we couldn’t verify if seqkit uses three-digit zero-padding or if the 999 limit is appropriate. Please manually verify that:
- Seqkit’s output filename format for partitions indeed uses three-digit, zero-padded values (to match the
{i:03d}format).- A cap of 999 partitions is both reflected by seqkit and reasonable for our intended use case.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
55-77:⚠️ Potential issueEnhance platform compatibility and error handling in partition_reads rule.
The rule has potential issues with symlinks and error handling.
Apply these changes to improve robustness:
shell: """ - ln -sr {input.r1} {output.r1} - ln -sr {input.r2} {output.r2} + cp {input.r1} {output.r1} + cp {input.r2} {output.r2} + trap 'rm -f {output.r1} {output.r2}' ERR seqkit split2 -f --quiet -1 {output.r1} -2 {output.r2} -p {params.chunks} -j {threads} -O {params.outdir} -e .gz 2> {log} """
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
43-43: Optimize the fastq_parts list generation.Consider using a list comprehension with f-strings for better performance:
-fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)] +fastq_parts = [f"{i:03d}" for i in range(1, min(workflow.cores, 999) + 1)]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
harpy/snakefiles/demultiplex_gen1.smk(4 hunks)
🔇 Additional comments (4)
harpy/snakefiles/demultiplex_gen1.smk (4)
93-117: Input validation needed for schema file.The rule now includes schema as input but lacks validation.
Run this script to check schema file validation in the Python script:
#!/bin/bash # Check schema file validation in demultiplex_gen1.py echo "Searching for schema file validation:" rg "schema.*=.*" "scripts/demultiplex_gen1.py" echo "Searching for file existence checks:" rg "os\.path\.exists" "scripts/demultiplex_gen1.py"
119-130: LGTM! Efficient implementation of merge_partitions.The rule uses
catfor efficient concatenation of files.
143-164: LGTM! Well-structured barcode log merging.The implementation:
- Uses efficient data structures
- Handles file I/O properly
- Maintains consistent output format
235-267: LGTM! Comprehensive workflow summary.The rule properly includes the new barcode logs and maintains good documentation practices.
This PR attempts to use the divide-and-conquer demultiplexing strategy proposed by @bpenaud, made idiommatic for snakemake.
Additionally:
align_bxstats.qmdto handle single-sample runs and column namesharpy viewto avoidsubprocesscalls and circumvent the stdout buffer warning that was printed to the terminal.Summary by CodeRabbit
New Features
Bug Fixes
Chores