Conversation
|
@nneune Great job as always! Technically looks good. I assume Emma or Richard will review the science. My only complaint is a few typos (see below), and that the dataset path might not accommodate future needs: there will permanently be only one dataset under I've been trying these new Ai-assisted dataset reviews. Posting in case useful:
TestingTry in Nextclade Web: ScienceBackground on the pathogen, its classification, epidemiology, and the reference strains used in this dataset. Provides context for evaluating dataset design decisions. Coxsackievirus A16 biology and classification [click to expand]Coxsackievirus A16 (CVA16) is a member of species Enterovirus A in the family Picornaviridae. It is one of the major causative agents of hand, foot, and mouth disease (HFMD), primarily affecting children under 5 (Sun et al., J Clin Microbiol 2014). The positive-sense single-stranded RNA genome is approximately 7,410 nt, encoding a single polyprotein cleaved into structural proteins (VP4, VP2, VP3, VP1) and non-structural proteins (2A, 2B, 2C, 3A, 3B, 3C, 3D), flanked by 5' and 3' UTRs (Xu et al., Front Microbiol 2025). The reference genome in this dataset, G-10 (GenBank U05876.1, 7413 nt), is the prototype CVA16 strain isolated in South Africa in 1951. It represents genotype A, the sole member of this clade (Sun et al., J Clin Microbiol 2014). As the README correctly notes, G-10 differs substantially from currently circulating strains, which belong to genotype B sublineages. VP1 is the standard molecular target for enterovirus typing and subgenogroup classification (Sun et al., J Clin Microbiol 2014), consistent with the dataset's Subgenogroup classification [click to expand]CVA16 phylogeny based on VP1 defines genotypes A, B, and D (Sun et al., J Clin Microbiol 2014). Genotype B is divided into B1 and B2, with B1 further split into B1a, B1b, and B1c clusters at 6.6-8.0% genetic distance (Zeng et al., Viruses 2025). Recombinant forms (sometimes labeled C-F) are described in Han et al., Virus Evol 2024. The dataset tree includes clades: A, B1, B1a, B1b, B1c, C, D, E, F, RFs, and unassigned. The README describes clades C-F as "recombinant forms" that cluster with the prototype strain (clade A), also known as B2, B3, and D in alternative nomenclatures. This is consistent with the recombination-driven genotype evolution described in Han et al., Virus Evol 2024. B1a and B1b co-circulated globally for decades; B1b became dominant in some regions after 2020 (Xu et al., Front Microbiol 2025). B1c, first reported in Southeast Asia and Europe after 2000, surged in China since 2023-2024 (Zeng et al., Viruses 2025) and was recently detected in Thailand for the first time in 2023 (Taoma et al., Microbiol Resour Announc 2025). The tree reflects this diversity with B1a (733 nodes), B1b (483), B1c (195), B1 (80), and smaller representation of D (43), C (5), RFs (4), F (3), E (3), and A (1). ENPEN and enterovirus Nextclade datasets [click to expand]The European Non-Polio Enterovirus Network (ENPEN), under the European Society for Clinical Virology, coordinates enterovirus surveillance across 20+ European countries (Harvala et al., Microorganisms 2021). A 2025 study in The Lancet Regional Health - Europe analyzed 63,659 samples from 48 countries (2015-2022), with ENPEN contributing 85% of typed non-polio enterovirus data (Harvala et al., Lancet Reg Health Eur 2025). This CVA16 dataset is the second ENPEN enterovirus dataset for Nextclade, following EV-D68. The build pipeline is available at enterovirus-phylo/nextclade_a16, adapted from the EV-D68 pipeline template. The same team (Neuner-Jehle, Gonzalez-Sanchez, Hodcroft) maintains both datasets. Blocking issuesIssues affecting scientific correctness, data integrity, or user-facing accuracy. These block adoption of the dataset until addressed. 🔴 H1. Dataset path naming inconsistent with sibling dataset [click to expand]The dataset path The EV-D68 dataset is already released (3 versions) and its flat convention (no reference accession suffix) is locked. The reference suffix question is moot for CVA16 if ENPEN follows the same pattern. Effect: Once released, the path cannot be renamed. Fix: Consider Non-blocking issuesCosmetic issues, minor inconsistencies, and documentation improvements. Fix if time allows. 🟡 M1. Typo "Cocksackievirus" in README [click to expand]
Fix: Change "Cocksackievirus" to "Coxsackievirus" on line 28. 🟡 M2. `reference name` attribute duplicates pathogen name [click to expand]In The EV-D68 dataset uses the FASTA header description: The Fix: Change 🟡 M3. Consider `experimental` flag given testing status [click to expand]The PR description states: "It needs to be tested further by ENPEN and others." The dataset has no If the dataset is intended for broader testing before full release, setting Fix: Consider adding 🟡 M4. Three files missing trailing newlines [click to expand]The following files lack a trailing newline:
Fix: Add a trailing newline to each file. 🔵 L1. No citation section in README [click to expand]The EV-D68 README includes a "Citation" section with a recommended citation. The CVA16 README omits this. Fix: Add a citation section referencing the dataset authors and the workflow repository, consistent with the EV-D68 README. NotesObservations that require no action: correct design decisions, positive patterns, comparisons with related work, and future improvement ideas. Click to expand
Nextclade CLI runNextclade CLI run via Docker ( Reference sequence [click to expand]The reference ( Example sequences (35 total) [click to expand]
Clade assignments: B1b (12), B1a (9), B1c (7), B1 (5), C (1), D (1). All 35 received a clade assignment. Private mutations: min=0, max=377, mean=51.5. 5/35 exceed the threshold of 120. Three sequences (PX448982, PX448985, PX448978) have >200 private mutations, suggesting they are divergent from the nearest tree node. Frameshifts: 4/35 have frameshifts. No
These are large frameshifts spanning most of each CDS. They are more likely sequencing artifacts or incomplete sequences than biological frameshifts. If they are expected in the example set, adding SNP clusters: 5/35 flagged as bad, 3/35 as mediocre. The Stop codons: 1/35 has a premature stop codon. Missing data: 0 across all 35 sequences. |
Claude is being a bit too dramatic here 😆 |
|
Oh wow, the AI did a great job at summarizing the science behind CVA16. For the "dataset path naming" inconsistency, CVA16 is actually the official name and not CV-A16 (see Simmonds et al., 2020). I'll correct the typos! |
nneune
left a comment
There was a problem hiding this comment.
The same issues exist as with the EV-D68 dataset. Divergence is not a valid QC label, and the nucMutLabelMapReverse is deprecated.
…ral squence instead of prototype reference.
There was a problem hiding this comment.
Pull request overview
This PR adds an ENPEN Nextclade dataset for Coxsackievirus A16 (CVA16) and wires it into the ENPEN collection so it appears in the dataset server index.
Changes:
- Adds a new dataset under
data/enpen/enterovirus/cva16/(reference, annotation, config, tree, examples, README/CHANGELOG). - Updates the ENPEN collection ordering to include
enpen/enterovirus/cva16. - Adds corresponding generated artifacts under
data_output/including the dataset bundle and an entry indata_output/index.json.
Reviewed changes
Copilot reviewed 10 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| data/enpen/collection.json | Adds CVA16 to ENPEN dataset ordering. |
| data/enpen/enterovirus/cva16/pathogen.json | Introduces CVA16 dataset configuration and QC/clade metadata. |
| data/enpen/enterovirus/cva16/reference.fasta | Adds the CVA16 reference sequence used for alignment/mutation calling. |
| data/enpen/enterovirus/cva16/genome_annotation.gff3 | Adds genome annotation for coding regions. |
| data/enpen/enterovirus/cva16/README.md | Documents dataset scope/reference choice and usage notes. |
| data/enpen/enterovirus/cva16/CHANGELOG.md | Adds initial release notes under ## Unreleased. |
| data/enpen/enterovirus/cva16/sequences.fasta | Adds example sequences for the dataset. |
| data/enpen/enterovirus/cva16/tree.json | Adds the reference phylogeny used for placement/clade assignment. |
| data_output/index.json | Registers the CVA16 dataset in the generated dataset index. |
| data_output/enpen/enterovirus/cva16/unreleased/reference.fasta | Generated reference FASTA for the unreleased dataset version. |
| data_output/enpen/enterovirus/cva16/unreleased/genome_annotation.gff3 | Generated genome annotation for the unreleased dataset version. |
| data_output/enpen/enterovirus/cva16/unreleased/pathogen.json | Generated pathogen config for the unreleased dataset version. |
| data_output/enpen/enterovirus/cva16/unreleased/tree.json | Generated tree for the unreleased dataset version. |
| data_output/enpen/enterovirus/cva16/unreleased/sequences.fasta | Generated example sequences for the unreleased dataset version. |
| data_output/enpen/enterovirus/cva16/unreleased/README.md | Generated README displayed to dataset users. |
| data_output/enpen/enterovirus/cva16/unreleased/CHANGELOG.md | Generated changelog displayed to dataset users. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| >ancestral_sequence | ||
| TTAAAACAGCCTGTGGGTTGTTCCCACCCACAGGGCCCACTGGGCGCTAGCACACTGATT |
There was a problem hiding this comment.
The generated data_output dataset uses U05876.1 as the reference (see FASTA header), which conflicts with the data/ dataset’s reference ID (ancestral_sequence). Since data_output/ should be produced from data/, please resolve the reference choice in data/ and rerun ./scripts/rebuild to regenerate data_output/enpen/enterovirus/cva16/unreleased/* consistently.
| ##sequence-region U05876.1 1 7413 | ||
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | ||
| U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | ||
| U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | ||
| U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | ||
| U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | ||
| U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | ||
| U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | ||
| U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | ||
| U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | ||
| U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | ||
| U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | ||
| U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | ||
| U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 |
There was a problem hiding this comment.
In this dataset, the GFF3 seqid is U05876.1, but the reference FASTA header in reference.fasta is ancestral_sequence. Nextclade expects the genome annotation seqid values to match the reference sequence ID; otherwise CDS translation/annotation lookup can fail. Please make the FASTA ID and all first-column GFF3 IDs consistent (either rename the FASTA header to U05876.1 or change the GFF3 seqid/##sequence-region to ancestral_sequence).
| ##sequence-region U05876.1 1 7413 | |
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | |
| U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | |
| U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | |
| U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | |
| U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | |
| U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | |
| U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | |
| U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | |
| U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | |
| U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | |
| U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | |
| U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | |
| U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 | |
| ##sequence-region ancestral_sequence 1 7413 | |
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | |
| ancestral_sequence Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | |
| ancestral_sequence Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | |
| ancestral_sequence Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | |
| ancestral_sequence Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | |
| ancestral_sequence Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | |
| ancestral_sequence Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | |
| ancestral_sequence Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | |
| ancestral_sequence Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | |
| ancestral_sequence Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | |
| ancestral_sequence Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | |
| ancestral_sequence Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | |
| ancestral_sequence Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 |
There was a problem hiding this comment.
Is this step really necessary?
There was a problem hiding this comment.
No, not necessary. I mean it would be nice to have for consistency (and also in pathogen.json), but there are dozens of datasets which have these values all over the place. Don't bother. Hope users will undesstand. Might add an automated check later.
| This dataset uses the [Static Inferred Ancestor](https://github.com/enterovirus-phylo/nextclade_a16/blob/master/resources/inferred-root.fasta) instead of the historical G-10 prototype sequence ([U05876.1](https://www.ncbi.nlm.nih.gov/nuccore/U05876)). It is intended for broad subgenogroup classification, mutation quality control, and phylogenetic analysis of CVA16 diversity. | ||
|
|
||
| *Note: The G-10 reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages. | ||
|
|
||
| To address this, the dataset is *rooted* on a Static Inferred Ancestor, a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used as an alternative for mutation calling. |
There was a problem hiding this comment.
The dataset README states the dataset uses a “Static Inferred Ancestor” instead of the G-10 prototype, but the accompanying genome annotation (U05876.1) and the generated data_output/ dataset currently indicate G-10/U05876.1 as the reference. Please clarify which reference sequence is intended and update README and dataset metadata consistently (README, pathogen.json attributes, reference FASTA header/accession).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request introduces a new Nextclade dataset for Coxsackievirus A16 (CVA16), based on the reference strain "G-10". It provides all the essential files and documentation required for lineage classification, phylogenetic analysis, and quality control of CVA16 sequences. The dataset is tailored for broad subgenogroup assignment and includes detailed metadata, genome annotation, and configuration for Nextclade compatibility.
It needs to be tested further by ENPEN and others.