-
Notifications
You must be signed in to change notification settings - Fork 42
Initial upload of CVA16 dataset #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9d704d2
a7a8a05
7e42af5
e41d9dc
fec4b05
697ee3c
96d6364
142d89d
a1235bc
c6fe883
8bec9db
dcdfee3
7724026
bc334a8
5efec0f
a451882
c300561
318944c
8e9cc59
76c44da
d3fc423
783d09d
4acb200
73c01e2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ | |
| ] | ||
| }, | ||
| "dataset_order": [ | ||
| "enpen/enterovirus/ev-d68" | ||
| "enpen/enterovirus/ev-d68", | ||
| "enpen/enterovirus/cva16" | ||
| ] | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| ## Unreleased | ||
|
|
||
| Initial release of a Coxsackievirus A16 dataset for lineage classification! | ||
|
|
||
| Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| # Coxsackievirus A16 dataset | ||
|
|
||
| | Key | Value | | ||
| |----------------------|-----------------------------------------------------------------------| | ||
| | authors | [Nadia Neuner-Jehle](https://eve-lab.org/people/nadia-neuner-jehle/), [Alejandra González-Sánchez](https://www.vallhebron.com/en/professionals/alejandra-gonzalez-sanchez), [Emma B. Hodcroft](https://eve-lab.org/people/emma-hodcroft/), [ENPEN](https://escv.eu/european-non-polio-enterovirus-network-enpen/) | | ||
| | name | Coxsackievirus A16 | | ||
| | reference | [Static Inferred Ancestor](https://github.com/enterovirus-phylo/nextclade_a16/blob/master/resources/inferred-root.fasta) | | ||
| | workflow | https://github.com/enterovirus-phylo/nextclade_a16 | | ||
| | path | `enpen/enterovirus/cva16` | | ||
| | clade definitions | A–F | | ||
|
|
||
| ## Scope of this dataset | ||
|
|
||
| This dataset uses the [Static Inferred Ancestor](https://github.com/enterovirus-phylo/nextclade_a16/blob/master/resources/inferred-root.fasta) instead of the historical G-10 prototype sequence ([U05876.1](https://www.ncbi.nlm.nih.gov/nuccore/U05876)). It is intended for broad subgenogroup classification, mutation quality control, and phylogenetic analysis of CVA16 diversity. | ||
|
|
||
| *Note: The G-10 reference differs substantially from currently circulating strains.* This is common for enterovirus datasets, in contrast to some other virus datasets (e.g., seasonal influenza), where the reference is updated more frequently to reflect recent lineages. | ||
|
|
||
| To address this, the dataset is *rooted* on a Static Inferred Ancestor, a phylogenetically reconstructed ancestral sequence near the tree root. This provides a stable reference point that can be used as an alternative for mutation calling. | ||
|
|
||
| ## Features | ||
|
|
||
| This dataset supports: | ||
|
|
||
| - Assignment of subgenotypes | ||
| - Phylogenetic placement | ||
| - Sequence quality control (QC) | ||
|
|
||
| ## Subgenogroups of Coxsackievirus A16 | ||
|
|
||
| Subgenogroups B1a, B1b and B1c represent the major phylogenetic divisions of CVA16 and are commonly used in virological surveillance and the literature. They are defined based on phylogenetic clustering and do not necessarily reflect antigenic differences. | ||
|
|
||
| In recent years, additional recombinant forms have been identified and labeled C-F (also referred to as B2, B3, and D). These recombinant forms cluster with the prototype strain (clade A). | ||
|
|
||
| Overall, these designations are based on phylogenetic structure and characteristic mutations, and are widely used in molecular epidemiology, similar to subgenotype systems for other enteroviruses. Unlike influenza (H1N1, H3N2) or SARS-CoV-2, there is no universally standardized global lineage nomenclature for enteroviruses; naming instead follows conventions established in published studies and surveillance practices. | ||
|
|
||
| ## Related Enteroviruses | ||
|
|
||
| CVA16 is closely related to other EV-A viruses, including EV-A71, EV-A120, and CVA5. If you are not certain that your sequences contain only CVA16, we recommend using the "[Multiple Datasets](https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-web/getting-started.html#multi-dataset-mode)" tab instead of "Single Dataset". | ||
|
|
||
| This prevents Nextclade from forcing sequences to align to the CVA16 reference tree. For example, EV-A71 sequences may still align and receive a clade assignment (often near recombinant forms). | ||
|
|
||
| Please be cautious when working with short genes or fragments (e.g., 5'UTR sequences). These regions can be highly conserved across EV-A viruses, making genogroup and subgenogroup assignment prone to errors. In addition, such fragments may originate from recombinant genomes. Recombination is common in enteroviruses, and when analyzing only a fragment, this may go undetected. | ||
|
|
||
| If you are unsure how to proceed, please contact us. We are happy to assist. | ||
|
|
||
| ## Reference types | ||
|
|
||
| This dataset includes several reference points used in analyses: | ||
| - *Static Inferred Ancestor:* Reconstructed ancestral sequence inferred with an outgroup, representing the likely founder of CVA16. Serves as a stable reference. | ||
|
|
||
| - *Parent:* The nearest ancestral node of a sample in the tree, used to infer branch-specific mutations. | ||
|
|
||
| - *Clade founder:* The inferred ancestral node defining a clade (e.g., B1a, B2). Mutations "since clade founder" describe changes that define that clade. | ||
|
|
||
| - *Reference:* RefSeq or similarly established prototype sequence. Here G-10 (U05876.1). | ||
|
|
||
| - *Tree root:* Corresponds to the root of the tree, it may change in future updates as more data become available. | ||
|
|
||
| All references use the coordinate system of the G-10 sequence. | ||
|
|
||
| ## Issues & Contact | ||
| - For questions or suggestions, please [open an issue](https://github.com/enterovirus-phylo/nextclade_a16/issues) or email: eve-group[at]swisstph.ch | ||
|
|
||
| ## What is a Nextclade dataset? | ||
|
|
||
| A Nextclade dataset includes the reference sequence, genome annotations, tree, clade definitions, and QC rules. Learn more in the [Nextclade documentation](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html). | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,17 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ##gff-version 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #!gff-spec-version 1.21 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| #!processor NCBI annotwriter | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ##sequence-region U05876.1 1 7413 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+4
to
+17
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ##sequence-region U05876.1 1 7413 | |
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | |
| U05876.1 Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | |
| U05876.1 Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | |
| U05876.1 Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | |
| U05876.1 Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | |
| U05876.1 Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | |
| U05876.1 Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | |
| U05876.1 Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | |
| U05876.1 Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | |
| U05876.1 Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | |
| U05876.1 Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | |
| U05876.1 Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | |
| U05876.1 Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 | |
| ##sequence-region ancestral_sequence 1 7413 | |
| ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31704 | |
| ancestral_sequence Genbank region 1 7413 . + . ID=U05876.1:1..7413;Dbxref=taxon:31704;gb-acronym=CV-A16;gbkey=Src;mol_type=genomic RNA;nat-host=Homo sapiens;strain=G-10 | |
| ancestral_sequence Genbank CDS 751 957 . + . Name=VP4;gbkey=Prot;product=VP4;ID=id-AAA50478.1:1..69 | |
| ancestral_sequence Genbank CDS 958 1719 . + . Name=VP2;gbkey=Prot;product=VP2;ID=id-AAA50478.1:70..323 | |
| ancestral_sequence Genbank CDS 1720 2445 . + . Name=VP3;gbkey=Prot;product=VP3;ID=id-AAA50478.1:324..565 | |
| ancestral_sequence Genbank CDS 2446 3336 . + . Name=VP1;gbkey=Prot;product=VP1;ID=id-AAA50478.1:566..862 | |
| ancestral_sequence Genbank CDS 3337 3786 . + . Name=2A;product=2A;gbkey=Prot;ID=id-AAA50478.1:863..1012 | |
| ancestral_sequence Genbank CDS 3787 4083 . + . Name=2B;product=2B;gbkey=Prot;ID=id-AAA50478.1:1013..1111 | |
| ancestral_sequence Genbank CDS 4084 5070 . + . Name=2C;product=2C;gbkey=Prot;ID=id-AAA50478.1:1112..1440 | |
| ancestral_sequence Genbank CDS 5071 5328 . + . Name=3A;product=3A;gbkey=Prot;ID=id-AAA50478.1:1441..1526 | |
| ancestral_sequence Genbank CDS 5329 5394 . + . Name=3B;product=3B;gbkey=Prot;ID=id-AAA50478.1:1527..1548 | |
| ancestral_sequence Genbank CDS 5395 5943 . + . Name=3C;product=3C;gbkey=Prot;ID=id-AAA50478.1:1549..1731 | |
| ancestral_sequence Genbank CDS 5944 7329 . + . Name=3D;product=3D;gbkey=Prot;ID=id-AAA50478.1:1732..2193 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not necessary. I mean it would be nice to have for consistency (and also in pathogen.json), but there are dozens of datasets which have these values all over the place. Don't bother. Hope users will undesstand. Might add an automated check later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset README states the dataset uses a “Static Inferred Ancestor” instead of the G-10 prototype, but the accompanying genome annotation (
U05876.1) and the generateddata_output/dataset currently indicate G-10/U05876.1 as the reference. Please clarify which reference sequence is intended and update README and dataset metadata consistently (README,pathogen.jsonattributes, reference FASTA header/accession).