-
Notifications
You must be signed in to change notification settings - Fork 29
Assign clades with Nextclade #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Copies the Nextclade "subclade" column to a new "subclade_nextclade" field and adds this new column to the Auspice config JSON for H3N2 HA as a proof of concept for exporting Nextclade-based clade annotations alongside the `augur clades` annotations. Since the Nextclade annotation of subclade already exists in the metadata we pass to `augur export`, we just need to rename that column to avoid conflicting with the "subclade" attribute that also appears in the node data JSON inputs to `augur export`. To avoid breaking any downstream logic that relies on the Nextclade column name of "subclade", I've chosen to duplicate the original column to a new column with the desired name. Related to #254 Related to #131
Implements an alternate approach to assigning clade labels to internal nodes and tips using Nextclade instead of `augur clades`. This approach works by 1) exporting ancestral and tip sequences from augur ancestral, 2) assigning clade and subclade to those sequences with Nextclade, and 3) converting the table of clade labels per node to a node data JSON file. The logic for the first two steps already existed. The third step required an expansion of an existing script that converts data frames into node data such that this script can also export branch labels. With this approach, clade assignments no longer require us to download clade definitions from the clade nomenclature repo and the accuracy of the assignments no longer relies on the composition of the input sequences. Closes #254
Updates the logic of the workflow and the table-to-node-data script to
handle the fact that we want to select a subset of metadata/Nextclade
columns that correspond to historical clade names for HA ("clade") and
currently maintained clade names for HA ("subclade") and NA ("clade")
which have different column names, different node attribute names, and
different branch labels. In an ideal world, we might drop the historical
HA clades and standardize the names of our Nextclade, node, and branch
attributes to "clade" across all segments, but for now, we need this
additional complexity in the workflow to handle the complexity in names.
| # Using a preorder traversal, find the first node in the tree with | ||
| # each distinct value. | ||
| for node in tree.find_clades(): | ||
| node_value = value_by_node.get(node.name) | ||
| if node_value in branch_values: | ||
| if node.name not in branches: | ||
| branches[node.name] = {"labels": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
augur clades guarantees that each clade will be mono/para-phyletic (although how it guarantees that is a little questionable!). While the nextclade approach should always result in the same structure, there aren't any guarantees. I'd suggest keeping track of all "start nodes" (nodes whose parent state differs from their own) in order to catch any polyphyletic calls and error / warn if encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, @jameshadfield! I ran into this issue when testing a method of assigning clades with augur traits which produced the following subtree for recent H3N2 HA sequences where an ancestral node has the mutations for J.2.2 but gets annotated as J.2 and its two immediate children get annotated as J.2.2.
I can imagine that this same outcome could occur with this Nextclade approach, though. I think the augur clades approach of picking the largest matching subclade makes sense at least for branch labels.
I think nextstrain/augur#1837 may solve this, depending on how the |




Description of proposed changes
Implements an alternate approach to assigning clade labels to internal nodes and tips using Nextclade instead of
augur clades. This approach works by:The logic for the first two steps already existed. The third step required an expansion of an existing script that converts data frames into node data such that this script can also export branch labels.
With this approach, clade assignments no longer require us to download clade definitions from the clade nomenclature repo and the accuracy of the assignments no longer relies on the composition of the input sequences. The benefit of this approach over one that relies on DTA with augur traits is that we get more deterministic clade labels for internal nodes that reflects the inferred ancestral sequence instead of a single inferred ancestral trait.
This PR builds on work I prototyped in #258, but I think this approach is a better long-term solution in that it provides a full replacement of the existing subclades functionality with internal node coloring and branch labels.
Examples
The following is an example of a small H3N2 HA dataset with subclades assigned by
augur clades:Then this example shows the same dataset with subclades assigned by Nextclade with this PR, showing how
augur cladesmissed that several samples actually map to subclade J.2.4 (the same issue noted in #254):Related issue(s)
Replaces #258
Closes #254
Closes #131
Closes #91
Checklist