Skip to content

Conversation

@huddlej
Copy link
Contributor

@huddlej huddlej commented Oct 8, 2025

Description of proposed changes

Implements an alternate approach to assigning clade labels to internal nodes and tips using Nextclade instead of augur clades. This approach works by:

  1. exporting ancestral and tip sequences from augur ancestral
  2. assigning clade and subclade to those sequences with Nextclade
  3. converting the table of clade labels per node to a node data JSON file

The logic for the first two steps already existed. The third step required an expansion of an existing script that converts data frames into node data such that this script can also export branch labels.

With this approach, clade assignments no longer require us to download clade definitions from the clade nomenclature repo and the accuracy of the assignments no longer relies on the composition of the input sequences. The benefit of this approach over one that relies on DTA with augur traits is that we get more deterministic clade labels for internal nodes that reflects the inferred ancestral sequence instead of a single inferred ancestral trait.

This PR builds on work I prototyped in #258, but I think this approach is a better long-term solution in that it provides a full replacement of the existing subclades functionality with internal node coloring and branch labels.

Examples

The following is an example of a small H3N2 HA dataset with subclades assigned by augur clades:

image 11

Then this example shows the same dataset with subclades assigned by Nextclade with this PR, showing how augur clades missed that several samples actually map to subclade J.2.4 (the same issue noted in #254):

image 10

Related issue(s)

Replaces #258
Closes #254
Closes #131
Closes #91

Checklist

  • Checks pass
  • Update changelog

Copies the Nextclade "subclade" column to a new "subclade_nextclade"
field and adds this new column to the Auspice config JSON for H3N2 HA as
a proof of concept for exporting Nextclade-based clade annotations
alongside the `augur clades` annotations. Since the Nextclade annotation
of subclade already exists in the metadata we pass to `augur export`, we
just need to rename that column to avoid conflicting with the "subclade"
attribute that also appears in the node data JSON inputs to `augur
export`. To avoid breaking any downstream logic that relies on the
Nextclade column name of "subclade", I've chosen to duplicate the
original column to a new column with the desired name.

Related to #254
Related to #131
Implements an alternate approach to assigning clade labels to internal
nodes and tips using Nextclade instead of `augur clades`. This approach
works by 1) exporting ancestral and tip sequences from augur ancestral,
2) assigning clade and subclade to those sequences with Nextclade, and
3) converting the table of clade labels per node to a node data JSON
file. The logic for the first two steps already existed. The third step
required an expansion of an existing script that converts data frames
into node data such that this script can also export branch labels.

With this approach, clade assignments no longer require us to download
clade definitions from the clade nomenclature repo and the accuracy of
the assignments no longer relies on the composition of the input
sequences.

Closes #254
Updates the logic of the workflow and the table-to-node-data script to
handle the fact that we want to select a subset of metadata/Nextclade
columns that correspond to historical clade names for HA ("clade") and
currently maintained clade names for HA ("subclade") and NA ("clade")
which have different column names, different node attribute names, and
different branch labels. In an ideal world, we might drop the historical
HA clades and standardize the names of our Nextclade, node, and branch
attributes to "clade" across all segments, but for now, we need this
additional complexity in the workflow to handle the complexity in names.
Comment on lines 81 to 87
# Using a preorder traversal, find the first node in the tree with
# each distinct value.
for node in tree.find_clades():
node_value = value_by_node.get(node.name)
if node_value in branch_values:
if node.name not in branches:
branches[node.name] = {"labels": {}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

augur clades guarantees that each clade will be mono/para-phyletic (although how it guarantees that is a little questionable!). While the nextclade approach should always result in the same structure, there aren't any guarantees. I'd suggest keeping track of all "start nodes" (nodes whose parent state differs from their own) in order to catch any polyphyletic calls and error / warn if encountered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, @jameshadfield! I ran into this issue when testing a method of assigning clades with augur traits which produced the following subtree for recent H3N2 HA sequences where an ancestral node has the mutations for J.2.2 but gets annotated as J.2 and its two immediate children get annotated as J.2.2.

image 2

I can imagine that this same outcome could occur with this Nextclade approach, though. I think the augur clades approach of picking the largest matching subclade makes sense at least for branch labels.

@huddlej
Copy link
Contributor Author

huddlej commented Oct 17, 2025

Another example issue with augur clades annotations that appeared recently in our 6-month trees is when older clades have died out and their clade-defining mutations reappear in new clades, those new subclades get incorrectly labeled with the old clade names. The following screenshot from an H3N2 HA tree shows an example where a subclade of J.2.2 gets labeled as J.3 because it has the same HA2:V176I substitution.

image

With the Nextclade-based clade assignments, we get the following result where that newer subclade is labeled as part of J.2.2 and only the handful of J.3 strains in the tree get properly labeled with that clade.

image

@jameshadfield
Copy link
Member

Another example issue with augur clades annotations that appeared recently in our 6-month trees is when older clades have died out and their clade-defining mutations reappear in new clades, those new subclades get incorrectly labeled with the old clade names. The following screenshot from an H3N2 HA tree shows an example where a subclade of J.2.2 gets labeled as J.3 because it has the same HA2:V176I substitution.

I think nextstrain/augur#1837 may solve this, depending on how the clades.tsv was constructed. P.S. https://next.nextstrain.org/seasonal-flu/h3n2/ha/2y also shows J.3 as a subclade of J2.2

@huddlej
Copy link
Contributor Author

huddlej commented Oct 22, 2025

Testing this again today with 6-month trees for H1N1pdm HA and NA, I found no issues with the HA assignments (all were consistently monophyletic), but the NA assignments had many polyphyletic branch labels for the subclade D. To show this, I've annotated the branches in the tree below by the subclade and node name which shows the multiple nodes labeled as subclade D.

image

The thickest branch descending from the MRCA of the tree above has the NA:I264T substitution that defines subclade D. Unfortunately, that substitution is homoplasic in this tree and the Nextclade dataset tree. When I map that node's inferred sequence to the NA tree with Nextclade, it maps to a small subclade of C.5.3 which happens to have this same substitution. As a result, many other descendant nodes of that node (which don't have any or many additional substitutions) also map to C.5.3.

The public 6-month H1N1pdm HA tree (shown below) doesn't suffer from this issue, since Augur assigns labels to the largest matching clade with the defining substitutions.

image

The root issue here is that our NA subclades aren't defined by unique substitutions. Still, this incorrect assignment of NA subclades highlights a major limitation of the Nextclade-based clade assignment approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants