Assign clades with Nextclade #259

huddlej · 2025-10-08T23:11:39Z

Description of proposed changes

Implements an alternate approach to assigning clade labels to internal nodes and tips using Nextclade instead of augur clades. This approach works by:

exporting ancestral and tip sequences from augur ancestral
assigning clade and subclade to those sequences with Nextclade
converting the table of clade labels per node to a node data JSON file

The logic for the first two steps already existed. The third step required an expansion of an existing script that converts data frames into node data such that this script can also export branch labels.

With this approach, clade assignments no longer require us to download clade definitions from the clade nomenclature repo and the accuracy of the assignments no longer relies on the composition of the input sequences. The benefit of this approach over one that relies on DTA with augur traits is that we get more deterministic clade labels for internal nodes that reflects the inferred ancestral sequence instead of a single inferred ancestral trait.

This PR builds on work I prototyped in #258, but I think this approach is a better long-term solution in that it provides a full replacement of the existing subclades functionality with internal node coloring and branch labels.

Examples

The following is an example of a small H3N2 HA dataset with subclades assigned by augur clades:

Then this example shows the same dataset with subclades assigned by Nextclade with this PR, showing how augur clades missed that several samples actually map to subclade J.2.4 (the same issue noted in #254):

Related issue(s)

Replaces #258
Closes #254
Closes #131
Closes #91

Checklist

Checks pass
Update changelog

Copies the Nextclade "subclade" column to a new "subclade_nextclade" field and adds this new column to the Auspice config JSON for H3N2 HA as a proof of concept for exporting Nextclade-based clade annotations alongside the `augur clades` annotations. Since the Nextclade annotation of subclade already exists in the metadata we pass to `augur export`, we just need to rename that column to avoid conflicting with the "subclade" attribute that also appears in the node data JSON inputs to `augur export`. To avoid breaking any downstream logic that relies on the Nextclade column name of "subclade", I've chosen to duplicate the original column to a new column with the desired name. Related to #254 Related to #131

Implements an alternate approach to assigning clade labels to internal nodes and tips using Nextclade instead of `augur clades`. This approach works by 1) exporting ancestral and tip sequences from augur ancestral, 2) assigning clade and subclade to those sequences with Nextclade, and 3) converting the table of clade labels per node to a node data JSON file. The logic for the first two steps already existed. The third step required an expansion of an existing script that converts data frames into node data such that this script can also export branch labels. With this approach, clade assignments no longer require us to download clade definitions from the clade nomenclature repo and the accuracy of the assignments no longer relies on the composition of the input sequences. Closes #254

Updates the logic of the workflow and the table-to-node-data script to handle the fact that we want to select a subset of metadata/Nextclade columns that correspond to historical clade names for HA ("clade") and currently maintained clade names for HA ("subclade") and NA ("clade") which have different column names, different node attribute names, and different branch labels. In an ideal world, we might drop the historical HA clades and standardize the names of our Nextclade, node, and branch attributes to "clade" across all segments, but for now, we need this additional complexity in the workflow to handle the complexity in names.

jameshadfield · 2025-10-13T19:41:05Z

scripts/table_to_node_data.py

+            # Using a preorder traversal, find the first node in the tree with
+            # each distinct value.
+            for node in tree.find_clades():
+                node_value = value_by_node.get(node.name)
+                if node_value in branch_values:
+                    if node.name not in branches:
+                        branches[node.name] = {"labels": {}}


augur clades guarantees that each clade will be mono/para-phyletic (although how it guarantees that is a little questionable!). While the nextclade approach should always result in the same structure, there aren't any guarantees. I'd suggest keeping track of all "start nodes" (nodes whose parent state differs from their own) in order to catch any polyphyletic calls and error / warn if encountered.

Good call, @jameshadfield! I ran into this issue when testing a method of assigning clades with augur traits which produced the following subtree for recent H3N2 HA sequences where an ancestral node has the mutations for J.2.2 but gets annotated as J.2 and its two immediate children get annotated as J.2.2.

I can imagine that this same outcome could occur with this Nextclade approach, though. I think the augur clades approach of picking the largest matching subclade makes sense at least for branch labels.

huddlej · 2025-10-17T20:05:19Z

Another example issue with augur clades annotations that appeared recently in our 6-month trees is when older clades have died out and their clade-defining mutations reappear in new clades, those new subclades get incorrectly labeled with the old clade names. The following screenshot from an H3N2 HA tree shows an example where a subclade of J.2.2 gets labeled as J.3 because it has the same HA2:V176I substitution.

With the Nextclade-based clade assignments, we get the following result where that newer subclade is labeled as part of J.2.2 and only the handful of J.3 strains in the tree get properly labeled with that clade.

jameshadfield · 2025-10-19T21:44:54Z

Another example issue with augur clades annotations that appeared recently in our 6-month trees is when older clades have died out and their clade-defining mutations reappear in new clades, those new subclades get incorrectly labeled with the old clade names. The following screenshot from an H3N2 HA tree shows an example where a subclade of J.2.2 gets labeled as J.3 because it has the same HA2:V176I substitution.

I think nextstrain/augur#1837 may solve this, depending on how the clades.tsv was constructed. P.S. https://next.nextstrain.org/seasonal-flu/h3n2/ha/2y also shows J.3 as a subclade of J2.2

huddlej · 2025-10-22T20:41:32Z

Testing this again today with 6-month trees for H1N1pdm HA and NA, I found no issues with the HA assignments (all were consistently monophyletic), but the NA assignments had many polyphyletic branch labels for the subclade D. To show this, I've annotated the branches in the tree below by the subclade and node name which shows the multiple nodes labeled as subclade D.

The thickest branch descending from the MRCA of the tree above has the NA:I264T substitution that defines subclade D. Unfortunately, that substitution is homoplasic in this tree and the Nextclade dataset tree. When I map that node's inferred sequence to the NA tree with Nextclade, it maps to a small subclade of C.5.3 which happens to have this same substitution. As a result, many other descendant nodes of that node (which don't have any or many additional substitutions) also map to C.5.3.

The public 6-month H1N1pdm HA tree (shown below) doesn't suffer from this issue, since Augur assigns labels to the largest matching clade with the defining substitutions.

The root issue here is that our NA subclades aren't defined by unique substitutions. Still, this incorrect assignment of NA subclades highlights a major limitation of the Nextclade-based clade assignment approach.

huddlej added 4 commits October 2, 2025 14:20

Drop subclade coloring per tip from Nextclade

0448d58

jameshadfield reviewed Oct 13, 2025

View reviewed changes

WIP: Label polyphyletic branches

1a1df0f

joverlee521 mentioned this pull request Dec 24, 2025

Enforce hierarchy in clade assignments nextstrain/public#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign clades with Nextclade #259

Assign clades with Nextclade #259

Uh oh!

huddlej commented Oct 8, 2025 •

edited

Loading

Uh oh!

jameshadfield Oct 13, 2025

Uh oh!

huddlej Oct 13, 2025

Uh oh!

huddlej commented Oct 17, 2025

Uh oh!

jameshadfield commented Oct 19, 2025

Uh oh!

huddlej commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Assign clades with Nextclade #259

Are you sure you want to change the base?

Assign clades with Nextclade #259

Uh oh!

Conversation

huddlej commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Examples

Related issue(s)

Checklist

Uh oh!

jameshadfield Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

huddlej Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

huddlej commented Oct 17, 2025

Uh oh!

jameshadfield commented Oct 19, 2025

Uh oh!

huddlej commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huddlej commented Oct 8, 2025 •

edited

Loading