WIP, ENH: add pairwise feature scatterplots by adamwitmer · Pull Request #31 · lanl/ldrd_neat_ml

adamwitmer · 2026-04-02T02:53:08Z

This PR begins to add infrastructure for generating pairwise scatterplots of image features in relation to the reviewer request for manuscript revisions at #26:

For each feature, please plot their distributions in the SI, partitioned into each class (two phase vs one phase) to see if there are meaningful patterns we can spot with our eyes. I imagine that a simple rule based on the number and size of bubbles could suffice instead of the machine learning model. This will help us see which features tend to separate the class more. Or you can make a pairplot in seaborn where it shows the relationship between the features and color/shape the points according to class. I just think more insights into your data are needed. The feature importance study is helpful. I was going to suggest that but see you have it.

The plotting function takes as input the aggregated feature dataframe and optional user input for which features to plot. If no features are provided by the user, the function calculates the top 3 feature pairs with the greatest separation between them, as measured by the distance between the centroids of their normalized feature distributions. It then plots the scatter plots of the raw data from the top feature pairs and their convex hulls, and calculates the number of points contained in the overlapping region of their convex hulls.

For full disclosure: I used an LLM to help me write the functions for determining the top feature pairs by calculating the distance between their centroids, as well as for plotting the convex hull and finding the points in the overlapping regions

This is a WIP branch with some TODO items as follows:

decide which features would be best represented by the pairplot for the manuscript i.e. raw features from the analysis step and/or "aggregate" features from the training step (my intuition is to use the raw features as is implied in the reviewers comment, but plotting all the features is probably not the best option because there are >30 individual features.)
fix the handling of pass-through arguments and data structure handling for the scatterplot generation in relation to existing infrastructure.

tylerjereddy · 2026-04-04T01:47:04Z

"aggregate" features from the training step

Using the aggregate features was discussed as preferable in person I think, since that is what the ML models are using to train/predict phase status.

I believe the in person discussion resulted in the suggestion of selecting a very small number of top features from FIC, showing their pair plots only, and trying to demonstrate that there is still overlap in 1-phase and 2-phase points in their plotting/variable space. One visual suggestion for doing that was to plot the convex hull of one of the phases and show that at least a few points from other phase fall inside that hull.

I'm not going to read the diff here until it is presented for review and taken out of WIP. This is quite last minute indeed, and I probably can no longer guarantee a review before submission of comments.

* modify feature description for accuracy * "cook" agg_df feature dataset to standardize plot outputs

adamwitmer · 2026-04-16T23:07:45Z

I pushed (two) commits here to update the functionality that was used for generating the manuscript figures. I also added an admonition to the PR description that I used an LLM to help me write the functions for calculating the distance between the centroids of feature pairs, plotting the convex hulls of the feature distributions, and determining the points that lie within the overlapping regions.

I still need to perform a self-review check of this PR before presenting for formal review, that includes addressing the remaining TODO item in the description, as well as:

checking for improvements in overall logic
modifying the plotting test to check that the calculation of overlapping points is correct
re-run the workflow on glycan to make sure that I can generate the figures used in the manuscript

Adam John Witmer added 3 commits April 1, 2026 20:45

ENH, TST, MAINT: add seaborn pairplot

f97c270

CI: temporarily point CI towards nidhin_train_inference_rebase

9bb9117

TST: add seaborn baseline image asset

385ece3

tylerjereddy added the enhancement New feature or request label Apr 4, 2026

ajwitmer added 2 commits April 16, 2026 15:53

ENH, TST, MAINT: generate pairwise scatterplots

e2b1774

TST, MAINT: PR #31 revisions

78c0e09

* modify feature description for accuracy * "cook" agg_df feature dataset to standardize plot outputs

adamwitmer changed the title ~~WIP, ENH: add seaborn pairplot~~ WIP, ENH: add pairwise feature scatterplots Apr 16, 2026

adamwitmer mentioned this pull request Apr 16, 2026

ENH: tracking changes to ML manuscript (la-2026-00306e) #26

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: add pairwise feature scatterplots#31

WIP, ENH: add pairwise feature scatterplots#31
adamwitmer wants to merge 5 commits intonidhin_train_inference_rebasefrom
awitmer_seaborn

adamwitmer commented Apr 2, 2026 •

edited

Loading

Uh oh!

tylerjereddy commented Apr 4, 2026

Uh oh!

adamwitmer commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adamwitmer commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylerjereddy commented Apr 4, 2026

Uh oh!

adamwitmer commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamwitmer commented Apr 2, 2026 •

edited

Loading

adamwitmer commented Apr 16, 2026 •

edited

Loading