Skip to content

WIP, ENH: add pairwise feature scatterplots#31

Open
adamwitmer wants to merge 5 commits intonidhin_train_inference_rebasefrom
awitmer_seaborn
Open

WIP, ENH: add pairwise feature scatterplots#31
adamwitmer wants to merge 5 commits intonidhin_train_inference_rebasefrom
awitmer_seaborn

Conversation

@adamwitmer
Copy link
Copy Markdown
Collaborator

@adamwitmer adamwitmer commented Apr 2, 2026

This PR begins to add infrastructure for generating pairwise scatterplots of image features in relation to the reviewer request for manuscript revisions at #26:

For each feature, please plot their distributions in the SI, partitioned into each class (two phase vs one phase) to see if there are meaningful patterns we can spot with our eyes. I imagine that a simple rule based on the number and size of bubbles could suffice instead of the machine learning model. This will help us see which features tend to separate the class more. Or you can make a pairplot in seaborn where it shows the relationship between the features and color/shape the points according to class. I just think more insights into your data are needed. The feature importance study is helpful. I was going to suggest that but see you have it.

The plotting function takes as input the aggregated feature dataframe and optional user input for which features to plot. If no features are provided by the user, the function calculates the top 3 feature pairs with the greatest separation between them, as measured by the distance between the centroids of their normalized feature distributions. It then plots the scatter plots of the raw data from the top feature pairs and their convex hulls, and calculates the number of points contained in the overlapping region of their convex hulls.

For full disclosure: I used an LLM to help me write the functions for determining the top feature pairs by calculating the distance between their centroids, as well as for plotting the convex hull and finding the points in the overlapping regions

This is a WIP branch with some TODO items as follows:

  • decide which features would be best represented by the pairplot for the manuscript i.e. raw features from the analysis step and/or "aggregate" features from the training step (my intuition is to use the raw features as is implied in the reviewers comment, but plotting all the features is probably not the best option because there are >30 individual features.)
  • fix the handling of pass-through arguments and data structure handling for the scatterplot generation in relation to existing infrastructure.

@tylerjereddy tylerjereddy added the enhancement New feature or request label Apr 4, 2026
@tylerjereddy
Copy link
Copy Markdown
Collaborator

"aggregate" features from the training step

Using the aggregate features was discussed as preferable in person I think, since that is what the ML models are using to train/predict phase status.

I believe the in person discussion resulted in the suggestion of selecting a very small number of top features from FIC, showing their pair plots only, and trying to demonstrate that there is still overlap in 1-phase and 2-phase points in their plotting/variable space. One visual suggestion for doing that was to plot the convex hull of one of the phases and show that at least a few points from other phase fall inside that hull.

I'm not going to read the diff here until it is presented for review and taken out of WIP. This is quite last minute indeed, and I probably can no longer guarantee a review before submission of comments.

* modify feature description for accuracy
* "cook" agg_df feature dataset to standardize plot outputs
@adamwitmer adamwitmer changed the title WIP, ENH: add seaborn pairplot WIP, ENH: add pairwise feature scatterplots Apr 16, 2026
@adamwitmer
Copy link
Copy Markdown
Collaborator Author

adamwitmer commented Apr 16, 2026

I pushed (two) commits here to update the functionality that was used for generating the manuscript figures. I also added an admonition to the PR description that I used an LLM to help me write the functions for calculating the distance between the centroids of feature pairs, plotting the convex hulls of the feature distributions, and determining the points that lie within the overlapping regions.

I still need to perform a self-review check of this PR before presenting for formal review, that includes addressing the remaining TODO item in the description, as well as:

  • checking for improvements in overall logic
  • modifying the plotting test to check that the calculation of overlapping points is correct
  • re-run the workflow on glycan to make sure that I can generate the figures used in the manuscript

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants