WIP, ENH: add pairwise feature scatterplots#31
WIP, ENH: add pairwise feature scatterplots#31adamwitmer wants to merge 5 commits intonidhin_train_inference_rebasefrom
Conversation
Using the aggregate features was discussed as preferable in person I think, since that is what the ML models are using to train/predict phase status. I believe the in person discussion resulted in the suggestion of selecting a very small number of top features from FIC, showing their pair plots only, and trying to demonstrate that there is still overlap in 1-phase and 2-phase points in their plotting/variable space. One visual suggestion for doing that was to plot the convex hull of one of the phases and show that at least a few points from other phase fall inside that hull. I'm not going to read the diff here until it is presented for review and taken out of WIP. This is quite last minute indeed, and I probably can no longer guarantee a review before submission of comments. |
* modify feature description for accuracy * "cook" agg_df feature dataset to standardize plot outputs
|
I pushed (two) commits here to update the functionality that was used for generating the manuscript figures. I also added an admonition to the PR description that I used an LLM to help me write the functions for calculating the distance between the centroids of feature pairs, plotting the convex hulls of the feature distributions, and determining the points that lie within the overlapping regions. I still need to perform a self-review check of this PR before presenting for formal review, that includes addressing the remaining TODO item in the description, as well as:
|
This PR begins to add infrastructure for generating pairwise scatterplots of image features in relation to the reviewer request for manuscript revisions at #26:
The plotting function takes as input the aggregated feature dataframe and optional user input for which features to plot. If no features are provided by the user, the function calculates the top 3 feature pairs with the greatest separation between them, as measured by the distance between the centroids of their normalized feature distributions. It then plots the scatter plots of the raw data from the top feature pairs and their convex hulls, and calculates the number of points contained in the overlapping region of their convex hulls.
For full disclosure: I used an LLM to help me write the functions for determining the top feature pairs by calculating the distance between their centroids, as well as for plotting the convex hull and finding the points in the overlapping regions
This is a WIP branch with some TODO items as follows:
analysisstep and/or "aggregate" features from thetrainingstep (my intuition is to use the raw features as is implied in the reviewers comment, but plotting all the features is probably not the best option because there are >30 individual features.)