Conversation
Add preprocessing.py for enhancing input images Add detection.py for OpenCV blob detection Add input data for testing OpenCV detection Add scripts for testing OpenCV detection Add raw_processed.png to baseline for test Add raw_detection.png to baseline for test Modified pyproject.toml to pull .tiff and .png
Removed fixtures from test scripts Use importlib.resources to retrieve images Reduced the tolerances for compare_images() Added ruff.toml to enforce no mutable default args
Added .gitattributes with image files
…rocessed.png,neat_ml/tests/data/images/raw.tiff,neat_ml/tests/data/images_Processed/raw.tiff: convert to Git LFS
Modified preprocessing.py to remove mutable defaults Updated .gitattributes to define image paths
Added lib_workflow.py to neat_ml/workflow Added test_workflow.py to test lib_workflow.py Modified run_workflow.py to import lib_workflow Modified test yaml file to remove results
Removed unnecessary git lfs installation commands from .gitlab-ci.yml Edited run_workflow.py and test_workflow.py to fix mypy errors.
Modified the test scripts to include pytest.raises(match..) Removed extraneous code in detection.py Added test method for checking a warning message
Added bubblesam.py and SAM.py into neat_ml/bubblesam Added stage_bubblesam() into run_workflow.py Added test_bubblesam.py and associated baseline images Modified .gitlab-ci.yml, mypy.ini and pyproject.toml to incorporate sam2 module Updated README.md with sam-2 installation instructions Added bubblesam_detection_test.yaml for functional test
Created neat_ml/workflow and added lib_workflow.py Added test_workflow.py to test lib_workflow.py
Added git config --global --add safe.directory to fix the dubious ownership error
Added neat_ml/analysis module for bubble analysis Added test_analysis.py for testing analysis module
Updated README.md with instructions to run OpenCV and BubbleSAM workflow with detection and analysis
Added neat_ml/model for ML Added neat_ml/utils for plotting Added neat_ml/phase_diagram for plotting phase diagram Added test scripts and baseline images Modified run_workflow.py for training, inference and feature importance
Updated the test_workflow.py to incorporate new tests Updated README.md with commandline example with train, infer,plot. Removed extraneous code from test scripts Modified baseline images to ensure the tolerance to be within 1e-4
Added ci.yml Added LANL copyright assertion ID
| # Or broader by type (covers future files anywhere) | ||
| *.tif filter=lfs diff=lfs merge=lfs -text | ||
| *.tiff filter=lfs diff=lfs merge=lfs -text | ||
| *.png filter=lfs diff=lfs merge=lfs -text |
There was a problem hiding this comment.
As noted in gh-8, we can't really use LFS with LANL GitHub org.
Indeed, if you try to checkout this branch out locally you'll get an error similar to:
Error downloading object: neat_ml/tests/baseline/circles_filtered_contours.png (1d6b164): Smudge error: Error downloading neat_ml/tests/baseline/circles_filtered_contours.png (1d6b164340ac27d913e2552b3109cb620f5befec2816b50183e51b1816f8dcda): [1d6b164340ac27d913e2552b3109cb620f5befec2816b50183e51b1816f8dcda] Object does not exist on the server: [404] Object does not exist on the server
There was a problem hiding this comment.
As in the other DR project, you can work around it to some degree with export GIT_LFS_SKIP_SMUDGE=1.
But if I check the branch out that way locally, I then end up with other issues with i.e., python -m pytest -n 4 resulting in 6 failing and 5 erroring tests.
Details
FAILED neat_ml/tests/test_detection.py::test_visual_regression_debug_overlay - FileNotFoundError: Unable to read image file: /Users/treddy/github_projects/ldrd_n...
FAILED neat_ml/tests/test_feature_importance.py::test_plot_feat_import_consensus_image - PIL.UnidentifiedImageError: cannot identify image file '/Users/treddy/github_proje...
FAILED neat_ml/tests/test_preprocessing.py::test_process_directory_single_image - cv2.error: OpenCV(4.10.0) /Users/xperience/GHA-Actions-OpenCV/_work/opencv-python/...
FAILED neat_ml/tests/test_phase_diagram.py::test_construct_phase_diagram_image - PIL.UnidentifiedImageError: cannot identify image file '/Users/treddy/github_proje...
FAILED neat_ml/tests/test_train.py::test_plot_roc - PIL.UnidentifiedImageError: cannot identify image file '/Users/treddy/github_proje...
FAILED neat_ml/tests/test_feature_importance.py::test_compare_methods_end_to_end - AssertionError:
ERROR neat_ml/tests/test_bubblesam.py
ERROR neat_ml/tests/test_workflow.py
ERROR neat_ml/tests/test_analysis.py::test_voronoi_qhull_error
ERROR neat_ml/tests/test_analysis.py::test_calculate_nnd_stats_warns_on_no_finite_distances
ERROR neat_ml/tests/test_analysis.py::test_calculate_graph_metrics_warns_on_exception
That seems like something @adamwitmer could help with--finding a sustainable solution to binary asset storage here so I/others can run the tests easily on this branch, etc.
| push: | ||
| branches: [ main ] | ||
| pull_request: # run for all PRs (parity with GitLab MR rule) | ||
| types: [opened, synchronize, reopened, ready_for_review] |
There was a problem hiding this comment.
Purge this out--seems unnecessary. In general, this CI config file has far too much duplication/complexity, and merge conflicts need to be resolved against the more concise CI config I added in gh-9.
| setup.cfg | ||
| **/requirements*.txt | ||
| neat_ml/sam2/pyproject.toml | ||
| neat_ml/sam2/setup.cfg |
There was a problem hiding this comment.
let's not use caching for now--it is just extra complexity we don't need--this is a very small project that probably won't see much GitHub activity once published.
| set -euxo pipefail | ||
| git config --global --add safe.directory "$GITHUB_WORKSPACE" | ||
| git config --global --add safe.directory "$GITHUB_WORKSPACE/neat_ml/sam2" | ||
| git submodule update --init --recursive |
There was a problem hiding this comment.
this all seems superfluous--we are already using submodules: recursive in the checkout action above, so why is this repeated here?
| openpyxl jinja2 python-ternary pytest shap interpret \ | ||
| types-tqdm types-Pillow scikit-image lightgbm lime \ | ||
| opencv-python-headless==4.10.0.84 \ | ||
| pyyaml types-pyyaml torch |
There was a problem hiding this comment.
this is verbose and gets repeated below--I think we should just list the dependencies in pyproject.toml and use the standard approach of installing them with pip from that file...
| - name: Run end-to-end script | ||
| run: | | ||
| set -euxo pipefail | ||
| python main.py |
There was a problem hiding this comment.
Let's just delete the functional test job above entirely--it hasn't passed for ages and just times out at about an hour internally on LISDI GitLab. I also don't think it tests much of the "new stuff" for the paper anyway. Using pytest tests is also generally preferable of course.
| setup.cfg | ||
| **/requirements*.txt | ||
| neat_ml/sam2/pyproject.toml | ||
| neat_ml/sam2/setup.cfg |
There was a problem hiding this comment.
get rid of caching for this small team/project
| set -euxo pipefail | ||
| git config --global --add safe.directory "$GITHUB_WORKSPACE" | ||
| git config --global --add safe.directory "$GITHUB_WORKSPACE/neat_ml/sam2" | ||
| git submodule update --init --recursive |
There was a problem hiding this comment.
Again, submodules: recursive should be handled above. Way too much repetition in this file. Using a matrix approach like in gh-9 probably makes sense when resolving merge conflicts.
| openpyxl jinja2 python-ternary pytest shap interpret \ | ||
| types-tqdm types-Pillow scikit-image lightgbm lime \ | ||
| opencv-python-headless==4.10.0.84 \ | ||
| pyyaml types-pyyaml torch |
There was a problem hiding this comment.
move these long repeated lists of deps to the appropriate parts of pyproject.toml
| - git config --global --add safe.directory "$CI_PROJECT_DIR/neat_ml/sam2" | ||
| - python -m pip install -U mypy ruff pandas-stubs "numpy>=1.26.3,<=2.1.3" matplotlib xgboost pandas scikit-learn openpyxl jinja2 python-ternary pytest shap interpret types-tqdm types-Pillow scikit-image lightgbm lime opencv-python-headless==4.10.0.84 pyyaml types-pyyaml pytest-mock torch | ||
| - git submodule update --init --recursive | ||
| - pip install -e ./neat_ml/sam2/ |
There was a problem hiding this comment.
just delete this file--we're not going to continue to use LISDI GitLab long-term (it has a subscription fee, etc.); the code should be open for others to see in any case
| conda activate ldrd_neat_ml | ||
|
|
||
| conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia | ||
| pip install -r requirements.txt |
There was a problem hiding this comment.
This is different than the steps the CI is following; if we have an updated requirements.txt file we could use that in CI instead of having multiple lists of deps needing to be maintained in different locations.
|
|
||
| ## Installation | ||
|
|
||
| Download the package from the following GitLab repository: |
There was a problem hiding this comment.
Let's just rewrite the instructions for the public GitHub--that's all anyone is going to care about at publication time since they cannot access our internal GitLab.
| @@ -0,0 +1,144 @@ | |||
| from pathlib import Path | |||
| from typing import Any, Dict, List, Optional | |||
There was a problem hiding this comment.
there shouldn't be any need to import Dict and List, can just use the builtin list and dict types now
| if not graph.has_edge(i, j): | ||
| dist = float(np.linalg.norm(points[i] - points[j])) | ||
| graph.add_edge(int(i), int(j), distance=dist) | ||
| except Exception as exc: |
There was a problem hiding this comment.
Let's try to avoid capturing the generic Exception class for the usual reasons
| ) | ||
| ax.add_patch(rect) | ||
| ax.axis("off") | ||
| plt.savefig(output_path, bbox_inches='tight') |
There was a problem hiding this comment.
Here and elsewhere--let's try to use fig, ax objects instead of the "global" plt approach, for the usual reasons.
| print(f"[INFO] scale_pos_weight={spw:.3f} | train neg/pos={np.bincount(y_train)}") | ||
|
|
||
| param_names = list(_PARAM_GRID.keys()) | ||
| param_values = list(_PARAM_GRID.values()) |
There was a problem hiding this comment.
let's avoid using "module globals" like _PARAM_GRID, instead generally passing variables in through functions arguments to avoid spaghetti logic...
| val_proba_best = val_proba | ||
|
|
||
| assert best_model is not None, "No model was fitted!" | ||
| assert val_proba_best is not None, "Validation probabilities were not calculated!" |
There was a problem hiding this comment.
avoid plain assert in production code, for the usual reasons
| plt.ylabel("True-Positive Rate") | ||
| plt.legend() | ||
| plt.tight_layout() | ||
| plt.savefig(out_png, dpi=300) |
There was a problem hiding this comment.
Let's avoid the "global"/MATLAB-like plt usage, instead favoring the object-oriented fig, ax approach..
| plt.tight_layout() | ||
| plt.savefig(out_png, dpi=300) | ||
| plt.close() | ||
| print(f"[INFO] ROC curve saved -> {out_png}") |
There was a problem hiding this comment.
logging not print() for production output
| auc = roc_auc_score(y_val, val_proba) | ||
| print(f"[GRID] params={params} | val ROC-AUC={auc:.4f}") | ||
|
|
||
| if auc > best_auc: |
There was a problem hiding this comment.
this manual checking shouldn't be necessary--sklearn has built-in grid searching functions/classes... let's not reinvent the wheel and create unnecessary reviewer burden with manually coded loops
There was a problem hiding this comment.
@adamwitmer since this PR hasn't been touched in months and I've mentioned this a few times in person now, I'll mention it in writing next -- please makes sure that:
- hyperparameter optimization is justified as useful for OpenCV and SAM2 approaches (model performs better after hyperopt vs. before)
- hyperopt is performed in a standard way rather then reinventing the wheel--this comment is from October 2025, was hoping it might be addressed a little faster than this since a few iterations are likely needed and the manuscript revisions are due quite soon..
| _PARAM_GRID: Dict[str, List[int | float | None]] = { | ||
| # Random‑Forest | ||
| "ensemble__rf__n_estimators": [500], | ||
| "ensemble__rf__max_depth": [None], |
There was a problem hiding this comment.
I don't get it--the Random Forest hyperparameter grid seems pointless--there is no work to be done here with a single entry for each of only two hyperparameters. I also already indicated that n_estimators shouldn't be in here for RandomForest since the trend of "larger is better" always holds true modulo noise in the tail of the distribution.
It feels like we're basically just "pretending" to do hyperparameter optimization, and I haven't seen any indication that it is helpful...
XGB below also probably has too few hyperparameters for it to matter (and brute force may take a while vs. Bayesian search approach...)
|
I'll revisit this after @adamwitmer cleans things up a bit, gets the tests/CI running/passing without LFS, so I can start doing more thorough reviews with the testsuite working locally.. |
|
@adamwitmer paper revisions are due in a few weeks max, and it could very well take me a few weeks to review this even if it were ready for review now, so definitely way past due to have this ready for review given December 2025 deadline. |
Summary
This MR introduces a compact ML toolkit for phase diagram construction including and model training, inference, explainability and plotting:
What’s included
Design highlights
VotingClassifieroverRandomForestClassifierandXGBClassifierwith soft voting.preprocess()and again inside the pipeline for robustness when new features are missing at inference).StandardScaler.scale_pos_weight = (#neg / #pos)for XGBoost, computed on the training (and refit on train+val) target.RANDOM_STATE=42across learners.joblibbundle containing the fitted pipeline, feature list, validation metrics, and best parameters.Main modules & functions
neat_ml/model/train.pyPublic API
preprocess(df, target, exclude=None) -> (X, y)train_with_validation(X_train, y_train, X_val, y_val)train+val.(final_model_pipeline, metrics_dict, best_params_dict, val_proba_best).plot_roc(y_true, y_prob, out_png, label="Validation")save_model_bundle(model, features, metrics, best_params, path){ "model": Pipeline, "features": List[str], "metrics": Dict[str, float], "best_params": Dict[str, Any] }.Model pipeline
Hyperparameter grid (72 total)
n_estimators=[500],max_depth=[None]n_estimators=[10,20,50,100,200,400]×learning_rate=[0.05,0.1]×max_depth=[3,4,5,8,10,20]neat_ml/model/inference.pyPublic API
save_predictions(df, y_prob, out_csv)Pred_Prob(positive‑class probability)Pred_Label(1if>=0.5, else0)run_inference(model_in, data_csv, target: Optional[str], exclude_cols: list[str], pred_csv)joblibbundle, readsdata_csv, runspreprocess(with optionaltargetto enable label alignment but no evaluation/ROC is performed here), adds any missing training features as all‑NaN columns, reorders to match the saved feature list, and computes probabilities with the bundled pipeline.save_predictions(...).neat_ml/model/feature_importance.pyPublic API
compare_methods(model, X, y, out_dir, top=20)shap_summary.png; returns mean |SHAP| importances.ebm_importance.png&ebm_importance.csv.feature_importance_comparison.csvandfeature_importance_comparison.png, ranked by mean rank across methods.feat_imp_consensus.png.feature_importance_consensus(pos_class_feat_imps, feature_names, top_feat_count)(ranked_names, ranked_counts, num_models).plot_feat_import_consensus(ranked_names, ranked_counts, num_models, top_feat_count, out_dir)get_k_best_scores(X, y, k, metrics)SelectKBestscores (raw vectors) for various scoring functions.Dependencies
shap,interpret(glassbox), andlime. Currently imported at module import time (i.e., behave as hard dependencies). See TODOs for lazy import to make them optional.neat_ml/utils/utils.pyPublic API & behavior
_axis_ranges(df1, df2, x_col, y_col, pad=2) -> ([0,x_max],[0,y_max])Shared axis ranges across two frames (integer bounds).GMMWrapper(gmm, x_comp, x_features)Uses aGaussianMixturetrained on 1D phase features to classify any (x,y) composition point by nearest neighbor mapping back to the observed feature domain._standardise_labels(cluster_labels, x_comp)Deterministic remap of raw GMM labels to the convention:0 = Two Phase (triangle‑up, aquamarine)1 = Single Phase (square, light‑steel‑blue)extract_boundary_from_contour(z, xs, ys, level=0.5)From a gridz, returns the longest (x,y) contour at thresholdlevel(e.g., phase boundary).plot_gmm_decision_regions(df, x_col, y_col, phase_col, ax, xrange, yrange, ..., plot_regions=True, ...) -> (gmm, std_labels, boundary_points)phase_col, projects decisions onto composition grid (xrange×yrange), optionally fills regions and draws the decision boundary; returns model, standardized labels for observed points, and extracted boundary coordinates.plot_gmm_composition_phase(df, x_col, y_col, phase_col, ax, point_cmap=None)phase_col, scatters (x,y) points with standardized phase labels and legend.Styling helpers
_set_axis_style(ax, xrange, yrange)Equal aspect, integer tick formatting, thicker spines/ticks.Usage examples
1) Train -> validate -> bundle
2) Run inference on a new CSV
3) Explainability & consensus
MR checklist
print(...)