WIP, ENH: Add scripts to analyze BubbleSAM and OpenCV detection results by adamwitmer · Pull Request #25 · lanl/ldrd_neat_ml

adamwitmer · 2026-03-01T00:25:11Z

This branch has been rebased against PR #3 to show only changes related to #4. It is a WIP branch intended to be presented for review and updated after merging PR #3.

Add preprocessing.py for enhancing input images Add detection.py for OpenCV blob detection Add input data for testing OpenCV detection Add scripts for testing OpenCV detection Add raw_processed.png to baseline for test Add raw_detection.png to baseline for test Modified pyproject.toml to pull .tiff and .png

Removed fixtures from test scripts Use importlib.resources to retrieve images Reduced the tolerances for compare_images() Added ruff.toml to enforce no mutable default args

…rocessed.png,neat_ml/tests/data/images/raw.tiff,neat_ml/tests/data/images_Processed/raw.tiff: convert to Git LFS

Added bubblesam.py and SAM.py into neat_ml/bubblesam Added stage_bubblesam() into run_workflow.py Added test_bubblesam.py and associated baseline images Modified .gitlab-ci.yml, mypy.ini and pyproject.toml to incorporate sam2 module Updated README.md with sam-2 installation instructions Added bubblesam_detection_test.yaml for functional test

Created neat_ml/workflow and added lib_workflow.py Added test_workflow.py to test lib_workflow.py

Added neat_ml/analysis module for bubble analysis Added test_analysis.py for testing analysis module

Updated README.md with instructions to run OpenCV and BubbleSAM workflow with detection and analysis

Added .github/workflows/ci.yml Added LANL copyright assertion ID

* fix typing * remove unnecessary monkeypatching * fix/update tests * purge unnecessary test/functions * fix function logic * add in-line comments * fix docstrings

* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic

* parameterize test for error outputs * add networkx to requirements.txt

* remove unused test assets * fix README * update pyproject.toml

tylerjereddy · 2026-03-01T00:46:41Z

You should probably temporarily update the CI config so that it runs against the target branch.

tylerjereddy · 2026-03-01T00:49:20Z

That said, this probably won't change a whole lot, I can generally provide about 1 detailed review per week in any case, so presenting everything for review a few months past due and all at once probably isn't sustainable.

Maybe the team will give you a bit more time, though considerable grace has already been provided. In any case, I'll do what I can to get you timely feedback.

adamwitmer · 2026-03-02T23:03:41Z

+        # calculate the graph using the KDTree to find the
+        # closest points within radius set by `param`
+        # https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html
+        if method == "radius" and isinstance(param, (int, float)) and param > 0:


potential area of improvement here and below if we vectorize these operations vs. looping.

I used line_profiler to measure the performance improvement associated with vectorizing the graph metric operations in calculate_graph_metrics with the following diff (The associated test, test_calculate_graph_metrics still passes with these changes):

Diff

diff --git a/neat_ml/analysis/data_analysis.py b/neat_ml/analysis/data_analysis.py index b295195..da98418 100644 --- a/neat_ml/analysis/data_analysis.py +++ b/neat_ml/analysis/data_analysis.py @@ -326,31 +326,82 @@ def calculate_graph_metrics( # https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html if method == "radius" and isinstance(param, (int, float)) and param > 0: tree = KDTree(points) - for i, j in tree.query_pairs(r=param): - dist = np.linalg.norm(points[i] - points[j]) - graph.add_edge(i, j, distance=dist) + # 1. Get the pairs as a set and convert to a (M, 2) numpy array + pairs = np.array(list(tree.query_pairs(r=param))) + + if pairs.size > 0: + # 2. Extract coordinates for all 'i' and 'j' indices at once + p_i = points[pairs[:, 0]] + p_j = points[pairs[:, 1]] + + # 3. Vectorized distance calculation + dists = np.linalg.norm(p_i - p_j, axis=1) + + # 4. Bulk add to graph + weighted_edges = [ + (int(pairs[idx, 0]), int(pairs[idx, 1]), {"distance": float(dists[idx])}) + for idx in range(len(pairs)) + ] + graph.add_edges_from(weighted_edges) # alternatively calculate the graph using the KDTree # to find the k-nearest neighbors of the points as determined # by the input `param` elif method == "knn" and isinstance(param, int) and param > 0: k = min(param, n_nodes - 1) tree = KDTree(points) + # dists and idxs are shape (n_nodes, k + 1) dists, idxs = tree.query(points, k=k + 1) - for i in range(n_nodes): - for nn_idx in range(1, k + 1): - j = idxs[i, nn_idx] - if not graph.has_edge(i, j): - dist = dists[i, nn_idx] - graph.add_edge(i, j, distance=dist) + + # 1. Slice off the first column (the node itself, distance 0) + neighbor_indices = idxs[:, 1:] + neighbor_dists = dists[:, 1:] + + # 2. Create the source indices (0,0,0, 1,1,1, etc.) to match neighbor_indices + source_indices = np.repeat(np.arange(n_nodes), k) + + # 3. Flatten both into 1D arrays + u = source_indices + v = neighbor_indices.flatten() + d = neighbor_dists.flatten() + + # 4. Construct edge tuples: (source, target, {'distance': value}) + # We use a list comprehension here as it's the fastest way to format for NetworkX + weighted_edges = [ + (int(u[i]), int(v[i]), {"distance": float(d[i])}) + for i in range(len(u)) + ] + + # 5. Add all edges at once + graph.add_edges_from(weighted_edges) # alternatively calculate the graph using Delaunay triangulation # https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.Delaunay.html elif method == "delaunay" and n_nodes >= 3: tri = Delaunay(points) - for simplex in tri.simplices: - for i, j in zip(simplex, np.roll(simplex, -1)): - if not graph.has_edge(i, j): - dist = np.linalg.norm(points[i] - points[j]) - graph.add_edge(i, j, distance=dist) + + # 1. Get all edges from all simplices + # tri.simplices has shape (N, 3). We extract pairs (0,1), (1,2), and (2,0) + edges = np.concatenate([ + tri.simplices[:, [0, 1]], + tri.simplices[:, [1, 2]], + tri.simplices[:, [2, 0]] + ]) + + # 2. Sort edges to ensure (i, j) is the same as (j, i) and remove duplicates + edges = np.sort(edges, axis=1) + unique_edges = np.unique(edges, axis=0) + + # 3. Vectorized distance calculation + p1 = points[unique_edges[:, 0]] + p2 = points[unique_edges[:, 1]] + distances = np.linalg.norm(p1 - p2, axis=1) + + # 4. Add all edges to the graph at once + # This is much faster than calling add_edge in a loop + weighted_edges = [ + (int(u), int(v), {"distance": float(d)}) + for (u, v), d in zip(unique_edges, distances) + ] + graph.add_edges_from(weighted_edges) except Exception as exc: warnings.warn(f"Graph construction ({method}) failed: {exc}")

I ran the workflow on this branch with the following input yaml file using each of the three methods (i.e. radius, knn and delaunay) with and without the changes in the diff.

input.yaml

roots: work: test results: test model: test inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib datasets: - id: PEO8K_Sodium_Citrate_Ph_2nd method: BubbleSAM role: infer class: Ph time_label: 2nd composition_cols: - "Sodium Citrate (wt%)" - "PEO 8 kg/mol (wt%)" analysis: input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary.csv aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate.csv group_cols: - Group - Label - Time - Class graph_method: radius graph_param: 30

There were not significant improvements in performance for the radius (11.74 vs. 12.87) and knn (13.76 vs. 13.79) methods (new vs. old, respectively), however, the performance did improve using the delaunay method (31.90 vs. 60.10), with the main bottleneck of the old method being the following loop:

352 2495964 2e+10 7996.4 33.2 for i, j in zip(simplex, np.roll(simplex, -1)):

which may warrent adopting the vectorized change(s) for the delaunay method.

I should mention -- please disclose usage of AI tools/LLMs whenever you accept suggestions from them. We generally have to be cautious because LLM-based code may have licensing issues--I do expect you to disclose what you use LLMs for and to ensure that the provenance of the code you place in the repo is license compatible (which is very hard to ensure when pasting from an LLM).

Certainly, the expectation is also that you understand what each line of code does as well.

* vectorize graph metric procedures * improve test coverage * update README.md * fix docs/inline comments * remove unnecessary type coercions/typing

Nidhin Thomas and others added 16 commits February 28, 2026 14:07

TST: Modified test scripts

0c664ef

Removed fixtures from test scripts Use importlib.resources to retrieve images Reduced the tolerances for compare_images() Added ruff.toml to enforce no mutable default args

neat_ml/tests/baseline/raw_detection.png,neat_ml/tests/baseline/raw_p…

1be36da

…rocessed.png,neat_ml/tests/data/images/raw.tiff,neat_ml/tests/data/images_Processed/raw.tiff: convert to Git LFS

DOC: Updated README.md with instructions

dca1451

ENH: Created workflow module

57843ed

Created neat_ml/workflow and added lib_workflow.py Added test_workflow.py to test lib_workflow.py

CI: Added torch module to .gitlab-ci.yml

1bbcff6

ENH: Added analysis module

317f2b4

Added neat_ml/analysis module for bubble analysis Added test_analysis.py for testing analysis module

TST: Fixed test_workflow.py to improve coverage

ceb43e0

DOC: Updated the README.md

90a5be6

Updated README.md with instructions to run OpenCV and BubbleSAM workflow with detection and analysis

TST: Added test scripts to improve coverage

19eacb3

CI: Added .github/workflows/ci.yml

fd66599

Added .github/workflows/ci.yml Added LANL copyright assertion ID

MAINT: PR #4 revisions

ea1ece5

* fix typing * remove unnecessary monkeypatching * fix/update tests * purge unnecessary test/functions * fix function logic * add in-line comments * fix docstrings

MAINT, TST, DOC, BUG: PR #4 revisions

34ddee8

* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic

MAINT, TST: PR #4 revisions

cbebcc1

* parameterize test for error outputs * add networkx to requirements.txt

MAINT: fix conflicts after rebase

9f75635

* remove unused test assets * fix README * update pyproject.toml

CI: modify CI to run against nidhin_bubblesam_detection_backup

c258b50

adamwitmer commented Mar 1, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml

adamwitmer commented Mar 2, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer commented Mar 2, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer commented Mar 2, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer commented Mar 4, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py

adamwitmer commented Mar 4, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer commented Mar 4, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated