Skip to content

WIP, ENH: Add scripts to analyze BubbleSAM and OpenCV detection results #25

Open
adamwitmer wants to merge 18 commits intonidhin_bubblesam_detection_backupfrom
nidhin_data_analysis_rebase
Open

WIP, ENH: Add scripts to analyze BubbleSAM and OpenCV detection results #25
adamwitmer wants to merge 18 commits intonidhin_bubblesam_detection_backupfrom
nidhin_data_analysis_rebase

Conversation

@adamwitmer
Copy link
Copy Markdown
Collaborator

This branch has been rebased against PR #3 to show only changes related to #4. It is a WIP branch intended to be presented for review and updated after merging PR #3.

Nidhin Thomas and others added 16 commits February 28, 2026 14:07
Add preprocessing.py for enhancing input images
Add detection.py for OpenCV blob detection
Add input data for testing OpenCV detection
Add scripts for testing OpenCV detection
Add raw_processed.png to baseline for test
Add raw_detection.png to baseline for test
Modified pyproject.toml to pull .tiff and .png
Removed fixtures from test scripts
Use importlib.resources to retrieve images
Reduced the tolerances for compare_images()
Added ruff.toml to enforce no mutable default args
…rocessed.png,neat_ml/tests/data/images/raw.tiff,neat_ml/tests/data/images_Processed/raw.tiff: convert to Git LFS
Added bubblesam.py and SAM.py into neat_ml/bubblesam
Added stage_bubblesam() into run_workflow.py
Added test_bubblesam.py and associated baseline images
Modified .gitlab-ci.yml, mypy.ini and pyproject.toml
to incorporate sam2 module
Updated README.md with sam-2 installation instructions

Added bubblesam_detection_test.yaml for functional test
Created neat_ml/workflow and added lib_workflow.py
Added test_workflow.py to test lib_workflow.py
Added neat_ml/analysis module for bubble analysis
Added test_analysis.py for testing analysis module
Updated README.md with instructions to run
OpenCV and BubbleSAM workflow with detection
and analysis
Added .github/workflows/ci.yml
Added LANL copyright assertion ID
* fix typing
* remove unnecessary monkeypatching
* fix/update tests
* purge unnecessary test/functions
* fix function logic
* add in-line comments
* fix docstrings
* remove monkeypatching for test
* add example .yaml file for analysis
* add checks/tests for parameter inputs
* fix bug in image file name string replacement logic
* parameterize test for error outputs
* add networkx to requirements.txt
* remove unused test assets
* fix README
* update pyproject.toml
@tylerjereddy
Copy link
Copy Markdown
Collaborator

You should probably temporarily update the CI config so that it runs against the target branch.

@tylerjereddy
Copy link
Copy Markdown
Collaborator

That said, this probably won't change a whole lot, I can generally provide about 1 detailed review per week in any case, so presenting everything for review a few months past due and all at once probably isn't sustainable.

Maybe the team will give you a bit more time, though considerable grace has already been provided. In any case, I'll do what I can to get you timely feedback.

Comment thread .github/workflows/ci.yml
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
# calculate the graph using the KDTree to find the
# closest points within radius set by `param`
# https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html
if method == "radius" and isinstance(param, (int, float)) and param > 0:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential area of improvement here and below if we vectorize these operations vs. looping.

Copy link
Copy Markdown
Collaborator Author

@adamwitmer adamwitmer Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used line_profiler to measure the performance improvement associated with vectorizing the graph metric operations in calculate_graph_metrics with the following diff (The associated test, test_calculate_graph_metrics still passes with these changes):

Diff
diff --git a/neat_ml/analysis/data_analysis.py b/neat_ml/analysis/data_analysis.py
index b295195..da98418 100644
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -326,31 +326,82 @@ def calculate_graph_metrics(
         # https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html
         if method == "radius" and isinstance(param, (int, float)) and param > 0:
             tree = KDTree(points)
-            for i, j in tree.query_pairs(r=param):
-                dist = np.linalg.norm(points[i] - points[j])
-                graph.add_edge(i, j, distance=dist)
+            # 1. Get the pairs as a set and convert to a (M, 2) numpy array
+            pairs = np.array(list(tree.query_pairs(r=param)))
+
+            if pairs.size > 0:
+                # 2. Extract coordinates for all 'i' and 'j' indices at once
+                p_i = points[pairs[:, 0]]
+                p_j = points[pairs[:, 1]]
+
+                # 3. Vectorized distance calculation
+                dists = np.linalg.norm(p_i - p_j, axis=1)
+
+                # 4. Bulk add to graph
+                weighted_edges = [
+                    (int(pairs[idx, 0]), int(pairs[idx, 1]), {"distance": float(dists[idx])})
+                    for idx in range(len(pairs))
+                ]
+                graph.add_edges_from(weighted_edges)
         # alternatively calculate the graph using the KDTree
         # to find the k-nearest neighbors of the points as determined
         # by the input `param`
         elif method == "knn" and isinstance(param, int) and param > 0:
             k = min(param, n_nodes - 1)
             tree = KDTree(points)
+            # dists and idxs are shape (n_nodes, k + 1)
             dists, idxs = tree.query(points, k=k + 1)
-            for i in range(n_nodes):
-                for nn_idx in range(1, k + 1):
-                    j = idxs[i, nn_idx]
-                    if not graph.has_edge(i, j):
-                        dist = dists[i, nn_idx]
-                        graph.add_edge(i, j, distance=dist)
+
+            # 1. Slice off the first column (the node itself, distance 0)
+            neighbor_indices = idxs[:, 1:]
+            neighbor_dists = dists[:, 1:]
+
+            # 2. Create the source indices (0,0,0, 1,1,1, etc.) to match neighbor_indices
+            source_indices = np.repeat(np.arange(n_nodes), k)
+
+            # 3. Flatten both into 1D arrays
+            u = source_indices
+            v = neighbor_indices.flatten()
+            d = neighbor_dists.flatten()
+
+            # 4. Construct edge tuples: (source, target, {'distance': value})
+            # We use a list comprehension here as it's the fastest way to format for NetworkX
+            weighted_edges = [
+                (int(u[i]), int(v[i]), {"distance": float(d[i])}) 
+                for i in range(len(u))
+            ]
+
+            # 5. Add all edges at once
+            graph.add_edges_from(weighted_edges)
         # alternatively calculate the graph using Delaunay triangulation
         # https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.Delaunay.html
         elif method == "delaunay" and n_nodes >= 3:
             tri = Delaunay(points)
-            for simplex in tri.simplices:
-                for i, j in zip(simplex, np.roll(simplex, -1)):
-                    if not graph.has_edge(i, j):
-                        dist = np.linalg.norm(points[i] - points[j])
-                        graph.add_edge(i, j, distance=dist)
+
+            # 1. Get all edges from all simplices
+            # tri.simplices has shape (N, 3). We extract pairs (0,1), (1,2), and (2,0)
+            edges = np.concatenate([
+                tri.simplices[:, [0, 1]],
+                tri.simplices[:, [1, 2]],
+                tri.simplices[:, [2, 0]]
+            ])
+
+            # 2. Sort edges to ensure (i, j) is the same as (j, i) and remove duplicates
+            edges = np.sort(edges, axis=1)
+            unique_edges = np.unique(edges, axis=0)
+
+            # 3. Vectorized distance calculation
+            p1 = points[unique_edges[:, 0]]
+            p2 = points[unique_edges[:, 1]]
+            distances = np.linalg.norm(p1 - p2, axis=1)
+
+            # 4. Add all edges to the graph at once
+            # This is much faster than calling add_edge in a loop
+            weighted_edges = [
+                (int(u), int(v), {"distance": float(d)})
+                for (u, v), d in zip(unique_edges, distances)
+            ]
+            graph.add_edges_from(weighted_edges)
     except Exception as exc:
         warnings.warn(f"Graph construction ({method}) failed: {exc}")

I ran the workflow on this branch with the following input yaml file using each of the three methods (i.e. radius, knn and delaunay) with and without the changes in the diff.

input.yaml
roots:
  work: test
  results: test
  model: test

inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib

datasets:

  - id: PEO8K_Sodium_Citrate_Ph_2nd
    method: BubbleSAM
    role: infer
    class: Ph
    time_label: 2nd
    composition_cols:
      - "Sodium Citrate (wt%)"
      - "PEO 8 kg/mol (wt%)"
    analysis:
      input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd
      composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv
      per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary.csv
      aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate.csv
      group_cols:
        - Group
        - Label
        - Time
        - Class
      graph_method: radius
      graph_param: 30

There were not significant improvements in performance for the radius (11.74 vs. 12.87) and knn (13.76 vs. 13.79) methods (new vs. old, respectively), however, the performance did improve using the delaunay method (31.90 vs. 60.10), with the main bottleneck of the old method being the following loop:

   352   2495964        2e+10   7996.4     33.2                  for i, j in zip(simplex, np.roll(simplex, -1)):

which may warrent adopting the vectorized change(s) for the delaunay method.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention -- please disclose usage of AI tools/LLMs whenever you accept suggestions from them. We generally have to be cautious because LLM-based code may have licensing issues--I do expect you to disclose what you use LLMs for and to ensure that the provenance of the code you place in the repo is license compatible (which is very hard to ensure when pasting from an LLM).

Certainly, the expectation is also that you understand what each line of code does as well.

Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_workflow.py Outdated
Comment thread run_workflow.py Outdated
Comment thread neat_ml/workflow/lib_workflow.py
* vectorize graph metric procedures
* improve test coverage
* update README.md
* fix docs/inline comments
* remove unnecessary type coercions/typing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants