WIP, ENH: Add scripts to analyze BubbleSAM and OpenCV detection results #25
WIP, ENH: Add scripts to analyze BubbleSAM and OpenCV detection results #25adamwitmer wants to merge 18 commits intonidhin_bubblesam_detection_backupfrom
Conversation
Add preprocessing.py for enhancing input images Add detection.py for OpenCV blob detection Add input data for testing OpenCV detection Add scripts for testing OpenCV detection Add raw_processed.png to baseline for test Add raw_detection.png to baseline for test Modified pyproject.toml to pull .tiff and .png
Removed fixtures from test scripts Use importlib.resources to retrieve images Reduced the tolerances for compare_images() Added ruff.toml to enforce no mutable default args
…rocessed.png,neat_ml/tests/data/images/raw.tiff,neat_ml/tests/data/images_Processed/raw.tiff: convert to Git LFS
Added bubblesam.py and SAM.py into neat_ml/bubblesam Added stage_bubblesam() into run_workflow.py Added test_bubblesam.py and associated baseline images Modified .gitlab-ci.yml, mypy.ini and pyproject.toml to incorporate sam2 module Updated README.md with sam-2 installation instructions Added bubblesam_detection_test.yaml for functional test
Created neat_ml/workflow and added lib_workflow.py Added test_workflow.py to test lib_workflow.py
Added neat_ml/analysis module for bubble analysis Added test_analysis.py for testing analysis module
Updated README.md with instructions to run OpenCV and BubbleSAM workflow with detection and analysis
Added .github/workflows/ci.yml Added LANL copyright assertion ID
* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic
* parameterize test for error outputs * add networkx to requirements.txt
* remove unused test assets * fix README * update pyproject.toml
|
You should probably temporarily update the CI config so that it runs against the target branch. |
|
That said, this probably won't change a whole lot, I can generally provide about 1 detailed review per week in any case, so presenting everything for review a few months past due and all at once probably isn't sustainable. Maybe the team will give you a bit more time, though considerable grace has already been provided. In any case, I'll do what I can to get you timely feedback. |
| # calculate the graph using the KDTree to find the | ||
| # closest points within radius set by `param` | ||
| # https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html | ||
| if method == "radius" and isinstance(param, (int, float)) and param > 0: |
There was a problem hiding this comment.
potential area of improvement here and below if we vectorize these operations vs. looping.
There was a problem hiding this comment.
I used line_profiler to measure the performance improvement associated with vectorizing the graph metric operations in calculate_graph_metrics with the following diff (The associated test, test_calculate_graph_metrics still passes with these changes):
Diff
diff --git a/neat_ml/analysis/data_analysis.py b/neat_ml/analysis/data_analysis.py
index b295195..da98418 100644
--- a/neat_ml/analysis/data_analysis.py
+++ b/neat_ml/analysis/data_analysis.py
@@ -326,31 +326,82 @@ def calculate_graph_metrics(
# https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.spatial.KDTree.html
if method == "radius" and isinstance(param, (int, float)) and param > 0:
tree = KDTree(points)
- for i, j in tree.query_pairs(r=param):
- dist = np.linalg.norm(points[i] - points[j])
- graph.add_edge(i, j, distance=dist)
+ # 1. Get the pairs as a set and convert to a (M, 2) numpy array
+ pairs = np.array(list(tree.query_pairs(r=param)))
+
+ if pairs.size > 0:
+ # 2. Extract coordinates for all 'i' and 'j' indices at once
+ p_i = points[pairs[:, 0]]
+ p_j = points[pairs[:, 1]]
+
+ # 3. Vectorized distance calculation
+ dists = np.linalg.norm(p_i - p_j, axis=1)
+
+ # 4. Bulk add to graph
+ weighted_edges = [
+ (int(pairs[idx, 0]), int(pairs[idx, 1]), {"distance": float(dists[idx])})
+ for idx in range(len(pairs))
+ ]
+ graph.add_edges_from(weighted_edges)
# alternatively calculate the graph using the KDTree
# to find the k-nearest neighbors of the points as determined
# by the input `param`
elif method == "knn" and isinstance(param, int) and param > 0:
k = min(param, n_nodes - 1)
tree = KDTree(points)
+ # dists and idxs are shape (n_nodes, k + 1)
dists, idxs = tree.query(points, k=k + 1)
- for i in range(n_nodes):
- for nn_idx in range(1, k + 1):
- j = idxs[i, nn_idx]
- if not graph.has_edge(i, j):
- dist = dists[i, nn_idx]
- graph.add_edge(i, j, distance=dist)
+
+ # 1. Slice off the first column (the node itself, distance 0)
+ neighbor_indices = idxs[:, 1:]
+ neighbor_dists = dists[:, 1:]
+
+ # 2. Create the source indices (0,0,0, 1,1,1, etc.) to match neighbor_indices
+ source_indices = np.repeat(np.arange(n_nodes), k)
+
+ # 3. Flatten both into 1D arrays
+ u = source_indices
+ v = neighbor_indices.flatten()
+ d = neighbor_dists.flatten()
+
+ # 4. Construct edge tuples: (source, target, {'distance': value})
+ # We use a list comprehension here as it's the fastest way to format for NetworkX
+ weighted_edges = [
+ (int(u[i]), int(v[i]), {"distance": float(d[i])})
+ for i in range(len(u))
+ ]
+
+ # 5. Add all edges at once
+ graph.add_edges_from(weighted_edges)
# alternatively calculate the graph using Delaunay triangulation
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.Delaunay.html
elif method == "delaunay" and n_nodes >= 3:
tri = Delaunay(points)
- for simplex in tri.simplices:
- for i, j in zip(simplex, np.roll(simplex, -1)):
- if not graph.has_edge(i, j):
- dist = np.linalg.norm(points[i] - points[j])
- graph.add_edge(i, j, distance=dist)
+
+ # 1. Get all edges from all simplices
+ # tri.simplices has shape (N, 3). We extract pairs (0,1), (1,2), and (2,0)
+ edges = np.concatenate([
+ tri.simplices[:, [0, 1]],
+ tri.simplices[:, [1, 2]],
+ tri.simplices[:, [2, 0]]
+ ])
+
+ # 2. Sort edges to ensure (i, j) is the same as (j, i) and remove duplicates
+ edges = np.sort(edges, axis=1)
+ unique_edges = np.unique(edges, axis=0)
+
+ # 3. Vectorized distance calculation
+ p1 = points[unique_edges[:, 0]]
+ p2 = points[unique_edges[:, 1]]
+ distances = np.linalg.norm(p1 - p2, axis=1)
+
+ # 4. Add all edges to the graph at once
+ # This is much faster than calling add_edge in a loop
+ weighted_edges = [
+ (int(u), int(v), {"distance": float(d)})
+ for (u, v), d in zip(unique_edges, distances)
+ ]
+ graph.add_edges_from(weighted_edges)
except Exception as exc:
warnings.warn(f"Graph construction ({method}) failed: {exc}")I ran the workflow on this branch with the following input yaml file using each of the three methods (i.e. radius, knn and delaunay) with and without the changes in the diff.
input.yaml
roots:
work: test
results: test
model: test
inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib
datasets:
- id: PEO8K_Sodium_Citrate_Ph_2nd
method: BubbleSAM
role: infer
class: Ph
time_label: 2nd
composition_cols:
- "Sodium Citrate (wt%)"
- "PEO 8 kg/mol (wt%)"
analysis:
input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd
composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv
per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary.csv
aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate.csv
group_cols:
- Group
- Label
- Time
- Class
graph_method: radius
graph_param: 30There were not significant improvements in performance for the radius (11.74 vs. 12.87) and knn (13.76 vs. 13.79) methods (new vs. old, respectively), however, the performance did improve using the delaunay method (31.90 vs. 60.10), with the main bottleneck of the old method being the following loop:
352 2495964 2e+10 7996.4 33.2 for i, j in zip(simplex, np.roll(simplex, -1)):which may warrent adopting the vectorized change(s) for the delaunay method.
There was a problem hiding this comment.
I should mention -- please disclose usage of AI tools/LLMs whenever you accept suggestions from them. We generally have to be cautious because LLM-based code may have licensing issues--I do expect you to disclose what you use LLMs for and to ensure that the provenance of the code you place in the repo is license compatible (which is very hard to ensure when pasting from an LLM).
Certainly, the expectation is also that you understand what each line of code does as well.
* vectorize graph metric procedures * improve test coverage * update README.md * fix docs/inline comments * remove unnecessary type coercions/typing
This branch has been rebased against PR #3 to show only changes related to #4. It is a WIP branch intended to be presented for review and updated after merging PR #3.