ENH: Add scripts to analyze BubbleSAM and OpenCV detection results by nidhinthomas-ai · Pull Request #4 · lanl/ldrd_neat_ml

nidhinthomas-ai · 2025-09-25T06:22:46Z

This merge request introduces a new library module, neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.

What’s inside

Analysis Module (neat_ml/analysis/): A new module dedicated to post-processing and analyzing bubble detection data.
- data_analysis.py: A script that extracts bubble features, computes a wide range of spatial metrics, and generates summary reports.
- tests/test_analysis.py: Unit tests for the analysis module to ensure correctness and stability.
Workflow Integration (workflow/lib_workflow.py):
- Modified to include a new wrapper function, stage_analyze_features(), which integrates the data_analysis.py script into the main processing workflow.
Detailed Description of data_analysis.py

The data_analysis.py script is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:

Scan and Load: Recursively scans a directory for result files from either OpenCV (*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).
Parse Metadata: Extracts experimental metadata (e.g., offset, position, label) directly from the filenames.
Per-Image Analysis: For each image, it calculates a suite of metrics:
- Basic Blob Stats: Number of bubbles, mean/median/std of area and radius.
- Coverage: Percentage of the image area covered by bubbles.
- Nearest-Neighbor Distance (NND): Mean and median distance between the closest bubbles.
- Voronoi Tessellation: Statistics on the area of Voronoi cells, which describe the local space around each bubble.
- Graph-Based Metrics: Models the spatial relationship between bubbles as a graph and computes network properties (e.g., connectivity, clustering).
Output Generation: Saves the results into two CSV files:
- per_image_metrics.csv: A detailed report with all calculated metrics for every single image.
- aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.

Key Methods
- full_analysis(): The main entry point that orchestrates the entire pipeline—from processing the input directory to saving the final per-image and aggregated CSV reports.
- process_directory(): Manages file discovery, loading, and iteration. It identifies the correct parsing function based on the specified mode ('OpenCV' or 'BubbleSAM') and compiles all per-image results into a single DataFrame.
- calculate_all_spatial_metrics(): A wrapper function that computes the full set of spatial metrics for a single image's data. It calls the specialized functions below.
- calculate_nnd_stats(): Computes the mean and median nearest-neighbor distances between bubble centroids using a KD-Tree for efficient querying. This helps quantify how clustered the bubbles are.
- calculate_voronoi_stats(): Performs Voronoi tessellation on the bubble centroids and calculates statistics (mean, median, std) on the resulting cell areas. This provides insight into the spatial organization and density of the bubbles.
- calculate_graph_metrics(): Constructs a spatial graph where bubbles are nodes and edges connect nearby bubbles. It supports three construction methods:
  - radius: Connects all bubbles within a given search radius.
  - knn: Connects each bubble to its k-nearest neighbors.
  - delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors. It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).
- calculate_summary_statistics(): Aggregates the per-image metrics DataFrame. It groups the data by specified columns (e.g., 'Label', 'Time') and computes summary statistics (min, max, median, std) for all numerical metrics.
- merge_composition_data(): An optional utility to merge the metrics DataFrame with an external CSV file (e.g., containing sample composition data) using a shared key like a UniqueID.
- parse_filename() and load_df(): Helper functions tailored to parse metadata from the specific filename conventions and load the data formats produced by OpenCV and BubbleSAM.
Runtime Tip

The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.

adamwitmer · 2026-02-10T18:42:11Z

* update/fix README * add progress bar for parquet processing * privatize helper functions in `data_analysis` * parametrize tests * remove private function from `as_steps_set` * fix/remove typing * fix docs

adamwitmer · 2026-03-12T19:01:46Z

In terms of the leftover comment from the WIP branch #25 (comment), as mentioned in person, I changed the logic there to be different than the copy/paste from the LLM that I referenced previously, and instead vectorized the operations in calculate_graph_metrics based on a better understanding of what the code is actually doing. I updated the accompanying test to be more robust to those changes, because although the tests were passing, the output on actual data was not the same before and after the previous change.

I verified this by checking the outputs of real data on glycan before and after vectorizing the function with the following input yaml config file:

test.yaml

roots:
  work: test
  results: test
  model: test

inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib

datasets:

  - id: PEO8K_Sodium_Citrate_Ph_2nd
    method: BubbleSAM 
    role: infer            
    class: Ph
    time_label: 2nd
    composition_cols:
      - "Sodium Citrate (wt%)"
      - "PEO 8 kg/mol (wt%)"
    analysis:
      input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd
      composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv
      per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_old.csv
      aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate_old.csv
      group_cols:
        - Group
        - Label
        - Time
        - Class
      graph_method: knn
      graph_param: 1

By looking at the aggregate csv file outputs from the incantation:

python run_workflow.py --config test.yaml --steps analysis

* update README with more information pertaining to analysis step * remove redundant test * fix docs

adamwitmer · 2026-03-13T06:46:44Z

@tylerjereddy I have completed my initial review checklist and addressed all self-review comments including those from the WIP branch #25, so this PR should be ready for your review now, thanks.

tylerjereddy · 2026-03-14T20:40:41Z

@adamwitmer I'll make a note to review this on Friday, March 20th.

The delay is because I'm not seeing enough activity on the ASC polymer project--once you've caught up with ~2 days of work from last week + 2 days from next week (~32 hours hours of very solid effort on that project) I'll see if it looks like the projects are being balanced. I can't pick up the slack over on that project just because you want to get this project caught up on its delay, otherwise I'm assuming the burden of the delay on my own.

* fix typing * remove unnecessary monkeypatching * fix/update tests * purge unnecessary test/functions * fix function logic * add in-line comments * fix docstrings

* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic

* parameterize test for error outputs * add networkx to requirements.txt

* vectorize graph metric procedures * improve test coverage * update README.md * fix docs/inline comments * remove unnecessary type coercions/typing

* update/fix README * add progress bar for parquet processing * privatize helper functions in `data_analysis` * parametrize tests * remove private function from `as_steps_set` * fix/remove typing * fix docs

* update README with more information pertaining to analysis step * remove redundant test * fix docs

tylerjereddy

the test line coverage looks "ok" for the Python modules touched
in the PR; I did have some concern about robust numerical testing beyond trivial cases
I don't see any indication of how long it takes to run the full
analysis for both OpenCV and SAM2 for a given condition? Can we iterate/repeat if we need to? Have you?
I also don't see an indication of whether or not running such
a full analysis allows for reproduction of results in the paper? I
think that's one of the main priorities re: others being able to
reproduce what we do, so that would be good to clearly state here
I don't see a clear indication of whether we can extract the
same features for both OpenCV and SAM2? Do both design matrices
(the rows and columns of parsed data) match between the two
in terms of the numbers and identities of the features (columns)? That
should be clearly communicated, possibly with a brief confirmation in i.e., the README.

tylerjereddy · 2026-03-20T16:28:37Z

+from tqdm.auto import tqdm
+from pyarrow.lib import ArrowInvalid
+
+__all__: Sequence[str] = [


I don't think we need this type hint--they're mostly useful for function signatures.

tylerjereddy · 2026-03-20T16:33:42Z

+    composition_df : pd.DataFrame
+        The external table with composition data.
+    cols_to_add : Sequence[str]
+        A list of columns from composition_df to add to summary_df.


sequence rather than list?

tylerjereddy · 2026-03-20T16:44:49Z

+        re.IGNORECASE | re.VERBOSE,
+    )
+    match = _RE.match(fname)
+    if not match:


What's the real world scenario where we have an invalid filename and we don't want to error out? In a normal processing workflow, shouldn't the relevant files all match the expected naming format? If you expect to have i.e., OpenCV and SAM2 raw data at the same path, maybe clarify that? How did you regenerate the original results produced for the paper?

Is there not a risk that we silently ignore a problem and end up with incorrect final aggregated data?

tylerjereddy · 2026-03-20T16:50:10Z

+    Loads a parquet file and converts it to the standard blob schema.
+
+    BubbleSAM parquet store 'area' and 'bbox'. This function computes the
+    'center' and 'radius' to make it compatible with downstream analysis.


It would be good to clarify here if the OpenCV input data is already formatted appropriately, so that this function is not useful for OpenCV?

tylerjereddy · 2026-03-20T16:52:39Z

+    Parameters
+    ----------
+    parquet_path : Path
+        The path to the `parquet.gzip` file.


Isn't method missing from your parameter declarations in the docstring here? If this function only does something useful for SAM2 and just passes the dataframe through for OpenCV, that seems useful to clarify up front.

tylerjereddy · 2026-03-20T22:55:09Z

+    wf.stage_analyze_features(ds, paths={})
+    assert "Analysis input_dir" in caplog.text
+
+def test_stage_analyze_features_raises_when_composition_csv_missing(


nothing being "raised" here, no errors, correct nomenclature

tylerjereddy · 2026-03-20T22:55:46Z

+        },
+    }
+    wf.stage_analyze_features(ds, paths={})
+    assert "Composition CSV" in caplog.text


many of these tests are kind of copy/paste-like, may be able to condense, it is a ton of reading for the reviewer, and some look like copy-pastes that are mistakes/duplicates as I note elsewhere

tylerjereddy · 2026-03-20T22:56:11Z

+    assert "Composition CSV" in caplog.text
+
+
+def test_stage_analyze_features_raises_when_no_detection_outputs_opencv(


Nothing is being "raised" here, correct nomenclature.

tylerjereddy · 2026-03-20T22:56:36Z

+            "graph_param": 1
+        },
+    }
+    wf.stage_analyze_features(ds, paths={})


I suspect you could condense the tests more than you have, very copy-pastey

tylerjereddy · 2026-03-20T22:59:05Z

+    caplog.set_level(logging.WARNING)
+    ds = {"id": "AN1", "method": "OpenCV", "time_label": "T01", "analysis": {}}
+    wf.stage_analyze_features(ds, paths={})
+    assert "No analysis input_dir provided" in caplog.text


What's going on here? This test looks identical to test_stage_analyze_features_raises_when_input_dir_unavailable()? And pretty much identical to test_stage_analyze_features_errors_when_input_dir_unavailable()?

This drains enormous amounts of review time to try to sort out why these exist after you've read the code a few times?

Noting here that I need to do a better job of double checking for unnecessary duplication before presenting PR's for review to avoid adding excessive burden to the reviewer.

tylerjereddy · 2026-04-14T22:09:31Z

@adamwitmer I'll ask that you respond to each individual review comment in this PR and other remaining PRs, given the delays, to discourage rushed work and encourage attention to detail, which seems to hold the reviews up a lot.

adamwitmer mentioned this pull request Oct 10, 2025

MAINT: TODO for open PRs #6

Open

5 tasks

tylerjereddy added the enhancement New feature or request label Oct 15, 2025

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 11, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 16, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Dec 16, 2025

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/bubblesam/bubblesam.py

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Feb 10, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread README.md Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread run_workflow.py

MAINT, TST, DOC: PR lanl#4 revisions

dd48f7e

* update/fix README * add progress bar for parquet processing * privatize helper functions in `data_analysis` * parametrize tests * remove private function from `as_steps_set` * fix/remove typing * fix docs

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Mar 12, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py

adamwitmer reviewed Mar 13, 2026

View reviewed changes

Comment thread README.md

adamwitmer reviewed Mar 13, 2026

View reviewed changes

Comment thread neat_ml/tests/test_analysis.py Outdated

adamwitmer reviewed Mar 13, 2026

View reviewed changes

Comment thread neat_ml/workflow/lib_workflow.py Outdated

adamwitmer reviewed Mar 13, 2026

View reviewed changes

Comment thread README.md

adamwitmer added 3 commits March 12, 2026 22:55

MAINT, TST, DOC: PR lanl#4 revisions

23799cf

* update README with more information pertaining to analysis step * remove redundant test * fix docs

DOC, TST: fix inline test comment

fa1b01e

TST: loosen floating point tolerance for voronoi calculation

085d96e

adamwitmer pushed a commit that referenced this pull request Mar 16, 2026

MAINT: PR #4 revisions

f5cb23e

* fix typing * remove unnecessary monkeypatching * fix/update tests * purge unnecessary test/functions * fix function logic * add in-line comments * fix docstrings

adamwitmer pushed a commit that referenced this pull request Mar 16, 2026

MAINT, TST, DOC, BUG: PR #4 revisions

1154c58

* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic

adamwitmer pushed a commit that referenced this pull request Mar 16, 2026

MAINT, TST: PR #4 revisions

d9427fc

* parameterize test for error outputs * add networkx to requirements.txt

adamwitmer pushed a commit that referenced this pull request Mar 16, 2026

MAINT, TST, DOC: PR #4 revisions

baa7137

* vectorize graph metric procedures * improve test coverage * update README.md * fix docs/inline comments * remove unnecessary type coercions/typing

adamwitmer added a commit that referenced this pull request Mar 16, 2026

MAINT, TST, DOC: PR #4 revisions

b0dc46b

* update README with more information pertaining to analysis step * remove redundant test * fix docs

adamwitmer mentioned this pull request Mar 17, 2026

WIP, ENH: Add ML training and inference #28

Open

adamwitmer reviewed Mar 17, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

adamwitmer reviewed Mar 17, 2026

View reviewed changes

Comment thread neat_ml/analysis/data_analysis.py Outdated

TST, MAINT: replace general exceptions and add new test cases

d5a3a3d

tylerjereddy requested changes Mar 20, 2026

View reviewed changes

		assert "Composition CSV" in caplog.text


		def test_stage_analyze_features_raises_when_no_detection_outputs_opencv(

Conversation

nidhinthomas-ai commented Sep 25, 2025 • edited by adamwitmer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamwitmer commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamwitmer commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamwitmer commented Mar 13, 2026

Uh oh!

tylerjereddy commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented Apr 14, 2026

Uh oh!

nidhinthomas-ai commented Sep 25, 2025 •

edited by adamwitmer

Loading

adamwitmer commented Feb 10, 2026 •

edited

Loading