Skip to content

ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4

Open
nidhinthomas-ai wants to merge 20 commits intolanl:mainfrom
nidhinthomas-ai:nidhin_data_analysis
Open

ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4
nidhinthomas-ai wants to merge 20 commits intolanl:mainfrom
nidhinthomas-ai:nidhin_data_analysis

Conversation

@nidhinthomas-ai
Copy link
Copy Markdown

@nidhinthomas-ai nidhinthomas-ai commented Sep 25, 2025

This merge request introduces a new library module, neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.

What’s inside

  • Analysis Module (neat_ml/analysis/): A new module dedicated to post-processing and analyzing bubble detection data.
    • data_analysis.py: A script that extracts bubble features, computes a wide range of spatial metrics, and generates summary reports.
    • tests/test_analysis.py: Unit tests for the analysis module to ensure correctness and stability.
  • Workflow Integration (workflow/lib_workflow.py):
    • Modified to include a new wrapper function, stage_analyze_features(), which integrates the data_analysis.py script into the main processing workflow.
  • Detailed Description of data_analysis.py

The data_analysis.py script is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:

  1. Scan and Load: Recursively scans a directory for result files from either OpenCV (*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).
  2. Parse Metadata: Extracts experimental metadata (e.g., offset, position, label) directly from the filenames.
  3. Per-Image Analysis: For each image, it calculates a suite of metrics:
    • Basic Blob Stats: Number of bubbles, mean/median/std of area and radius.
    • Coverage: Percentage of the image area covered by bubbles.
    • Nearest-Neighbor Distance (NND): Mean and median distance between the closest bubbles.
    • Voronoi Tessellation: Statistics on the area of Voronoi cells, which describe the local space around each bubble.
    • Graph-Based Metrics: Models the spatial relationship between bubbles as a graph and computes network properties (e.g., connectivity, clustering).
  4. Output Generation: Saves the results into two CSV files:
    • per_image_metrics.csv: A detailed report with all calculated metrics for every single image.
    • aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.
  • Key Methods
    • full_analysis(): The main entry point that orchestrates the entire pipeline—from processing the input directory to saving the final per-image and aggregated CSV reports.
    • process_directory(): Manages file discovery, loading, and iteration. It identifies the correct parsing function based on the specified mode ('OpenCV' or 'BubbleSAM') and compiles all per-image results into a single DataFrame.
    • calculate_all_spatial_metrics(): A wrapper function that computes the full set of spatial metrics for a single image's data. It calls the specialized functions below.
    • calculate_nnd_stats(): Computes the mean and median nearest-neighbor distances between bubble centroids using a KD-Tree for efficient querying. This helps quantify how clustered the bubbles are.
    • calculate_voronoi_stats(): Performs Voronoi tessellation on the bubble centroids and calculates statistics (mean, median, std) on the resulting cell areas. This provides insight into the spatial organization and density of the bubbles.
    • calculate_graph_metrics(): Constructs a spatial graph where bubbles are nodes and edges connect nearby bubbles. It supports three construction methods:
      • radius: Connects all bubbles within a given search radius.
      • knn: Connects each bubble to its k-nearest neighbors.
      • delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors.
It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).
    • calculate_summary_statistics(): Aggregates the per-image metrics DataFrame. It groups the data by specified columns (e.g., 'Label', 'Time') and computes summary statistics (min, max, median, std) for all numerical metrics.
    • merge_composition_data(): An optional utility to merge the metrics DataFrame with an external CSV file (e.g., containing sample composition data) using a shared key like a UniqueID.
    • parse_filename() and load_df(): Helper functions tailored to parse metadata from the specific filename conventions and load the data formats produced by OpenCV and BubbleSAM.
  • Runtime Tip

The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.

@adamwitmer adamwitmer mentioned this pull request Oct 10, 2025
5 tasks
@tylerjereddy tylerjereddy added the enhancement New feature or request label Oct 15, 2025
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
@adamwitmer
Copy link
Copy Markdown
Collaborator

adamwitmer commented Feb 10, 2026

Initial TODO items for reviewing this branch:

  • read/understand diff/PR (read line-by-line, probing for weaknesses)
  • check for unnecessary complexity; areas for improvement/simplification
  • rebase branch against main (and push backup branch)
  • fix issues with git lfs (i.e. image storage with zenodo https://github.com/lanl/ldrd_neat_ml_images)
  • fix github CI
  • run test-suite and check test coverage
  • run branch according to README.md instructions
  • copy remaining review comments from gitlab
  • perform detailed code review
  • address all review comments
  • triple check diff
    • 1st check
    • 2nd check
    • 3rd check

Comment thread .github/workflows/ci.yml
Comment thread neat_ml/bubblesam/bubblesam.py
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread README.md Outdated
Comment thread run_workflow.py
* update/fix README
* add progress bar for parquet processing
* privatize helper functions in `data_analysis`
* parametrize tests
* remove private function from `as_steps_set`
* fix/remove typing
* fix docs
@adamwitmer
Copy link
Copy Markdown
Collaborator

In terms of the leftover comment from the WIP branch #25 (comment), as mentioned in person, I changed the logic there to be different than the copy/paste from the LLM that I referenced previously, and instead vectorized the operations in calculate_graph_metrics based on a better understanding of what the code is actually doing. I updated the accompanying test to be more robust to those changes, because although the tests were passing, the output on actual data was not the same before and after the previous change.

I verified this by checking the outputs of real data on glycan before and after vectorizing the function with the following input yaml config file:

test.yaml
roots:
  work: test
  results: test
  model: test

inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib

datasets:

  - id: PEO8K_Sodium_Citrate_Ph_2nd
    method: BubbleSAM 
    role: infer            
    class: Ph
    time_label: 2nd
    composition_cols:
      - "Sodium Citrate (wt%)"
      - "PEO 8 kg/mol (wt%)"
    analysis:
      input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd
      composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv
      per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_old.csv
      aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate_old.csv
      group_cols:
        - Group
        - Label
        - Time
        - Class
      graph_method: knn
      graph_param: 1

By looking at the aggregate csv file outputs from the incantation:

python run_workflow.py --config test.yaml --steps analysis

Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/tests/test_analysis.py
Comment thread README.md
Comment thread neat_ml/tests/test_analysis.py Outdated
Comment thread neat_ml/workflow/lib_workflow.py Outdated
Comment thread README.md
* update README with more information pertaining to analysis step
* remove redundant test
* fix docs
@adamwitmer
Copy link
Copy Markdown
Collaborator

@tylerjereddy I have completed my initial review checklist and addressed all self-review comments including those from the WIP branch #25, so this PR should be ready for your review now, thanks.

@tylerjereddy
Copy link
Copy Markdown
Collaborator

@adamwitmer I'll make a note to review this on Friday, March 20th.

The delay is because I'm not seeing enough activity on the ASC polymer project--once you've caught up with ~2 days of work from last week + 2 days from next week (~32 hours hours of very solid effort on that project) I'll see if it looks like the projects are being balanced. I can't pick up the slack over on that project just because you want to get this project caught up on its delay, otherwise I'm assuming the burden of the delay on my own.

adamwitmer pushed a commit that referenced this pull request Mar 16, 2026
* fix typing
* remove unnecessary monkeypatching
* fix/update tests
* purge unnecessary test/functions
* fix function logic
* add in-line comments
* fix docstrings
adamwitmer pushed a commit that referenced this pull request Mar 16, 2026
* remove monkeypatching for test
* add example .yaml file for analysis
* add checks/tests for parameter inputs
* fix bug in image file name string replacement logic
adamwitmer pushed a commit that referenced this pull request Mar 16, 2026
* parameterize test for error outputs
* add networkx to requirements.txt
adamwitmer pushed a commit that referenced this pull request Mar 16, 2026
* vectorize graph metric procedures
* improve test coverage
* update README.md
* fix docs/inline comments
* remove unnecessary type coercions/typing
adamwitmer pushed a commit that referenced this pull request Mar 16, 2026
* update/fix README
* add progress bar for parquet processing
* privatize helper functions in `data_analysis`
* parametrize tests
* remove private function from `as_steps_set`
* fix/remove typing
* fix docs
adamwitmer added a commit that referenced this pull request Mar 16, 2026
* update README with more information pertaining to analysis step
* remove redundant test
* fix docs
Comment thread neat_ml/analysis/data_analysis.py Outdated
Comment thread neat_ml/analysis/data_analysis.py Outdated
Copy link
Copy Markdown
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the test line coverage looks "ok" for the Python modules touched
    in the PR; I did have some concern about robust numerical testing beyond trivial cases
  • I don't see any indication of how long it takes to run the full
    analysis for both OpenCV and SAM2 for a given condition? Can we iterate/repeat if we need to? Have you?
  • I also don't see an indication of whether or not running such
    a full analysis allows for reproduction of results in the paper? I
    think that's one of the main priorities re: others being able to
    reproduce what we do, so that would be good to clearly state here
  • I don't see a clear indication of whether we can extract the
    same features for both OpenCV and SAM2? Do both design matrices
    (the rows and columns of parsed data) match between the two
    in terms of the numbers and identities of the features (columns)? That
    should be clearly communicated, possibly with a brief confirmation in i.e., the README.

from tqdm.auto import tqdm
from pyarrow.lib import ArrowInvalid

__all__: Sequence[str] = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this type hint--they're mostly useful for function signatures.

composition_df : pd.DataFrame
The external table with composition data.
cols_to_add : Sequence[str]
A list of columns from composition_df to add to summary_df.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequence rather than list?

re.IGNORECASE | re.VERBOSE,
)
match = _RE.match(fname)
if not match:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the real world scenario where we have an invalid filename and we don't want to error out? In a normal processing workflow, shouldn't the relevant files all match the expected naming format? If you expect to have i.e., OpenCV and SAM2 raw data at the same path, maybe clarify that? How did you regenerate the original results produced for the paper?

Is there not a risk that we silently ignore a problem and end up with incorrect final aggregated data?

Loads a parquet file and converts it to the standard blob schema.

BubbleSAM parquet store 'area' and 'bbox'. This function computes the
'center' and 'radius' to make it compatible with downstream analysis.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify here if the OpenCV input data is already formatted appropriately, so that this function is not useful for OpenCV?

Parameters
----------
parquet_path : Path
The path to the `parquet.gzip` file.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't method missing from your parameter declarations in the docstring here? If this function only does something useful for SAM2 and just passes the dataframe through for OpenCV, that seems useful to clarify up front.

wf.stage_analyze_features(ds, paths={})
assert "Analysis input_dir" in caplog.text

def test_stage_analyze_features_raises_when_composition_csv_missing(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing being "raised" here, no errors, correct nomenclature

},
}
wf.stage_analyze_features(ds, paths={})
assert "Composition CSV" in caplog.text
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of these tests are kind of copy/paste-like, may be able to condense, it is a ton of reading for the reviewer, and some look like copy-pastes that are mistakes/duplicates as I note elsewhere

assert "Composition CSV" in caplog.text


def test_stage_analyze_features_raises_when_no_detection_outputs_opencv(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is being "raised" here, correct nomenclature.

"graph_param": 1
},
}
wf.stage_analyze_features(ds, paths={})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you could condense the tests more than you have, very copy-pastey

caplog.set_level(logging.WARNING)
ds = {"id": "AN1", "method": "OpenCV", "time_label": "T01", "analysis": {}}
wf.stage_analyze_features(ds, paths={})
assert "No analysis input_dir provided" in caplog.text
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? This test looks identical to test_stage_analyze_features_raises_when_input_dir_unavailable()? And pretty much identical to test_stage_analyze_features_errors_when_input_dir_unavailable()?

This drains enormous amounts of review time to try to sort out why these exist after you've read the code a few times?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting here that I need to do a better job of double checking for unnecessary duplication before presenting PR's for review to avoid adding excessive burden to the reviewer.

@tylerjereddy
Copy link
Copy Markdown
Collaborator

@adamwitmer I'll ask that you respond to each individual review comment in this PR and other remaining PRs, given the delays, to discourage rushed work and encourage attention to detail, which seems to hold the reviews up a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants