ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4
ENH: Add scripts to analyze BubbleSAM and OpenCV detection results#4nidhinthomas-ai wants to merge 20 commits intolanl:mainfrom
Conversation
|
Initial TODO items for reviewing this branch:
|
* update/fix README * add progress bar for parquet processing * privatize helper functions in `data_analysis` * parametrize tests * remove private function from `as_steps_set` * fix/remove typing * fix docs
|
In terms of the leftover comment from the I verified this by checking the outputs of real data on test.yamlroots:
work: test
results: test
model: test
inference_model: neat_ml/data/model/PEO10K_DEX10K_BubblesAM_Ph_2nd_model.joblib
datasets:
- id: PEO8K_Sodium_Citrate_Ph_2nd
method: BubbleSAM
role: infer
class: Ph
time_label: 2nd
composition_cols:
- "Sodium Citrate (wt%)"
- "PEO 8 kg/mol (wt%)"
analysis:
input_dir: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/BubbleSAM/Ph/2nd
composition_csv: /scratch2/LDRD_Chicoma_Backup/Final_Results_06092025/LDRD_DR_NEAT_Dataset/PEO8K_Sodium_Citrate/PEO8K_Sodium_Citrate_Image_Composition_Phase.csv
per_image_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_old.csv
aggregate_csv: PEO8K_Sodium_Citrate_2nd_BubbleSAM_Analysis_Summary_Aggregate_old.csv
group_cols:
- Group
- Label
- Time
- Class
graph_method: knn
graph_param: 1By looking at the aggregate
|
* update README with more information pertaining to analysis step * remove redundant test * fix docs
|
@tylerjereddy I have completed my initial review checklist and addressed all self-review comments including those from the |
|
@adamwitmer I'll make a note to review this on Friday, March 20th. The delay is because I'm not seeing enough activity on the ASC polymer project--once you've caught up with ~2 days of work from last week + 2 days from next week (~32 hours hours of very solid effort on that project) I'll see if it looks like the projects are being balanced. I can't pick up the slack over on that project just because you want to get this project caught up on its delay, otherwise I'm assuming the burden of the delay on my own. |
* remove monkeypatching for test * add example .yaml file for analysis * add checks/tests for parameter inputs * fix bug in image file name string replacement logic
* parameterize test for error outputs * add networkx to requirements.txt
* vectorize graph metric procedures * improve test coverage * update README.md * fix docs/inline comments * remove unnecessary type coercions/typing
* update/fix README * add progress bar for parquet processing * privatize helper functions in `data_analysis` * parametrize tests * remove private function from `as_steps_set` * fix/remove typing * fix docs
* update README with more information pertaining to analysis step * remove redundant test * fix docs
tylerjereddy
left a comment
There was a problem hiding this comment.
- the test line coverage looks "ok" for the Python modules touched
in the PR; I did have some concern about robust numerical testing beyond trivial cases - I don't see any indication of how long it takes to run the full
analysis for both OpenCV and SAM2 for a given condition? Can we iterate/repeat if we need to? Have you? - I also don't see an indication of whether or not running such
a full analysis allows for reproduction of results in the paper? I
think that's one of the main priorities re: others being able to
reproduce what we do, so that would be good to clearly state here - I don't see a clear indication of whether we can extract the
same features for both OpenCV and SAM2? Do both design matrices
(the rows and columns of parsed data) match between the two
in terms of the numbers and identities of the features (columns)? That
should be clearly communicated, possibly with a brief confirmation in i.e., theREADME.
| from tqdm.auto import tqdm | ||
| from pyarrow.lib import ArrowInvalid | ||
|
|
||
| __all__: Sequence[str] = [ |
There was a problem hiding this comment.
I don't think we need this type hint--they're mostly useful for function signatures.
| composition_df : pd.DataFrame | ||
| The external table with composition data. | ||
| cols_to_add : Sequence[str] | ||
| A list of columns from composition_df to add to summary_df. |
There was a problem hiding this comment.
sequence rather than list?
| re.IGNORECASE | re.VERBOSE, | ||
| ) | ||
| match = _RE.match(fname) | ||
| if not match: |
There was a problem hiding this comment.
What's the real world scenario where we have an invalid filename and we don't want to error out? In a normal processing workflow, shouldn't the relevant files all match the expected naming format? If you expect to have i.e., OpenCV and SAM2 raw data at the same path, maybe clarify that? How did you regenerate the original results produced for the paper?
Is there not a risk that we silently ignore a problem and end up with incorrect final aggregated data?
| Loads a parquet file and converts it to the standard blob schema. | ||
|
|
||
| BubbleSAM parquet store 'area' and 'bbox'. This function computes the | ||
| 'center' and 'radius' to make it compatible with downstream analysis. |
There was a problem hiding this comment.
It would be good to clarify here if the OpenCV input data is already formatted appropriately, so that this function is not useful for OpenCV?
| Parameters | ||
| ---------- | ||
| parquet_path : Path | ||
| The path to the `parquet.gzip` file. |
There was a problem hiding this comment.
Isn't method missing from your parameter declarations in the docstring here? If this function only does something useful for SAM2 and just passes the dataframe through for OpenCV, that seems useful to clarify up front.
| wf.stage_analyze_features(ds, paths={}) | ||
| assert "Analysis input_dir" in caplog.text | ||
|
|
||
| def test_stage_analyze_features_raises_when_composition_csv_missing( |
There was a problem hiding this comment.
nothing being "raised" here, no errors, correct nomenclature
| }, | ||
| } | ||
| wf.stage_analyze_features(ds, paths={}) | ||
| assert "Composition CSV" in caplog.text |
There was a problem hiding this comment.
many of these tests are kind of copy/paste-like, may be able to condense, it is a ton of reading for the reviewer, and some look like copy-pastes that are mistakes/duplicates as I note elsewhere
| assert "Composition CSV" in caplog.text | ||
|
|
||
|
|
||
| def test_stage_analyze_features_raises_when_no_detection_outputs_opencv( |
There was a problem hiding this comment.
Nothing is being "raised" here, correct nomenclature.
| "graph_param": 1 | ||
| }, | ||
| } | ||
| wf.stage_analyze_features(ds, paths={}) |
There was a problem hiding this comment.
I suspect you could condense the tests more than you have, very copy-pastey
| caplog.set_level(logging.WARNING) | ||
| ds = {"id": "AN1", "method": "OpenCV", "time_label": "T01", "analysis": {}} | ||
| wf.stage_analyze_features(ds, paths={}) | ||
| assert "No analysis input_dir provided" in caplog.text |
There was a problem hiding this comment.
What's going on here? This test looks identical to test_stage_analyze_features_raises_when_input_dir_unavailable()? And pretty much identical to test_stage_analyze_features_errors_when_input_dir_unavailable()?
This drains enormous amounts of review time to try to sort out why these exist after you've read the code a few times?
There was a problem hiding this comment.
Noting here that I need to do a better job of double checking for unnecessary duplication before presenting PR's for review to avoid adding excessive burden to the reviewer.
|
@adamwitmer I'll ask that you respond to each individual review comment in this PR and other remaining PRs, given the delays, to discourage rushed work and encourage attention to detail, which seems to hold the reviews up a lot. |
This merge request introduces a new library module,
neat_ml/analysis, which analyzes the bubble detection results either from BubbleSAM or OpenCV. It also adds an accompanying test suite.What’s inside
The
data_analysis.pyscript is the core of this feature. It processes raw bubble data (centroids, areas, etc.) stored in parquet files and produces comprehensive quantitative metrics. The workflow can be summarized as:*_bubble_data.parquet.gzip) or BubbleSAM (*_masks_filtered.parquet.gzip).NND): Mean and median distance between the closest bubbles.per_image_metrics.csv: A detailed report with all calculated metrics for every single image.aggregated_metrics.csv: A summary report where metrics are aggregated (min, max, median, std) across different experimental groups.radius: Connects all bubbles within a given search radius.knn: Connects each bubble to its k-nearest neighbors.delaunay: Creates a graph based on Delaunay triangulation, connecting "natural" neighbors. It then calculates network metrics like average node degree, clustering coefficient, and properties of the largest connected component (LCC).OpenCVandBubbleSAM.The analysis typically uses the radius graph method with the parameter 30. Users can also change the parameter for their custom dataset. This parameter defines the maximum distance (in pixels) for two bubbles to be considered connected in the spatial graph.