Merge MZMinetoMSstatsFormat converter into development by swaraj-neu · Pull Request #132 · Vitek-Lab/MSstatsConvert

swaraj-neu · 2026-05-19T00:24:09Z

Summary

Adds MZMinetoMSstatsFormat, a new converter for MZMine metabolomics output, mirroring the structure and conventions of DIANNtoMSstatsFormat. This is Phase 1 of a cross-package task to add metabolomics support to the MSstats family; Phase 2 (the BIO = "Metabolomics" value in MSstatsShiny) will be a separate PR in Vitek-Lab/MSstatsShiny that depends on this one.

What's added

R/clean_MZMine.R: internal .cleanRawMZMine() that handles the wide-to-long pivot, column normalization, and the optional compound-name join.
R/converters_MZMinetoMSstatsFormat.R — public converter. Signature mirrors DIANNtoMSstatsFormat with metabolomics-appropriate defaults (remove_shared_peptides = FALSE, no decoy/oxidation filters, IsotopeLabelType = "Light").
R/MSstatsConvert_core_functions.R — MSstatsMZMineFiles class registration + MSstatsClean method dispatch. Dispatch is name-based in this package, so no other registration site needs editing.
Fixtures under inst/tinytest/raw_data/MZMine/: a wide-format quant CSV (6 features × 4 samples), spectral-library annotations CSV, and run-annotation CSV.
Tests at inst/tinytest/test_converters_MZMinetoMSstatsFormat.R: 38 assertions covering happy path, IsotopeLabelType, NA charge/fragment columns, highest-score-wins annotation policy, mz_rt fallback for unannotated features, and removeProtein_with1Feature behavior.
Vignette: new "Metabolomics with MZMine" section in vignettes/msstats_data_format.Rmd.

Column mapping

MSstats column	Metabolomics meaning	Source
`ProteinName`	Compound name	From `mzmine_annotations` (highest-scoring hit per feature), or mz_rt fallback
`PeptideSequence`	Feature ID	`row ID` from quant file
`PrecursorCharge`	—	`NA`
`FragmentIon`	—	`NA`
`ProductCharge`	—	`NA`
`IsotopeLabelType`	—	`"Light"` (matches DIANN's non-SILAC convention)
`Run`	Sample run	Per-sample column name, with `Peak area` stripped
`Intensity`	Peak area	Melted from per-sample columns

Design choices worth noting

Optional mzmine_annotations parameter. Most features in real MZMine datasets have no spectral-library annotation (in the example dataset I tested with, 69 of 2,569 features were annotated — about 3%). To make output biologically usable, MZMinetoMSstatsFormat accepts an optional mzmine_annotations data frame that joins compound names by highest score per feature. When NULL, every feature falls back to a mz_rt string (paste0(round(mz, 4), "_", round(rt, 2))). This is the simple spectral-library join only - not the full multi-source SIRIUS/MS2Query/CANOPUS consensus workflow some labs use.

Verification

tinytest::test_all("."): 732/732 pass, including the 38 new MZMine assertions.
R CMD check on MSstatsConvert: clean. One new NOTE (MZMinetoMSstatsFormat: no visible binding for global variable 'Intensity'); same pattern as DIANN/Skyline/Spectronaut.
Cross-package check: R CMD check on ../MSstats shows no MZMine-related findings (pre-existing WARNINGs/NOTEs only). I wasn't able to run cross-checks on MSstatsTMT, MSstatsPTM, or MSstatsLiP because they're not cloned in my immediate workspace - flagging in case you want to verify those before merge.

Heads-up for review

Column-name standardization affects user-facing Run values. .standardizeColnames strips spaces and dots from column names, so the per-sample column "sampleA.mzML Peak area" becomes sampleAmzMLPeakarea internally and the resulting Run value (after the Peakarea suffix strip) is sampleAmzML. This means a user's annotation file must use Run = sampleAmzML, not sampleA.mzML or sampleA. This is consistent with how every MSstatsConvert converter
behaves — but MZMine's column names are uglier than the others, so the transformation is more visible. Worth deciding whether docs should call this out more loudly, the converter should normalize the user's Run values internally, or it's fine as-is.

Closes Phase 1 of the metabolomics integration. Phase 2 (MSstatsShiny BIO = "Metabolomics") will follow in a separate PR.

Motivation and context — short summary

This PR adds first-class support for untargeted metabolomics MZMine exports to MSstatsConvert by introducing a converter and cleaning pipeline that mirrors DIANN conventions but uses metabolomics-appropriate defaults (e.g., IsotopeLabelType = "Light", NA charge/fragment fields). It pivots MZMine wide-format " Peak area" tables to MSstats long format, optionally joins spectral-library compound annotations (highest-score wins), and falls back to an mz_rt identifier for unannotated features.

Detailed changes

New public converter
- Added MZMinetoMSstatsFormat(input, annotation = NULL, mzmine_annotations = NULL, removeProtein_with1Feature = FALSE, ...) — R/converters_MZMinetoMSstatsFormat.R (exported via NAMESPACE).
Cleaning internals
- Added .cleanRawMZMine(msstats_object, mzmine_annotations = NULL) — R/clean_MZMine.R
  - Validates required columns (rowID, rowmz, rowretentiontime)
  - Detects and melts peak-area columns matching " Peak area"
  - Builds mz_rt fallback identifier (rounded mz / rt) for unannotated features
  - Optionally joins mzmine_annotations; enforces numeric score and selects highest-scoring annotation per id
  - Standardizes run names by stripping " Peak area" (internal column-name normalization)
  - Sets PeptideSequence = feature row id, ProteinName = compound_name (or mz_rt fallback), PrecursorCharge/FragmentIon/ProductCharge = NA, IsotopeLabelType = "Light", Intensity = peak area
  - Logs via .logSuccess("MZMine","clean")
S4 integration
- Added MSstatsMZMineFiles S4 class and setMethod("MSstatsClean", signature = "MSstatsMZMineFiles", .cleanRawMZMine) — R/MSstatsConvert_core_functions.R
Tests and fixtures
- Added fixtures: inst/tinytest/raw_data/MZMine/{mzmine_input.csv, mzmine_annotations.csv, annotation.csv}
- Added inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (38 assertions covering core behaviors)
Documentation and vignette
- Added man/MZMinetoMSstatsFormat.Rd; man/dot-cleanRawMZMine.Rd; updated man/MSstatsClean.Rd and man/MSstatsInputFiles.Rd
- Extended vignette vignettes/msstats_data_format.Rmd with "Metabolomics with MZMine" section and runnable examples
Package metadata and ignores
- Updated DESCRIPTION Collate to include clean_MZMine.R and converters_MZMinetoMSstatsFormat.R
- Exported MZMinetoMSstatsFormat in NAMESPACE
- Updated .Rbuildignore and .gitignore to ignore .positai, doc, Meta (and similar) entries

Unit tests added/modified

inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (new)
- 38 assertions exercising:
  - Output schema (11 expected MSstats columns) and row counts (24 rows for test fixture: 6 features × 4 runs)
  - Defaults: IsotopeLabelType == "Light"; PrecursorCharge/FragmentIon/ProductCharge == NA; Fraction == 1
  - Annotation behavior: highest-score-wins selection for duplicate annotations; unmatched features use mz_rt fallback
  - mz_rt fallback construction and usage when mzmine_annotations absent
  - Zero-intensity values converted to NA
  - Propagation of sample metadata (Condition, BioReplicate) from annotation/run CSV
  - removeProtein_with1Feature = TRUE behavior (filters to proteins with ≥2 features; expected 8 rows in fixture)
- Tests run in tinytest suite (author reports 732/732 tinytests passing including these)

Coding guidelines / notable issues

Documentation / roxygen2 compatibility: A pre-existing roxygen2 docstring problem surfaced during this work — an @inheritParams reference targets a function whose roxygen block does not produce a discoverable topic, causing roxygen2 v7.3.3 to error (roxygen2 v8 tolerates it). The author avoided re-running document() to keep scope limited and recommends a separate housekeeping PR to fix the ProteinProspector docstring (adjust @inheritParams or ensure the referenced topic is documentable).
Column-name normalization caveat: internal standardization removes spaces/dots from column names (affects Run values). Users must ensure run/sample names in external annotation files match the standardized Run names (the vignette documents this caveat).
No other coding-guideline violations detected; new S4 class/method and naming conventions follow existing package patterns.

Brings metabolomics into the MSstats family by adding an MZMine converter that mirrors the structure of DIANNtoMSstatsFormat. Phase 1 of a two-phase task; Phase 2 (MSstatsShiny BIO=Metabolomics) will be a separate PR.

coderabbitai · 2026-05-19T00:24:22Z

Warning

Rate limit exceeded

@swaraj-neu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c717d945-fea0-4112-9249-9fdda2c7685e

📥 Commits

Reviewing files that changed from the base of the PR and between 7ec49d1 and 17a9c57.

📒 Files selected for processing (1)

R/clean_MZMine.R

📝 Walkthrough

Walkthrough

This PR adds MZMine metabolomics data conversion to MSstatsConvert. It introduces an S4 class (MSstatsMZMineFiles), an internal cleaning function (.cleanRawMZMine), a public converter (MZMinetoMSstatsFormat), comprehensive tests, and documentation. The converter transforms wide-format MZMine feature tables into MSstats-compatible long format with optional annotation-based protein identification and mz/rt fallback naming.

Changes

MZMine Metabolomics Converter

Layer / File(s)	Summary
Package build configuration `.Rbuildignore`, `.gitignore`, `DESCRIPTION`, `NAMESPACE`	Build ignore patterns updated for `.positai`, `.claude`, `doc`, and `Meta` directories. `DESCRIPTION` Collate order extended to include new `clean_MZMine.R` and `converters_MZMinetoMSstatsFormat.R` files. `NAMESPACE` exports `MZMinetoMSstatsFormat`.
S4 class and method infrastructure `R/MSstatsConvert_core_functions.R`	New `MSstatsMZMineFiles` S4 class added as subtype of `MSstatsInputFiles`. S4 method for `MSstatsClean` generic registered for `MSstatsMZMineFiles`, binding it to `.cleanRawMZMine` implementation.
MZMine raw data cleaning `R/clean_MZMine.R`	`.cleanRawMZMine()` transforms MZMine wide format to MSstats long format: loads data, validates metadata columns (`rowID`, `rowmz`, `rowretentiontime`), detects sample Peakarea columns, builds mz/rt fallback identifiers, optionally merges spectral-library annotations (highest score wins), derives protein/peptide/run/charge fields, melts to long rows, and logs success.
Public converter API `R/converters_MZMinetoMSstatsFormat.R`	`MZMinetoMSstatsFormat()` wires logging, `MSstatsImport`, `MSstatsClean` (with `mzmine_annotations`), `MSstatsMakeAnnotation`, isotope-label handling, `MSstatsPreprocess` with configurable summarization and optional single-feature protein removal, converts zeros to `NA`, runs `MSstatsBalancedDesign(fill_incomplete=TRUE)`, and returns MSstats-ready table.
Test suite `inst/tinytest/test_converters_MZMinetoMSstatsFormat.R`	Validates output schema (11 columns), row counts (24 rows = 6 features × 4 runs), metabolomics defaults (`IsotopeLabelType="Light"`, `NA` charge/ion fields, `Fraction=1`), annotation merge rules (highest score per feature), mz_rt fallback for unannotated features, zero→NA conversion, condition/BioReplicate propagation, and `removeProtein_with1Feature=TRUE` filtering behavior.
Documentation `man/MSstatsClean.Rd`, `man/MSstatsInputFiles.Rd`, `man/MZMinetoMSstatsFormat.Rd`, `man/dot-cleanRawMZMine.Rd`, `vignettes/msstats_data_format.Rmd`	Roxygen-generated documentation for `MSstatsClean` MZMine method, `MSstatsMZMineFiles` class, public `MZMinetoMSstatsFormat()` function, internal `.cleanRawMZMine()` function, and vignette section with runnable conversion examples demonstrating annotation-based and fallback protein naming.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

mstaniak

Poem

🐰 I hopped through peaks and rt to find,
Annotations sorted, best-scored assigned,
When libraries miss, mz_rt will guide,
From wide Peakarea to long-format stride,
Now MSstats greets each metabolite with pride.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: adding the MZMinetoMSstatsFormat converter into the development branch.
Description check	✅ Passed	The description comprehensively covers motivation, detailed changes, testing, and includes verification results; it deviates from the template structure but provides all essential information and more.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstatsConvert/work/20260516_mzmine_converter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

swaraj-neu · 2026-05-19T00:26:29Z

Pre-existing roxygen2 issue surfaced during this work.

When re-documenting the package with roxygen2 7.3.3, the build fails with "In topic 'MSstatsClean': @inherits failed to find topic '.cleanRawProteinProspector'."

This is unrelated to MZMine. Somewhere in the package an @inheritParams .cleanRawProteinProspector tag points at a function whose roxygen block doesn't produce a discoverable topic. roxygen2 v8 tolerates this silently; v7.3.3 treats it as an error. To keep this PR scoped to MZMine, I worked around it by hand-editing DESCRIPTION rather than re-running document(). Worth a separate housekeeping PR to fix the ProteinProspector docstring.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (1)

98-100: ⚡ Quick win

Strengthen fallback coverage assertion to detect missing mz_rt IDs.

On Lines 98-100, all(ProteinName %in% expected_mz_rt) only checks “no unexpected values”; it does not guarantee every expected fallback appears. Use set equality on unique values to avoid false positives.

Proposed test assertion hardening

 expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67",
                    "489.334_7.89", "555.447_9.1", "123.056_1.45")
-expect_true(all(as.character(output_nolib_dt$ProteinName) %in% expected_mz_rt))
+expect_equal(
+  sort(unique(as.character(output_nolib_dt$ProteinName))),
+  sort(expected_mz_rt)
+)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R` around lines 98 - 100,
The test currently only asserts that there are no unexpected ProteinName values
using expected_mz_rt and output_nolib_dt$ProteinName, which can miss missing
expected mz_rt IDs; change the assertion to compare the sets (or sorted uniques)
so it verifies every expected fallback is present and no extras — e.g., replace
the all(... %in% ...) check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-56: The code sorts annotations using setorder(ann, id, -score)
but assumes score is numeric; coerce ann$score to numeric before ordering to
avoid mis-ranking when score is character/factor. Update the logic around the
ann data.table (before setorder) to convert score (e.g., ann[, score :=
as.numeric(score)]) and handle potential NAs (optionally warn if coercion
produces NA) so setorder and the subsequent unique(ann, by = "id") produce
correct ann_top results.

In `@R/converters_MZMinetoMSstatsFormat.R`:
- Around line 8-10: Update the parameter docs for annotation to state that the
`Run` column must match the standardized sample/run names produced by the
converter (i.e., after the function's column-name cleaning which strips the
trailing " Peak area" and normalizes spaces/dots), because the code (look for
the converter function that accepts `annotation` and the internal name-cleaning
logic that normalizes sample column names) uses those cleaned names when
joining; tell users to either provide `Run` values in that standardized form or
to run the same name-cleaning routine on their annotation before passing it in.
- Around line 54-55: The wrapper call to MSstatsImport in
converters_MZMinetoMSstatsFormat.R ignores the variadic arguments (...)
advertised for passing to data.table::fread; update the call where input is
assigned (the MSstatsImport invocation) to forward ... into MSstatsImport so
caller-supplied fread options are preserved, i.e., include ... in the
MSstatsImport argument list when calling MSstatsImport(list(input = input),
"MSstats", "MZMine") so the fread passthrough works as documented.

---

Nitpick comments:
In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R`:
- Around line 98-100: The test currently only asserts that there are no
unexpected ProteinName values using expected_mz_rt and
output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change
the assertion to compare the sets (or sorted uniques) so it verifies every
expected fallback is present and no extras — e.g., replace the all(... %in% ...)
check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60c2c010-0294-4a60-b8ea-1991a39f0d27

📥 Commits

Reviewing files that changed from the base of the PR and between b9564f2 and 589ebb5.

⛔ Files ignored due to path filters (3)

inst/tinytest/raw_data/MZMine/annotation.csv is excluded by !**/*.csv
inst/tinytest/raw_data/MZMine/mzmine_annotations.csv is excluded by !**/*.csv
inst/tinytest/raw_data/MZMine/mzmine_input.csv is excluded by !**/*.csv

📒 Files selected for processing (13)

.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
R/MSstatsConvert_core_functions.R
R/clean_MZMine.R
R/converters_MZMinetoMSstatsFormat.R
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
man/MSstatsClean.Rd
man/MSstatsInputFiles.Rd
man/MZMinetoMSstatsFormat.Rd
man/dot-cleanRawMZMine.Rd
vignettes/msstats_data_format.Rmd

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-57: The score column is being coerced with as.numeric directly
(ann[, score := suppressWarnings(as.numeric(score))]) which will return factor
internal codes if score is a factor; change the coercion to go through character
first (i.e., suppressWarnings(as.numeric(as.character(score)))) so factor levels
are converted correctly, keep the anyNA(ann$score) check and the existing stop
message unchanged to validate numeric coercion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95a37b9b-01a2-4648-b3c5-3aa94dcfecbf

📥 Commits

Reviewing files that changed from the base of the PR and between 589ebb5 and 7ec49d1.

📒 Files selected for processing (4)

R/clean_MZMine.R
R/converters_MZMinetoMSstatsFormat.R
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
man/MZMinetoMSstatsFormat.Rd

✅ Files skipped from review due to trivial changes (1)

man/MZMinetoMSstatsFormat.Rd

swaraj-neu · 2026-05-21T16:13:10Z

@tonywu1999

End-to-end test:

MZMinetoMSstatsFormat against the biologist's master_annotation_table.csv

Same input file mzmine-summer25-microbial-results_iimn_gnps_quant.csv.
Same library annotations mzmine-summer25-microbial-results_annotations.csv.
Compared our converter's output to the master_annotation_table the notebook produces.

PART 1: ROW KEY COMPARISON

Their (feature, run) keys: 59087
Our (feature, run) keys: 59087
In both: 59087
Only in theirs: 0
Only in ours: 0

-> Every (feature, run) pair matches. No rows dropped, no extra rows.

PART 2: INTENSITY COMPARISON

status N
match 53491
both_NA 5596

Zero value mismatches.

-> All peak-area values agree. The 5,596 both-NA rows are where the
biologist's file has 0 and ours has NA (we convert zero -> NA on
purpose, per MSstats convention).

PART 3: COMPOUND NAME COMPARISON

Their annotated features (any source): 1796
sirius: 1679
library: 69
ms2query: 48

Their library-annotated features: 69
Our library-annotated features: 69

In both: 69
Only in theirs (we missed naming): 0
Only in ours (their score < 0.7 cutoff): 0

Name mismatches on the 69 shared library-annotated features: 0
-> Perfect agreement on every library-named feature.

Note on the 1,679 features named by SIRIUS in their pipeline:
those show up as mz_rt fallback strings in our output (e.g.
"455.282_0.65"), because Phase 1 only implements the library
annotation source, not SIRIUS / MS2Query / CANOPUS.

tonywu1999 · 2026-05-23T16:45:57Z

+2,245.1290,3.45,5000,4800,5200,4900
+3,367.2010,5.67,800,0,750,820
+4,489.3340,7.89,2000,2100,1900,2050
+5,555.4470,9.10,100,0,0,0


In the msstats-ai repo, it would be helpful to start compiling questions to ask a developer when they're creating converters (and integrate it as part of an existing skill / a new skill for msstatsConvert development). One such question to ask:

Can we assume zeros mean a value is missing due to low abundance? If so, we can safely convert them to NA.

tonywu1999 · 2026-05-23T16:47:44Z

+feature4_proteins = unique(output_dt[PeptideSequence == "4", ProteinName])
+expect_equal(as.character(feature4_proteins), "489.334_7.89")
+feature5_proteins = unique(output_dt[PeptideSequence == "5", ProteinName])
+expect_equal(as.character(feature5_proteins), "555.447_9.1")


Another question for msstats-ai repo:

Should we filter out features that don't have a clear corresponding protein/metabolite? Generally the preference is yes, we should filter them out (because it messes with multiple hypothesis testing)

tonywu1999 · 2026-05-23T16:50:55Z

@@ -0,0 +1,7 @@
+row ID,row m/z,row retention time,sampleA.mzML Peak area,sampleB.mzML Peak area,sampleC.mzML Peak area,sampleD.mzML Peak area


I think it was great you created a really small version of the user input that is simple. Makes the unit tests really straightforward and easy to follow (noticed other unit tests are harder to follow since there's so many rows in the input)

Maybe in msstats-ai, mention to generate a subset of the test dataset we're given for testing simplicity. And the test dataset should have things like missing values, etc.

tonywu1999 · 2026-05-23T16:55:36Z

+output_nolib = MZMinetoMSstatsFormat(input, annotation = annot,
+                                     mzmine_annotations = NULL,
+                                     use_log_file = FALSE)
+output_nolib_dt = data.table::as.data.table(output_nolib)
+
+# Every ProteinName is the mz_rt fallback string
+expect_equal(ncol(output_nolib), 11)
+expect_equal(nrow(output_nolib), 24)
+expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67",
+                   "489.334_7.89", "555.447_9.1", "123.056_1.45")
+expect_equal(
+    sort(unique(as.character(output_nolib_dt$ProteinName))),
+    sort(expected_mz_rt)
+)
+# Compound names from the library must not leak in
+expect_false(any(as.character(output_nolib_dt$ProteinName) %in%
+                 c("Caffeine", "GlucoseHigh", "GlucoseLow", "Lactate")))


I think you can safely get rid of these unit tests for now and make the mzmine annotation file be mandatory. I don't see this scenario happening / would rather not support it initially.

Reading your E2E testing

Their annotated features (any source): 1796
sirius: 1679
library: 69
ms2query: 48

Their library-annotated features: 69
Our library-annotated features: 69

In both: 69
Only in theirs (we missed naming): 0
Only in ours (their score < 0.7 cutoff): 0

Name mismatches on the 69 shared library-annotated features: 0
-> Perfect agreement on every library-named feature.

Note on the 1,679 features named by SIRIUS in their pipeline:
those show up as mz_rt fallback strings in our output (e.g.
"455.282_0.65"), because Phase 1 only implements the library
annotation source, not SIRIUS / MS2Query / CANOPUS.

Could you elaborate what you mean here? Are you saying that for features identified by SIRIUS, there's no compound_name? And then that's why you fell back to mz_rt? Because this may affect any of my comments related to the mz_rt fallback.

@tonywu1999 Okay, let me clarify what's happening with those features.
The library annotations file (mzmine-summer25-microbial-results_annotations.csv) only contains entries for features the spectral library could match i.e. 69 unique feature IDs out of 2,569 total. SIRIUS-named features aren't in that file at all; SIRIUS writes its results to a separate structure_identifications.tsv.

Our converter only reads the library annotations file (the mzmine_annotations argument). So when SIRIUS named a feature in their pipeline, we have no annotation row for that feature at all. Not a low-scoring one, not a missing compound_name, just no row. That's why those features fall back to mz_rt in our output.

To pick up SIRIUS / MS2Query / CANOPUS names, the user would have to either (a) run the full notebook to produce the master_annotation_table.csv and feed that in as mzmine_annotations (after renaming consensus_name → compound_name and consensus_score → score), or (b) we extend the converter to accept multiple annotation sources, which is what Phase 1 deliberately scoped out.

tonywu1999 · 2026-05-23T16:59:30Z

+# Without a library: ProteinName is an mz_rt fallback string
+mzmine_no_lib = MZMinetoMSstatsFormat(
+  mzmine_input,
+  annotation = mzmine_annotation,
+  mzmine_annotations = NULL,
+  use_log_file = FALSE
+)
+head(mzmine_no_lib)
+```


You can remove this

tonywu1999 · 2026-05-23T17:17:38Z

+            stop("mzmine_annotations is missing required column(s): ",
+                 paste(missing_ann, collapse = ", "), ".")
+        }
+        ann[, score := suppressWarnings(as.numeric(as.character(score)))]


What kind of warnings show up? Why as.character and then as.numeric when you can just do as.numeric?

tonywu1999 · 2026-05-23T17:18:16Z

+        }
+        ann[, score := suppressWarnings(as.numeric(as.character(score)))]
+        if (anyNA(ann$score)) {
+            stop("mzmine_annotations$score must be numeric (or coercible to numeric).")


This error message may be confusing to a user. I'd say the score column must be numeric values in the mzmine annotation file

tonywu1999 · 2026-05-23T17:19:52Z

+                             round(mz_input[[rt_col]], 2))
+
+    if (!is.null(mzmine_annotations)) {
+        ann <- data.table::as.data.table(mzmine_annotations)


Can you use a variable name that is more descriptive, e.g. feature to compound mapping?

tonywu1999 · 2026-05-23T17:24:59Z

+        mz_input,
+        id.vars = c("ProteinName", "PeptideSequence"),
+        measure.vars = peak_area_cols,
+        variable.name = "sample_col",


Could this simply be named to "Run"?

tonywu1999 · 2026-05-23T17:36:13Z

+        compound <- mz_rt_fallback
+    }
+
+    mz_input[, ProteinName := compound]


Are we assuming mz_input is sorted here by ID? Would it be better to do a join?

e.g.

mz_input[ ann, ProteinName := i.compound_name, on = c(id_col = "id"), mult = "first" ]

Add MZMinetoMSstatsFormat converter

589ebb5

Brings metabolomics into the MSstats family by adding an MZMine converter that mirrors the structure of DIANNtoMSstatsFormat. Phase 1 of a two-phase task; Phase 2 (MSstatsShiny BIO=Metabolomics) will be a separate PR.

swaraj-neu requested a review from tonywu1999 May 19, 2026 00:24

swaraj-neu self-assigned this May 19, 2026

swaraj-neu added the enhancement New feature or request label May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread R/clean_MZMine.R

Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated

Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated

Address CodeRabbit review feedback

7ec49d1

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread R/clean_MZMine.R Outdated

Coerce score via character to handle factor inputs

17a9c57

tonywu1999 reviewed May 23, 2026

View reviewed changes

		@@ -0,0 +1,7 @@
		row ID,row m/z,row retention time,sampleA.mzML Peak area,sampleB.mzML Peak area,sampleC.mzML Peak area,sampleD.mzML Peak area

Conversation

swaraj-neu commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

Column mapping

Design choices worth noting

Verification

Heads-up for review

Motivation and context — short summary

Detailed changes

Unit tests added/modified

Coding guidelines / notable issues

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

swaraj-neu commented May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

swaraj-neu commented May 21, 2026

End-to-end test:

PART 1: ROW KEY COMPARISON

PART 2: INTENSITY COMPARISON

PART 3: COMPOUND NAME COMPARISON

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonywu1999 May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swaraj-neu May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swaraj-neu commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

tonywu1999 May 23, 2026 •

edited

Loading

swaraj-neu May 25, 2026 •

edited

Loading