Skip to content

Merge MZMinetoMSstatsFormat converter into development#132

Open
swaraj-neu wants to merge 3 commits into
develfrom
MSstatsConvert/work/20260516_mzmine_converter
Open

Merge MZMinetoMSstatsFormat converter into development#132
swaraj-neu wants to merge 3 commits into
develfrom
MSstatsConvert/work/20260516_mzmine_converter

Conversation

@swaraj-neu
Copy link
Copy Markdown

@swaraj-neu swaraj-neu commented May 19, 2026

Summary

Adds MZMinetoMSstatsFormat, a new converter for MZMine metabolomics output, mirroring the structure and conventions of DIANNtoMSstatsFormat. This is Phase 1 of a cross-package task to add metabolomics support to the MSstats family; Phase 2 (the BIO = "Metabolomics" value in MSstatsShiny) will be a separate PR in Vitek-Lab/MSstatsShiny that depends on this one.

What's added

  • R/clean_MZMine.R: internal .cleanRawMZMine() that handles the wide-to-long pivot, column normalization, and the optional compound-name join.
  • R/converters_MZMinetoMSstatsFormat.R — public converter. Signature mirrors DIANNtoMSstatsFormat with metabolomics-appropriate defaults (remove_shared_peptides = FALSE, no decoy/oxidation filters, IsotopeLabelType = "Light").
  • R/MSstatsConvert_core_functions.RMSstatsMZMineFiles class registration + MSstatsClean method dispatch. Dispatch is name-based in this package, so no other registration site needs editing.
  • Fixtures under inst/tinytest/raw_data/MZMine/: a wide-format quant CSV (6 features × 4 samples), spectral-library annotations CSV, and run-annotation CSV.
  • Tests at inst/tinytest/test_converters_MZMinetoMSstatsFormat.R: 38 assertions covering happy path, IsotopeLabelType, NA charge/fragment columns, highest-score-wins annotation policy, mz_rt fallback for unannotated features, and removeProtein_with1Feature behavior.
  • Vignette: new "Metabolomics with MZMine" section in vignettes/msstats_data_format.Rmd.

Column mapping

MSstats column Metabolomics meaning Source
ProteinName Compound name From mzmine_annotations (highest-scoring hit per feature), or mz_rt fallback
PeptideSequence Feature ID row ID from quant file
PrecursorCharge NA
FragmentIon NA
ProductCharge NA
IsotopeLabelType "Light" (matches DIANN's non-SILAC convention)
Run Sample run Per-sample column name, with Peak area stripped
Intensity Peak area Melted from per-sample columns

Design choices worth noting

Optional mzmine_annotations parameter. Most features in real MZMine datasets have no spectral-library annotation (in the example dataset I tested with, 69 of 2,569 features were annotated — about 3%). To make output biologically usable, MZMinetoMSstatsFormat accepts an optional mzmine_annotations data frame that joins compound names by highest score per feature. When NULL, every feature falls back to a mz_rt string (paste0(round(mz, 4), "_", round(rt, 2))). This is the simple spectral-library join only - not the full multi-source SIRIUS/MS2Query/CANOPUS consensus workflow some labs use.

Verification

  • tinytest::test_all("."): 732/732 pass, including the 38 new MZMine assertions.
  • R CMD check on MSstatsConvert: clean. One new NOTE (MZMinetoMSstatsFormat: no visible binding for global variable 'Intensity'); same pattern as DIANN/Skyline/Spectronaut.
  • Cross-package check: R CMD check on ../MSstats shows no MZMine-related findings (pre-existing WARNINGs/NOTEs only). I wasn't able to run cross-checks on MSstatsTMT, MSstatsPTM, or MSstatsLiP because they're not cloned in my immediate workspace - flagging in case you want to verify those before merge.

Heads-up for review

Column-name standardization affects user-facing Run values. .standardizeColnames strips spaces and dots from column names, so the per-sample column "sampleA.mzML Peak area" becomes sampleAmzMLPeakarea internally and the resulting Run value (after the Peakarea suffix strip) is sampleAmzML. This means a user's annotation file must use Run = sampleAmzML, not sampleA.mzML or sampleA. This is consistent with how every MSstatsConvert converter
behaves — but MZMine's column names are uglier than the others, so the transformation is more visible. Worth deciding whether docs should call this out more loudly, the converter should normalize the user's Run values internally, or it's fine as-is.

Closes Phase 1 of the metabolomics integration. Phase 2 (MSstatsShiny BIO = "Metabolomics") will follow in a separate PR.

Motivation and context — short summary

This PR adds first-class support for untargeted metabolomics MZMine exports to MSstatsConvert by introducing a converter and cleaning pipeline that mirrors DIANN conventions but uses metabolomics-appropriate defaults (e.g., IsotopeLabelType = "Light", NA charge/fragment fields). It pivots MZMine wide-format " Peak area" tables to MSstats long format, optionally joins spectral-library compound annotations (highest-score wins), and falls back to an mz_rt identifier for unannotated features.

Detailed changes

  • New public converter
    • Added MZMinetoMSstatsFormat(input, annotation = NULL, mzmine_annotations = NULL, removeProtein_with1Feature = FALSE, ...) — R/converters_MZMinetoMSstatsFormat.R (exported via NAMESPACE).
  • Cleaning internals
    • Added .cleanRawMZMine(msstats_object, mzmine_annotations = NULL) — R/clean_MZMine.R
      • Validates required columns (rowID, rowmz, rowretentiontime)
      • Detects and melts peak-area columns matching " Peak area"
      • Builds mz_rt fallback identifier (rounded mz / rt) for unannotated features
      • Optionally joins mzmine_annotations; enforces numeric score and selects highest-scoring annotation per id
      • Standardizes run names by stripping " Peak area" (internal column-name normalization)
      • Sets PeptideSequence = feature row id, ProteinName = compound_name (or mz_rt fallback), PrecursorCharge/FragmentIon/ProductCharge = NA, IsotopeLabelType = "Light", Intensity = peak area
      • Logs via .logSuccess("MZMine","clean")
  • S4 integration
    • Added MSstatsMZMineFiles S4 class and setMethod("MSstatsClean", signature = "MSstatsMZMineFiles", .cleanRawMZMine) — R/MSstatsConvert_core_functions.R
  • Tests and fixtures
    • Added fixtures: inst/tinytest/raw_data/MZMine/{mzmine_input.csv, mzmine_annotations.csv, annotation.csv}
    • Added inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (38 assertions covering core behaviors)
  • Documentation and vignette
    • Added man/MZMinetoMSstatsFormat.Rd; man/dot-cleanRawMZMine.Rd; updated man/MSstatsClean.Rd and man/MSstatsInputFiles.Rd
    • Extended vignette vignettes/msstats_data_format.Rmd with "Metabolomics with MZMine" section and runnable examples
  • Package metadata and ignores
    • Updated DESCRIPTION Collate to include clean_MZMine.R and converters_MZMinetoMSstatsFormat.R
    • Exported MZMinetoMSstatsFormat in NAMESPACE
    • Updated .Rbuildignore and .gitignore to ignore .positai, doc, Meta (and similar) entries

Unit tests added/modified

  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (new)
    • 38 assertions exercising:
      • Output schema (11 expected MSstats columns) and row counts (24 rows for test fixture: 6 features × 4 runs)
      • Defaults: IsotopeLabelType == "Light"; PrecursorCharge/FragmentIon/ProductCharge == NA; Fraction == 1
      • Annotation behavior: highest-score-wins selection for duplicate annotations; unmatched features use mz_rt fallback
      • mz_rt fallback construction and usage when mzmine_annotations absent
      • Zero-intensity values converted to NA
      • Propagation of sample metadata (Condition, BioReplicate) from annotation/run CSV
      • removeProtein_with1Feature = TRUE behavior (filters to proteins with ≥2 features; expected 8 rows in fixture)
    • Tests run in tinytest suite (author reports 732/732 tinytests passing including these)

Coding guidelines / notable issues

  • Documentation / roxygen2 compatibility: A pre-existing roxygen2 docstring problem surfaced during this work — an @inheritParams reference targets a function whose roxygen block does not produce a discoverable topic, causing roxygen2 v7.3.3 to error (roxygen2 v8 tolerates it). The author avoided re-running document() to keep scope limited and recommends a separate housekeeping PR to fix the ProteinProspector docstring (adjust @inheritParams or ensure the referenced topic is documentable).
  • Column-name normalization caveat: internal standardization removes spaces/dots from column names (affects Run values). Users must ensure run/sample names in external annotation files match the standardized Run names (the vignette documents this caveat).
  • No other coding-guideline violations detected; new S4 class/method and naming conventions follow existing package patterns.

Review Change Stack

Brings metabolomics into the MSstats family by adding an MZMine converter that mirrors the structure of DIANNtoMSstatsFormat. Phase 1 of a two-phase task; Phase 2 (MSstatsShiny BIO=Metabolomics) will be a separate PR.
@swaraj-neu swaraj-neu requested a review from tonywu1999 May 19, 2026 00:24
@swaraj-neu swaraj-neu self-assigned this May 19, 2026
@swaraj-neu swaraj-neu added the enhancement New feature or request label May 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@swaraj-neu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c717d945-fea0-4112-9249-9fdda2c7685e

📥 Commits

Reviewing files that changed from the base of the PR and between 7ec49d1 and 17a9c57.

📒 Files selected for processing (1)
  • R/clean_MZMine.R
📝 Walkthrough

Walkthrough

This PR adds MZMine metabolomics data conversion to MSstatsConvert. It introduces an S4 class (MSstatsMZMineFiles), an internal cleaning function (.cleanRawMZMine), a public converter (MZMinetoMSstatsFormat), comprehensive tests, and documentation. The converter transforms wide-format MZMine feature tables into MSstats-compatible long format with optional annotation-based protein identification and mz/rt fallback naming.

Changes

MZMine Metabolomics Converter

Layer / File(s) Summary
Package build configuration
.Rbuildignore, .gitignore, DESCRIPTION, NAMESPACE
Build ignore patterns updated for .positai, .claude, doc, and Meta directories. DESCRIPTION Collate order extended to include new clean_MZMine.R and converters_MZMinetoMSstatsFormat.R files. NAMESPACE exports MZMinetoMSstatsFormat.
S4 class and method infrastructure
R/MSstatsConvert_core_functions.R
New MSstatsMZMineFiles S4 class added as subtype of MSstatsInputFiles. S4 method for MSstatsClean generic registered for MSstatsMZMineFiles, binding it to .cleanRawMZMine implementation.
MZMine raw data cleaning
R/clean_MZMine.R
.cleanRawMZMine() transforms MZMine wide format to MSstats long format: loads data, validates metadata columns (rowID, rowmz, rowretentiontime), detects sample Peakarea columns, builds mz/rt fallback identifiers, optionally merges spectral-library annotations (highest score wins), derives protein/peptide/run/charge fields, melts to long rows, and logs success.
Public converter API
R/converters_MZMinetoMSstatsFormat.R
MZMinetoMSstatsFormat() wires logging, MSstatsImport, MSstatsClean (with mzmine_annotations), MSstatsMakeAnnotation, isotope-label handling, MSstatsPreprocess with configurable summarization and optional single-feature protein removal, converts zeros to NA, runs MSstatsBalancedDesign(fill_incomplete=TRUE), and returns MSstats-ready table.
Test suite
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
Validates output schema (11 columns), row counts (24 rows = 6 features × 4 runs), metabolomics defaults (IsotopeLabelType="Light", NA charge/ion fields, Fraction=1), annotation merge rules (highest score per feature), mz_rt fallback for unannotated features, zero→NA conversion, condition/BioReplicate propagation, and removeProtein_with1Feature=TRUE filtering behavior.
Documentation
man/MSstatsClean.Rd, man/MSstatsInputFiles.Rd, man/MZMinetoMSstatsFormat.Rd, man/dot-cleanRawMZMine.Rd, vignettes/msstats_data_format.Rmd
Roxygen-generated documentation for MSstatsClean MZMine method, MSstatsMZMineFiles class, public MZMinetoMSstatsFormat() function, internal .cleanRawMZMine() function, and vignette section with runnable conversion examples demonstrating annotation-based and fallback protein naming.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • mstaniak

Poem

🐰 I hopped through peaks and rt to find,
Annotations sorted, best-scored assigned,
When libraries miss, mz_rt will guide,
From wide Peakarea to long-format stride,
Now MSstats greets each metabolite with pride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding the MZMinetoMSstatsFormat converter into the development branch.
Description check ✅ Passed The description comprehensively covers motivation, detailed changes, testing, and includes verification results; it deviates from the template structure but provides all essential information and more.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstatsConvert/work/20260516_mzmine_converter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@swaraj-neu
Copy link
Copy Markdown
Author

Pre-existing roxygen2 issue surfaced during this work.

When re-documenting the package with roxygen2 7.3.3, the build fails with "In topic 'MSstatsClean': @inherits failed to find topic '.cleanRawProteinProspector'."

This is unrelated to MZMine. Somewhere in the package an @inheritParams .cleanRawProteinProspector tag points at a function whose roxygen block doesn't produce a discoverable topic. roxygen2 v8 tolerates this silently; v7.3.3 treats it as an error. To keep this PR scoped to MZMine, I worked around it by hand-editing DESCRIPTION rather than re-running document(). Worth a separate housekeeping PR to fix the ProteinProspector docstring.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (1)

98-100: ⚡ Quick win

Strengthen fallback coverage assertion to detect missing mz_rt IDs.

On Lines 98-100, all(ProteinName %in% expected_mz_rt) only checks “no unexpected values”; it does not guarantee every expected fallback appears. Use set equality on unique values to avoid false positives.

Proposed test assertion hardening
 expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67",
                    "489.334_7.89", "555.447_9.1", "123.056_1.45")
-expect_true(all(as.character(output_nolib_dt$ProteinName) %in% expected_mz_rt))
+expect_equal(
+  sort(unique(as.character(output_nolib_dt$ProteinName))),
+  sort(expected_mz_rt)
+)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R` around lines 98 - 100,
The test currently only asserts that there are no unexpected ProteinName values
using expected_mz_rt and output_nolib_dt$ProteinName, which can miss missing
expected mz_rt IDs; change the assertion to compare the sets (or sorted uniques)
so it verifies every expected fallback is present and no extras — e.g., replace
the all(... %in% ...) check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-56: The code sorts annotations using setorder(ann, id, -score)
but assumes score is numeric; coerce ann$score to numeric before ordering to
avoid mis-ranking when score is character/factor. Update the logic around the
ann data.table (before setorder) to convert score (e.g., ann[, score :=
as.numeric(score)]) and handle potential NAs (optionally warn if coercion
produces NA) so setorder and the subsequent unique(ann, by = "id") produce
correct ann_top results.

In `@R/converters_MZMinetoMSstatsFormat.R`:
- Around line 8-10: Update the parameter docs for annotation to state that the
`Run` column must match the standardized sample/run names produced by the
converter (i.e., after the function's column-name cleaning which strips the
trailing " Peak area" and normalizes spaces/dots), because the code (look for
the converter function that accepts `annotation` and the internal name-cleaning
logic that normalizes sample column names) uses those cleaned names when
joining; tell users to either provide `Run` values in that standardized form or
to run the same name-cleaning routine on their annotation before passing it in.
- Around line 54-55: The wrapper call to MSstatsImport in
converters_MZMinetoMSstatsFormat.R ignores the variadic arguments (...)
advertised for passing to data.table::fread; update the call where input is
assigned (the MSstatsImport invocation) to forward ... into MSstatsImport so
caller-supplied fread options are preserved, i.e., include ... in the
MSstatsImport argument list when calling MSstatsImport(list(input = input),
"MSstats", "MZMine") so the fread passthrough works as documented.

---

Nitpick comments:
In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R`:
- Around line 98-100: The test currently only asserts that there are no
unexpected ProteinName values using expected_mz_rt and
output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change
the assertion to compare the sets (or sorted uniques) so it verifies every
expected fallback is present and no extras — e.g., replace the all(... %in% ...)
check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60c2c010-0294-4a60-b8ea-1991a39f0d27

📥 Commits

Reviewing files that changed from the base of the PR and between b9564f2 and 589ebb5.

⛔ Files ignored due to path filters (3)
  • inst/tinytest/raw_data/MZMine/annotation.csv is excluded by !**/*.csv
  • inst/tinytest/raw_data/MZMine/mzmine_annotations.csv is excluded by !**/*.csv
  • inst/tinytest/raw_data/MZMine/mzmine_input.csv is excluded by !**/*.csv
📒 Files selected for processing (13)
  • .Rbuildignore
  • .gitignore
  • DESCRIPTION
  • NAMESPACE
  • R/MSstatsConvert_core_functions.R
  • R/clean_MZMine.R
  • R/converters_MZMinetoMSstatsFormat.R
  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
  • man/MSstatsClean.Rd
  • man/MSstatsInputFiles.Rd
  • man/MZMinetoMSstatsFormat.Rd
  • man/dot-cleanRawMZMine.Rd
  • vignettes/msstats_data_format.Rmd

Comment thread R/clean_MZMine.R
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Comment thread R/converters_MZMinetoMSstatsFormat.R Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-57: The score column is being coerced with as.numeric directly
(ann[, score := suppressWarnings(as.numeric(score))]) which will return factor
internal codes if score is a factor; change the coercion to go through character
first (i.e., suppressWarnings(as.numeric(as.character(score)))) so factor levels
are converted correctly, keep the anyNA(ann$score) check and the existing stop
message unchanged to validate numeric coercion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95a37b9b-01a2-4648-b3c5-3aa94dcfecbf

📥 Commits

Reviewing files that changed from the base of the PR and between 589ebb5 and 7ec49d1.

📒 Files selected for processing (4)
  • R/clean_MZMine.R
  • R/converters_MZMinetoMSstatsFormat.R
  • inst/tinytest/test_converters_MZMinetoMSstatsFormat.R
  • man/MZMinetoMSstatsFormat.Rd
✅ Files skipped from review due to trivial changes (1)
  • man/MZMinetoMSstatsFormat.Rd

Comment thread R/clean_MZMine.R Outdated
@swaraj-neu
Copy link
Copy Markdown
Author

@tonywu1999

End-to-end test:

MZMinetoMSstatsFormat against the biologist's master_annotation_table.csv

Same input file mzmine-summer25-microbial-results_iimn_gnps_quant.csv.
Same library annotations mzmine-summer25-microbial-results_annotations.csv.
Compared our converter's output to the master_annotation_table the notebook produces.

PART 1: ROW KEY COMPARISON

Their (feature, run) keys: 59087
Our (feature, run) keys: 59087
In both: 59087
Only in theirs: 0
Only in ours: 0

-> Every (feature, run) pair matches. No rows dropped, no extra rows.

PART 2: INTENSITY COMPARISON

status N
match 53491
both_NA 5596

Zero value mismatches.

-> All peak-area values agree. The 5,596 both-NA rows are where the
biologist's file has 0 and ours has NA (we convert zero -> NA on
purpose, per MSstats convention).

PART 3: COMPOUND NAME COMPARISON

Their annotated features (any source): 1796
sirius: 1679
library: 69
ms2query: 48

Their library-annotated features: 69
Our library-annotated features: 69

In both: 69
Only in theirs (we missed naming): 0
Only in ours (their score < 0.7 cutoff): 0

Name mismatches on the 69 shared library-annotated features: 0
-> Perfect agreement on every library-named feature.

Note on the 1,679 features named by SIRIUS in their pipeline:
those show up as mz_rt fallback strings in our output (e.g.
"455.282_0.65"), because Phase 1 only implements the library
annotation source, not SIRIUS / MS2Query / CANOPUS.

2,245.1290,3.45,5000,4800,5200,4900
3,367.2010,5.67,800,0,750,820
4,489.3340,7.89,2000,2100,1900,2050
5,555.4470,9.10,100,0,0,0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the msstats-ai repo, it would be helpful to start compiling questions to ask a developer when they're creating converters (and integrate it as part of an existing skill / a new skill for msstatsConvert development). One such question to ask:

  • Can we assume zeros mean a value is missing due to low abundance? If so, we can safely convert them to NA.

Comment on lines +57 to +60
feature4_proteins = unique(output_dt[PeptideSequence == "4", ProteinName])
expect_equal(as.character(feature4_proteins), "489.334_7.89")
feature5_proteins = unique(output_dt[PeptideSequence == "5", ProteinName])
expect_equal(as.character(feature5_proteins), "555.447_9.1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question for msstats-ai repo:

  • Should we filter out features that don't have a clear corresponding protein/metabolite? Generally the preference is yes, we should filter them out (because it messes with multiple hypothesis testing)

@@ -0,0 +1,7 @@
row ID,row m/z,row retention time,sampleA.mzML Peak area,sampleB.mzML Peak area,sampleC.mzML Peak area,sampleD.mzML Peak area
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was great you created a really small version of the user input that is simple. Makes the unit tests really straightforward and easy to follow (noticed other unit tests are harder to follow since there's so many rows in the input)

Maybe in msstats-ai, mention to generate a subset of the test dataset we're given for testing simplicity. And the test dataset should have things like missing values, etc.

Comment on lines +90 to +106
output_nolib = MZMinetoMSstatsFormat(input, annotation = annot,
mzmine_annotations = NULL,
use_log_file = FALSE)
output_nolib_dt = data.table::as.data.table(output_nolib)

# Every ProteinName is the mz_rt fallback string
expect_equal(ncol(output_nolib), 11)
expect_equal(nrow(output_nolib), 24)
expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67",
"489.334_7.89", "555.447_9.1", "123.056_1.45")
expect_equal(
sort(unique(as.character(output_nolib_dt$ProteinName))),
sort(expected_mz_rt)
)
# Compound names from the library must not leak in
expect_false(any(as.character(output_nolib_dt$ProteinName) %in%
c("Caffeine", "GlucoseHigh", "GlucoseLow", "Lactate")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can safely get rid of these unit tests for now and make the mzmine annotation file be mandatory. I don't see this scenario happening / would rather not support it initially.

Copy link
Copy Markdown
Contributor

@tonywu1999 tonywu1999 May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading your E2E testing

Their annotated features (any source): 1796
sirius: 1679
library: 69
ms2query: 48

Their library-annotated features: 69
Our library-annotated features: 69

In both: 69
Only in theirs (we missed naming): 0
Only in ours (their score < 0.7 cutoff): 0

Name mismatches on the 69 shared library-annotated features: 0
-> Perfect agreement on every library-named feature.

Note on the 1,679 features named by SIRIUS in their pipeline:
those show up as mz_rt fallback strings in our output (e.g.
"455.282_0.65"), because Phase 1 only implements the library
annotation source, not SIRIUS / MS2Query / CANOPUS.

Could you elaborate what you mean here? Are you saying that for features identified by SIRIUS, there's no compound_name? And then that's why you fell back to mz_rt? Because this may affect any of my comments related to the mz_rt fallback.

Copy link
Copy Markdown
Author

@swaraj-neu swaraj-neu May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonywu1999 Okay, let me clarify what's happening with those features.
The library annotations file (mzmine-summer25-microbial-results_annotations.csv) only contains entries for features the spectral library could match i.e. 69 unique feature IDs out of 2,569 total. SIRIUS-named features aren't in that file at all; SIRIUS writes its results to a separate structure_identifications.tsv.

Our converter only reads the library annotations file (the mzmine_annotations argument). So when SIRIUS named a feature in their pipeline, we have no annotation row for that feature at all. Not a low-scoring one, not a missing compound_name, just no row. That's why those features fall back to mz_rt in our output.

To pick up SIRIUS / MS2Query / CANOPUS names, the user would have to either (a) run the full notebook to produce the master_annotation_table.csv and feed that in as mzmine_annotations (after renaming consensus_name → compound_name and consensus_score → score), or (b) we extend the converter to accept multiple annotation sources, which is what Phase 1 deliberately scoped out.

Comment on lines +368 to +376
# Without a library: ProteinName is an mz_rt fallback string
mzmine_no_lib = MZMinetoMSstatsFormat(
mzmine_input,
annotation = mzmine_annotation,
mzmine_annotations = NULL,
use_log_file = FALSE
)
head(mzmine_no_lib)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this

Comment thread R/clean_MZMine.R
stop("mzmine_annotations is missing required column(s): ",
paste(missing_ann, collapse = ", "), ".")
}
ann[, score := suppressWarnings(as.numeric(as.character(score)))]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of warnings show up? Why as.character and then as.numeric when you can just do as.numeric?

Comment thread R/clean_MZMine.R
}
ann[, score := suppressWarnings(as.numeric(as.character(score)))]
if (anyNA(ann$score)) {
stop("mzmine_annotations$score must be numeric (or coercible to numeric).")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message may be confusing to a user. I'd say the score column must be numeric values in the mzmine annotation file

Comment thread R/clean_MZMine.R
round(mz_input[[rt_col]], 2))

if (!is.null(mzmine_annotations)) {
ann <- data.table::as.data.table(mzmine_annotations)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a variable name that is more descriptive, e.g. feature to compound mapping?

Comment thread R/clean_MZMine.R
mz_input,
id.vars = c("ProteinName", "PeptideSequence"),
measure.vars = peak_area_cols,
variable.name = "sample_col",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this simply be named to "Run"?

Comment thread R/clean_MZMine.R
compound <- mz_rt_fallback
}

mz_input[, ProteinName := compound]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming mz_input is sorted here by ID? Would it be better to do a join?

e.g.

mz_input[
  ann,
  ProteinName := i.compound_name,
  on = c(id_col = "id"),
  mult = "first"
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants