M sstats big/work/20260513 arrow chunking followups by Rudhik1904 · Pull Request #17 · Vitek-Lab/MSstatsBig

Rudhik1904 · 2026-05-18T21:53:34Z

Motivation and Context

reduceBigSpectronaut() previously streamed Spectronaut CSV exports through readr::read_delim_chunked, which holds a string-interning pool that grows across chunks and pushed peak memory well above one batch's working set. This branch ports the reader to Arrow (which releases per-batch state) and follows up with the three remaining items from TODO-arrow_chunking_followups.md:

The Arrow block_size was Arrow's 256 KiB default — too small for wide Spectronaut rows, causing Invalid: straddling object straddles two block boundaries errors and excessive parser overhead.
End users had no documented escape hatch when the straddling error did fire on extra-wide exports.
cleanSpectronautChunk() ran every batch through ~13 sequential dplyr verbs on a data.frame, producing repeated transient column allocations and fragmenting R's allocator. The rest of the MSstatsConvert family is data.table-native.

Solution: replace readr with Arrow's CsvReadOptions / Scanner / ToRecordBatchReader, raise the default block_size to 16 MiB and expose it as a user parameter, document the straddling-object workaround at the parameter, and rewrite cleanSpectronautChunk() in pure data.table.

Changes

Arrow reader replaces readr::read_delim_chunked in reduceBigSpectronaut(). Uses arrow::open_dataset() + Scanner + ToRecordBatchReader to stream record batches; preserves the existing comma/tab/semicolon delimiter switch via CsvParseOptions$delimiter. Per-batch progress logging every 1,000 batches.
New block_size parameter on both reduceBigSpectronaut() and the user-facing bigSpectronauttoMSstatsFormat(). Default 16L * 1024L * 1024L (16 MiB) replaces Arrow's 256 KiB default. Coerced to integer and validated (length 1, non-NA, positive).
Roxygen @param block_size documents the exact error string (Invalid: straddling object straddles two block boundaries) and the recommended override (64L * 1024L * 1024L) so users hitting the straddling error on pathological rows have a self-service fix.
cleanSpectronautChunk() rewritten in data.table. setDT(input) at entry; column selection via dt[, cols, with = FALSE]; two-step rename via setnames(..., skip_absent = TRUE) matching the MSstatsConvert family convention; in-place column updates via :=; conditional NA assignment via mask form dt[cond, Intensity := NA_real_]. Function shrank from ~88 lines to ~64.
NA-q-value semantics preserved. Q-value filters use is.na(EGQvalue) | EGQvalue >= cutoff so rows with missing q-values still get Intensity = NA, matching the previous dplyr::if_else behavior (a naive data.table translation would silently change this).
Dead code removed. The dplyr::collect(head(dplyr::select(...))) pattern at the old lines 140/144 was a no-op residue from an earlier Arrow-Table refactor and is gone.
data.table added to Imports in DESCRIPTION and imported via @importFrom data.table := .SD setDT setnames so the package is cedta()-aware. Regenerated NAMESPACE and man/bigSpectronauttoMSstatsFormat.Rd via devtools::document().

Testing

All tests run via devtools::test(): 51 PASS, 0 FAIL, 0 WARN, 0 SKIP.

New tests added in tests/testthat/test-converters.R:

reduceBigSpectronaut validates block_size: rejects negative, zero, NA, length-2 vector, and unparseable string inputs.
bigSpectronauttoMSstatsFormat plumbs block_size through: spies on reduceBigSpectronaut via mockery::stub, asserts the default forwards 16L * 1024L * 1024L and an explicit override forwards the user's value.
cleanSpectronautChunk schema smoke test: synthetic minimal Spectronaut-shaped input, asserts output column set and basic values.
cleanSpectronautChunk filter_by_excluded: rows with Excluded == "True" get Intensity = NA.
cleanSpectronautChunk filter_by_identified: rows with Identified == "False" get Intensity = NA.
cleanSpectronautChunk filter_by_qvalue (incl. NA case): rows below cutoff are kept, above cutoff become NA, and NA q-values become NA — the explicit semantic guarantee from the rewrite.
cleanSpectronautChunk drops rows where F.FrgLossType != "noloss".

Before this branch cleanSpectronautChunk had no direct test coverage (existing tests stubbed reduceBigSpectronaut out entirely).

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

Motivation & Context

The Spectronaut CSV parsing pipeline in reduceBigSpectronaut() previously used readr::read_delim_chunked(), which accumulates string-interning state across chunks, leading to unbounded memory growth on large files. Additionally, the readr pipeline encountered "straddling object" errors where rows span block boundaries, preventing successful processing of certain Spectronaut exports. The cleanSpectronautChunk() function was implemented with dplyr pipelines, which is less memory-efficient for large-scale data processing.

Solution

Replace the readr-based chunked reader with Apache Arrow's streaming CSV reader (arrow::open_dataset() + Scanner + ToRecordBatchReader), which bounds memory to a single record batch and properly handles block boundaries. Rewrite cleanSpectronautChunk() in data.table for better performance and memory efficiency. Expose a configurable block_size parameter (default 16 MiB) to allow tuning for pathological cases (up to 64 MiB). Add progress logging every 1,000 batches during streaming.

Detailed Changes

Core Implementation (`R/clean_spectronaut.R`)

reduceBigSpectronaut() rewrite:
- Replaced readr::read_delim_chunked() with Arrow CSV streaming pipeline
- Delimiter selection via regex pattern matching on filename (csv → comma, tsv/xls → tab, else semicolon)
- Arrow pipeline: CsvParseOptions$create(delimiter = delim) + CsvConvertOptions$create() + CsvReadOptions$create(block_size = block_size)
- Column pre-filtering via needed_cols to prevent materializing unused Spectronaut columns (~35 unused columns dropped at parse time)
- Batch-by-batch iteration with reader$read_next_batch() → convert to data.frame → call cleanSpectronautChunk()
- Progress logging every 1,000 batches (and final summary if non-aligned)
- Tracks row position (pos), batch index (batch_idx), elapsed time, and processing rate (rows/sec)
- Added block_size parameter: coerced to integer, validated (length 1, non-NA, positive)
- New signature: reduceBigSpectronaut(..., anomalyModelFeatures=c(), block_size = 16L * 1024L * 1024L)
cleanSpectronautChunk() rewrite from dplyr to data.table:
- Input coerced with setDT() for in-place conversion
- Column selection via dt[, cols, with = FALSE]
- Two-step renaming: standardize all column names first, then map standardized → MSstats final names (using setnames(..., skip_absent = TRUE))
- In-place type conversions: Intensity := as.numeric(Intensity), logical conversions for Excluded and Identified (character "True" → logical TRUE)
- Filtering applied with data.table conditional updates (dt[cond, col := NA_real_])
- Filters preserve NA-q-value semantics: is.na(EGQvalue) | EGQvalue >= cutoff
- IsotopeLabelType derived from LabeledSequence or defaults to "L"
- Output columns selected before writing chunk to file
- Function reduced from ~88 to ~64 lines by eliminating intermediate pipelines and dead code
- Behavior unchanged; internal implementation optimized

Public API (`R/converters.R`)

bigSpectronauttoMSstatsFormat() updated:
- New block_size parameter added after connection = NULL
- Default: 16L * 1024L * 1024L (16 MiB)
- Parameter forwarded to reduceBigSpectronaut() call
- Roxygen-documented

Documentation Updates

man/bigSpectronauttoMSstatsFormat.Rd:
- Added block_size parameter to usage signature
- Added documentation in \arguments{} section describing block size control
- Includes recommendation to increase to 64 MiB for pathological cases (extra-wide Spectronaut exports triggering "Invalid: straddling object straddles two block boundaries")
man/dot-prefixedPath.Rd (new file):
- Documentation for internal function .prefixedPath(prefix, path)
- Describes prefix-only naming for intermediate output paths

Package Metadata

DESCRIPTION:
- Added data.table to Imports list
NAMESPACE:
- Added four new importFrom directives: importFrom(data.table, ":="), importFrom(data.table, ".SD"), importFrom(data.table, setDT), importFrom(data.table, setnames)

Unit Tests

Two new test cases added to tests/testthat/test-converters.R:

Block size validation test (test_that("reduceBigSpectronaut block_size argument validation", ...)):
- Validates that reduceBigSpectronaut() rejects invalid block_size inputs: negative values, zero, NA, vector lengths > 1, non-numeric strings
- Each invalid input triggers stopifnot() failure as expected
Block size forwarding test (test_that("bigSpectronauttoMSstatsFormat forwards block_size to reduceBigSpectronaut", ...)):
- Stubs reduceBigSpectronaut to capture the forwarded block_size argument
- Verifies default value (16L * 1024L * 1024L) when block_size not explicitly provided
- Verifies explicit override (8L * 1024L * 1024L) is correctly forwarded
- Uses temporary files and proper cleanup via on.exit() handlers

Coding Guidelines

No violations identified. The implementation follows R best practices:

Roxygen2 documentation maintained and regenerated
data.table imports explicitly declared in NAMESPACE and DESCRIPTION
Parameter validation via stopifnot() with clear assertions
Consistent naming conventions (camelCase for parameters and functions)
Proper cleanup of Arrow reader objects and data frames in loops
Temporary files in tests properly cleaned up with on.exit()
Documented Arrow error message and workaround guidance

R/clean_spectronaut.R:9-12: added block_size parameter (default 16L * 1024L * 1024L) with coerce + validation. R/clean_spectronaut.R:44: CsvReadOptions$create now uses the parameter. R/converters.R:120-125: new @param block_size roxygen with the straddling-object workaround note. R/converters.R:148-156: bigSpectronauttoMSstatsFormat gains block_size, plumbed to reduceBigSpectronaut. tests/testthat/test-converters.R:97-163: validation tests (rejects negative/zero/NA/vector/string) + plumbing tests (default forwards 16 MiB, override forwards user's value). man/bigSpectronauttoMSstatsFormat.Rd: regenerated from roxygen.

…setnames so the package is data.table-aware (cedta()). R/clean_spectronaut.R:103-187: rewrote cleanSpectronautChunk in data.table: setDT(input) at entry; subsequent operations modify in place via :=. Two-step rename (setnames for standardize, then setnames with skip_absent = TRUE to map standardized→MSstats) matches the MSstatsConvert family pattern. Conditional NA assignment uses mask form dt[cond, Intensity := NA_real_]. Q-value filters preserve dplyr::if_else NA semantics via explicit is.na(EGQvalue) | EGQvalue >= cutoff. Dropped the leftover dplyr::collect(head(dplyr::select(...))) pattern — was a no-op residue from a prior refactor. Function shrank from ~88 lines to ~64. DESCRIPTION:20: added data.table to Imports. NAMESPACE: regenerated, now imports :=, .SD, setDT, setnames from data.table. tests/testthat/test-converters.R:97-211: 5 new tests — schema smoke test, filter_by_excluded, filter_by_identified, filter_by_qvalue (covering the NA-q-value case), and FFrgLossType row drop.

coderabbitai · 2026-05-18T21:53:46Z

📝 Walkthrough

Walkthrough

This PR refactors Spectronaut CSV processing to use Apache Arrow record-batch streaming instead of readr chunking, converts chunk processing from dplyr to data.table operations, and introduces a configurable block_size parameter to control Arrow CSV reader memory allocation—exposed in the public API, documented, and validated with tests.

Changes

Spectronaut Arrow Streaming & Data.Table Refactor

Layer / File(s)	Summary
data.table dependency setup `DESCRIPTION`, `NAMESPACE`	Added `data.table` to package imports and declared required data.table symbols (`:=`, `.SD`, `setDT`, `setnames`) for chunk processing.
Arrow-based CSV streaming in reduceBigSpectronaut `R/clean_spectronaut.R`	Replaced `readr::read_delim_chunked()` with Apache Arrow dataset scanning using `ToRecordBatchReader()`, configurable `block_size`, explicit CSV parse options, and batch-by-batch iteration with progress tracking.
Data.table refactor in cleanSpectronautChunk `R/clean_spectronaut.R`	Converted dplyr pipelines to data.table: coerces to data.table, filters to present columns, applies standardized/final column renaming, converts string logicals, applies exclusion/identified/qvalue/loss-type filtering, derives `IsotopeLabelType`, and selects output columns including optional anomaly features.
Public API parameter threading for block_size `R/converters.R`	Extended `bigSpectronauttoMSstatsFormat()` with `block_size` parameter (16 MiB default), added roxygen documentation, and wired forwarding to `reduceBigSpectronaut()`.
User and internal function documentation `man/bigSpectronauttoMSstatsFormat.Rd`, `man/dot-prefixedPath.Rd`	Updated function signature and arguments documentation with `block_size` guidance for handling wide exports; added new man page for `.prefixedPath()` helper.
block_size validation and forwarding tests `tests/testthat/test-converters.R`	Added test cases validating rejection of invalid `block_size` inputs and asserting correct parameter forwarding through the API stack.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Vitek-Lab/MSstatsBig#6: Both PRs modify R/clean_spectronaut.R's reduceBigSpectronaut()/cleanSpectronautChunk() to thread and conditionally include calculateAnomalyScores/anomalyModelFeatures-driven anomaly feature columns, with this PR adding Arrow chunking (block_size) and data.table refactors on top.
Vitek-Lab/MSstatsBig#10: Both PRs refactor chunk processing to rely on the shared internal chunk-writing helper .writeChunkToFile(...) (DIANN cleaning switched to it, and Spectronaut chunk cleaning now delegates its CSV output to it).

Suggested reviewers

tonywu1999

Poem

🐰 Arrow flies swift through data streams,
data.table dreams, block_size schemes,
From dplyr's chains we break free,
Spectronaut chunks flow efficiently!
🏹✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and does not clearly convey the primary changes; it uses unclear notation and lacks specific detail about what the PR accomplishes.	Clarify the title to describe the main changes, e.g., 'Replace readr with Arrow for Spectronaut chunking and rewrite cleanSpectronautChunk in data.table' or similar.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is comprehensive and well-structured, covering motivation, detailed changes, testing, and completing the checklist as required by the template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstatsBig/work/20260513_arrow_chunking_followups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_spectronaut.R`:
- Around line 155-167: The current filter block assumes columns Excluded,
Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if
any are missing; update the logic in clean_spectronaut.R to guard each filter by
checking column presence (e.g., use "if ('Excluded' %in% names(input))" before
the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before
applying qvalue-based NA assignment, and check for FFrgLossType before
subsetting), so each conditional only runs when its target column exists and the
behavior remains unchanged otherwise.
- Around line 25-58: The computed needed_cols is never passed to Arrow, so the
CSV reader still parses all columns; update the CsvConvertOptions usage to
include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

In `@tests/testthat/test-converters.R`:
- Around line 106-113: The test's expect_error calls are too broad—change them
to assert the error message mentions "block_size" so they only pass when
block_size validation fails; update each expect_error(reduceBigSpectronaut(...),
...) in tests/testthat/test-converters.R to include a second argument (string or
regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid")
for the negative values and vector cases, and similarly for the
suppressed-warning call so all invalid-block_size cases are checked by message
content rather than any error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ebfd981-78cf-43a3-b93a-fd55ccfaeccc

📥 Commits

Reviewing files that changed from the base of the PR and between a43b90b and c8c835e.

📒 Files selected for processing (7)

DESCRIPTION
NAMESPACE
R/clean_spectronaut.R
R/converters.R
man/bigSpectronauttoMSstatsFormat.Rd
man/dot-prefixedPath.Rd
tests/testthat/test-converters.R

coderabbitai · 2026-05-18T21:57:25Z

+  # Columns cleanSpectronautChunk actually consumes; Arrow's
+  # convert_options$include_columns drops everything else at parse time so
+  # we never materialize the ~35 unused columns Spectronaut exports.
+  needed_cols <- c("R.FileName", "R.Condition", "R.Replicate",
+                   "PG.ProteinAccessions", "EG.ModifiedSequence",
+                   "FG.LabeledSequence", "FG.Charge",
+                   "F.FrgIon", "F.Charge",
+                   "EG.Identified", "F.ExcludedFromQuantification",
+                   "F.FrgLossType", "PG.Qvalue", "EG.Qvalue",
+                   intensity)
+  if (calculateAnomalyScores) {
+    needed_cols <- c(needed_cols, anomalyModelFeatures)
+  }
+
+  # Arrow's CSV reader replaces readr::read_delim_chunked.  Arrow releases
+  # per-batch state as soon as a batch is consumed, so peak memory is
+  # bounded by one record batch instead of growing with the dataset (readr
+  # keeps a string-interning pool that accumulates across chunks).  The
+  # `delim` switch above already covers comma / tab / semicolon variants;
+  # Arrow's CSV reader handles all three the same way through
+  # CsvParseOptions$delimiter.
+  parse_opts   <- arrow::CsvParseOptions$create(delimiter = delim)
+  convert_opts <- arrow::CsvConvertOptions$create()
+  read_opts    <- arrow::CsvReadOptions$create(block_size = block_size)
+
+  ds <- arrow::open_dataset(
+    input_file,
+    format          = "csv",
+    parse_options   = parse_opts,
+    convert_options = convert_opts,
+    read_options    = read_opts
+  )
+
+  reader <- arrow::Scanner$create(ds)$ToRecordBatchReader()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether column projection is actually wired in the Arrow path. rg -n -C2 'needed_cols|CsvConvertOptions\\$create|include_columns|Scanner\\$create|open_dataset\\(' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 289

🏁 Script executed:

#!/bin/bash # First, check for needed_cols usage in the file echo "=== Searching for needed_cols usage ===" rg -n 'needed_cols' R/clean_spectronaut.R echo "" echo "=== Searching for CsvConvertOptions and include_columns ===" rg -n 'CsvConvertOptions\|include_columns' R/clean_spectronaut.R echo "" echo "=== Checking context around CsvConvertOptions create call ===" rg -n -B3 -A3 'CsvConvertOptions\$create' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 688

🏁 Script executed:

#!/bin/bash # Check what happens after reader creation echo "=== Code from line 50 onwards ===" sed -n '50,80p' R/clean_spectronaut.R echo "" echo "=== Complete function to see full context ===" wc -l R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1118

🏁 Script executed:

#!/bin/bash # Search for include_columns usage in the entire codebase echo "=== Searching for include_columns in codebase ===" rg -n 'include_columns' . echo "" echo "=== Searching for CsvConvertOptions usage patterns ===" rg -n 'CsvConvertOptions' . -A2 echo "" echo "=== Check the comment context more carefully ===" sed -n '25,50p' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1892

Wire needed_cols into Arrow's column projection.

needed_cols is computed but never applied to CsvConvertOptions, so the reader still parses all CSV columns. This defeats the intended memory/performance reduction on wide Spectronaut exports.

💡 Proposed fix

- convert_opts <- arrow::CsvConvertOptions$create() + convert_opts <- arrow::CsvConvertOptions$create( + include_columns = needed_cols + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_spectronaut.R` around lines 25 - 58, The computed needed_cols is never passed to Arrow, so the CSV reader still parses all columns; update the CsvConvertOptions usage to include the projection by calling arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the include_columns field on convert_opts after creation) so that convert_opts includes needed_cols before calling arrow::open_dataset/Scanner$create; reference the symbols needed_cols and convert_opts (and the call arrow::CsvConvertOptions$create) and ensure this happens prior to creating ds/reader.

coderabbitai · 2026-05-18T21:57:26Z

  if (filter_by_excluded) {
-    input <- dplyr::mutate(
-      input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))
-
+    input[Excluded == TRUE, Intensity := NA_real_]
  }
-
  if (filter_by_identified) {
-    input <- dplyr::mutate(
-      input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
+    input[Identified == FALSE, Intensity := NA_real_]
  }
-
  if (filter_by_qvalue) {
-    input <- dplyr::mutate(
-      input,
-      Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
-    input <- dplyr::mutate(
-      input, 
-      Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
+    # Preserve dplyr::if_else semantics: rows with NA q-values become NA.
+    input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
  }
-
-  input <- dplyr::filter(input, FFrgLossType == "noloss")
-  if (is.element("LabeledSequence", colnames(input))) {
-    input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
-    input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))
+
+  input <- input[FFrgLossType == "noloss"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard filter steps when selected columns are absent.

After present_orig subsetting, downstream filters still assume Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType exist. Missing columns here will raise runtime errors.

🛡️ Proposed fix

- if (filter_by_excluded) { + if (filter_by_excluded && "Excluded" %in% colnames(input)) { input[Excluded == TRUE, Intensity := NA_real_] } - if (filter_by_identified) { + if (filter_by_identified && "Identified" %in% colnames(input)) { input[Identified == FALSE, Intensity := NA_real_] } if (filter_by_qvalue) { - input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_] - input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_] + if ("EGQvalue" %in% colnames(input)) { + input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_] + } + if ("PGQvalue" %in% colnames(input)) { + input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_] + } } - - input <- input[FFrgLossType == "noloss"] + if ("FFrgLossType" %in% colnames(input)) { + input <- input[FFrgLossType == "noloss"] + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (filter_by_excluded) {

input <- dplyr::mutate(

input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))

input[Excluded == TRUE, Intensity := NA_real_]

}

if (filter_by_identified) {

input <- dplyr::mutate(

input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))

input[Identified == FALSE, Intensity := NA_real_]

}

if (filter_by_qvalue) {

input <- dplyr::mutate(

input,

Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))

input <- dplyr::mutate(

input,

Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))

# Preserve dplyr::if_else semantics: rows with NA q-values become NA.

input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]

input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

input <- dplyr::filter(input, FFrgLossType == "noloss")

if (is.element("LabeledSequence", colnames(input))) {

input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))

input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))

input <- input[FFrgLossType == "noloss"]

if (filter_by_excluded && "Excluded" %in% colnames(input)) {

input[Excluded == TRUE, Intensity := NA_real_]

}

if (filter_by_identified && "Identified" %in% colnames(input)) {

input[Identified == FALSE, Intensity := NA_real_]

}

if (filter_by_qvalue) {

# Preserve dplyr::if_else semantics: rows with NA q-values become NA.

if ("EGQvalue" %in% colnames(input)) {

input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

if ("PGQvalue" %in% colnames(input)) {

input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]

}

}

if ("FFrgLossType" %in% colnames(input)) {

input <- input[FFrgLossType == "noloss"]

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_spectronaut.R` around lines 155 - 167, The current filter block assumes columns Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if any are missing; update the logic in clean_spectronaut.R to guard each filter by checking column presence (e.g., use "if ('Excluded' %in% names(input))" before the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before applying qvalue-based NA assignment, and check for FFrgLossType before subsetting), so each conditional only runs when its target column exists and the behavior remains unchanged otherwise.

coderabbitai · 2026-05-18T21:57:26Z

+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
+  expect_error(suppressWarnings(
+    reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
+  ))
+})


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen invalid block_size expectations to avoid false positives.

These assertions currently pass on any error. If reducer internals fail for another reason, this test can still pass even when block_size validation regresses. Constrain the expected error to include block_size in the message.

Suggested tightening

- expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_)) - expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L))) + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_), regexp = "block_size") + expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)), regexp = "block_size") expect_error(suppressWarnings( reduceBigSpectronaut(input_file, output_file, block_size = "16MB") - )) + ), regexp = "block_size")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/testthat/test-converters.R` around lines 106 - 113, The test's expect_error calls are too broad—change them to assert the error message mentions "block_size" so they only pass when block_size validation fails; update each expect_error(reduceBigSpectronaut(...), ...) in tests/testthat/test-converters.R to include a second argument (string or regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid") for the negative values and vector cases, and similarly for the suppressed-warning call so all invalid-block_size cases are checked by message content rather than any error.

tonywu1999 · 2026-05-19T23:08:56Z

-                                 anomalyModelFeatures=c()) {
+                                 calculateAnomalyScores=FALSE,
+                                 anomalyModelFeatures=c(),
+                                 block_size = 16L * 1024L * 1024L) {


For people who may want to increase their block size to increase the speed of processing, add a recommendation on how to estimate the adequate block size to maximize speed while reducing the risk of the system crashing

Make it more clear in the MSstatsBig / MSstatsConvert documentation on which columns we actually need from users for Spectronaut

tonywu1999 and others added 9 commits May 7, 2026 14:46

filter columns for readr initially

20dc07e

use col_names parameter

5129c77

fix col_names input

08b0db2

reduce chunk size

5a03986

try arrow csv reader delimted reader

f844758

fix column selection

53f7a78

add progress tracking

a08a65b

Rudhik1904 requested a review from tonywu1999 May 18, 2026 21:53

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

tonywu1999 reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M sstats big/work/20260513 arrow chunking followups#17

M sstats big/work/20260513 arrow chunking followups#17
Rudhik1904 wants to merge 9 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups

Rudhik1904 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

tonywu1999 May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rudhik1904 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

Motivation & Context

Solution

Detailed Changes

Core Implementation (R/clean_spectronaut.R)

Public API (R/converters.R)

Documentation Updates

Package Metadata

Unit Tests

Coding Guidelines

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

tonywu1999 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rudhik1904 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Core Implementation (`R/clean_spectronaut.R`)

Public API (`R/converters.R`)

coderabbitai Bot commented May 18, 2026 •

edited

Loading