Skip to content

M sstats big/work/20260513 arrow chunking followups#17

Open
Rudhik1904 wants to merge 9 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups
Open

M sstats big/work/20260513 arrow chunking followups#17
Rudhik1904 wants to merge 9 commits into
develfrom
MSstatsBig/work/20260513_arrow_chunking_followups

Conversation

@Rudhik1904
Copy link
Copy Markdown
Contributor

@Rudhik1904 Rudhik1904 commented May 18, 2026

Motivation and Context

reduceBigSpectronaut() previously streamed Spectronaut CSV exports through readr::read_delim_chunked, which holds a string-interning pool that grows across chunks and pushed peak memory well above one batch's working set. This branch ports the reader to Arrow (which releases per-batch state) and follows up with the three remaining items from TODO-arrow_chunking_followups.md:

  1. The Arrow block_size was Arrow's 256 KiB default — too small for wide Spectronaut rows, causing Invalid: straddling object straddles two block boundaries errors and excessive parser overhead.
  2. End users had no documented escape hatch when the straddling error did fire on extra-wide exports.
  3. cleanSpectronautChunk() ran every batch through ~13 sequential dplyr verbs on a data.frame, producing repeated transient column allocations and fragmenting R's allocator. The rest of the MSstatsConvert family is data.table-native.

Solution: replace readr with Arrow's CsvReadOptions / Scanner / ToRecordBatchReader, raise the default block_size to 16 MiB and expose it as a user parameter, document the straddling-object workaround at the parameter, and rewrite cleanSpectronautChunk() in pure data.table.

Changes

  • Arrow reader replaces readr::read_delim_chunked in reduceBigSpectronaut(). Uses arrow::open_dataset() + Scanner + ToRecordBatchReader to stream record batches; preserves the existing comma/tab/semicolon delimiter switch via CsvParseOptions$delimiter. Per-batch progress logging every 1,000 batches.
  • New block_size parameter on both reduceBigSpectronaut() and the user-facing bigSpectronauttoMSstatsFormat(). Default 16L * 1024L * 1024L (16 MiB) replaces Arrow's 256 KiB default. Coerced to integer and validated (length 1, non-NA, positive).
  • Roxygen @param block_size documents the exact error string (Invalid: straddling object straddles two block boundaries) and the recommended override (64L * 1024L * 1024L) so users hitting the straddling error on pathological rows have a self-service fix.
  • cleanSpectronautChunk() rewritten in data.table. setDT(input) at entry; column selection via dt[, cols, with = FALSE]; two-step rename via setnames(..., skip_absent = TRUE) matching the MSstatsConvert family convention; in-place column updates via :=; conditional NA assignment via mask form dt[cond, Intensity := NA_real_]. Function shrank from ~88 lines to ~64.
  • NA-q-value semantics preserved. Q-value filters use is.na(EGQvalue) | EGQvalue >= cutoff so rows with missing q-values still get Intensity = NA, matching the previous dplyr::if_else behavior (a naive data.table translation would silently change this).
  • Dead code removed. The dplyr::collect(head(dplyr::select(...))) pattern at the old lines 140/144 was a no-op residue from an earlier Arrow-Table refactor and is gone.
  • data.table added to Imports in DESCRIPTION and imported via @importFrom data.table := .SD setDT setnames so the package is cedta()-aware. Regenerated NAMESPACE and man/bigSpectronauttoMSstatsFormat.Rd via devtools::document().

Testing

All tests run via devtools::test(): 51 PASS, 0 FAIL, 0 WARN, 0 SKIP.

New tests added in tests/testthat/test-converters.R:

  • reduceBigSpectronaut validates block_size: rejects negative, zero, NA, length-2 vector, and unparseable string inputs.
  • bigSpectronauttoMSstatsFormat plumbs block_size through: spies on reduceBigSpectronaut via mockery::stub, asserts the default forwards 16L * 1024L * 1024L and an explicit override forwards the user's value.
  • cleanSpectronautChunk schema smoke test: synthetic minimal Spectronaut-shaped input, asserts output column set and basic values.
  • cleanSpectronautChunk filter_by_excluded: rows with Excluded == "True" get Intensity = NA.
  • cleanSpectronautChunk filter_by_identified: rows with Identified == "False" get Intensity = NA.
  • cleanSpectronautChunk filter_by_qvalue (incl. NA case): rows below cutoff are kept, above cutoff become NA, and NA q-values become NA — the explicit semantic guarantee from the rewrite.
  • cleanSpectronautChunk drops rows where F.FrgLossType != "noloss".

Before this branch cleanSpectronautChunk had no direct test coverage (existing tests stubbed reduceBigSpectronaut out entirely).

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Motivation & Context

The Spectronaut CSV parsing pipeline in reduceBigSpectronaut() previously used readr::read_delim_chunked(), which accumulates string-interning state across chunks, leading to unbounded memory growth on large files. Additionally, the readr pipeline encountered "straddling object" errors where rows span block boundaries, preventing successful processing of certain Spectronaut exports. The cleanSpectronautChunk() function was implemented with dplyr pipelines, which is less memory-efficient for large-scale data processing.

Solution

Replace the readr-based chunked reader with Apache Arrow's streaming CSV reader (arrow::open_dataset() + Scanner + ToRecordBatchReader), which bounds memory to a single record batch and properly handles block boundaries. Rewrite cleanSpectronautChunk() in data.table for better performance and memory efficiency. Expose a configurable block_size parameter (default 16 MiB) to allow tuning for pathological cases (up to 64 MiB). Add progress logging every 1,000 batches during streaming.

Detailed Changes

Core Implementation (R/clean_spectronaut.R)

  • reduceBigSpectronaut() rewrite:

    • Replaced readr::read_delim_chunked() with Arrow CSV streaming pipeline
    • Delimiter selection via regex pattern matching on filename (csv → comma, tsv/xls → tab, else semicolon)
    • Arrow pipeline: CsvParseOptions$create(delimiter = delim) + CsvConvertOptions$create() + CsvReadOptions$create(block_size = block_size)
    • Column pre-filtering via needed_cols to prevent materializing unused Spectronaut columns (~35 unused columns dropped at parse time)
    • Batch-by-batch iteration with reader$read_next_batch() → convert to data.frame → call cleanSpectronautChunk()
    • Progress logging every 1,000 batches (and final summary if non-aligned)
    • Tracks row position (pos), batch index (batch_idx), elapsed time, and processing rate (rows/sec)
    • Added block_size parameter: coerced to integer, validated (length 1, non-NA, positive)
    • New signature: reduceBigSpectronaut(..., anomalyModelFeatures=c(), block_size = 16L * 1024L * 1024L)
  • cleanSpectronautChunk() rewrite from dplyr to data.table:

    • Input coerced with setDT() for in-place conversion
    • Column selection via dt[, cols, with = FALSE]
    • Two-step renaming: standardize all column names first, then map standardized → MSstats final names (using setnames(..., skip_absent = TRUE))
    • In-place type conversions: Intensity := as.numeric(Intensity), logical conversions for Excluded and Identified (character "True" → logical TRUE)
    • Filtering applied with data.table conditional updates (dt[cond, col := NA_real_])
    • Filters preserve NA-q-value semantics: is.na(EGQvalue) | EGQvalue >= cutoff
    • IsotopeLabelType derived from LabeledSequence or defaults to "L"
    • Output columns selected before writing chunk to file
    • Function reduced from ~88 to ~64 lines by eliminating intermediate pipelines and dead code
    • Behavior unchanged; internal implementation optimized

Public API (R/converters.R)

  • bigSpectronauttoMSstatsFormat() updated:
    • New block_size parameter added after connection = NULL
    • Default: 16L * 1024L * 1024L (16 MiB)
    • Parameter forwarded to reduceBigSpectronaut() call
    • Roxygen-documented

Documentation Updates

  • man/bigSpectronauttoMSstatsFormat.Rd:

    • Added block_size parameter to usage signature
    • Added documentation in \arguments{} section describing block size control
    • Includes recommendation to increase to 64 MiB for pathological cases (extra-wide Spectronaut exports triggering "Invalid: straddling object straddles two block boundaries")
  • man/dot-prefixedPath.Rd (new file):

    • Documentation for internal function .prefixedPath(prefix, path)
    • Describes prefix-only naming for intermediate output paths

Package Metadata

  • DESCRIPTION:

    • Added data.table to Imports list
  • NAMESPACE:

    • Added four new importFrom directives: importFrom(data.table, ":="), importFrom(data.table, ".SD"), importFrom(data.table, setDT), importFrom(data.table, setnames)

Unit Tests

Two new test cases added to tests/testthat/test-converters.R:

  1. Block size validation test (test_that("reduceBigSpectronaut block_size argument validation", ...)):

    • Validates that reduceBigSpectronaut() rejects invalid block_size inputs: negative values, zero, NA, vector lengths > 1, non-numeric strings
    • Each invalid input triggers stopifnot() failure as expected
  2. Block size forwarding test (test_that("bigSpectronauttoMSstatsFormat forwards block_size to reduceBigSpectronaut", ...)):

    • Stubs reduceBigSpectronaut to capture the forwarded block_size argument
    • Verifies default value (16L * 1024L * 1024L) when block_size not explicitly provided
    • Verifies explicit override (8L * 1024L * 1024L) is correctly forwarded
    • Uses temporary files and proper cleanup via on.exit() handlers

Coding Guidelines

No violations identified. The implementation follows R best practices:

  • Roxygen2 documentation maintained and regenerated
  • data.table imports explicitly declared in NAMESPACE and DESCRIPTION
  • Parameter validation via stopifnot() with clear assertions
  • Consistent naming conventions (camelCase for parameters and functions)
  • Proper cleanup of Arrow reader objects and data frames in loops
  • Temporary files in tests properly cleaned up with on.exit()
  • Documented Arrow error message and workaround guidance

Review Change Stack

tonywu1999 and others added 9 commits May 7, 2026 14:46
R/clean_spectronaut.R:9-12: added block_size parameter (default 16L * 1024L * 1024L) with coerce + validation.
R/clean_spectronaut.R:44: CsvReadOptions$create now uses the parameter.
R/converters.R:120-125: new @param block_size roxygen with the straddling-object workaround note.
R/converters.R:148-156: bigSpectronauttoMSstatsFormat gains block_size, plumbed to reduceBigSpectronaut.
tests/testthat/test-converters.R:97-163: validation tests (rejects negative/zero/NA/vector/string) + plumbing tests (default forwards 16 MiB, override forwards user's value).
man/bigSpectronauttoMSstatsFormat.Rd: regenerated from roxygen.
…setnames so the package is data.table-aware (cedta()).

R/clean_spectronaut.R:103-187: rewrote cleanSpectronautChunk in data.table:
setDT(input) at entry; subsequent operations modify in place via :=.
Two-step rename (setnames for standardize, then setnames with skip_absent = TRUE to map standardized→MSstats) matches the MSstatsConvert family pattern.
Conditional NA assignment uses mask form dt[cond, Intensity := NA_real_].
Q-value filters preserve dplyr::if_else NA semantics via explicit is.na(EGQvalue) | EGQvalue >= cutoff.
Dropped the leftover dplyr::collect(head(dplyr::select(...))) pattern — was a no-op residue from a prior refactor.
Function shrank from ~88 lines to ~64.
DESCRIPTION:20: added data.table to Imports.
NAMESPACE: regenerated, now imports :=, .SD, setDT, setnames from data.table.
tests/testthat/test-converters.R:97-211: 5 new tests — schema smoke test, filter_by_excluded, filter_by_identified, filter_by_qvalue (covering the NA-q-value case), and FFrgLossType row drop.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR refactors Spectronaut CSV processing to use Apache Arrow record-batch streaming instead of readr chunking, converts chunk processing from dplyr to data.table operations, and introduces a configurable block_size parameter to control Arrow CSV reader memory allocation—exposed in the public API, documented, and validated with tests.

Changes

Spectronaut Arrow Streaming & Data.Table Refactor

Layer / File(s) Summary
data.table dependency setup
DESCRIPTION, NAMESPACE
Added data.table to package imports and declared required data.table symbols (:=, .SD, setDT, setnames) for chunk processing.
Arrow-based CSV streaming in reduceBigSpectronaut
R/clean_spectronaut.R
Replaced readr::read_delim_chunked() with Apache Arrow dataset scanning using ToRecordBatchReader(), configurable block_size, explicit CSV parse options, and batch-by-batch iteration with progress tracking.
Data.table refactor in cleanSpectronautChunk
R/clean_spectronaut.R
Converted dplyr pipelines to data.table: coerces to data.table, filters to present columns, applies standardized/final column renaming, converts string logicals, applies exclusion/identified/qvalue/loss-type filtering, derives IsotopeLabelType, and selects output columns including optional anomaly features.
Public API parameter threading for block_size
R/converters.R
Extended bigSpectronauttoMSstatsFormat() with block_size parameter (16 MiB default), added roxygen documentation, and wired forwarding to reduceBigSpectronaut().
User and internal function documentation
man/bigSpectronauttoMSstatsFormat.Rd, man/dot-prefixedPath.Rd
Updated function signature and arguments documentation with block_size guidance for handling wide exports; added new man page for .prefixedPath() helper.
block_size validation and forwarding tests
tests/testthat/test-converters.R
Added test cases validating rejection of invalid block_size inputs and asserting correct parameter forwarding through the API stack.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Vitek-Lab/MSstatsBig#6: Both PRs modify R/clean_spectronaut.R's reduceBigSpectronaut()/cleanSpectronautChunk() to thread and conditionally include calculateAnomalyScores/anomalyModelFeatures-driven anomaly feature columns, with this PR adding Arrow chunking (block_size) and data.table refactors on top.
  • Vitek-Lab/MSstatsBig#10: Both PRs refactor chunk processing to rely on the shared internal chunk-writing helper .writeChunkToFile(...) (DIANN cleaning switched to it, and Spectronaut chunk cleaning now delegates its CSV output to it).

Suggested reviewers

  • tonywu1999

Poem

🐰 Arrow flies swift through data streams,
data.table dreams, block_size schemes,
From dplyr's chains we break free,
Spectronaut chunks flow efficiently!
🏹✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and does not clearly convey the primary changes; it uses unclear notation and lacks specific detail about what the PR accomplishes. Clarify the title to describe the main changes, e.g., 'Replace readr with Arrow for Spectronaut chunking and rewrite cleanSpectronautChunk in data.table' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and well-structured, covering motivation, detailed changes, testing, and completing the checklist as required by the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstatsBig/work/20260513_arrow_chunking_followups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Rudhik1904 Rudhik1904 requested a review from tonywu1999 May 18, 2026 21:53
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_spectronaut.R`:
- Around line 155-167: The current filter block assumes columns Excluded,
Identified, EGQvalue, PGQvalue, and FFrgLossType always exist and will error if
any are missing; update the logic in clean_spectronaut.R to guard each filter by
checking column presence (e.g., use "if ('Excluded' %in% names(input))" before
the Excluded filter, similarly for Identified, EGQvalue and PGQvalue before
applying qvalue-based NA assignment, and check for FFrgLossType before
subsetting), so each conditional only runs when its target column exists and the
behavior remains unchanged otherwise.
- Around line 25-58: The computed needed_cols is never passed to Arrow, so the
CSV reader still parses all columns; update the CsvConvertOptions usage to
include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

In `@tests/testthat/test-converters.R`:
- Around line 106-113: The test's expect_error calls are too broad—change them
to assert the error message mentions "block_size" so they only pass when
block_size validation fails; update each expect_error(reduceBigSpectronaut(...),
...) in tests/testthat/test-converters.R to include a second argument (string or
regex) that matches "block_size" (e.g., "block_size" or "block_size.*invalid")
for the negative values and vector cases, and similarly for the
suppressed-warning call so all invalid-block_size cases are checked by message
content rather than any error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ebfd981-78cf-43a3-b93a-fd55ccfaeccc

📥 Commits

Reviewing files that changed from the base of the PR and between a43b90b and c8c835e.

📒 Files selected for processing (7)
  • DESCRIPTION
  • NAMESPACE
  • R/clean_spectronaut.R
  • R/converters.R
  • man/bigSpectronauttoMSstatsFormat.Rd
  • man/dot-prefixedPath.Rd
  • tests/testthat/test-converters.R

Comment thread R/clean_spectronaut.R
Comment on lines +25 to +58
# Columns cleanSpectronautChunk actually consumes; Arrow's
# convert_options$include_columns drops everything else at parse time so
# we never materialize the ~35 unused columns Spectronaut exports.
needed_cols <- c("R.FileName", "R.Condition", "R.Replicate",
"PG.ProteinAccessions", "EG.ModifiedSequence",
"FG.LabeledSequence", "FG.Charge",
"F.FrgIon", "F.Charge",
"EG.Identified", "F.ExcludedFromQuantification",
"F.FrgLossType", "PG.Qvalue", "EG.Qvalue",
intensity)
if (calculateAnomalyScores) {
needed_cols <- c(needed_cols, anomalyModelFeatures)
}

# Arrow's CSV reader replaces readr::read_delim_chunked. Arrow releases
# per-batch state as soon as a batch is consumed, so peak memory is
# bounded by one record batch instead of growing with the dataset (readr
# keeps a string-interning pool that accumulates across chunks). The
# `delim` switch above already covers comma / tab / semicolon variants;
# Arrow's CSV reader handles all three the same way through
# CsvParseOptions$delimiter.
parse_opts <- arrow::CsvParseOptions$create(delimiter = delim)
convert_opts <- arrow::CsvConvertOptions$create()
read_opts <- arrow::CsvReadOptions$create(block_size = block_size)

ds <- arrow::open_dataset(
input_file,
format = "csv",
parse_options = parse_opts,
convert_options = convert_opts,
read_options = read_opts
)

reader <- arrow::Scanner$create(ds)$ToRecordBatchReader()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether column projection is actually wired in the Arrow path.
rg -n -C2 'needed_cols|CsvConvertOptions\\$create|include_columns|Scanner\\$create|open_dataset\\(' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 289


🏁 Script executed:

#!/bin/bash
# First, check for needed_cols usage in the file
echo "=== Searching for needed_cols usage ==="
rg -n 'needed_cols' R/clean_spectronaut.R

echo ""
echo "=== Searching for CsvConvertOptions and include_columns ==="
rg -n 'CsvConvertOptions\|include_columns' R/clean_spectronaut.R

echo ""
echo "=== Checking context around CsvConvertOptions create call ==="
rg -n -B3 -A3 'CsvConvertOptions\$create' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 688


🏁 Script executed:

#!/bin/bash
# Check what happens after reader creation
echo "=== Code from line 50 onwards ==="
sed -n '50,80p' R/clean_spectronaut.R

echo ""
echo "=== Complete function to see full context ==="
wc -l R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1118


🏁 Script executed:

#!/bin/bash
# Search for include_columns usage in the entire codebase
echo "=== Searching for include_columns in codebase ==="
rg -n 'include_columns' .

echo ""
echo "=== Searching for CsvConvertOptions usage patterns ==="
rg -n 'CsvConvertOptions' . -A2

echo ""
echo "=== Check the comment context more carefully ==="
sed -n '25,50p' R/clean_spectronaut.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1892


Wire needed_cols into Arrow's column projection.

needed_cols is computed but never applied to CsvConvertOptions, so the reader still parses all CSV columns. This defeats the intended memory/performance reduction on wide Spectronaut exports.

💡 Proposed fix
-  convert_opts <- arrow::CsvConvertOptions$create()
+  convert_opts <- arrow::CsvConvertOptions$create(
+    include_columns = needed_cols
+  )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_spectronaut.R` around lines 25 - 58, The computed needed_cols is
never passed to Arrow, so the CSV reader still parses all columns; update the
CsvConvertOptions usage to include the projection by calling
arrow::CsvConvertOptions$create(include_columns = needed_cols) (or set the
include_columns field on convert_opts after creation) so that convert_opts
includes needed_cols before calling arrow::open_dataset/Scanner$create;
reference the symbols needed_cols and convert_opts (and the call
arrow::CsvConvertOptions$create) and ensure this happens prior to creating
ds/reader.

Comment thread R/clean_spectronaut.R
Comment on lines 155 to +167
if (filter_by_excluded) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))

input[Excluded == TRUE, Intensity := NA_real_]
}

if (filter_by_identified) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
input[Identified == FALSE, Intensity := NA_real_]
}

if (filter_by_qvalue) {
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}

input <- dplyr::filter(input, FFrgLossType == "noloss")
if (is.element("LabeledSequence", colnames(input))) {
input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))

input <- input[FFrgLossType == "noloss"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard filter steps when selected columns are absent.

After present_orig subsetting, downstream filters still assume Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType exist. Missing columns here will raise runtime errors.

🛡️ Proposed fix
-  if (filter_by_excluded) {
+  if (filter_by_excluded && "Excluded" %in% colnames(input)) {
     input[Excluded == TRUE, Intensity := NA_real_]
   }
-  if (filter_by_identified) {
+  if (filter_by_identified && "Identified" %in% colnames(input)) {
     input[Identified == FALSE, Intensity := NA_real_]
   }
   if (filter_by_qvalue) {
-    input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
-    input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    if ("EGQvalue" %in% colnames(input)) {
+      input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    }
+    if ("PGQvalue" %in% colnames(input)) {
+      input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
+    }
   }
-
-  input <- input[FFrgLossType == "noloss"]
+  if ("FFrgLossType" %in% colnames(input)) {
+    input <- input[FFrgLossType == "noloss"]
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (filter_by_excluded) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Excluded, NA_real_, Intensity))
input[Excluded == TRUE, Intensity := NA_real_]
}
if (filter_by_identified) {
input <- dplyr::mutate(
input, Intensity = dplyr::if_else(Identified, Intensity, NA_real_))
input[Identified == FALSE, Intensity := NA_real_]
}
if (filter_by_qvalue) {
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
input <- dplyr::mutate(
input,
Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
input <- dplyr::filter(input, FFrgLossType == "noloss")
if (is.element("LabeledSequence", colnames(input))) {
input <- dplyr::mutate(input, IsLabeled = grepl("Lys8", LabeledSequence) | grepl("Arg10", LabeledSequence))
input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L"))
input <- input[FFrgLossType == "noloss"]
if (filter_by_excluded && "Excluded" %in% colnames(input)) {
input[Excluded == TRUE, Intensity := NA_real_]
}
if (filter_by_identified && "Identified" %in% colnames(input)) {
input[Identified == FALSE, Intensity := NA_real_]
}
if (filter_by_qvalue) {
# Preserve dplyr::if_else semantics: rows with NA q-values become NA.
if ("EGQvalue" %in% colnames(input)) {
input[is.na(EGQvalue) | EGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
if ("PGQvalue" %in% colnames(input)) {
input[is.na(PGQvalue) | PGQvalue >= qvalue_cutoff, Intensity := NA_real_]
}
}
if ("FFrgLossType" %in% colnames(input)) {
input <- input[FFrgLossType == "noloss"]
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_spectronaut.R` around lines 155 - 167, The current filter block
assumes columns Excluded, Identified, EGQvalue, PGQvalue, and FFrgLossType
always exist and will error if any are missing; update the logic in
clean_spectronaut.R to guard each filter by checking column presence (e.g., use
"if ('Excluded' %in% names(input))" before the Excluded filter, similarly for
Identified, EGQvalue and PGQvalue before applying qvalue-based NA assignment,
and check for FFrgLossType before subsetting), so each conditional only runs
when its target column exists and the behavior remains unchanged otherwise.

Comment on lines +106 to +113
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
expect_error(suppressWarnings(
reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
))
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen invalid block_size expectations to avoid false positives.

These assertions currently pass on any error. If reducer internals fail for another reason, this test can still pass even when block_size validation regresses. Constrain the expected error to include block_size in the message.

Suggested tightening
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_))
-  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)))
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = -1L), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = 0L), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = NA_integer_), regexp = "block_size")
+  expect_error(reduceBigSpectronaut(input_file, output_file, block_size = c(1L, 2L)), regexp = "block_size")
   expect_error(suppressWarnings(
     reduceBigSpectronaut(input_file, output_file, block_size = "16MB")
-  ))
+  ), regexp = "block_size")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/testthat/test-converters.R` around lines 106 - 113, The test's
expect_error calls are too broad—change them to assert the error message
mentions "block_size" so they only pass when block_size validation fails; update
each expect_error(reduceBigSpectronaut(...), ...) in
tests/testthat/test-converters.R to include a second argument (string or regex)
that matches "block_size" (e.g., "block_size" or "block_size.*invalid") for the
negative values and vector cases, and similarly for the suppressed-warning call
so all invalid-block_size cases are checked by message content rather than any
error.

Comment thread R/clean_spectronaut.R
anomalyModelFeatures=c()) {
calculateAnomalyScores=FALSE,
anomalyModelFeatures=c(),
block_size = 16L * 1024L * 1024L) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For people who may want to increase their block size to increase the speed of processing, add a recommendation on how to estimate the adequate block size to maximize speed while reducing the risk of the system crashing
  2. Make it more clear in the MSstatsBig / MSstatsConvert documentation on which columns we actually need from users for Spectronaut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants