Add SMBCollection: unified fluent API for Sort-Merge Bucket operations #5848

spkrka · 2025-12-18T16:58:46Z

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations

This introduces SMBCollection, a new fluent API that unifies and improves all SMB operations in Scio.

Key Improvements

1. Unified API

Traditional SMB operations are fragmented across disjoint methods solving specific sub-problems:

sortMergeJoin - read and join to SCollection
sortMergeTransform - read, transform, and write back to SMB
sortMergeGroupByKey - read single source to SCollection
sortMergeCoGroup - read multiple sources to SCollection

SMBCollection provides a single, composable API for all SMB workflows.

2. Familiar SCollection-like Ergonomics

Uses familiar functional operations (map, filter, flatMap) instead of imperative callbacks:

Before (Traditional API):

sc.sortMergeTransform(classOf[Integer], usersRead)
  .to(output)
  .via { case (key, users, outputCollector) =>
    users.foreach { user =>
      val transformed = transformUser(user)
      outputCollector.accept(transformed)  // ❌ Imperative callback
    }
  }

After (SMBCollection):

SMBCollection.read(classOf[Integer], usersRead)
  .flatMap(users => users.map(transformUser))  // ✅ Functional style
  .saveAsSortedBucket(output)

3. Better Interoperability

Seamlessly convert between SMB and SCollection:

val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) => expensiveJoin(users, accounts) }

// SMB outputs (stay bucketed)
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)

// SCollection output (for non-SMB operations)
val sc = base.toDeferredSCollection().get
sc.filter(_.needsProcessing).saveAsTextFile(textOutput)

sc.run()  // All outputs execute in one pass!

4. Zero-Shuffle Multi-Output (Massive Performance Gains)

Create multiple SMB outputs from the same computation with zero shuffles.

Before (Traditional - SCollection fanout):

// Reads once, joins once, BUT shuffles 3 times
val joined = sc.sortMergeJoin(classOf[Integer], usersRead, accountsRead)
  .map { case (userId, (user, account)) =>
    expensiveJoin(user, account)  // Runs once ✓
  }

// ❌ Each saveAsSortedBucket does a GroupByKey shuffle!
joined.map(_.summary).saveAsSortedBucket(summaryOutput)    // Shuffle 1
joined.map(_.details).saveAsSortedBucket(detailsOutput)    // Shuffle 2
joined.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)  // Shuffle 3

After (SMBCollection - zero shuffles):

// Reads once, joins once, zero shuffles!
val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) =>
    expensiveJoin(users, accounts)  // Runs ONCE
  }

// ✅ Fan out to multiple SMB outputs - data already bucketed!
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)
base.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)

sc.run()  // Single pass execution

Performance Impact:

Scenario	Traditional (SCollection fanout)	SMBCollection Multi-Output	Cost Reduction
1TB → 3 SMB outputs	1TB read + ~3TB shuffle	1TB read, 0 shuffle	~4× savings
2TB join → 5 outputs	2TB read + ~10TB shuffle	2TB read, 0 shuffle	~6× savings
500GB → 10 outputs	500GB read + ~5TB shuffle	500GB read, 0 shuffle	~11× savings

Complete Example

See SortMergeBucketMultiOutputExample in scio-examples for a full working example showing how to create multiple derived datasets (summary, details, high-value users) from a single expensive user-account join with zero shuffles.

API Design

Type signature: SMBCollection[K1, K2, V] - tracks keys for type safety, methods work with V directly
read() returns Iterable[V] without key wrapper
cogroup2() returns (K, (Iterable[L], Iterable[R]))
Standard transformations: map, filter, flatMap (not mapValues/flatMapValues)
Side inputs: clean (SideInputContext, V) signature
Auto-execution: outputs execute via sc.onClose() hook

Limitations

Currently supports up to 4-way cogroups (cogroup2, cogroup3, cogroup4)
For 5-22 way cogroups, use traditional sortMergeCoGroup
Note: This is not a systemic limitation - easily extensible by adding cogroup5 through cogroup22 methods

Documentation

Updated documentation includes:

Complete fluent API guide with multi-output examples
API comparison table (fluent vs traditional)
Performance impact analysis
Migration examples
When to use which API

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 [email protected]

This introduces SMBCollection, a new fluent API that unifies and improves all SMB operations in Scio. Traditional SMB operations are fragmented across disjoint methods solving specific sub-problems: - `sortMergeJoin` - read and join to SCollection - `sortMergeTransform` - read, transform, and write back to SMB - `sortMergeGroupByKey` - read single source to SCollection - `sortMergeCoGroup` - read multiple sources to SCollection SMBCollection provides a single, composable API for all SMB workflows. Uses familiar functional operations (`map`, `filter`, `flatMap`) instead of imperative callbacks: **Before (Traditional API):** ```scala sc.sortMergeTransform(classOf[Integer], usersRead) .to(output) .via { case (key, users, outputCollector) => users.foreach { user => val transformed = transformUser(user) outputCollector.accept(transformed) // ❌ Imperative callback } } ``` **After (SMBCollection):** ```scala SMBCollection.read(classOf[Integer], usersRead) .flatMap(users => users.map(transformUser)) // ✅ Functional style .saveAsSortedBucket(output) ``` Seamlessly convert between SMB and SCollection: ```scala val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead) .map { case (_, (users, accounts)) => expensiveJoin(users, accounts) } // SMB outputs (stay bucketed) base.map(_.summary).saveAsSortedBucket(summaryOutput) base.map(_.details).saveAsSortedBucket(detailsOutput) // SCollection output (for non-SMB operations) val sc = base.toDeferredSCollection().get sc.filter(_.needsProcessing).saveAsTextFile(textOutput) sc.run() // All outputs execute in one pass! ``` Create multiple SMB outputs from the same computation with zero shuffles. **Before (Traditional - SCollection fanout):** ```scala // Reads once, joins once, BUT shuffles 3 times val joined = sc.sortMergeJoin(classOf[Integer], usersRead, accountsRead) .map { case (userId, (user, account)) => expensiveJoin(user, account) // Runs once ✓ } // ❌ Each saveAsSortedBucket does a GroupByKey shuffle! joined.map(_.summary).saveAsSortedBucket(summaryOutput) // Shuffle 1 joined.map(_.details).saveAsSortedBucket(detailsOutput) // Shuffle 2 joined.filter(_.isHighValue).saveAsSortedBucket(highValueOutput) // Shuffle 3 ``` **After (SMBCollection - zero shuffles):** ```scala // Reads once, joins once, zero shuffles! val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead) .map { case (_, (users, accounts)) => expensiveJoin(users, accounts) // Runs ONCE } // ✅ Fan out to multiple SMB outputs - data already bucketed! base.map(_.summary).saveAsSortedBucket(summaryOutput) base.map(_.details).saveAsSortedBucket(detailsOutput) base.filter(_.isHighValue).saveAsSortedBucket(highValueOutput) sc.run() // Single pass execution ``` **Performance Impact:** | Scenario | Traditional (SCollection fanout) | SMBCollection Multi-Output | Cost Reduction | |----------|----------------------------------|----------------------------|----------------| | 1TB → 3 SMB outputs | 1TB read + ~3TB shuffle | 1TB read, 0 shuffle | **~4× savings** | | 2TB join → 5 outputs | 2TB read + ~10TB shuffle | 2TB read, 0 shuffle | **~6× savings** | | 500GB → 10 outputs | 500GB read + ~5TB shuffle | 500GB read, 0 shuffle | **~11× savings** | See `SortMergeBucketMultiOutputExample` in scio-examples for a full working example showing how to create multiple derived datasets (summary, details, high-value users) from a single expensive user-account join with zero shuffles. - Type signature: `SMBCollection[K1, K2, V]` - tracks keys for type safety, methods work with V directly - `read()` returns `Iterable[V]` without key wrapper - `cogroup2()` returns `(K, (Iterable[L], Iterable[R]))` - Standard transformations: `map`, `filter`, `flatMap` (not `mapValues`/`flatMapValues`) - Side inputs: clean `(SideInputContext, V)` signature - Auto-execution: outputs execute via `sc.onClose()` hook - Currently supports up to 4-way cogroups (`cogroup2`, `cogroup3`, `cogroup4`) - For 5-22 way cogroups, use traditional `sortMergeCoGroup` - Note: This is not a systemic limitation - easily extensible by adding `cogroup5` through `cogroup22` methods Updated documentation includes: - Complete fluent API guide with multi-output examples - API comparison table (fluent vs traditional) - Performance impact analysis - Migration examples - When to use which API 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

codecov · 2025-12-18T17:23:39Z

Codecov Report

❌ Patch coverage is 87.69772% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.71%. Comparing base (c234ba6) to head (2fd00b9).

Files with missing lines	Patch %	Lines
...ain/scala/com/spotify/scio/smb/SMBCollection.scala	43.58%	44 Missing ⚠️
...scala/com/spotify/scio/smb/SMBCollectionImpl.scala	95.83%	7 Missing ⚠️
...mb/src/main/scala/com/spotify/scio/smb/SmbIO.scala	83.33%	6 Missing ⚠️
...a/com/spotify/scio/smb/SMBCollectionInternal.scala	97.25%	5 Missing ⚠️
.../spotify/scio/smb/syntax/SMBCollectionSyntax.scala	84.37%	5 Missing ⚠️
...main/scala/com/spotify/scio/smb/SmbTestPaths.scala	66.66%	2 Missing ⚠️
...la/com/spotify/scio/smb/SimpleKeyGroupReader.scala	97.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5848      +/-   ##
==========================================
+ Coverage   61.49%   62.71%   +1.21%     
==========================================
  Files         317      324       +7     
  Lines       11650    12218     +568     
  Branches      845      885      +40     
==========================================
+ Hits         7164     7662     +498     
- Misses       4486     4556      +70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

spkrka force-pushed the krka/smb-fluent branch from 1983b61 to 5aad414 Compare December 18, 2025 17:05

Add chill dependency to scio-smb

2fd00b9

spkrka force-pushed the krka/smb-fluent branch from 4851f57 to 2fd00b9 Compare December 18, 2025 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations #5848

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations #5848

Uh oh!

spkrka commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations #5848

Are you sure you want to change the base?

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations #5848

Uh oh!

Conversation

spkrka commented Dec 18, 2025

Key Improvements

1. Unified API

2. Familiar SCollection-like Ergonomics

3. Better Interoperability

4. Zero-Shuffle Multi-Output (Massive Performance Gains)

Complete Example

API Design

Limitations

Documentation

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Dec 18, 2025 •

edited

Loading