Skip to content

Conversation

@spkrka
Copy link
Member

@spkrka spkrka commented Dec 18, 2025

Add SMBCollection: unified fluent API for Sort-Merge Bucket operations

This introduces SMBCollection, a new fluent API that unifies and improves all SMB operations in Scio.

Key Improvements

1. Unified API

Traditional SMB operations are fragmented across disjoint methods solving specific sub-problems:

  • sortMergeJoin - read and join to SCollection
  • sortMergeTransform - read, transform, and write back to SMB
  • sortMergeGroupByKey - read single source to SCollection
  • sortMergeCoGroup - read multiple sources to SCollection

SMBCollection provides a single, composable API for all SMB workflows.

2. Familiar SCollection-like Ergonomics

Uses familiar functional operations (map, filter, flatMap) instead of imperative callbacks:

Before (Traditional API):

sc.sortMergeTransform(classOf[Integer], usersRead)
  .to(output)
  .via { case (key, users, outputCollector) =>
    users.foreach { user =>
      val transformed = transformUser(user)
      outputCollector.accept(transformed)  // ❌ Imperative callback
    }
  }

After (SMBCollection):

SMBCollection.read(classOf[Integer], usersRead)
  .flatMap(users => users.map(transformUser))  // ✅ Functional style
  .saveAsSortedBucket(output)

3. Better Interoperability

Seamlessly convert between SMB and SCollection:

val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) => expensiveJoin(users, accounts) }

// SMB outputs (stay bucketed)
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)

// SCollection output (for non-SMB operations)
val sc = base.toDeferredSCollection().get
sc.filter(_.needsProcessing).saveAsTextFile(textOutput)

sc.run()  // All outputs execute in one pass!

4. Zero-Shuffle Multi-Output (Massive Performance Gains)

Create multiple SMB outputs from the same computation with zero shuffles.

Before (Traditional - SCollection fanout):

// Reads once, joins once, BUT shuffles 3 times
val joined = sc.sortMergeJoin(classOf[Integer], usersRead, accountsRead)
  .map { case (userId, (user, account)) =>
    expensiveJoin(user, account)  // Runs once ✓
  }

// ❌ Each saveAsSortedBucket does a GroupByKey shuffle!
joined.map(_.summary).saveAsSortedBucket(summaryOutput)    // Shuffle 1
joined.map(_.details).saveAsSortedBucket(detailsOutput)    // Shuffle 2
joined.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)  // Shuffle 3

After (SMBCollection - zero shuffles):

// Reads once, joins once, zero shuffles!
val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) =>
    expensiveJoin(users, accounts)  // Runs ONCE
  }

// ✅ Fan out to multiple SMB outputs - data already bucketed!
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)
base.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)

sc.run()  // Single pass execution

Performance Impact:

Scenario Traditional (SCollection fanout) SMBCollection Multi-Output Cost Reduction
1TB → 3 SMB outputs 1TB read + ~3TB shuffle 1TB read, 0 shuffle ~4× savings
2TB join → 5 outputs 2TB read + ~10TB shuffle 2TB read, 0 shuffle ~6× savings
500GB → 10 outputs 500GB read + ~5TB shuffle 500GB read, 0 shuffle ~11× savings

Complete Example

See SortMergeBucketMultiOutputExample in scio-examples for a full working example showing how to create multiple derived datasets (summary, details, high-value users) from a single expensive user-account join with zero shuffles.

API Design

  • Type signature: SMBCollection[K1, K2, V] - tracks keys for type safety, methods work with V directly
  • read() returns Iterable[V] without key wrapper
  • cogroup2() returns (K, (Iterable[L], Iterable[R]))
  • Standard transformations: map, filter, flatMap (not mapValues/flatMapValues)
  • Side inputs: clean (SideInputContext, V) signature
  • Auto-execution: outputs execute via sc.onClose() hook

Limitations

  • Currently supports up to 4-way cogroups (cogroup2, cogroup3, cogroup4)
  • For 5-22 way cogroups, use traditional sortMergeCoGroup
  • Note: This is not a systemic limitation - easily extensible by adding cogroup5 through cogroup22 methods

Documentation

Updated documentation includes:

  • Complete fluent API guide with multi-output examples
  • API comparison table (fluent vs traditional)
  • Performance impact analysis
  • Migration examples
  • When to use which API

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 [email protected]

This introduces SMBCollection, a new fluent API that unifies and improves all SMB operations in Scio.

Traditional SMB operations are fragmented across disjoint methods solving specific sub-problems:
- `sortMergeJoin` - read and join to SCollection
- `sortMergeTransform` - read, transform, and write back to SMB
- `sortMergeGroupByKey` - read single source to SCollection
- `sortMergeCoGroup` - read multiple sources to SCollection

SMBCollection provides a single, composable API for all SMB workflows.

Uses familiar functional operations (`map`, `filter`, `flatMap`) instead of imperative callbacks:

**Before (Traditional API):**
```scala
sc.sortMergeTransform(classOf[Integer], usersRead)
  .to(output)
  .via { case (key, users, outputCollector) =>
    users.foreach { user =>
      val transformed = transformUser(user)
      outputCollector.accept(transformed)  // ❌ Imperative callback
    }
  }
```

**After (SMBCollection):**
```scala
SMBCollection.read(classOf[Integer], usersRead)
  .flatMap(users => users.map(transformUser))  // ✅ Functional style
  .saveAsSortedBucket(output)
```

Seamlessly convert between SMB and SCollection:

```scala
val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) => expensiveJoin(users, accounts) }

// SMB outputs (stay bucketed)
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)

// SCollection output (for non-SMB operations)
val sc = base.toDeferredSCollection().get
sc.filter(_.needsProcessing).saveAsTextFile(textOutput)

sc.run()  // All outputs execute in one pass!
```

Create multiple SMB outputs from the same computation with zero shuffles.

**Before (Traditional - SCollection fanout):**
```scala
// Reads once, joins once, BUT shuffles 3 times
val joined = sc.sortMergeJoin(classOf[Integer], usersRead, accountsRead)
  .map { case (userId, (user, account)) =>
    expensiveJoin(user, account)  // Runs once ✓
  }

// ❌ Each saveAsSortedBucket does a GroupByKey shuffle!
joined.map(_.summary).saveAsSortedBucket(summaryOutput)    // Shuffle 1
joined.map(_.details).saveAsSortedBucket(detailsOutput)    // Shuffle 2
joined.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)  // Shuffle 3
```

**After (SMBCollection - zero shuffles):**
```scala
// Reads once, joins once, zero shuffles!
val base = SMBCollection.cogroup2(classOf[Integer], usersRead, accountsRead)
  .map { case (_, (users, accounts)) =>
    expensiveJoin(users, accounts)  // Runs ONCE
  }

// ✅ Fan out to multiple SMB outputs - data already bucketed!
base.map(_.summary).saveAsSortedBucket(summaryOutput)
base.map(_.details).saveAsSortedBucket(detailsOutput)
base.filter(_.isHighValue).saveAsSortedBucket(highValueOutput)

sc.run()  // Single pass execution
```

**Performance Impact:**

| Scenario | Traditional (SCollection fanout) | SMBCollection Multi-Output | Cost Reduction |
|----------|----------------------------------|----------------------------|----------------|
| 1TB → 3 SMB outputs | 1TB read + ~3TB shuffle | 1TB read, 0 shuffle | **~4× savings** |
| 2TB join → 5 outputs | 2TB read + ~10TB shuffle | 2TB read, 0 shuffle | **~6× savings** |
| 500GB → 10 outputs | 500GB read + ~5TB shuffle | 500GB read, 0 shuffle | **~11× savings** |

See `SortMergeBucketMultiOutputExample` in scio-examples for a full working example showing how to create multiple derived datasets (summary, details, high-value users) from a single expensive user-account join with zero shuffles.

- Type signature: `SMBCollection[K1, K2, V]` - tracks keys for type safety, methods work with V directly
- `read()` returns `Iterable[V]` without key wrapper
- `cogroup2()` returns `(K, (Iterable[L], Iterable[R]))`
- Standard transformations: `map`, `filter`, `flatMap` (not `mapValues`/`flatMapValues`)
- Side inputs: clean `(SideInputContext, V)` signature
- Auto-execution: outputs execute via `sc.onClose()` hook

- Currently supports up to 4-way cogroups (`cogroup2`, `cogroup3`, `cogroup4`)
- For 5-22 way cogroups, use traditional `sortMergeCoGroup`
- Note: This is not a systemic limitation - easily extensible by adding `cogroup5` through `cogroup22` methods

Updated documentation includes:
- Complete fluent API guide with multi-output examples
- API comparison table (fluent vs traditional)
- Performance impact analysis
- Migration examples
- When to use which API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 87.69772% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.71%. Comparing base (c234ba6) to head (2fd00b9).

Files with missing lines Patch % Lines
...ain/scala/com/spotify/scio/smb/SMBCollection.scala 43.58% 44 Missing ⚠️
...scala/com/spotify/scio/smb/SMBCollectionImpl.scala 95.83% 7 Missing ⚠️
...mb/src/main/scala/com/spotify/scio/smb/SmbIO.scala 83.33% 6 Missing ⚠️
...a/com/spotify/scio/smb/SMBCollectionInternal.scala 97.25% 5 Missing ⚠️
.../spotify/scio/smb/syntax/SMBCollectionSyntax.scala 84.37% 5 Missing ⚠️
...main/scala/com/spotify/scio/smb/SmbTestPaths.scala 66.66% 2 Missing ⚠️
...la/com/spotify/scio/smb/SimpleKeyGroupReader.scala 97.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5848      +/-   ##
==========================================
+ Coverage   61.49%   62.71%   +1.21%     
==========================================
  Files         317      324       +7     
  Lines       11650    12218     +568     
  Branches      845      885      +40     
==========================================
+ Hits         7164     7662     +498     
- Misses       4486     4556      +70     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant