feat: add S3 SAIL with Parquet columnar storage engine by odysa · Pull Request #5696 · eclipse-rdf4j/rdf4j

odysa · 2026-02-23T01:28:56Z

Summary

Adds an experimental SAIL implementation (rdf4j-sail-s3) that stores RDF data on S3-compatible object storage using an LSM-tree architecture with Apache Parquet as the storage format.

Phase 1 — Foundation

Phase 1a: Module skeleton with SPI wiring (S3Store, S3StoreConfig, S3StoreFactory)
Phase 1b: In-memory storage engine with MemTable (sorted skip list), QuadIndex (configurable index permutations), and Varint encoding

Phase 2 — Parquet + Tiered Cache

Storage format: Apache Parquet with ZSTD compression and dictionary encoding, replacing custom SSTables
Three sort orders per epoch (SPOC, OPSC, CSPO): always picks optimal sort order for any query pattern
Single MemTable in SPOC order, writes 3 files per flush
JSON catalog with per-file min/max statistics (s, p, o, c) for stats-based pruning before opening any Parquet file
Three-tier cache: L1 Caffeine heap (256 MB default) → L2 local disk LRU (10 GB default) → L3 S3
Write-through caching: flushed files populate L1+L2 immediately, avoiding cold reads
Compaction: L0→L1 merge (8 epoch threshold), L1→L2 merge (4 epoch threshold), tombstone suppression at highest level
Zero Hadoop JARs: PlainParquetConfiguration + custom SimpleCodecFactory (zstd-jni) bypass all Hadoop runtime paths. 14 minimal stub classes satisfy parquet-java's JVM class loading requirements

Phase 3 — Workbench UI + Instance-level S3 Config + Local Filesystem Backend

Workbench integration: S3 Store option in repository creation dropdown, create-s3.xsl form, s3.ttl config template
Instance-level S3 connection: Bucket, endpoint, region, credentials, and force-path-style are configured once per RDF4J instance via environment variables (RDF4J_S3_BUCKET, RDF4J_S3_ENDPOINT, RDF4J_S3_REGION, RDF4J_S3_ACCESS_KEY, RDF4J_S3_SECRET_KEY, RDF4J_S3_FORCE_PATH_STYLE) or system properties (rdf4j.s3.*)
Per-repo isolation via prefix: Each repository specifies only s3Prefix — all data is namespaced under that prefix within the shared bucket
Three storage modes: S3 persistence (bucket configured), local filesystem persistence (dataDir configured), or pure in-memory (neither configured)
FileSystemObjectStore: Production-ready local filesystem backend with atomic writes

Phase 4 — Safety + Code Quality

Thread safety: Catalog uses volatile copy-on-write for concurrent read access
Crash safety: Catalog saved before values (ID gaps are safe, ID reuse is not); compaction deletes old files only after catalog is persisted; namespaces/values always persisted on flush even when memTable is empty
Context ID correctness: Fixed QuadIndex range scans treating context=0 (default graph) as wildcard
Stats accuracy: QuadStats now filters tombstone entries for more precise pruning
Atomic file writes: Both FileSystemObjectStore and L2DiskCache use temp-file-then-rename pattern
Removed dead config: quadIndexes, blockSize, valueCacheSize, valueIdCacheSize removed from schema
Unified sort order source of truth: ALL_INDEXES derives from SortOrder.values()

Phase 5 — Performance Optimizations

Parallel Parquet writes: 3 files per flush written concurrently via CompletableFuture with a shared thread pool
Background compaction: Compaction runs asynchronously in a single-thread executor, never blocking the flush path
Streaming Parquet reads: ParquetQuadSource streams rows lazily one row group at a time instead of loading entire files into memory
Parquet row group filtering: Column statistics (min/max) checked before decoding each row group — skips row groups that cannot match the query
Bloom filters: Per-file bloom filter on the leading sort component (subject for SPOC, object for OPSC, context for CSPO). Checked in mayContain() before loading any file data
Catalog-based cardinality estimates: S3EvaluationStatistics resolves bound values to IDs, sums row counts from matching files (via min/max + bloom filter pruning), enabling better query planning
getContextIDs optimization: Uses CSPO index with last-seen dedup instead of full scan + HashSet

Read Path

flowchart TD
    A[getStatements\nsubj, pred, obj, ctx] --> B[Resolve Value → ID\nvia ValueStore]
    B --> C{ID found?}
    C -- No --> D[Return EmptyIteration]
    C -- Yes --> E[Select best QuadIndex\nfor query pattern]
    E --> F[Build source list]

    F --> G[MemTable.asRawSource\nre-encode keys in best index order]
    F --> H[Catalog: get files\nfor selected sort order]

    H --> I{Per-file pruning\nmin/max + bloom filter}
    I -- Outside range --> J[Skip file]
    I -- May contain --> K[Load file via\nTieredCache]

    K --> L{L1 Caffeine\nheap cache}
    L -- Hit --> M[Return bytes]
    L -- Miss --> N{L2 disk\nLRU cache}
    N -- Hit --> M
    N -- Miss --> O[L3 S3 fetch]
    O --> M

    M --> P[ParquetQuadSource\nstream rows + row group filtering]

    G --> Q[MergeIterator\nK-way merge, dedup, tombstone suppression]
    P --> Q
    Q --> R[QuadToStatementIteration\nID → Value resolution]
    R --> S[Statement stream]

Insert + Commit Path

flowchart TD
    A[approve / approveAll\nsubj, pred, obj, ctx] --> B[Store values → IDs\nvia ValueStore]
    B --> C[MemTable.put\ns, p, o, c, flag]
    C --> D{MemTable size\n≥ flush threshold?}
    D -- No --> E[Return — buffered in memory]
    D -- Yes --> F[Trigger flush]

    F --> G[Freeze MemTable\nswap in fresh one]
    G --> H[Collect all quads\ncompute min/max stats]

    H --> I[Parallel: for each sort order\nSPOC, OPSC, CSPO]
    I --> J[Sort entries by index]
    J --> K[ParquetFileBuilder.build\nZSTD + dictionary encoding]
    J --> K2[Build bloom filter\nfor leading component]
    K --> L[objectStore.put → S3]
    K --> M[cache.writeThrough\nL1 + L2 populated]
    L --> N[Catalog.addFile\nwith stats + bloom filter]

    N --> O[Persist catalog first\nthen ValueStore + NamespaceStore]
    O --> P[Catalog.save\natomic version bump]

    P --> Q{Compaction\ntriggers?}
    Q -- L0 count ≥ 8 --> R[Background: L0 → L1 merge]
    Q -- L1 count ≥ 4 --> S[Background: L1 → L2 merge]
    Q -- No --> T[Done]
    R --> S
    S --> T

    subgraph Transaction Commit
        U[SailSink.flush] --> F
    end

S3 Layout

{s3Prefix}/catalog/current               → pointer to latest catalog JSON
{s3Prefix}/catalog/v{epoch}.json         → catalog metadata (includes bloom filters)

{s3Prefix}/data/
    L0-{epoch}-SPOC.parquet              → Level-0 flush (SPOC sorted)
    L0-{epoch}-OPSC.parquet              → Level-0 flush (OPSC sorted)
    L0-{epoch}-CSPO.parquet              → Level-0 flush (CSPO sorted)
    L1-{epoch}-{sort}.parquet            → Level-1 compacted
    L2-{epoch}-{sort}.parquet            → Level-2 fully compacted

{s3Prefix}/values/current                → value store
{s3Prefix}/namespaces/current            → namespace store

Key Files

File	Purpose
`S3SailStore.java`	Main store — flush, read, compaction orchestration
`storage/ParquetFileBuilder.java`	Writes byte[] Parquet from quad entries
`storage/ParquetQuadSource.java`	Streaming `RawEntrySource` over Parquet files with row group filtering
`storage/MergeIterator.java`	K-way merge with dedup, tombstone suppression, and source cleanup
`storage/Catalog.java`	JSON catalog with per-file column statistics and bloom filters
`storage/Compactor.java`	LSM-tree merge compaction with bloom filter generation
`storage/BloomFilter.java`	Bit-array bloom filter with murmur hashing and Base64 serialization
`storage/QuadIndex.java`	Index permutations, pattern scoring, key encode/decode
`storage/QuadEntry.java`	Top-level record for Parquet row data
`storage/FileSystemObjectStore.java`	Local filesystem backend with atomic writes
`storage/SimpleCodecFactory.java`	ZSTD codec without Hadoop
`cache/TieredCache.java`	Unified L1+L2+L3 cache facade
`config/S3StoreConfig.java`	Sail config with env var fallback for S3 connection
`S3EvaluationStatistics.java`	Catalog-based cardinality estimation for query planning

Three storage modes: S3 (bucket configured), local filesystem (dataDir configured), or in-memory (neither configured).

Test plan

🤖 Generated with Claude Code

Future Work

Multi-node support: Single writer + multiple read replicas with S3-based leader election. Reader nodes poll the catalog for updates; the writer acquires an S3 lease before flushing. All coordination via S3 (no external database dependency).

Introduce rdf4j-sail-s3, an S3-backed SAIL using LSM-tree architecture adapted from RisingWave's Hummock engine. This commit implements the module skeleton and in-memory storage layer: - Config: S3StoreConfig, S3StoreFactory, S3StoreSchema - Storage: Varint encoding, QuadIndex permutations, MemTable (ConcurrentSkipListMap) - Value/NS: S3ValueStore (ConcurrentHashMap ID mapping), S3NamespaceStore - Core SAIL: S3Store, S3StoreConnection, S3SailStore with SailSource/Sink/Dataset - SPI registration via META-INF/services Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add persistence layer so MemTables flush to immutable SSTable files on S3-compatible storage. When s3Bucket is not configured the store stays in pure in-memory mode and all existing tests remain unaffected. Key additions: - ObjectStore interface with S3ObjectStore (MinIO) and FileSystemObjectStore (test double) - SSTableWriter/SSTable: binary format with block index for range scans - MergeIterator: K-way merge across MemTable + SSTables with deduplication and tombstone suppression - Manifest: versioned JSON manifest tracking SSTables on S3 - S3ValueStore and S3NamespaceStore serialization/deserialization - S3StoreConfig: S3 connectivity properties (bucket, endpoint, etc.) - S3SailStore: flush path, merged read path, startup loading - 29 new tests (unit + persistence); 541 total tests pass

kenwenzel · 2026-02-23T08:18:38Z

@odysa Thank you - this is really cool.
I would propose to use Parquet instead of a custom SST format. We've made good experiences by using Parquet to store time-series data indexed by subject, property, and time:
https://github.com/linkedfactory/linkedfactory-pod/tree/main/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet

By using Parquet you get bloom filters, filter pushdown and advanced compression (dictionary and others) for free.
The only problem is that you need to implement a custom block cache that is currently not included in Parquet for Java:
apache/parquet-java#3006

Best regards,
Ken

…iered cache (Phase 2) Replace custom SSTable binary format with Apache Parquet columnar storage, introduce vertical partitioning by predicate, and add a three-tier cache (Caffeine heap -> local disk LRU -> S3). Storage redesign: - Parquet files on S3 with ZSTD compression and dictionary encoding - Predicate-based partitioning (data/predicates/{id}/) eliminates predicate column from files, tightening column statistics - Three sort orders per partition (SOC, OSC, CSO) for optimal query performance regardless of access pattern - Single MemTable in SPOC order, partitioned on flush - JSON catalog with per-file column statistics for catalog-level pruning Cache system: - L1: Caffeine heap cache (configurable, default 256 MB) - L2: Local disk LRU cache (configurable, default 10 GB) - L3: S3 source of truth - Write-through on flush avoids cold reads Compaction: - L0->L1 merge when epoch count >= 8 per predicate - L1->L2 merge when epoch count >= 4 per predicate - Tombstone suppression at highest level Hadoop dependency elimination: - Zero Hadoop JARs in dependency tree - PlainParquetConfiguration + custom SimpleCodecFactory bypass all Hadoop runtime paths - 14 minimal stub classes in org.apache.hadoop.* satisfy parquet-hadoop JVM class loading requirements Deleted: SSTable, SSTableWriter, Manifest (replaced by Parquet + Catalog) All 529 tests pass.

kenwenzel · 2026-02-24T09:59:38Z

@odysa That looks promising. Do you already have any benchmarks?
I'm also a bit concerned in regard to the predicate-based partitioning. I think that this only works good if you have (very) few individual predicates.
I've also made the observation that caching Parquet footers and statistics can make a significant difference in performance.

odysa · 2026-02-24T16:17:23Z

@kenwenzel Thank you for comments. It's not completed yet, just experimental.
My intuition was number of predicates is relatively small compared to subjects/objects. Lemme further research this.

kenwenzel · 2026-02-25T08:11:17Z

@odysa Maybe you also like to take a look at the QLever paper for some inspiration:
https://ad-publications.cs.uni-freiburg.de/CIKM_qlever_BB_2017.pdf

Does the store also work with local files?

JervenBolleman · 2026-02-26T13:49:36Z

Great idea, I really like it. Idea could the access to data via S3 code can be hidden behind an interface? allowing either a local files system or an S3 store?

… (Phase 3) Replace predicate partitioning with flat files and per-file min/max stats for pruning. Add S3 Store to RDF4J Workbench with creation form and TTL config template. S3 connection settings (bucket, endpoint, credentials) resolve from environment variables (RDF4J_S3_*) or system properties so multiple repositories share a single bucket, each isolated by s3Prefix.

odysa · 2026-02-27T02:21:13Z

@kenwenzel @JervenBolleman

I was testing locally with MinIO, but using the local filesystem is actually a big plus too. Let me add that

- Promote FileSystemObjectStore from test to production, enabling 3-mode backend selection (S3 / filesystem / in-memory) via config - Extract QuadStats value type to deduplicate stats computation in S3SailStore and Compactor - Add QuadIndex.matches() helper, eliminating 4-place quad-filter duplication across MergeIterator, MemTable, and ParquetQuadSource - Extract hasPersistence(), queryQuads(), resolveValueId() helpers in S3SailStore to remove repeated guard logic - Split flushToObjectStore() into focused methods - Merge CompactionPolicy.shouldCompactL0/L1 into shouldCompact() - Remove unused explicit param from MemTable.remove() - Delete dead ParquetFilterBuilder (zero usages) - Fix QuadIndex wildcard sentinel inconsistency in getMaxKey - Narrow Throwable catch to Exception in S3Store - Add dataDir config field for filesystem persistence mode

- Catalog: volatile copy-on-write for thread-safe concurrent reads - S3SailStore: save catalog before values for crash-safe ordering; persist namespaces/values even when memTable is empty; delete old compaction files only after catalog is saved - QuadIndex: fix context=0 treated as wildcard in range scans - QuadStats: filter tombstones from stats computation - FileSystemObjectStore: atomic writes via temp-file-then-rename - L2DiskCache: volatile lastAccessNanos, synchronized eviction - Remove dead config fields (quadIndexes, blockSize, valueCacheSize, valueIdCacheSize) from S3StoreConfig and S3StoreSchema - Unify ALL_INDEXES with SortOrder.values() single source of truth - Fix inline FQNs and wildcard imports across main and test sources - Fix stale javadocs in CompactionPolicy, MemTable, S3SailDataset

- Eliminate double serialization of values/namespaces on flush - Make MemTable.approximateSizeInBytes() O(1) via AtomicLong counter - Precompute field indices in sort comparator to avoid hot-loop switch - Remove duplicate rowGroupSize/pageSize fields from S3SailStore/Compactor - Centralize storage key literals into named constants - Add named type discriminator constants in S3ValueStore - Unify QuadStats accumulation with shared Accumulator inner class - Centralize data key generation in Catalog.dataKey() - Restrict ParquetFileInfo 14-param constructor to package-private

JervenBolleman · 2026-03-02T10:44:31Z

FYI: https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/ might be easier to have as a dependency than parquet-java.

odysa · 2026-03-03T03:35:47Z

@JervenBolleman Looks promising. But it does not support writing now.

For the time after the 1.0 final release, we are planning to add support for writing Parquet files, and we may provide a CLI tool for inspecting and analyzing Parquet files

…y estimates - Parallelize Parquet writes with CompletableFuture (3 files written concurrently) - Optimize getContextIDs() to use CSPO index with last-seen dedup - Move compaction to background single-thread executor - Implement catalog-based cardinality estimation in S3EvaluationStatistics - Refactor ParquetQuadSource to stream rows lazily instead of loading all into memory - Add row group filtering using Parquet column statistics (min/max) - Add BloomFilter class for leading-component filtering per Parquet file - Add close() to RawEntrySource; MergeIterator closes sources when exhausted

Consolidate duplicated buildBloomFilter and leading-component switch logic from S3SailStore and Compactor into BloomFilter. Replace hand-rolled byte serialization with ByteBuffer. Merge queryQuads overloads and eliminate double shouldCompact evaluation in compaction.

odysa and others added 2 commits February 20, 2026 19:43

odysa changed the title ~~feat: add S3 SAIL with LSM-tree storage engine~~ feat: add S3 SAIL with Parquet columnar storage engine Feb 24, 2026

odysa added 3 commits February 27, 2026 23:06

odysa added 2 commits March 4, 2026 17:14

odysa force-pushed the feature/s3-sail branch from cc3be73 to 89a73e1 Compare March 8, 2026 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add S3 SAIL with Parquet columnar storage engine#5696

feat: add S3 SAIL with Parquet columnar storage engine#5696
odysa wants to merge 9 commits intoeclipse-rdf4j:developfrom
odysa:feature/s3-sail

odysa commented Feb 23, 2026 •

edited

Loading

Uh oh!

kenwenzel commented Feb 23, 2026

Uh oh!

kenwenzel commented Feb 24, 2026

Uh oh!

odysa commented Feb 24, 2026

Uh oh!

kenwenzel commented Feb 25, 2026

Uh oh!

JervenBolleman commented Feb 26, 2026

Uh oh!

odysa commented Feb 27, 2026

Uh oh!

JervenBolleman commented Mar 2, 2026

Uh oh!

odysa commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

odysa commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1 — Foundation

Phase 2 — Parquet + Tiered Cache

Phase 3 — Workbench UI + Instance-level S3 Config + Local Filesystem Backend

Phase 4 — Safety + Code Quality

Phase 5 — Performance Optimizations

Read Path

Insert + Commit Path

S3 Layout

Key Files

Test plan

Future Work

Uh oh!

kenwenzel commented Feb 23, 2026

Uh oh!

kenwenzel commented Feb 24, 2026

Uh oh!

odysa commented Feb 24, 2026

Uh oh!

kenwenzel commented Feb 25, 2026

Uh oh!

JervenBolleman commented Feb 26, 2026

Uh oh!

odysa commented Feb 27, 2026

Uh oh!

JervenBolleman commented Mar 2, 2026

Uh oh!

odysa commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

odysa commented Feb 23, 2026 •

edited

Loading