Skip to content

feat: add S3 SAIL with Parquet columnar storage engine#5696

Draft
odysa wants to merge 9 commits intoeclipse-rdf4j:developfrom
odysa:feature/s3-sail
Draft

feat: add S3 SAIL with Parquet columnar storage engine#5696
odysa wants to merge 9 commits intoeclipse-rdf4j:developfrom
odysa:feature/s3-sail

Conversation

@odysa
Copy link
Contributor

@odysa odysa commented Feb 23, 2026

Summary

Adds an experimental SAIL implementation (rdf4j-sail-s3) that stores RDF data on S3-compatible object storage using an LSM-tree architecture with Apache Parquet as the storage format.

Phase 1 — Foundation

  • Phase 1a: Module skeleton with SPI wiring (S3Store, S3StoreConfig, S3StoreFactory)
  • Phase 1b: In-memory storage engine with MemTable (sorted skip list), QuadIndex (configurable index permutations), and Varint encoding

Phase 2 — Parquet + Tiered Cache

  • Storage format: Apache Parquet with ZSTD compression and dictionary encoding, replacing custom SSTables
  • Three sort orders per epoch (SPOC, OPSC, CSPO): always picks optimal sort order for any query pattern
  • Single MemTable in SPOC order, writes 3 files per flush
  • JSON catalog with per-file min/max statistics (s, p, o, c) for stats-based pruning before opening any Parquet file
  • Three-tier cache: L1 Caffeine heap (256 MB default) → L2 local disk LRU (10 GB default) → L3 S3
  • Write-through caching: flushed files populate L1+L2 immediately, avoiding cold reads
  • Compaction: L0→L1 merge (8 epoch threshold), L1→L2 merge (4 epoch threshold), tombstone suppression at highest level
  • Zero Hadoop JARs: PlainParquetConfiguration + custom SimpleCodecFactory (zstd-jni) bypass all Hadoop runtime paths. 14 minimal stub classes satisfy parquet-java's JVM class loading requirements

Phase 3 — Workbench UI + Instance-level S3 Config + Local Filesystem Backend

  • Workbench integration: S3 Store option in repository creation dropdown, create-s3.xsl form, s3.ttl config template
  • Instance-level S3 connection: Bucket, endpoint, region, credentials, and force-path-style are configured once per RDF4J instance via environment variables (RDF4J_S3_BUCKET, RDF4J_S3_ENDPOINT, RDF4J_S3_REGION, RDF4J_S3_ACCESS_KEY, RDF4J_S3_SECRET_KEY, RDF4J_S3_FORCE_PATH_STYLE) or system properties (rdf4j.s3.*)
  • Per-repo isolation via prefix: Each repository specifies only s3Prefix — all data is namespaced under that prefix within the shared bucket
  • Three storage modes: S3 persistence (bucket configured), local filesystem persistence (dataDir configured), or pure in-memory (neither configured)
  • FileSystemObjectStore: Production-ready local filesystem backend with atomic writes

Phase 4 — Safety + Code Quality

  • Thread safety: Catalog uses volatile copy-on-write for concurrent read access
  • Crash safety: Catalog saved before values (ID gaps are safe, ID reuse is not); compaction deletes old files only after catalog is persisted; namespaces/values always persisted on flush even when memTable is empty
  • Context ID correctness: Fixed QuadIndex range scans treating context=0 (default graph) as wildcard
  • Stats accuracy: QuadStats now filters tombstone entries for more precise pruning
  • Atomic file writes: Both FileSystemObjectStore and L2DiskCache use temp-file-then-rename pattern
  • Removed dead config: quadIndexes, blockSize, valueCacheSize, valueIdCacheSize removed from schema
  • Unified sort order source of truth: ALL_INDEXES derives from SortOrder.values()

Phase 5 — Performance Optimizations

  • Parallel Parquet writes: 3 files per flush written concurrently via CompletableFuture with a shared thread pool
  • Background compaction: Compaction runs asynchronously in a single-thread executor, never blocking the flush path
  • Streaming Parquet reads: ParquetQuadSource streams rows lazily one row group at a time instead of loading entire files into memory
  • Parquet row group filtering: Column statistics (min/max) checked before decoding each row group — skips row groups that cannot match the query
  • Bloom filters: Per-file bloom filter on the leading sort component (subject for SPOC, object for OPSC, context for CSPO). Checked in mayContain() before loading any file data
  • Catalog-based cardinality estimates: S3EvaluationStatistics resolves bound values to IDs, sums row counts from matching files (via min/max + bloom filter pruning), enabling better query planning
  • getContextIDs optimization: Uses CSPO index with last-seen dedup instead of full scan + HashSet

Read Path

flowchart TD
    A[getStatements\nsubj, pred, obj, ctx] --> B[Resolve Value → ID\nvia ValueStore]
    B --> C{ID found?}
    C -- No --> D[Return EmptyIteration]
    C -- Yes --> E[Select best QuadIndex\nfor query pattern]
    E --> F[Build source list]

    F --> G[MemTable.asRawSource\nre-encode keys in best index order]
    F --> H[Catalog: get files\nfor selected sort order]

    H --> I{Per-file pruning\nmin/max + bloom filter}
    I -- Outside range --> J[Skip file]
    I -- May contain --> K[Load file via\nTieredCache]

    K --> L{L1 Caffeine\nheap cache}
    L -- Hit --> M[Return bytes]
    L -- Miss --> N{L2 disk\nLRU cache}
    N -- Hit --> M
    N -- Miss --> O[L3 S3 fetch]
    O --> M

    M --> P[ParquetQuadSource\nstream rows + row group filtering]

    G --> Q[MergeIterator\nK-way merge, dedup, tombstone suppression]
    P --> Q
    Q --> R[QuadToStatementIteration\nID → Value resolution]
    R --> S[Statement stream]
Loading

Insert + Commit Path

flowchart TD
    A[approve / approveAll\nsubj, pred, obj, ctx] --> B[Store values → IDs\nvia ValueStore]
    B --> C[MemTable.put\ns, p, o, c, flag]
    C --> D{MemTable size\n≥ flush threshold?}
    D -- No --> E[Return — buffered in memory]
    D -- Yes --> F[Trigger flush]

    F --> G[Freeze MemTable\nswap in fresh one]
    G --> H[Collect all quads\ncompute min/max stats]

    H --> I[Parallel: for each sort order\nSPOC, OPSC, CSPO]
    I --> J[Sort entries by index]
    J --> K[ParquetFileBuilder.build\nZSTD + dictionary encoding]
    J --> K2[Build bloom filter\nfor leading component]
    K --> L[objectStore.put → S3]
    K --> M[cache.writeThrough\nL1 + L2 populated]
    L --> N[Catalog.addFile\nwith stats + bloom filter]

    N --> O[Persist catalog first\nthen ValueStore + NamespaceStore]
    O --> P[Catalog.save\natomic version bump]

    P --> Q{Compaction\ntriggers?}
    Q -- L0 count ≥ 8 --> R[Background: L0 → L1 merge]
    Q -- L1 count ≥ 4 --> S[Background: L1 → L2 merge]
    Q -- No --> T[Done]
    R --> S
    S --> T

    subgraph Transaction Commit
        U[SailSink.flush] --> F
    end
Loading

S3 Layout

{s3Prefix}/catalog/current               → pointer to latest catalog JSON
{s3Prefix}/catalog/v{epoch}.json         → catalog metadata (includes bloom filters)

{s3Prefix}/data/
    L0-{epoch}-SPOC.parquet              → Level-0 flush (SPOC sorted)
    L0-{epoch}-OPSC.parquet              → Level-0 flush (OPSC sorted)
    L0-{epoch}-CSPO.parquet              → Level-0 flush (CSPO sorted)
    L1-{epoch}-{sort}.parquet            → Level-1 compacted
    L2-{epoch}-{sort}.parquet            → Level-2 fully compacted

{s3Prefix}/values/current                → value store
{s3Prefix}/namespaces/current            → namespace store

Key Files

File Purpose
S3SailStore.java Main store — flush, read, compaction orchestration
storage/ParquetFileBuilder.java Writes byte[] Parquet from quad entries
storage/ParquetQuadSource.java Streaming RawEntrySource over Parquet files with row group filtering
storage/MergeIterator.java K-way merge with dedup, tombstone suppression, and source cleanup
storage/Catalog.java JSON catalog with per-file column statistics and bloom filters
storage/Compactor.java LSM-tree merge compaction with bloom filter generation
storage/BloomFilter.java Bit-array bloom filter with murmur hashing and Base64 serialization
storage/QuadIndex.java Index permutations, pattern scoring, key encode/decode
storage/QuadEntry.java Top-level record for Parquet row data
storage/FileSystemObjectStore.java Local filesystem backend with atomic writes
storage/SimpleCodecFactory.java ZSTD codec without Hadoop
cache/TieredCache.java Unified L1+L2+L3 cache facade
config/S3StoreConfig.java Sail config with env var fallback for S3 connection
S3EvaluationStatistics.java Catalog-based cardinality estimation for query planning

Three storage modes: S3 (bucket configured), local filesystem (dataDir configured), or in-memory (neither configured).

Test plan

  • 566 tests pass (SAIL compliance + unit + integration)
  • S3PersistenceTest — write/flush/restart, multiple flushes, delete/restart, context graphs, sort orders, namespace persistence
  • CatalogTest, MemTableReorderTest, ParquetRoundTripTest, QuadIndexSelectionTest, MergeIteratorTest — unit tests
  • S3PersistenceMinioIT — integration test against real MinIO via Testcontainers
  • Zero Hadoop JARs in dependency tree verified
  • Workbench form renders and creates S3 Store repositories
  • Thread safety: Catalog volatile copy-on-write pattern
  • Crash safety: Catalog-first persistence ordering, atomic file writes
  • Parallel Parquet writes produce correct output
  • Background compaction runs without blocking flush
  • Streaming reads — large files don't spike heap
  • Bloom filter serialization round-trips through catalog JSON

🤖 Generated with Claude Code

Future Work

  • Multi-node support: Single writer + multiple read replicas with S3-based leader election. Reader nodes poll the catalog for updates; the writer acquires an S3 lease before flushing. All coordination via S3 (no external database dependency).

odysa and others added 2 commits February 20, 2026 19:43
Introduce rdf4j-sail-s3, an S3-backed SAIL using LSM-tree architecture
adapted from RisingWave's Hummock engine. This commit implements the
module skeleton and in-memory storage layer:

- Config: S3StoreConfig, S3StoreFactory, S3StoreSchema
- Storage: Varint encoding, QuadIndex permutations, MemTable (ConcurrentSkipListMap)
- Value/NS: S3ValueStore (ConcurrentHashMap ID mapping), S3NamespaceStore
- Core SAIL: S3Store, S3StoreConnection, S3SailStore with SailSource/Sink/Dataset
- SPI registration via META-INF/services

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persistence layer so MemTables flush to immutable SSTable files on
S3-compatible storage. When s3Bucket is not configured the store stays
in pure in-memory mode and all existing tests remain unaffected.

Key additions:
- ObjectStore interface with S3ObjectStore (MinIO) and
  FileSystemObjectStore (test double)
- SSTableWriter/SSTable: binary format with block index for range scans
- MergeIterator: K-way merge across MemTable + SSTables with
  deduplication and tombstone suppression
- Manifest: versioned JSON manifest tracking SSTables on S3
- S3ValueStore and S3NamespaceStore serialization/deserialization
- S3StoreConfig: S3 connectivity properties (bucket, endpoint, etc.)
- S3SailStore: flush path, merged read path, startup loading
- 29 new tests (unit + persistence); 541 total tests pass
@kenwenzel
Copy link
Contributor

@odysa Thank you - this is really cool.
I would propose to use Parquet instead of a custom SST format. We've made good experiences by using Parquet to store time-series data indexed by subject, property, and time:
https://github.com/linkedfactory/linkedfactory-pod/tree/main/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet

By using Parquet you get bloom filters, filter pushdown and advanced compression (dictionary and others) for free.
The only problem is that you need to implement a custom block cache that is currently not included in Parquet for Java:
apache/parquet-java#3006

Best regards,
Ken

…iered cache (Phase 2)

Replace custom SSTable binary format with Apache Parquet columnar storage,
introduce vertical partitioning by predicate, and add a three-tier cache
(Caffeine heap -> local disk LRU -> S3).

Storage redesign:
- Parquet files on S3 with ZSTD compression and dictionary encoding
- Predicate-based partitioning (data/predicates/{id}/) eliminates
  predicate column from files, tightening column statistics
- Three sort orders per partition (SOC, OSC, CSO) for optimal query
  performance regardless of access pattern
- Single MemTable in SPOC order, partitioned on flush
- JSON catalog with per-file column statistics for catalog-level pruning

Cache system:
- L1: Caffeine heap cache (configurable, default 256 MB)
- L2: Local disk LRU cache (configurable, default 10 GB)
- L3: S3 source of truth
- Write-through on flush avoids cold reads

Compaction:
- L0->L1 merge when epoch count >= 8 per predicate
- L1->L2 merge when epoch count >= 4 per predicate
- Tombstone suppression at highest level

Hadoop dependency elimination:
- Zero Hadoop JARs in dependency tree
- PlainParquetConfiguration + custom SimpleCodecFactory bypass all
  Hadoop runtime paths
- 14 minimal stub classes in org.apache.hadoop.* satisfy parquet-hadoop
  JVM class loading requirements

Deleted: SSTable, SSTableWriter, Manifest (replaced by Parquet + Catalog)
All 529 tests pass.
@odysa odysa changed the title feat: add S3 SAIL with LSM-tree storage engine feat: add S3 SAIL with Parquet columnar storage engine Feb 24, 2026
@kenwenzel
Copy link
Contributor

@odysa That looks promising. Do you already have any benchmarks?
I'm also a bit concerned in regard to the predicate-based partitioning. I think that this only works good if you have (very) few individual predicates.
I've also made the observation that caching Parquet footers and statistics can make a significant difference in performance.

@odysa
Copy link
Contributor Author

odysa commented Feb 24, 2026

@kenwenzel Thank you for comments. It's not completed yet, just experimental.
My intuition was number of predicates is relatively small compared to subjects/objects. Lemme further research this.

@kenwenzel
Copy link
Contributor

@odysa Maybe you also like to take a look at the QLever paper for some inspiration:
https://ad-publications.cs.uni-freiburg.de/CIKM_qlever_BB_2017.pdf

Does the store also work with local files?

@JervenBolleman
Copy link
Contributor

Great idea, I really like it. Idea could the access to data via S3 code can be hidden behind an interface? allowing either a local files system or an S3 store?

… (Phase 3)

Replace predicate partitioning with flat files and per-file min/max stats
for pruning. Add S3 Store to RDF4J Workbench with creation form and TTL
config template. S3 connection settings (bucket, endpoint, credentials)
resolve from environment variables (RDF4J_S3_*) or system properties so
multiple repositories share a single bucket, each isolated by s3Prefix.
@odysa
Copy link
Contributor Author

odysa commented Feb 27, 2026

@kenwenzel @JervenBolleman

I was testing locally with MinIO, but using the local filesystem is actually a big plus too. Let me add that

- Promote FileSystemObjectStore from test to production, enabling
  3-mode backend selection (S3 / filesystem / in-memory) via config
- Extract QuadStats value type to deduplicate stats computation
  in S3SailStore and Compactor
- Add QuadIndex.matches() helper, eliminating 4-place quad-filter
  duplication across MergeIterator, MemTable, and ParquetQuadSource
- Extract hasPersistence(), queryQuads(), resolveValueId() helpers
  in S3SailStore to remove repeated guard logic
- Split flushToObjectStore() into focused methods
- Merge CompactionPolicy.shouldCompactL0/L1 into shouldCompact()
- Remove unused explicit param from MemTable.remove()
- Delete dead ParquetFilterBuilder (zero usages)
- Fix QuadIndex wildcard sentinel inconsistency in getMaxKey
- Narrow Throwable catch to Exception in S3Store
- Add dataDir config field for filesystem persistence mode
- Catalog: volatile copy-on-write for thread-safe concurrent reads
- S3SailStore: save catalog before values for crash-safe ordering;
  persist namespaces/values even when memTable is empty;
  delete old compaction files only after catalog is saved
- QuadIndex: fix context=0 treated as wildcard in range scans
- QuadStats: filter tombstones from stats computation
- FileSystemObjectStore: atomic writes via temp-file-then-rename
- L2DiskCache: volatile lastAccessNanos, synchronized eviction
- Remove dead config fields (quadIndexes, blockSize, valueCacheSize,
  valueIdCacheSize) from S3StoreConfig and S3StoreSchema
- Unify ALL_INDEXES with SortOrder.values() single source of truth
- Fix inline FQNs and wildcard imports across main and test sources
- Fix stale javadocs in CompactionPolicy, MemTable, S3SailDataset
- Eliminate double serialization of values/namespaces on flush
- Make MemTable.approximateSizeInBytes() O(1) via AtomicLong counter
- Precompute field indices in sort comparator to avoid hot-loop switch
- Remove duplicate rowGroupSize/pageSize fields from S3SailStore/Compactor
- Centralize storage key literals into named constants
- Add named type discriminator constants in S3ValueStore
- Unify QuadStats accumulation with shared Accumulator inner class
- Centralize data key generation in Catalog.dataKey()
- Restrict ParquetFileInfo 14-param constructor to package-private
@JervenBolleman
Copy link
Contributor

FYI: https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/ might be easier to have as a dependency than parquet-java.

@odysa
Copy link
Contributor Author

odysa commented Mar 3, 2026

@JervenBolleman Looks promising. But it does not support writing now.

For the time after the 1.0 final release, we are planning to add support for writing Parquet files, and we may provide a CLI tool for inspecting and analyzing Parquet files

odysa added 2 commits March 4, 2026 17:14
…y estimates

- Parallelize Parquet writes with CompletableFuture (3 files written concurrently)
- Optimize getContextIDs() to use CSPO index with last-seen dedup
- Move compaction to background single-thread executor
- Implement catalog-based cardinality estimation in S3EvaluationStatistics
- Refactor ParquetQuadSource to stream rows lazily instead of loading all into memory
- Add row group filtering using Parquet column statistics (min/max)
- Add BloomFilter class for leading-component filtering per Parquet file
- Add close() to RawEntrySource; MergeIterator closes sources when exhausted
Consolidate duplicated buildBloomFilter and leading-component switch
logic from S3SailStore and Compactor into BloomFilter. Replace
hand-rolled byte serialization with ByteBuffer. Merge queryQuads
overloads and eliminate double shouldCompact evaluation in compaction.
@odysa odysa force-pushed the feature/s3-sail branch from cc3be73 to 89a73e1 Compare March 8, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants