Insert data with reserved stable row ids #6484

majin1102 · 2026-04-09T10:25:14Z

majin1102
Apr 9, 2026
Collaborator

Motivation

My use case is to use stable_row_id as a business-facing primary key, not just an internal storage identifier. For that to work, the ID needs to be known before data is written, so downstream records can reference it during the same write pipeline.

Today, Lance does not provide a suitable interface to obtain the row IDs for data being written ahead of time. That becomes a problem when one logical entity needs to be linked across different schemas or tables.

A concrete example is a setup with three tables:

observation
session
index

The index table points to both observation and session, so it needs to store the primary keys of rows in those two tables. In this design, I want to use stable_row_id as the primary key because it allows direct take by row ID, instead of maintaining an additional secondary index just to map from an application key back to rows. The current blocker is that these row IDs are not available at the time the related rows are being prepared and written.

So the core requirement is: reserve stable row IDs before append, then use those IDs as cross-table references when writing related data.

Proposal

This change adds a two-step flow for stable row IDs:

Reserve a range of stable row IDs before writing.
Use that reserved range in a later append transaction.

To support this, we introduce a new transaction type, ReserveRowIds { num_rows }, plus a ReservedRowIds { start_row_id, num_rows } struct to represent the reserved range explicitly.

On the write path, Operation::Append is extended to optionally carry a reserved row ID range, and InsertBuilder::with_row_ids(...) is added so callers can append data with pre-reserved IDs. Dataset::reserved_row_ids() exposes the reservation attached to the current dataset version when that version was produced by ReserveRowIds.

The append validation path ensures that:

reserved row IDs are only used on datasets with stable row IDs enabled,
the append transaction’s read_version points to the reserve snapshot,
the requested range is contained within the reserved range.

The transaction and protobuf formats are updated accordingly so both reservation transactions and append-time row ID assignments are persisted.

Conflict handling is also updated so concurrent appends using reserved row IDs are checked for overlap.

westonpace · 2026-04-09T15:15:13Z

westonpace
Apr 9, 2026
Maintainer

More and more I wonder what the difference is between a stable row id and just having a column named "id" with a scalar index. This feels like it is closing the gap even more.

0 replies

majin1102 · 2026-04-09T17:49:49Z

majin1102
Apr 9, 2026
Collaborator Author

More and more I wonder what the difference is between a stable row id and just having a column named "id" with a scalar index. This feels like it is closing the gap even more.

I think there are two separate questions here:

Should stable_row_id be usable as a business primary key?
If yes, should Lance support allocating it before the final write?

For the first question, I think the answer is yes for workloads like mine. My records may be written concurrently, sometimes even across multiple nodes, so I need a centralized way to assign globally unique IDs. In that setup, stable_row_id is a natural fit. This is similar to how many systems use database-generated auto-increment IDs as their primary key instead of pushing global ID coordination into application logic.

In my use case, if I choose a custom id column plus a scalar index, I would need some kind of infrastructure:

A global coordinator to assign globally unique IDs. In my case I prefer an increment id, but UUID might be acceptable.
A watchdog or background mechanism to continuously maintain the scalar index, so random lookup by id remains available at all times. In my case，the key would be used for frequent read and update,even join.

I think it will work, just leave me an optimize module to maintain which I'm not familiar with. Do you have some good practice about this? @westonpace

The second question is built on top of the first. Once stable_row_id is accepted as a usable global identifier, pre-allocation becomes valuable because it lets related writes reference those IDs before the final append.

This is also a common database pattern. PostgreSQL and Oracle both allow fetching sequence values before insert, and SQL Server even provides sp_sequence_get_range to reserve a block of IDs in advance.

References:

PostgreSQL sequences / nextval: https://www.postgresql.org/docs/current/functions-sequence.html
Oracle NEXTVAL: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Sequence-Pseudocolumns.html
SQL Server sp_sequence_get_range: https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-sequence-get-range-transact-sql

Pre-allocation is not strictly required for a database to support system-generated IDs. For example, SQLite supports auto-generated integer IDs, but it does not provide a built-in way to reserve IDs before insert.

I think the important requirement is that Lance should support at least one of these two capabilities:

Return generated ID(s) after insert.
Allocate ID(s) before insert, so they are available earlier.

All databases I know support the first model. Only some support the second. In my case, pre-allocation is the better fit because I need to use those IDs to establish relationships before inserting the records. I could also work with post-insert generated IDs, but that would require additional write/read steps and more table operations.

For Lance specifically,, I think pre-allocation might be the more reasonable direction. It appears simpler to implement and easier to reason about, especially since we already have the reserved fragment ID operation.

0 replies

westonpace · 2026-04-10T01:18:00Z

westonpace
Apr 10, 2026
Maintainer

Yes, sorry, I feel somewhat guilty now as my comment was glib and unhelpful. Thank you for providing the extra information. I agree that reserving a block of ids is useful and I'll add a review to the PR soon (still catching up on thinking it through).

It would be good to start documenting what users need and expect from primary keys and sequences. Can we have multiple sequences in a table? Are they auto incrementing counts or can we do UUID? Etc. Etc.

I think stable row id is a complicated ground between sequence, some kind of "range index", and "update index on write" and "unique constraint". It might be good someday to expand these into their own concepts.

0 replies

majin1102 · 2026-04-10T15:44:17Z

majin1102
Apr 10, 2026
Collaborator Author

Yes, sorry, I feel somewhat guilty now as my comment was glib and unhelpful

No worries, your comment helped clarify the tradeoff, and I really appreciate it.

I did consider the UUID direction first, but for a workload dominated by updates and random reads, it makes index maintenance part of the correctness path and adds nontrivial overhead, especially on updates. So from my perspective, this is fundamentally a systems tradeoff. Once I started thinking in terms of stable_row_id, I realized there were still some gaps to fill, but the overall design felt cleaner and more maintainable.

SQLite is a useful comparison here. In SQLite, I could use internal row_id as the primary key just like this:

CREATE TABLE observation (
    id INTEGER PRIMARY KEY,
    payload BLOB
);

The memory-core case in openclaw is also materially different(it relied on SQLite and use a business id): it is a single-table, append-only design. What I am building has a more demanding read/update pattern. After looking at this from the angles of maintainability, availability, and the current state of Lance, I ended up preferring to strengthen stable_row_id rather than push the problem into an application key plus a secondary index.

My view is:

stable_row_id is already quite close to an auto-increment / sequence-backed primary-key path
to fully play that role, besides reservation, we would at least need update by stable_row_id, and ideally a way to return generated row IDs after insert
whether an application chooses a sequence-style key or a UUID-style key should remain an application-level choice; Lance only needs to provide a solid first-class primitive
this model is probably best kept as one sequence-like primary-key path per table; if multiple sequences are needed, that can be modeled explicitly, for example via a dedicated sequence table

1 reply

westonpace Apr 16, 2026
Maintainer

This makes sense. I am fine moving forward with the stable row id.

jackye1995 · 2026-04-30T08:10:31Z

jackye1995
Apr 30, 2026
Maintainer

The general approach makes sense to me for table row ID. Just curious have we researched the Delta Lake identity column feature? I know it exists but I don't know if it is open source spec or if it is only in commercial. @wjones127 might know more:

https://docs.databricks.com/aws/en/delta/generated-columns#use-identity-columns-in-delta-lake

CREATE TABLE table_name (
  id_col1 BIGINT GENERATED ALWAYS AS IDENTITY,
  id_col2 BIGINT GENERATED ALWAYS AS IDENTITY (START WITH -1 INCREMENT BY 1),
  id_col3 BIGINT GENERATED BY DEFAULT AS IDENTITY,
  id_col4 BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH -1 INCREMENT BY 1)
 )

If implementing it is not hard, we might want to consider do it in a more generic way instead of just do it for stable row ID.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert data with reserved stable row ids #6484

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Insert data with reserved stable row ids #6484

Uh oh!

majin1102 Apr 9, 2026 Collaborator

Motivation

Proposal

Replies: 5 comments · 1 reply

Uh oh!

westonpace Apr 9, 2026 Maintainer

Uh oh!

Uh oh!

majin1102 Apr 9, 2026 Collaborator Author

Uh oh!

westonpace Apr 10, 2026 Maintainer

Uh oh!

Uh oh!

majin1102 Apr 10, 2026 Collaborator Author

Uh oh!

westonpace Apr 16, 2026 Maintainer

Uh oh!

jackye1995 Apr 30, 2026 Maintainer

majin1102
Apr 9, 2026
Collaborator

Replies: 5 comments 1 reply

westonpace
Apr 9, 2026
Maintainer

majin1102
Apr 9, 2026
Collaborator Author

westonpace
Apr 10, 2026
Maintainer

majin1102
Apr 10, 2026
Collaborator Author

westonpace Apr 16, 2026
Maintainer

jackye1995
Apr 30, 2026
Maintainer