Replies: 5 comments 1 reply
-
|
More and more I wonder what the difference is between a stable row id and just having a column named "id" with a scalar index. This feels like it is closing the gap even more. |
Beta Was this translation helpful? Give feedback.
-
I think there are two separate questions here:
For the first question, I think the answer is yes for workloads like mine. My records may be written concurrently, sometimes even across multiple nodes, so I need a centralized way to assign globally unique IDs. In that setup, stable_row_id is a natural fit. This is similar to how many systems use database-generated auto-increment IDs as their primary key instead of pushing global ID coordination into application logic. In my use case, if I choose a custom id column plus a scalar index, I would need some kind of infrastructure:
I think it will work, just leave me an optimize module to maintain which I'm not familiar with. Do you have some good practice about this? @westonpace The second question is built on top of the first. Once stable_row_id is accepted as a usable global identifier, pre-allocation becomes valuable because it lets related writes reference those IDs before the final append. This is also a common database pattern. PostgreSQL and Oracle both allow fetching sequence values before insert, and SQL Server even provides sp_sequence_get_range to reserve a block of IDs in advance. References:
Pre-allocation is not strictly required for a database to support system-generated IDs. For example, SQLite supports auto-generated integer IDs, but it does not provide a built-in way to reserve IDs before insert. I think the important requirement is that Lance should support at least one of these two capabilities:
All databases I know support the first model. Only some support the second. In my case, pre-allocation is the better fit because I need to use those IDs to establish relationships before inserting the records. I could also work with post-insert generated IDs, but that would require additional write/read steps and more table operations. For Lance specifically,, I think pre-allocation might be the more reasonable direction. It appears simpler to implement and easier to reason about, especially since we already have the reserved fragment ID operation. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, sorry, I feel somewhat guilty now as my comment was glib and unhelpful. Thank you for providing the extra information. I agree that reserving a block of ids is useful and I'll add a review to the PR soon (still catching up on thinking it through). It would be good to start documenting what users need and expect from primary keys and sequences. Can we have multiple sequences in a table? Are they auto incrementing counts or can we do UUID? Etc. Etc. I think stable row id is a complicated ground between sequence, some kind of "range index", and "update index on write" and "unique constraint". It might be good someday to expand these into their own concepts. |
Beta Was this translation helpful? Give feedback.
-
No worries, your comment helped clarify the tradeoff, and I really appreciate it. I did consider the UUID direction first, but for a workload dominated by updates and random reads, it makes index maintenance part of the correctness path and adds nontrivial overhead, especially on updates. So from my perspective, this is fundamentally a systems tradeoff. Once I started thinking in terms of SQLite is a useful comparison here. In SQLite, I could use internal row_id as the primary key just like this: CREATE TABLE observation (
id INTEGER PRIMARY KEY,
payload BLOB
);The My view is:
|
Beta Was this translation helpful? Give feedback.
-
|
The general approach makes sense to me for table row ID. Just curious have we researched the Delta Lake identity column feature? I know it exists but I don't know if it is open source spec or if it is only in commercial. @wjones127 might know more: https://docs.databricks.com/aws/en/delta/generated-columns#use-identity-columns-in-delta-lake CREATE TABLE table_name (
id_col1 BIGINT GENERATED ALWAYS AS IDENTITY,
id_col2 BIGINT GENERATED ALWAYS AS IDENTITY (START WITH -1 INCREMENT BY 1),
id_col3 BIGINT GENERATED BY DEFAULT AS IDENTITY,
id_col4 BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH -1 INCREMENT BY 1)
)If implementing it is not hard, we might want to consider do it in a more generic way instead of just do it for stable row ID. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
My use case is to use
stable_row_idas a business-facing primary key, not just an internal storage identifier. For that to work, the ID needs to be known before data is written, so downstream records can reference it during the same write pipeline.Today, Lance does not provide a suitable interface to obtain the row IDs for data being written ahead of time. That becomes a problem when one logical entity needs to be linked across different schemas or tables.
A concrete example is a setup with three tables:
observationsessionindexThe
indextable points to bothobservationandsession, so it needs to store the primary keys of rows in those two tables. In this design, I want to usestable_row_idas the primary key because it allows directtakeby row ID, instead of maintaining an additional secondary index just to map from an application key back to rows. The current blocker is that these row IDs are not available at the time the related rows are being prepared and written.So the core requirement is: reserve stable row IDs before append, then use those IDs as cross-table references when writing related data.
Proposal
This change adds a two-step flow for stable row IDs:
To support this, we introduce a new transaction type,
ReserveRowIds { num_rows }, plus aReservedRowIds { start_row_id, num_rows }struct to represent the reserved range explicitly.On the write path,
Operation::Appendis extended to optionally carry a reserved row ID range, andInsertBuilder::with_row_ids(...)is added so callers can append data with pre-reserved IDs.Dataset::reserved_row_ids()exposes the reservation attached to the current dataset version when that version was produced byReserveRowIds.The append validation path ensures that:
read_versionpoints to the reserve snapshot,The transaction and protobuf formats are updated accordingly so both reservation transactions and append-time row ID assignments are persisted.
Conflict handling is also updated so concurrent appends using reserved row IDs are checked for overlap.
Beta Was this translation helpful? Give feedback.
All reactions