-
Notifications
You must be signed in to change notification settings - Fork 12
Accept table function as destination for part export #1320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.8
Are you sure you want to change the base?
Accept table function as destination for part export #1320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c7f7c7c90
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const auto & destination_storage = manifest.destination_storage_ptr; | ||
| const auto destination_storage_id = destination_storage->getStorageID(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revalidate destination table before export task runs
The export task now uses the captured destination_storage_ptr directly without rechecking whether the destination table still exists or is dropped/detached. If the destination table is dropped after the ALTER EXPORT is queued but before this background task executes, destination_storage->import(...) will still run and can write data for a table that no longer exists (or has been recreated with a different definition). Previously the task re-resolved the destination from the catalog and aborted on UNKNOWN_TABLE; consider re-looking it up or at least checking is_dropped/locking before proceeding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting one. This is kind of an existing problem, not exclusive to this PR.
Plain object storage tables don't have state, and there is no trigger / event for when these tables are dropped. At the same time, it is safe (in terms of thread and memory safety) to write to S3 even after the table has been dropped.
Being that said, I think we have a few options:
- Document it and leave it as is, meaning that if a table gets deleted after the export process has started, it will continue writing to the S3 bucket.
- Somehow trigger an event when the table gets deleted, and cancel exports. Might leak abstractions
- Constantly check in the pipeline if the table remains alive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I don't think this is a major issue. We probably have bigger fish to fry
| bool operator<(const MergeTreePartExportManifest & rhs) const | ||
| { | ||
| const auto lhs_storage_id = destination_storage_ptr->getStorageID(); | ||
| const auto rhs_storage_id = rhs.destination_storage_ptr->getStorageID(); | ||
| // Lexicographic comparison: first compare destination storage, then part name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid set ordering on mutable StorageID
The manifest comparison now derives ordering from destination_storage_ptr->getStorageID(), but StorageID for a table can change on RENAME/DETACH while an export is in progress. Because export_manifests is a std::set, mutating the key of an element already in the set violates ordering invariants and can lead to missed lookups/erasures or undefined behavior. Consider storing a stable StorageID (or a hash of table-function arguments) inside the manifest for ordering, even if you keep the StoragePtr for execution.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly interesting one, and kind of related to #1320 (comment).
|
There is one interesting thing to validate on this one: exports are considered duplicate based on source table + part name + destination table name. I need to check what is the name used for table functions. Edit1: Edit2: Table function names, by default, are all the same as observed above. As stated earlier, the list of manifests is unique based on This isn't a big problem I suppose, but it sounds like a cool protection to have. At the same time, it is not a very good protection as it is based on the table name, not the destination storage. For instance: if you have two s3 tables with different names pointing to the same location, it will export it. The options we have from the top of my head:
|
… be exported simutaneously
Discussed with Misha and Dima, two options:
For now, I implemented n1. |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d613c808e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| bool operator<(const MergeTreePartExportManifest & rhs) const | ||
| { | ||
| // Lexicographic comparison: first compare destination storage, then part name | ||
| auto lhs_storage = destination_storage_id.getQualifiedName(); | ||
| auto rhs_storage = rhs.destination_storage_id.getQualifiedName(); | ||
|
|
||
| if (lhs_storage != rhs_storage) | ||
| return lhs_storage < rhs_storage; | ||
|
|
||
| return data_part->name < rhs.data_part->name; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include destination in export manifest ordering
Because MergeTreeData::exportPartToTable stores manifests in a std::set and relies on the comparator for uniqueness, comparing only data_part->name makes all exports of the same part equivalent even if the destination differs. This means concurrent exports of the same part to two different tables or table functions will now collide and the second will fail with “already being exported,” which is a regression from the previous per-destination behavior. Consider including the destination (e.g., storage ID or pointer) in operator</operator== so the set only deduplicates identical part+destination pairs.
Useful? React with 👍 / 👎.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Accept table function as destination for part export, inherit schema if not explicitly provided. Example:
ALTER TABLE src_merge_tree EXPORT PART '2020_1_1_0' TO TABLE FUNCTION s3(s3_conn, filename='...', partition_strategy...) PARTITION BY ..Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: