Dont drop tables in incremental append runs#365
Dont drop tables in incremental append runs#365stephen-up wants to merge 2 commits intoClickHouse:mainfrom
Conversation
|
I signed the CLA btw. Dont know how long it takes to update. |
|
Hi @stephen-up, Thank you for your contribution! I appreciate the effort you’ve put into this. I'm hesitant to approve this, as it could leave a hanging table behind. While I understand that it would be dropped in the next run, it’s still something to consider. Additionally, this solution seems to address a specific edge case rather than providing a more comprehensive approach. A broader solution for handling parallel runs would be preferable. I'll review the suggested alternative, though it currently applies only to temporary tables and not to relations created via make_intermediate_relation. Perhaps we could override make_intermediate_relation to include the invocation_id, similar to what was done in #150 and #353? We could also introduce a feature flag for this macro, defaulting to false (disabled). What do you think? |
|
Hi Bentsi. Leaving hanging tables does sound like something we want to avoid. Having random ids in the names like #150 or #353 do sound like solutions that ensure better concurrency, support multiple concurrent runs, but they increase the risk / cost of hanging tables. Just not dropping the shadow tables that arent even used during an incremental+append seems simpler and more desirable for me. Im happy to add a feature flag for it. |
|
I've added the feature flag. Is there a way to set the feature flag for every model in a dbt project? They way i've implemented it, it looks like it needs to be set in every model like... {{ config(
order_by='project_id, igid',
engine='ReplacingMergeTree(version)',
materialized='incremental',
incremental_strategy='append',
incremental_append_skip_drop_table='true'
) }} |
|
The feature flag I referred to in my previous reply was for the suggested |
|
Hi @BentsiLeviav, I've added some comments to #420. When thinking about race conditions, it gets tricky. Do you want to support concurrent full refreshes. Or a pair of a refresh and an incremental run. If you just want the latter, like me. I think this PR is perfect. The incremental + refresh seems like a common pattern. We've setup a pipeline where we have an incremental run every 1/2 hour then occasionally run full refreshes which take hours, easily leading to a clash. |
Hi, when dbt runs for a incremental+append it always drops the temporary tables that are used for full refreshes. Even if its not going to use them.
In my system, I have scheduled incremental runs of dbt that run frequently updating incremental+append tables. Whenever I do a full refresh (concurrently), the scheduled run drops the tables of the full refresh causing it to crash.
This PR stops that, by making the incremental+append runs not drop tables.
I see there have been a few other issues similar to this, but this seems like an easy minimal way to fix the issue.
Thanks