Skip to content

ct: introduce producer queue#29084

Merged
rockwotj merged 10 commits intoredpanda-data:devfrom
rockwotj:ct-producer-queue
Jan 4, 2026
Merged

ct: introduce producer queue#29084
rockwotj merged 10 commits intoredpanda-data:devfrom
rockwotj:ct-producer-queue

Conversation

@rockwotj
Copy link
Contributor

  • ct/l0: add a producer queue
  • ct/l0/stm: wire in producer queue
  • ct/frontend: allow concurrent requests per producer

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

@rockwotj rockwotj force-pushed the ct-producer-queue branch 5 times, most recently from c315223 to f4623c8 Compare December 22, 2025 15:00
@rockwotj rockwotj marked this pull request as ready for review December 22, 2025 15:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a producer queue mechanism to enforce ordering of concurrent produce requests within cloud topics. The change allows multiple concurrent requests per producer while maintaining the correct order when committing to Raft, addressing the need to preserve ordering when uploading to object storage happens concurrently.

Key Changes:

  • Introduced producer_queue class to enforce per-producer ordering via tickets
  • Wired producer queue into the cloud topics STM layer
  • Refactored frontend to use ticket-based ordering instead of background task promises

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/v/cloud_topics/level_zero/common/producer_queue.h Defines the producer queue API and ticket interface
src/v/cloud_topics/level_zero/common/producer_queue.cc Implements producer queue with future chaining and state management
src/v/cloud_topics/level_zero/common/tests/producer_queue_test.cc Comprehensive test suite validating queue ordering and concurrency
src/v/cloud_topics/level_zero/stm/ctp_stm.h Adds producer queue member to the STM
src/v/cloud_topics/level_zero/stm/ctp_stm.cc Integrates producer queue lifecycle (stop) and accessor
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h Exposes producer queue through STM API
src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc Implements producer queue accessor wrapper
src/v/cloud_topics/frontend/frontend.cc Refactors replicate flow to use producer tickets and restructures upload/replicate logic
BUILD files Adds necessary dependencies for producer queue integration

@rockwotj rockwotj force-pushed the ct-producer-queue branch 2 times, most recently from e3e2d50 to 367b95d Compare December 22, 2025 15:36
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78277

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 22, 2025

CI test results

test results on build#78277
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78277#019b46e2-fe71-4e33-a919-d95eace4625d FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0012, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78336
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee85-4ce1-be40-e11980e4f5d0 FLAKY 7/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1690, p0=0.2314, reject_threshold=0.0100. adj_baseline=0.4262, p1=0.3189, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
FollowerFetchingTest test_with_leadership_transfers {"fetch_from": "fetch-from-cloud-topic"} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb3-4d9c-a982-4f7c6c531bb7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
FollowerFetchingTest test_with_leadership_transfers {"fetch_from": "fetch-from-cloud-topic"} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4c83-8f22-6b7af855600e FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
NodesDecommissioningTest test_decommissioning_crashed_node {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb0-4b21-918f-a1f37bae23a8 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_crashed_node
NodesDecommissioningTest test_decommissioning_finishes_after_manual_cancellation {"cloud_topic": true, "delete_topic": false} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb2-407e-b793-0bd8890437ac FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
NodesDecommissioningTest test_decommissioning_finishes_after_manual_cancellation {"cloud_topic": true, "delete_topic": false} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
NodesDecommissioningTest test_decommissioning_working_node {"cloud_topic": true, "delete_topic": false, "tick_interval": 5000} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb1-4637-ae51-80223d98b857 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_working_node
NodesDecommissioningTest test_flipping_decommission_recommission {"cloud_topic": true, "node_is_alive": false} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee86-491b-ad68-1fc8ce312717 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_flipping_decommission_recommission
NodesDecommissioningTest test_multiple_decommissions {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_multiple_decommissions
NodesDecommissioningTest test_recommissioning_node_finishes {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb2-407e-b793-0bd8890437ac FLAKY 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_node_finishes
NodesDecommissioningTest test_recommissioning_node_finishes {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_node_finishes
NodesDecommissioningTest test_recommissioning_one_of_decommissioned_nodes {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb4-46db-aa67-d93a489f47bc FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_one_of_decommissioned_nodes
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb3-4d9c-a982-4f7c6c531bb7 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4c83-8f22-6b7af855600e FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TopicRecreateTest test_cloud_topic_recreation_while_producing null integration https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dca9-4262-a130-6c9cc3ffb1db FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecreateTest&test_method=test_cloud_topic_recreation_while_producing
test results on build#78377
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/78377#019b4e5c-f099-407e-8667-e4131277f288 FLAKY 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0220, p0=0.3597, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78377#019b4e64-44b0-4b2d-bd8a-3e73f516c21e FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0006, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78377#019b4e5c-f09b-402c-9b39-f91de9519075 FLAKY 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0013, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78406
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
NodesDecommissioningTest test_decommissioning_cancel_ongoing_movements {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/78406#019b51e4-076c-4bc9-9ec6-2f0137205549 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_cancel_ongoing_movements
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78406#019b51ec-ff18-488e-a40a-fc2aea86ea7d FLAKY 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0030, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78406#019b51ec-ff15-4623-8e8d-1913e8889303 FLAKY 6/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0024, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
test results on build#78456
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
LeadersRedirectTest test_subdomain_redirect {"subdomain": "broker."} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b802f-ce02-4145-9d3e-5d029d040c6f FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0040, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LeadersRedirectTest&test_method=test_subdomain_redirect
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3825-4142-b489-50b3c26a537c FLAKY 2/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0255, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3828-4b23-8d63-d5ade0997f5d FLAKY 1/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0198, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3830-47f3-9a4d-afaa23808d75 FLAKY 1/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0065, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3825-4142-b489-50b3c26a537c FLAKY 1/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0065, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b802f-ce04-4fe2-946f-987112eb5490 FLAKY 4/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0079, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ScalingUpTest test_moves_with_local_retention {"use_topic_property": false} integration https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3823-420e-b1f2-2a84641f869b FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0169, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_moves_with_local_retention
src/v/wasm/tests/wasm_transform_test src/v/wasm/tests/wasm_transform_test unit https://buildkite.com/redpanda/redpanda/builds/78456#019b8014-35f2-4cfa-adc0-55d3dc9c13d2 FAIL 0/1
test results on build#78475
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329 FAIL 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329 FLAKY 1/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0318, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329 FLAKY 2/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0104, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329 FLAKY 2/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0083, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0d9a-427a-b008-25747c95e4fb FLAKY 4/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0100, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78480
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78480#019b819d-dd7a-4e36-b887-b72f6386c672 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78480#019b819e-9a0b-41bc-a3be-32914449ae88 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
test results on build#78488
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78488#019b8213-9183-466a-ba12-09d1a1def2bb FLAKY 11/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.1940, p0=0.0081, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
NodesDecommissioningTest test_decommission_status null integration https://buildkite.com/redpanda/redpanda/builds/78488#019b8210-1cb0-402e-9d44-24be81e8833b FLAKY 18/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0556, p0=0.3061, reject_threshold=0.0100. adj_baseline=0.1576, p1=0.3690, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommission_status
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78488#019b8213-9183-466a-ba12-09d1a1def2bb FAIL 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0020, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
test results on build#78497
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78497#019b83f7-0872-447c-93ca-e8da7e154753 FLAKY 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0289, p0=0.0025, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78501
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78501#019b848b-5ad4-4e15-817a-a605604e72d5 FLAKY 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0249, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78506
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/78506#019b8505-897f-40e8-a7f3-1df94291e66f FLAKY 15/21 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0245, p0=0.0001, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#78512
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78512#019b858a-4bb8-47a7-86ce-fbd78ddd6f45 FAIL 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
ShadowLinkingRandomOpsTest test_node_operations {"failures": true} integration https://buildkite.com/redpanda/redpanda/builds/78512#019b8589-60de-49f4-a853-88eeb0c5734b FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0499, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1422, p1=0.2156, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
test results on build#78513
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true} integration https://buildkite.com/redpanda/redpanda/builds/78513#019b85d8-b454-424f-a8e3-c9e259d499ea FLAKY 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0093, p0=0.0037, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing

wdberkeley
wdberkeley previously approved these changes Dec 22, 2025
Copy link
Contributor

@wdberkeley wdberkeley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, seems elegant to me. Looking forward to seeing how it works in a benchmark!

op->replicate_finished.set_value(raft::errc::timeout);
});

// The default errc that will cause the client to retry to operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to retry the operation

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Needs back-pressure propagation.

wait_on = it->second->promise.get_future();
}

auto new_state = ss::make_lw_shared<chain_state>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this could be done using a map of semaphores with count set to 1. Is it done this way to simplify the implementation of the 'release' method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will also require storing a future inside a ticket. The future will be awaited in the redeem method and just discarded in the release method. I have a small gripe with storing futures in data structures but in this case I think it's not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree futures stored in data structures are awkward and not good, but I think this is the simplest way to accomplish this?

Copy link
Contributor Author

@rockwotj rockwotj Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rework this a bit from @bharathv's feedback if you want to take another look, I ended up going down the semaphore route

auto it = _producer_states.find(pid);
ss::future<> wait_on = ss::now();
if (it != _producer_states.end()) {
wait_on = it->second->promise.get_future();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_future() asserts on a double get(), right? Is that a problem?

ticket0
ticket1 <chained to ticket0>

ticket1 is destroyed without redeem/release (say some exception)

ticket2 -- tries to chain to ticket1 (since it is the last member) -- calls ticket0.promise.get_future()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticket1 is destroyed without redeem/release (say some exception)

The dctor for tickets always calls release, I'm not seeing it possible to get a double get_future, as you always replace the state if you call get_future. It is possible to have issues if redeem is called twice, but I don't think that is what you're getting at?

Copy link
Contributor

@bharathv bharathv Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had a typo in my comment.. (sorry for the confusion)..

I meant (replace ticket1 -> ticket0).

ticket2 -- tries to chain to ticket0 (since it is the last member) -- calls ticket0.promise.get_future()?

In that case, ticket0.state.get_future() is called twice IIUC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I guess that's an ordering violation, not a double get_future, because we unconditionally replace the current future always after we grab it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay good catch, I've fixed this case :)

@rockwotj
Copy link
Contributor Author

Force push: address review feedback and add backpressure from write pipeline units into the stages

dotnwat
dotnwat previously approved these changes Dec 23, 2025
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh no red flags from me

Comment on lines 777 to 781
out.request_enqueued = _data_plane->reserve_write(std::move(batch_vec))
.then_wrapped([this,
p = std::move(result),
cloned = std::move(to_cache),
batch_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think I should add back the struct to hold all the stuff? It feels very messy I agree

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem worth it. we're just spoiled with coroutines most of the time

Comment on lines 80 to 85
ss::semaphore_units<> _units;
ss::future<ss::semaphore_units<>> _wait_on;
ss::lw_shared_ptr<chain_state> _state;
model::producer_id _pid;
producer_state_map* _map;
ss::gate::holder _gate_holder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the state we maintain per request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we might be able to consolidate a little by moving the gate into the shared state, but this is the simplest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up simplifying it
#29084 (comment)

@rockwotj
Copy link
Contributor Author

Force push: fixed the build by just doing the subclass trick instead of trying to forward declare only

Also pushed a commit to get better error messages when there are stale epochs

Right now we get unhelpful log messages about empty epochs, but now at
least we get a better error message about what the latest value in the
STM is.
@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78497

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@rockwotj rockwotj requested a review from dotnwat January 3, 2026 15:45
@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

Okay the debug failures are all shutdown hangs:

1/RedpandaService-0-281472953189712/docker-rp-2/redpanda.log:INFO  2026-01-03 13:21:19,735 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
1/RedpandaService-0-281472953189712/docker-rp-4/redpanda.log:INFO  2026-01-03 13:21:19,679 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
2/RedpandaService-0-281473298644944/docker-rp-11/redpanda.log:INFO  2026-01-03 13:28:09,935 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
2/RedpandaService-0-281473298644944/docker-rp-12/redpanda.log:INFO  2026-01-03 13:28:40,695 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
3/RedpandaService-0-281473275675984/docker-rp-22/redpanda.log:INFO  2026-01-03 13:29:27,929 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
3/RedpandaService-0-281473275675984/docker-rp-23/redpanda.log:INFO  2026-01-03 13:28:33,775 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
5/RedpandaService-0-281473275662384/docker-rp-37/redpanda.log:INFO  2026-01-03 13:28:34,665 [shard 0:main] main - sharded_service_container.h:132 - Service cloud_topics::app is taking more than 15 seconds to shutdown
5/RedpandaService-0-281473275662384/docker-rp-39/redpanda.log:INFO  2026-01-03 13:29:20,418 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown

The one cloud topics one is the reconciler, which is unrelated to this PR

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
debug
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78501

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

dotnwat
dotnwat previously approved these changes Jan 3, 2026
@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
debug
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78506

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

ERROR 2026-01-03 16:18:31,741 [shard 0:main] cluster - partition_manager.cc:502 - partition {kafka/tp-workload-ct/0} shutdown takes longer than expected, current shutdown stage: removing_raft time since last update: 20 seconds
ERROR 2026-01-03 16:18:30,777 [shard 0:main] raft - state_machine_manager.cc:245 - [{kafka/tp-workload-ct/0}] Timedout waiting for ctp_stm state machine to stop

Hrm I wonder if increased traffic to the ctp_stm causing the shutdown delay... Anyways, I suspect it's due to the lock in the epoch fencing as everything else looks like it shuts down ok

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

Okay so it seems reasonable that the issue here is the epoch locking timing out on shutdown, as there are lots of partition moves and the cluster epoch is probably out of sync on these different nodes.
This means there are write locks waiting to resolve on shutdown, it shouldn't take 20 seconds, so it might be something else... I added a commit to be able to abort the current waiters in the epoch fencing, and noticed there is a lifetime issue if the ctp_stm is destroyed before the units are returned, so I added a code in stop to wait for all units to be returned (and a watchdog because I'm sure there will be shutdown issues with this still).

We see shutdowns in the ctp stm, likely because of outstanding units in
the epoch fencing.

Also add a watchdog for shutdown to better identify what is taking so
long in stop.

We switch from ss::rwlock to ss::semaphore as to be able to log the
number of read locks held on shutdown hangs.
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78512

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

The only test failure is now a crash that Evgeny has a fix for:

rptest.services.utils.NodeCrash: <NodeCrash docker-rp-46: ERROR 2026-01-03 20:50:19,726 [shard 0:main] assert - Assert failure: (src/v/cloud_topics/level_zero/pipeline/pipeline_stage.cc:39) 'static_cast<size_t>(next_ix) < _stages.size()' Pipeline stage

@rockwotj rockwotj requested a review from dotnwat January 3, 2026 21:50
@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
release
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78513

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

This time I got the same assertion again and

INFO  2026-01-03 22:01:21,524 [shard 0:main] main - sharded_service_container.h:132 - Service cloud_topics::app is taking more than 15 seconds to shutdown
INFO  2026-01-03 22:01:21,525 [shard 0:main] main/cloud_topics - sharded_service_container.h:76 - Service cloud_topics::reconciler::reconciler is taking more than 15 seconds to shut down.

@rockwotj
Copy link
Contributor Author

rockwotj commented Jan 3, 2026

I think I'm ready to force merge at this point @dotnwat, this PR is less flaky than CI on mainline wrt cloud topics.

@dotnwat
Copy link
Member

dotnwat commented Jan 4, 2026

I think I'm ready to force merge at this point @dotnwat, this PR is less flaky than CI on mainline wrt cloud topics.

That sounds good to me.

@rockwotj rockwotj disabled auto-merge January 4, 2026 04:07
@rockwotj rockwotj merged commit f83d5c6 into redpanda-data:dev Jan 4, 2026
16 of 19 checks passed
@rockwotj rockwotj deleted the ct-producer-queue branch January 4, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants