ct: introduce producer queue by rockwotj · Pull Request #29084 · redpanda-data/redpanda

rockwotj · 2025-12-21T00:57:20Z

ct/l0: add a producer queue
ct/l0/stm: wire in producer queue
ct/frontend: allow concurrent requests per producer

Backports Required

Release Notes

none

Copilot

Pull request overview

This PR introduces a producer queue mechanism to enforce ordering of concurrent produce requests within cloud topics. The change allows multiple concurrent requests per producer while maintaining the correct order when committing to Raft, addressing the need to preserve ordering when uploading to object storage happens concurrently.

Key Changes:

Introduced producer_queue class to enforce per-producer ordering via tickets
Wired producer queue into the cloud topics STM layer
Refactored frontend to use ticket-based ordering instead of background task promises

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/v/cloud_topics/level_zero/common/producer_queue.h`	Defines the producer queue API and ticket interface
`src/v/cloud_topics/level_zero/common/producer_queue.cc`	Implements producer queue with future chaining and state management
`src/v/cloud_topics/level_zero/common/tests/producer_queue_test.cc`	Comprehensive test suite validating queue ordering and concurrency
`src/v/cloud_topics/level_zero/stm/ctp_stm.h`	Adds producer queue member to the STM
`src/v/cloud_topics/level_zero/stm/ctp_stm.cc`	Integrates producer queue lifecycle (stop) and accessor
`src/v/cloud_topics/level_zero/stm/ctp_stm_api.h`	Exposes producer queue through STM API
`src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc`	Implements producer queue accessor wrapper
`src/v/cloud_topics/frontend/frontend.cc`	Refactors replicate flow to use producer tickets and restructures upload/replicate logic
BUILD files	Adds necessary dependencies for producer queue integration

src/v/cloud_topics/level_zero/common/producer_queue.h

src/v/cloud_topics/level_zero/common/producer_queue.cc

src/v/cloud_topics/frontend/frontend.cc

vbotbuildovich · 2025-12-22T18:14:12Z

Retry command for Build#78277

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

vbotbuildovich · 2025-12-22T18:23:47Z

CI test results

test results on build#78277

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78277#019b46e2-fe71-4e33-a919-d95eace4625d	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0012, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78336

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee85-4ce1-be40-e11980e4f5d0	FLAKY	7/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.1690, p0=0.2314, reject_threshold=0.0100. adj_baseline=0.4262, p1=0.3189, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
FollowerFetchingTest	test_with_leadership_transfers	{"fetch_from": "fetch-from-cloud-topic"}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb3-4d9c-a982-4f7c6c531bb7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
FollowerFetchingTest	test_with_leadership_transfers	{"fetch_from": "fetch-from-cloud-topic"}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4c83-8f22-6b7af855600e	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_with_leadership_transfers
NodesDecommissioningTest	test_decommissioning_crashed_node	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb0-4b21-918f-a1f37bae23a8	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_crashed_node
NodesDecommissioningTest	test_decommissioning_finishes_after_manual_cancellation	{"cloud_topic": true, "delete_topic": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb2-407e-b793-0bd8890437ac	FLAKY	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
NodesDecommissioningTest	test_decommissioning_finishes_after_manual_cancellation	{"cloud_topic": true, "delete_topic": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
NodesDecommissioningTest	test_decommissioning_working_node	{"cloud_topic": true, "delete_topic": false, "tick_interval": 5000}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb1-4637-ae51-80223d98b857	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_working_node
NodesDecommissioningTest	test_flipping_decommission_recommission	{"cloud_topic": true, "node_is_alive": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee86-491b-ad68-1fc8ce312717	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_flipping_decommission_recommission
NodesDecommissioningTest	test_multiple_decommissions	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_multiple_decommissions
NodesDecommissioningTest	test_recommissioning_node_finishes	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb2-407e-b793-0bd8890437ac	FLAKY	9/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_node_finishes
NodesDecommissioningTest	test_recommissioning_node_finishes	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4761-aebb-fcba9211424c	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_node_finishes
NodesDecommissioningTest	test_recommissioning_one_of_decommissioned_nodes	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb4-46db-aa67-d93a489f47bc	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_one_of_decommissioned_nodes
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dcb3-4d9c-a982-4f7c6c531bb7	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b7b-ee8c-4c83-8f22-6b7af855600e	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TopicRecreateTest	test_cloud_topic_recreation_while_producing	null	integration	https://buildkite.com/redpanda/redpanda/builds/78336#019b4b70-dca9-4262-a130-6c9cc3ffb1db	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecreateTest&test_method=test_cloud_topic_recreation_while_producing

test results on build#78377

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DataMigrationsApiTest	test_higher_level_migration_api	null	integration	https://buildkite.com/redpanda/redpanda/builds/78377#019b4e5c-f099-407e-8667-e4131277f288	FLAKY	19/21	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0220, p0=0.3597, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78377#019b4e64-44b0-4b2d-bd8a-3e73f516c21e	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0006, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78377#019b4e5c-f09b-402c-9b39-f91de9519075	FLAKY	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0013, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78406

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
NodesDecommissioningTest	test_decommissioning_cancel_ongoing_movements	{"cloud_topic": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78406#019b51e4-076c-4bc9-9ec6-2f0137205549	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_cancel_ongoing_movements
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78406#019b51ec-ff18-488e-a40a-fc2aea86ea7d	FLAKY	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0030, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78406#019b51ec-ff15-4623-8e8d-1913e8889303	FLAKY	6/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0024, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing

test results on build#78456

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
LeadersRedirectTest	test_subdomain_redirect	{"subdomain": "broker."}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b802f-ce02-4145-9d3e-5d029d040c6f	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0040, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LeadersRedirectTest&test_method=test_subdomain_redirect
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3825-4142-b489-50b3c26a537c	FLAKY	2/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0255, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3828-4b23-8d63-d5ade0997f5d	FLAKY	1/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0198, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3830-47f3-9a4d-afaa23808d75	FLAKY	1/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0065, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3825-4142-b489-50b3c26a537c	FLAKY	1/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0065, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b802f-ce04-4fe2-946f-987112eb5490	FLAKY	4/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0079, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
ScalingUpTest	test_moves_with_local_retention	{"use_topic_property": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78456#019b8073-3823-420e-b1f2-2a84641f869b	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0169, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_moves_with_local_retention
src/v/wasm/tests/wasm_transform_test	src/v/wasm/tests/wasm_transform_test		unit	https://buildkite.com/redpanda/redpanda/builds/78456#019b8014-35f2-4cfa-adc0-55d3dc9c13d2	FAIL	0/1

test results on build#78475

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329	FAIL	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329	FLAKY	1/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0318, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329	FLAKY	2/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0104, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0da1-497f-9ced-d1fe1b424329	FLAKY	2/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0083, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78475#019b8125-0d9a-427a-b008-25747c95e4fb	FLAKY	4/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0100, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78480

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DatalakeClusterRestoreTest	test_restore_partition_spec	{"catalog_type": "rest_hadoop", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/78480#019b819d-dd7a-4e36-b887-b72f6386c672	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78480#019b819e-9a0b-41bc-a3be-32914449ae88	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard

test results on build#78488

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/78488#019b8213-9183-466a-ba12-09d1a1def2bb	FLAKY	11/21	Test FAILS after retries.Significant increase in flaky rate(baseline=0.1940, p0=0.0081, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
NodesDecommissioningTest	test_decommission_status	null	integration	https://buildkite.com/redpanda/redpanda/builds/78488#019b8210-1cb0-402e-9d44-24be81e8833b	FLAKY	18/21	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0556, p0=0.3061, reject_threshold=0.0100. adj_baseline=0.1576, p1=0.3690, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommission_status
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 1, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78488#019b8213-9183-466a-ba12-09d1a1def2bb	FAIL	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0020, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing

test results on build#78497

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78497#019b83f7-0872-447c-93ca-e8da7e154753	FLAKY	7/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0289, p0=0.0025, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78501

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78501#019b848b-5ad4-4e15-817a-a605604e72d5	FLAKY	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0249, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78506

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/78506#019b8505-897f-40e8-a7f3-1df94291e66f	FLAKY	15/21	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0245, p0=0.0001, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

test results on build#78512

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78512#019b858a-4bb8-47a7-86ce-fbd78ddd6f45	FAIL	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78512#019b8589-60de-49f4-a853-88eeb0c5734b	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0499, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1422, p1=0.2156, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations

test results on build#78513

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
SIPartitionMovementTest	test_shadow_indexing	{"cloud_storage_type": 2, "num_to_upgrade": 0, "with_cloud_topics": true}	integration	https://buildkite.com/redpanda/redpanda/builds/78513#019b85d8-b454-424f-a8e3-c9e259d499ea	FLAKY	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0093, p0=0.0037, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing

wdberkeley

I like it, seems elegant to me. Looking forward to seeing how it works in a benchmark!

wdberkeley · 2025-12-22T19:23:11Z

src/v/cloud_topics/frontend/frontend.cc

-        op->replicate_finished.set_value(raft::errc::timeout);
-    });
-
+    // The default errc that will cause the client to retry to operation


to retry the operation

Lazin

Looks good. Needs back-pressure propagation.

Lazin · 2025-12-22T19:23:57Z

src/v/cloud_topics/level_zero/common/producer_queue.cc

+            wait_on = it->second->promise.get_future();
+        }
+
+        auto new_state = ss::make_lw_shared<chain_state>();


It feels like this could be done using a map of semaphores with count set to 1. Is it done this way to simplify the implementation of the 'release' method?

I guess it will also require storing a future inside a ticket. The future will be awaited in the redeem method and just discarded in the release method. I have a small gripe with storing futures in data structures but in this case I think it's not an issue.

Yeah I agree futures stored in data structures are awkward and not good, but I think this is the simplest way to accomplish this?

I had to rework this a bit from @bharathv's feedback if you want to take another look, I ended up going down the semaphore route

bharathv · 2025-12-22T20:11:44Z

src/v/cloud_topics/level_zero/common/producer_queue.cc

+        auto it = _producer_states.find(pid);
+        ss::future<> wait_on = ss::now();
+        if (it != _producer_states.end()) {
+            wait_on = it->second->promise.get_future();


get_future() asserts on a double get(), right? Is that a problem?

ticket0
ticket1 <chained to ticket0>

ticket1 is destroyed without redeem/release (say some exception)

ticket2 -- tries to chain to ticket1 (since it is the last member) -- calls ticket0.promise.get_future()?

ticket1 is destroyed without redeem/release (say some exception)

The dctor for tickets always calls release, I'm not seeing it possible to get a double get_future, as you always replace the state if you call get_future. It is possible to have issues if redeem is called twice, but I don't think that is what you're getting at?

Ah I had a typo in my comment.. (sorry for the confusion)..

I meant (replace ticket1 -> ticket0).

ticket2 -- tries to chain to ticket0 (since it is the last member) -- calls ticket0.promise.get_future()?

In that case, ticket0.state.get_future() is called twice IIUC?

Hrm, I guess that's an ordering violation, not a double get_future, because we unconditionally replace the current future always after we grab it.

Okay good catch, I've fixed this case :)

rockwotj · 2025-12-22T22:34:59Z

Force push: address review feedback and add backpressure from write pipeline units into the stages

dotnwat

yeh no red flags from me

src/v/cloud_topics/frontend/frontend.cc

dotnwat · 2025-12-23T00:41:54Z

src/v/cloud_topics/frontend/frontend.cc

+    out.request_enqueued = _data_plane->reserve_write(std::move(batch_vec))
+                             .then_wrapped([this,
+                                            p = std::move(result),
+                                            cloned = std::move(to_cache),
+                                            batch_id,


Do you think I should add back the struct to hold all the stuff? It feels very messy I agree

doesn't seem worth it. we're just spoiled with coroutines most of the time

dotnwat · 2025-12-23T00:47:05Z

src/v/cloud_topics/level_zero/common/producer_queue.cc

+    ss::semaphore_units<> _units;
+    ss::future<ss::semaphore_units<>> _wait_on;
+    ss::lw_shared_ptr<chain_state> _state;
+    model::producer_id _pid;
+    producer_state_map* _map;
+    ss::gate::holder _gate_holder;


is this the state we maintain per request?

Yes, we might be able to consolidate a little by moving the gate into the shared state, but this is the simplest

Ended up simplifying it
#29084 (comment)

rockwotj · 2025-12-23T03:22:30Z

Force push: fixed the build by just doing the subclass trick instead of trying to forward declare only

Also pushed a commit to get better error messages when there are stale epochs

Right now we get unhelpful log messages about empty epochs, but now at least we get a better error message about what the latest value in the STM is.

rockwotj · 2026-01-03T13:05:06Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":0,"with_cloud_topics":true}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

vbotbuildovich · 2026-01-03T13:42:37Z

Retry command for Build#78497

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

rockwotj · 2026-01-03T15:46:32Z

Okay the debug failures are all shutdown hangs:

1/RedpandaService-0-281472953189712/docker-rp-2/redpanda.log:INFO  2026-01-03 13:21:19,735 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
1/RedpandaService-0-281472953189712/docker-rp-4/redpanda.log:INFO  2026-01-03 13:21:19,679 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
2/RedpandaService-0-281473298644944/docker-rp-11/redpanda.log:INFO  2026-01-03 13:28:09,935 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
2/RedpandaService-0-281473298644944/docker-rp-12/redpanda.log:INFO  2026-01-03 13:28:40,695 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
3/RedpandaService-0-281473275675984/docker-rp-22/redpanda.log:INFO  2026-01-03 13:29:27,929 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
3/RedpandaService-0-281473275675984/docker-rp-23/redpanda.log:INFO  2026-01-03 13:28:33,775 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown
5/RedpandaService-0-281473275662384/docker-rp-37/redpanda.log:INFO  2026-01-03 13:28:34,665 [shard 0:main] main - sharded_service_container.h:132 - Service cloud_topics::app is taking more than 15 seconds to shutdown
5/RedpandaService-0-281473275662384/docker-rp-39/redpanda.log:INFO  2026-01-03 13:29:20,418 [shard 0:main] main - sharded_service_container.h:132 - Service seastar::sharded<cluster::partition_manager> is taking more than 15 seconds to shutdown

The one cloud topics one is the reconciler, which is unrelated to this PR

rockwotj · 2026-01-03T15:47:03Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
debug
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

vbotbuildovich · 2026-01-03T16:24:48Z

Retry command for Build#78501

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

rockwotj · 2026-01-03T18:00:15Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
debug
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

vbotbuildovich · 2026-01-03T18:59:27Z

Retry command for Build#78506

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

rockwotj · 2026-01-03T19:12:13Z

ERROR 2026-01-03 16:18:31,741 [shard 0:main] cluster - partition_manager.cc:502 - partition {kafka/tp-workload-ct/0} shutdown takes longer than expected, current shutdown stage: removing_raft time since last update: 20 seconds
ERROR 2026-01-03 16:18:30,777 [shard 0:main] raft - state_machine_manager.cc:245 - [{kafka/tp-workload-ct/0}] Timedout waiting for ctp_stm state machine to stop

Hrm I wonder if increased traffic to the ctp_stm causing the shutdown delay... Anyways, I suspect it's due to the lock in the epoch fencing as everything else looks like it shuts down ok

rockwotj · 2026-01-03T19:45:35Z

Okay so it seems reasonable that the issue here is the epoch locking timing out on shutdown, as there are lots of partition moves and the cluster epoch is probably out of sync on these different nodes.
This means there are write locks waiting to resolve on shutdown, it shouldn't take 20 seconds, so it might be something else... I added a commit to be able to abort the current waiters in the epoch fencing, and noticed there is a lifetime issue if the ctp_stm is destroyed before the units are returned, so I added a code in stop to wait for all units to be returned (and a watchdog because I'm sure there will be shutdown issues with this still).

We see shutdowns in the ctp stm, likely because of outstanding units in the epoch fencing. Also add a watchdog for shutdown to better identify what is taking so long in stop. We switch from ss::rwlock to ss::semaphore as to be able to log the number of read locks held on shutdown hangs.

vbotbuildovich · 2026-01-03T21:10:14Z

Retry command for Build#78512

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

rockwotj · 2026-01-03T21:49:29Z

The only test failure is now a crash that Evgeny has a fix for:

rptest.services.utils.NodeCrash: <NodeCrash docker-rp-46: ERROR 2026-01-03 20:50:19,726 [shard 0:main] assert - Assert failure: (src/v/cloud_topics/level_zero/pipeline/pipeline_stage.cc:39) 'static_cast<size_t>(next_ix) < _stages.size()' Pipeline stage

rockwotj · 2026-01-03T21:51:11Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
release
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

vbotbuildovich · 2026-01-03T22:10:12Z

Retry command for Build#78513

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":0,"with_cloud_topics":true}

rockwotj · 2026-01-03T22:28:42Z

This time I got the same assertion again and

INFO  2026-01-03 22:01:21,524 [shard 0:main] main - sharded_service_container.h:132 - Service cloud_topics::app is taking more than 15 seconds to shutdown
INFO  2026-01-03 22:01:21,525 [shard 0:main] main/cloud_topics - sharded_service_container.h:76 - Service cloud_topics::reconciler::reconciler is taking more than 15 seconds to shut down.

rockwotj · 2026-01-03T22:29:36Z

I think I'm ready to force merge at this point @dotnwat, this PR is less flaky than CI on mainline wrt cloud topics.

dotnwat · 2026-01-04T03:56:26Z

I think I'm ready to force merge at this point @dotnwat, this PR is less flaky than CI on mainline wrt cloud topics.

That sounds good to me.

github-actions bot added area/build area/redpanda labels Dec 21, 2025

rockwotj force-pushed the ct-producer-queue branch 5 times, most recently from c315223 to f4623c8 Compare December 22, 2025 15:00

rockwotj marked this pull request as ready for review December 22, 2025 15:28

rockwotj requested review from Lazin, andrwng, bharathv, Copilot, dotnwat and mmaslankaprv December 22, 2025 15:28

Copilot AI reviewed Dec 22, 2025

View reviewed changes

rockwotj force-pushed the ct-producer-queue branch 2 times, most recently from e3e2d50 to 367b95d Compare December 22, 2025 15:36

wdberkeley previously approved these changes Dec 22, 2025

View reviewed changes

Lazin reviewed Dec 22, 2025

View reviewed changes

bharathv reviewed Dec 22, 2025

View reviewed changes

rockwotj dismissed wdberkeley’s stale review via 38fb918 December 22, 2025 22:33

rockwotj force-pushed the ct-producer-queue branch from 367b95d to 38fb918 Compare December 22, 2025 22:33

rockwotj force-pushed the ct-producer-queue branch from 38fb918 to b99a6a7 Compare December 22, 2025 22:38

dotnwat previously approved these changes Dec 23, 2025

View reviewed changes

rockwotj dismissed dotnwat’s stale review via 2cab834 December 23, 2025 02:05

rockwotj force-pushed the ct-producer-queue branch from b99a6a7 to 2cab834 Compare December 23, 2025 02:05

rockwotj added 2 commits January 3, 2026 05:39

ct/l0: add better logging for stale epochs

3f9fc60

Right now we get unhelpful log messages about empty epochs, but now at least we get a better error message about what the latest value in the STM is.

cluster/partition: cleanup cloud storage probe on stop

95b8aa0

rockwotj force-pushed the ct-producer-queue branch from 3ae3891 to 95b8aa0 Compare January 3, 2026 05:56

rockwotj requested a review from dotnwat January 3, 2026 15:45

dotnwat previously approved these changes Jan 3, 2026

View reviewed changes

raft: modernize with std::ranges

1e122e8

rockwotj dismissed dotnwat’s stale review via d30095b January 3, 2026 19:42

rockwotj force-pushed the ct-producer-queue branch from d30095b to b4ad1b4 Compare January 3, 2026 20:04

rockwotj requested a review from dotnwat January 3, 2026 21:50

dotnwat approved these changes Jan 4, 2026

View reviewed changes

rockwotj disabled auto-merge January 4, 2026 04:07

rockwotj merged commit f83d5c6 into redpanda-data:dev Jan 4, 2026
16 of 19 checks passed

rockwotj deleted the ct-producer-queue branch January 4, 2026 04:07

Conversation

rockwotj commented Dec 21, 2025

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbotbuildovich commented Dec 22, 2025

Retry command for Build#78277

Uh oh!

vbotbuildovich commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

wdberkeley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lazin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rockwotj Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharathv Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rockwotj commented Dec 22, 2025

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rockwotj commented Dec 23, 2025

Uh oh!

rockwotj commented Jan 3, 2026

Uh oh!

vbotbuildovich commented Jan 3, 2026

Retry command for Build#78497

Uh oh!

rockwotj commented Jan 3, 2026

Uh oh!

rockwotj commented Jan 3, 2026

vbotbuildovich commented Dec 22, 2025 •

edited

Loading

rockwotj Dec 22, 2025 •

edited

Loading

bharathv Dec 22, 2025 •

edited

Loading

rockwotj commented Jan 3, 2026 •

edited

Loading