Skip to content

Joe/cluster module breakup#30151

Draft
joe-redpanda wants to merge 8 commits intoredpanda-data:devfrom
joe-redpanda:joe/cluster-module-breakup
Draft

Joe/cluster module breakup#30151
joe-redpanda wants to merge 8 commits intoredpanda-data:devfrom
joe-redpanda:joe/cluster-module-breakup

Conversation

@joe-redpanda
Copy link
Copy Markdown
Contributor

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

…gets

Break three sets of files out of the monolithic cluster library into
their own Bazel targets, as a first step toward reducing the module's
compilation blast radius.

- cluster:errc — errc.h + errors.cc, zero intra-cluster deps
- cluster:logger — logger.h + logger.cc, depends only on base + seastar
- cluster:leader_balancer_strategy — the pure-algorithm layer of the
  leader balancer (types, strategy interface, constraints, probe,
  random), with zero dependencies on cluster internals

Headers are re-exported from the main cluster target so downstream
consumers are unaffected. Fix bare includes (e.g. "errc.h" →
"cluster/errc.h") that broke once the headers moved to separate
targets.
Extract three self-contained type groups from the 3,931-line types.h
into their own headers, reducing the monolithic header by ~500 lines:

- security_types.h: ACL cmd data, ACL request/reply, user_and_credential,
  role cmd data types
- plugin_rpc_types.h: upsert/remove plugin request/response types
- cluster_link_rpc_types.h: all cluster link and mirror topic
  request/response types

types.h re-exports these headers so existing consumers are unaffected.
This is preparation for giving each domain its own Bazel target, which
will let downstream code depend on only the types it actually uses.
Break 23 sets of files out of the monolithic cluster library into their
own Bazel targets, culminating in the extraction of the 3,400-line
types.h — the single biggest dependency bottleneck in the module.

New targets with zero cluster-internal deps (13 header-only, 6 with .cc):
  fwd, tx_errc, cluster_link_errc, simple_batch_builder, namespaced_cache,
  tx_hash_ranges, ntp_callbacks, shard_table, node_status_rpc_types,
  node_status_table, offsets_snapshot, security_types, plugin_rpc_types,
  partition_balancer_types, cluster_link_rpc_types, client_quota_serde,
  self_test_rpc_types, tm_stm_types, cloud_metadata_types,
  cloud_metadata_key_utils, cloud_metadata_cluster_manifest,
  data_migration_types

Keystone extraction — types target:
  types.h/cc depends only on the targets above plus already-extracted
  targets (errc, features, version, nt_revision, topic_configuration).

This removes 10 .cc files and 24 .h files from the monolith (148 → 138
.cc files). Fix bare includes (e.g. "types.h" → "cluster/types.h") that
broke once the headers moved to separate targets. All headers are
re-exported from the main cluster target so downstream consumers are
unaffected.
Extract the next wave of targets from the monolithic cluster library,
following the dependency chain unlocked by the types.h extraction.

Header-only extractions (12 targets):
  commands, bootstrap_types, cluster_recovery_state,
  data_migration_group_proxy, cluster_uuid, partition_probe,
  drain_manager (header only; .cc stays in monolith),
  distributed_kv_stm (+types), rm_group_proxy, topic_table_probe

Full .h+.cc extractions (10 targets):
  node_types, rm_stm_types, tx_protocol_types, plugin_table,
  data_migrated_resources, notification_latch, cluster_link_types,
  controller_snapshot, controller_log_limiter, health_monitor_types,
  members_table, feature_backend

The commands.h extraction was the key unlock: it enabled
topic_table_probe.h and members_table.h extraction, and together with
cluster_recovery_state.h enabled controller_snapshot.h which unblocked
members_table.cc and feature_backend.cc.

This removes 12 .cc files from the monolith (138 → 126) and 23 .h
files. Fix bare includes and re-export all headers from the main
cluster target for backward compatibility.
Extract the next wave of targets plus clean up header duplication
from incremental breakouts.

New targets with .cc files (9):
  cluster_recovery_table, client_quota_store, producer_state,
  partition_properties_stm, log_eviction_stm, ephemeral_credential_frontend,
  producer_state_manager, client_quota_backend, ephemeral_credential_service

New header-only targets (2):
  prefix_truncate_record, ephemeral_credential_serde

Deduplication fixes:
  Remove headers from monolith hdrs that already exist in extracted
  targets (leader_balancer_strategy, topic_configuration,
  tx_manager_migrator_rpc). Add ephemeral_credential_serde as explicit
  dep of the ephemeral_credential_rpc generated target since the
  generated header includes it.

Fix bare includes exposed by the deduplication (topic_configuration.h,
topic_properties.h, snapshot.h, client_quota_store.h, producer_state.h,
producer_state_manager.h).

This removes 9 .cc files from the monolith (126 → 117).
Break the monolithic cluster_utils.h/cc into three purpose-specific
files to decouple the "toxic" controller_stm.h dependency from the
clean utility functions:

- rpc_utils.h/cc: RPC client helpers (with_client, make_self_broker,
  etc.). Zero cluster deps — extracted as its own Bazel target.

- controller_utils.h/cc: Functions that depend on controller_stm,
  partition, or topic_table (replicate_and_wait, get_partition_state,
  log_revision_on_node, persistent state copy/remove). Stays in the
  monolith.

- cluster_utils.h/cc: Generic replica set utilities (subtract,
  contains_node, moving_from/to_node, etc.), make_error_topic_results,
  map_update_interruption_error_code, check_result_configuration.
  All deps are already extracted — extracted as its own Bazel target.

This unblocks extraction of the scheduling/allocation subsystem:
partition_allocator.cc only used subtract() from cluster_utils, which
no longer pulls in controller_stm.h.

Update 29 consumers to include the appropriate header(s). Remove 7
unused cluster_utils.h includes. Add explicit includes for types that
were previously available transitively through the old cluster_utils.h
→ controller_stm.h → everything chain.
Extract 18 .cc files from the monolith across 9 new targets, enabled
by the cluster_utils split in the previous commit.

New targets:
  scheduling_allocation (6 .cc) — the full allocation subsystem: types,
    allocation_node, allocation_state, allocation_strategy, constraints,
    partition_allocator. Unblocked because partition_allocator only needed
    subtract() from the now-clean cluster_utils.

  node_local_monitor (1 .cc) — standalone, no monolith deps
  node_status_backend (1 .cc) — standalone
  node_status_rpc_handler (1 .cc) — depends on node_status_backend
  cluster_link_table (1 .cc) — standalone

  self_test (6 .cc) — the complete self-test subsystem: cloudcheck,
    diskcheck, netcheck, backend, frontend, rpc_handler. Unblocked by
    extracting node_local_monitor first.

  topic_validators (header-only) — trivial prerequisite for topic_table
  topic_table (1 .cc) — core topic metadata, unblocked by
    topic_validators extraction
  partition_leaders_table (1 .cc) — unblocked by topic_table extraction

This reduces the monolith from 117 to 99 .cc files (below 100).
Extract another wave of .cc files newly unblocked by the topic_table
and partition_leaders_table extractions in the previous commit.

New targets (11):
  topic_table_partition_generator, topic_metrics_watcher,
  topic_recovery_validator, id_allocator_stm,
  cloud_metadata_manifest_downloads, inventory_service,
  tx_helpers, tm_stm, data_migration_table, plugin_backend,
  remote_topic_configuration_source

These files all had their cluster deps fully satisfied by
already-extracted targets. Fix bare includes in topic_metrics_watcher,
topic_recovery_validator, and plugin_backend.

This reduces the monolith from 98 to 87 .cc files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant