Skip to content

Add Backfill Circuit Breaking for Offset Reset Feature #18314

@rseetham

Description

@rseetham

Currently, this feature always triggers when there's an ingestion delay detected at segment creation. This feature effectively doubles the ingestion which we might not have resources for sometimes. There are no mechanism to stop it if there's any issues or if the cluster is overwhelmed specifically a topic is constantly triggering this feature or if all the topics on a cluster trigger this feature at once. We need some circuit breaking mechanisms.

Proposal

  1. Introduce a backfill pause property stream.kafka.consumer.prop.auto.offset.reset.pause. This can be set programmatically (through a controller API that will be added) or through an automatic process details below. This property will be checked before the backfill is triggered at segment creation.
  2. Add another property stream.kafka.consumer.prop.auto.offset.reset.maxSegmentsBeforeBackfillSkip - If the number of segments on the table are already high, we will not trigger backfill. In case ingestion is permanently high and not spiky, the ingestion doubling is causing more segments to be created leading to znode limit being reached faster.
  3. Add a dsc property maxConcurrentBackfillsPerController. Per controller if there's more than this number of backfills ongoing, we will not trigger any more backfills. In the case where the cluster was down and restarted after a while, all topics will have ingestion lag. This means backfill will trigger for all tables. This is not really necessary in the case because cluster can self recover. Instead if all topics backfill, we will create a lot of topics for this. Since topics cannot be cleanly removed, we will need to carry these topics in the stream config forever which is not ideal.
  4. If a backfill is ongoing, we will not trigger another backfill until the current backfill is complete. We will also pause the backfill using the property in point 1. and emit a metric in this case. This featurre was built for occasional spikes. If backfills are constantly triggering then there's more throughput than expected and we should be increasing the number of partitions on the main topic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions