Skip to content

Allow dynamic toggling of adaptive routing metrics export#18286

Open
timothy-e wants to merge 3 commits intoapache:masterfrom
timothy-e:timothy-e-adaptive-routing-dynamic-toggle
Open

Allow dynamic toggling of adaptive routing metrics export#18286
timothy-e wants to merge 3 commits intoapache:masterfrom
timothy-e:timothy-e-adaptive-routing-dynamic-toggle

Conversation

@timothy-e
Copy link
Copy Markdown
Contributor

@timothy-e timothy-e commented Apr 22, 2026

#18134 adds metrics for adaptive routing stats.

We may want to toggle whether or not stats are exported as metrics without needing to rolling restart the brokers.

ServerRoutingStatsManager now implements PinotClusterConfigChangeListener. The periodic export task is always scheduled when stats collection is enabled; the enable.stats.metric.export flag is checked inside the task and can be updated at runtime via the Helix cluster config (PUT /cluster/configs) without a broker restart.

We also allow updating the metrics export frequency at runtime.

If the metric interval set to a bad value at startup, it'll block the cluster from starting. But if a bad value is set dynamically, if we don't validate it, it could stop us from reaching later listeners.

Testing

Deployed to an internal Stripe cluster.

sshed onto one of the controllers and ran

curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
# wait a bit
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "true"}'
# wait a bit more
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'

Saw the metrics start/stop as expected.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 88.63636% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.61%. Comparing base (6b79b57) to head (6dd3a89).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...erver/routing/stats/ServerRoutingStatsManager.java 90.69% 2 Missing and 2 partials ⚠️
...e/pinot/broker/broker/helix/BaseBrokerStarter.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18286      +/-   ##
============================================
+ Coverage     63.60%   63.61%   +0.01%     
  Complexity     1659     1659              
============================================
  Files          3246     3246              
  Lines        197510   197554      +44     
  Branches      30578    30585       +7     
============================================
+ Hits         125620   125671      +51     
+ Misses        61845    61836       -9     
- Partials      10045    10047       +2     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.59% <88.63%> (+0.02%) ⬆️
java-21 63.58% <88.63%> (+<0.01%) ⬆️
temurin 63.61% <88.63%> (+0.01%) ⬆️
unittests 63.61% <88.63%> (+0.01%) ⬆️
unittests1 55.58% <90.69%> (+0.01%) ⬆️
unittests2 35.04% <13.63%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@timothy-e
Copy link
Copy Markdown
Contributor Author

@xiangfu0 @Jackie-Jiang @yashmayya can one of you take a look at this?

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-confidence concurrency issue in the dynamic metrics toggle.

@timothy-e
Copy link
Copy Markdown
Contributor Author

@xiangfu0 thanks for catching that - can you take another look?

Committed-By-Agent: claude

cc stripe-private-oss-forks/pinot-reviewers
r?

https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/581 added metrics for adaptive routing stats.

We may want to toggle whether or not stats are exported as metrics without needing to launch a rolling restart.

ServerRoutingStatsManager now implements PinotClusterConfigChangeListener. The periodic export task is always scheduled when stats collection is enabled; the enable.stats.metric.export flag is checked inside the task and can be updated at runtime via the Helix cluster config (PUT /cluster/configs) without a broker restart.

We also allow updating the metrics export frequency at runtime.

[STREAMANALYTICS-4390](https://jira.corp.stripe.com/browse/STREAMANALYTICS-4390)

Deployed to [rad-canary QA](https://amp.qa.corp.stripe.com/deploy/qa-deploy1.pdx.deploy.stripe.net%2Fdeploy_r2WUckQcQ467TfeJvyp4zw).

`ssh`ed onto a rad-canary controller and ran
```
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "true"}'
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
```
I see that the metrics stopped / resumed as expected in [grafana](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/explore?schemaVersion=1&panes=%7B%223yy%22:%7B%22datasource%22:%22zb219lV4k%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22count%20by%20%28host%29%20%28%7B__name__%3D~%5C%22pinot_broker_adaptive_server_latency_ema%5C%22,%20pinot_cluster%3D%5C%22rad-canary%5C%22%7D%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22zb219lV4k%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%221775159834995%22,%22to%22:%221775160734995%22%7D%7D%7D&orgId=1)

<img width="1477" alt="Screenshot 2026-04-02 at 4 17 58 pm" src="https://git.corp.stripe.com/user-attachments/assets/dca06b6d-8b6f-4797-a079-95a1f21d7de7" />

Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-04-07T21:12:57Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/582
@timothy-e timothy-e force-pushed the timothy-e-adaptive-routing-dynamic-toggle branch from a9122af to 42877ef Compare April 23, 2026 19:17
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal config-safety issue; see inline comment.

@xiangfu0
Copy link
Copy Markdown
Contributor

@jasperjiaguo @siddharthteotia ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants