Skip to content

Commit 78eecc0

Browse files
committed
feat: metricsV2 + oTel + prometheus sample and Grafana dashboard (#3154)
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
1 parent e9ec0d8 commit 78eecc0

File tree

33 files changed

+3761
-163
lines changed

33 files changed

+3761
-163
lines changed

.github/workflows/e2e-test.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ jobs:
2525
- "sample-operators/tomcat-operator"
2626
- "sample-operators/webpage"
2727
- "sample-operators/leader-election"
28+
- "sample-operators/metrics-processing"
2829
runs-on: ubuntu-latest
2930
steps:
3031
- name: Checkout

.github/workflows/pr.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ on:
1111
paths-ignore:
1212
- 'docs/**'
1313
- 'adr/**'
14+
- 'observability/**'
1415
workflow_dispatch:
1516
jobs:
1617
check_format_and_unit_tests:

docs/content/en/docs/documentation/observability.md

Lines changed: 101 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
7777
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
7878
```
7979

80-
### Micrometer implementation
80+
### MicrometerMetricsV2
8181

82-
The micrometer implementation is typically created using one of the provided factory methods which, depending on which
83-
is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
84-
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
85-
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
86-
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
87-
could lead to performance issues.
82+
[`MicrometerMetricsV2`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java)
83+
is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
84+
all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
85+
resources come and go.
8886

89-
To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
90-
instance via:
87+
The simplest way to create an instance:
9188

9289
```java
9390
MeterRegistry registry; // initialize your registry implementation
94-
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
91+
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry).build();
92+
```
93+
94+
Optionally, include a `namespace` tag on per-reconciliation counters (disabled by default to avoid unexpected
95+
cardinality increases in existing deployments):
96+
97+
```java
98+
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
99+
.withNamespaceAsTag()
100+
.build();
101+
```
102+
103+
You can also supply a custom timer configuration for `reconciliations.execution.duration`:
104+
105+
```java
106+
Metrics metrics = MicrometerMetricsV2.newMicrometerMetricsV2Builder(registry)
107+
.withExecutionTimerConfig(builder -> builder.publishPercentiles(0.5, 0.95, 0.99))
108+
.build();
95109
```
96110

97-
The class provides factory methods which either return a fully pre-configured instance or a builder object that will
98-
allow you to configure more easily how the instance will behave. You can, for example, configure whether the
99-
implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
100-
resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
111+
#### MicrometerMetricsV2 metrics
112+
113+
All meters use `controller.name` as their primary tag. Counters optionally carry a `namespace` tag when
114+
`withNamespaceAsTag()` is enabled.
115+
116+
| Meter name (Micrometer) | Type | Tags | Description |
117+
|--------------------------------------|---------|---------------------------------------------------|------------------------------------------------------------------|
118+
| `reconciliations.active` | gauge | `controller.name` | Number of reconciler executions currently executing |
119+
| `reconciliations.queue` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
120+
| `custom_resources` | gauge | `controller.name` | Number of custom resources tracked by the controller |
121+
| `reconciliations.execution.duration` | timer | `controller.name` | Reconciliation execution duration with explicit bucket histogram |
122+
| `reconciliations.started.total` | counter | `controller.name`, `namespace`* | Number of reconciliations started (including retries) |
123+
| `reconciliations.success.total` | counter | `controller.name`, `namespace`* | Number of successfully finished reconciliations |
124+
| `reconciliations.failure.total` | counter | `controller.name`, `namespace`* | Number of failed reconciliations |
125+
| `reconciliations.retries.total` | counter | `controller.name`, `namespace`* | Number of reconciliation retries |
126+
| `events.received` | counter | `controller.name`, `event`, `action`, `namespace` | Number of Kubernetes events received by the controller |
127+
128+
\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.
129+
130+
The execution timer uses explicit boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
131+
compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OpenTelemetry Protocol (OTLP) registry, where
132+
`publishPercentileHistogram()` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
133+
`_bucket` queries.
134+
135+
> **Note on Prometheus metric names**: The exact Prometheus metric name suffix depends on the `MeterRegistry` in use.
136+
> For `PrometheusMeterRegistry` the timer is exposed as `reconciliations_execution_duration_seconds_*`. For
137+
> `OtlpMeterRegistry` (metrics exported via OpenTelemetry Collector), it is exposed as
138+
> `reconciliations_execution_duration_milliseconds_*`.
139+
140+
#### Grafana Dashboard
141+
142+
A ready-to-use Grafana dashboard is available at
143+
[`observability/josdk-operator-metrics-dashboard.json`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
144+
It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
145+
executions, resource counts, and execution duration histograms and heatmaps.
146+
147+
The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
148+
observability sample (see below).
149+
150+
#### Exploring metrics end-to-end
151+
152+
The
153+
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
154+
includes a full end-to-end test,
155+
[`MetricsHandlingE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
156+
that:
157+
158+
1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
159+
`observability/install-observability.sh`. That imports also the Grafana dashboards.
160+
2. Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
161+
3. Verifies that the expected metrics appear in Prometheus
162+
163+
This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
164+
having to deploy your own operator.
165+
166+
### MicrometerMetrics (Deprecated)
167+
168+
> **Deprecated**: `MicrometerMetrics` (V1) is deprecated as of JOSDK 5.3.0. Use `MicrometerMetricsV2` instead.
169+
> V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
170+
> cardinality growth and can lead to performance issues in your metrics backend.
171+
172+
The legacy `MicrometerMetrics` implementation is still available. To create an instance that behaves as it historically
173+
has:
174+
175+
```java
176+
MeterRegistry registry; // initialize your registry implementation
177+
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
178+
```
101179

102-
For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
103-
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
180+
To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
181+
using up to 2 threads:
104182

105183
```java
106184
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
@@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
109187
.build();
110188
```
111189

112-
### Operator SDK metrics
190+
#### Operator SDK metrics (V1)
113191

114-
The micrometer implementation records the following metrics:
192+
The V1 micrometer implementation records the following metrics:
115193

116194
| Meter name | Type | Tag names | Description |
117195
|-------------------------------------------------------------|----------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
@@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
130208
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
131209
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
132210

133-
As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
134-
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
135-
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
136-
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
137-
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
138-
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
211+
All V1 metrics start with the `operator.sdk` prefix. `<resource metadata>` refers to resource-specific metadata and
212+
depends on the considered metric and how the implementation is configured: `group?, version, kind, [name, namespace?],
213+
scope` where tags in square brackets (`[]`) won't be present when per-resource collection is disabled and tags followed
214+
by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
215+
are prefixed with `resource.`.
139216

140217
### Aggregated Metrics
141218

docs/content/en/docs/migration/v5-3-migration.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Migrating from v5.2 to v5.3
44
---
55

66

7-
## Renamed JUnit Module
7+
## Rename of JUnit module
88

99
If you use JUnit extension in your test just rename it from:
1010

@@ -26,4 +26,26 @@ to
2626
<version>5.3.0<version>
2727
<scope>test</scope>
2828
</dependency>
29-
```
29+
```
30+
31+
## Metrics interface changes
32+
33+
The [Metrics](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/monitoring/Metrics.java)
34+
interface changed in non backwards compatible way, in order to make the API cleaner:
35+
36+
The following table shows the relevant method renames:
37+
38+
| v5.2 method | v5.3 method |
39+
|------------------------------------|------------------------------|
40+
| `reconcileCustomResource` | `reconciliationSubmitted` |
41+
| `reconciliationExecutionStarted` | `reconciliationStarted` |
42+
| `reconciliationExecutionFinished` | `reconciliationSucceeded` |
43+
| `failedReconciliation` | `reconciliationFailed` |
44+
| `finishedReconciliation` | `reconciliationFinished` |
45+
| `cleanupDoneFor` | `cleanupDone` |
46+
| `receivedEvent` | `eventReceived` |
47+
48+
49+
Other changes:
50+
- `reconciliationFinished(..)` method is extended with `RetryInfo`
51+
- `monitorSizeOf(..)` method is removed.

micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,10 @@
3939

4040
import static io.javaoperatorsdk.operator.api.reconciler.Constants.CONTROLLER_NAME;
4141

42+
/**
43+
* @deprecated Use {@link MicrometerMetricsV2} instead
44+
*/
45+
@Deprecated(forRemoval = true)
4246
public class MicrometerMetrics implements Metrics {
4347

4448
private static final String PREFIX = "operator.sdk.";
@@ -68,7 +72,6 @@ public class MicrometerMetrics implements Metrics {
6872
private static final String EVENTS_RECEIVED = "events.received";
6973
private static final String EVENTS_DELETE = "events.delete";
7074
private static final String CLUSTER = "cluster";
71-
private static final String SIZE_SUFFIX = ".size";
7275
private static final String UNKNOWN_ACTION = "UNKNOWN";
7376
private final boolean collectPerResourceMetrics;
7477
private final MeterRegistry registry;
@@ -182,7 +185,7 @@ public <T> T timeControllerExecution(ControllerExecution<T> execution) {
182185
}
183186

184187
@Override
185-
public void receivedEvent(Event event, Map<String, Object> metadata) {
188+
public void eventReceived(Event event, Map<String, Object> metadata) {
186189
if (event instanceof ResourceEvent) {
187190
incrementCounter(
188191
event.getRelatedCustomResourceID(),
@@ -201,14 +204,14 @@ public void receivedEvent(Event event, Map<String, Object> metadata) {
201204
}
202205

203206
@Override
204-
public void cleanupDoneFor(ResourceID resourceID, Map<String, Object> metadata) {
207+
public void cleanupDone(ResourceID resourceID, Map<String, Object> metadata) {
205208
incrementCounter(resourceID, EVENTS_DELETE, metadata);
206209

207210
cleaner.removeMetersFor(resourceID);
208211
}
209212

210213
@Override
211-
public void reconcileCustomResource(
214+
public void reconciliationSubmitted(
212215
HasMetadata resource, RetryInfo retryInfoNullable, Map<String, Object> metadata) {
213216
Optional<RetryInfo> retryInfo = Optional.ofNullable(retryInfoNullable);
214217
incrementCounter(
@@ -228,19 +231,20 @@ public void reconcileCustomResource(
228231
}
229232

230233
@Override
231-
public void finishedReconciliation(HasMetadata resource, Map<String, Object> metadata) {
234+
public void reconciliationSucceeded(HasMetadata resource, Map<String, Object> metadata) {
232235
incrementCounter(ResourceID.fromResource(resource), RECONCILIATIONS_SUCCESS, metadata);
233236
}
234237

235238
@Override
236-
public void reconciliationExecutionStarted(HasMetadata resource, Map<String, Object> metadata) {
239+
public void reconciliationStarted(HasMetadata resource, Map<String, Object> metadata) {
237240
var reconcilerExecutions =
238241
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
239242
reconcilerExecutions.incrementAndGet();
240243
}
241244

242245
@Override
243-
public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Object> metadata) {
246+
public void reconciliationFinished(
247+
HasMetadata resource, RetryInfo retryInfo, Map<String, Object> metadata) {
244248
var reconcilerExecutions =
245249
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
246250
reconcilerExecutions.decrementAndGet();
@@ -251,8 +255,8 @@ public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Ob
251255
}
252256

253257
@Override
254-
public void failedReconciliation(
255-
HasMetadata resource, Exception exception, Map<String, Object> metadata) {
258+
public void reconciliationFailed(
259+
HasMetadata resource, RetryInfo retry, Exception exception, Map<String, Object> metadata) {
256260
var cause = exception.getCause();
257261
if (cause == null) {
258262
cause = exception;
@@ -266,11 +270,6 @@ public void failedReconciliation(
266270
Tag.of(EXCEPTION, cause.getClass().getSimpleName()));
267271
}
268272

269-
@Override
270-
public <T extends Map<?, ?>> T monitorSizeOf(T map, String name) {
271-
return registry.gaugeMapSize(PREFIX + name + SIZE_SUFFIX, Collections.emptyList(), map);
272-
}
273-
274273
private void addMetadataTags(
275274
ResourceID resourceID, Map<String, Object> metadata, List<Tag> tags, boolean prefixed) {
276275
if (collectPerResourceMetrics) {

0 commit comments

Comments
 (0)