S3 Plugin Metrics - Complete Grafana Guide

This guide provides ready-to-use PromQL queries for S3 Storage Plugin metrics, including panel configuration details (legend, min step, units).

Overview

The S3 plugin exposes two primary metrics to track file operations and errors:

rr_s3_operations_total - Counter tracking all S3 operations by type, bucket, and status
rr_s3_errors_total - Counter tracking errors by bucket and error type

1. Operation Metrics

1.1 Total Operations Per Second

Query:

sum(rate(rr_s3_operations_total[5m]))

Configuration:

Legend: Total OPS
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Overall S3 operation rate across all buckets

1.2 Operations Per Second by Bucket

Query:

sum by (bucket) (rate(rr_s3_operations_total[5m]))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph or Bar gauge
Description: Operation rate grouped by bucket

1.3 Operations Per Second by Type

Query:

sum by (operation) (rate(rr_s3_operations_total[5m]))

Configuration:

Legend: {{operation}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph (stacked area) or Pie chart
Description: Operation distribution by type (write, read, delete, copy, move, list, exists, get_metadata, set_visibility, get_url)

1.4 Operations Per Second by Status

Query:

sum by (status) (rate(rr_s3_operations_total[5m]))

Configuration:

Legend: {{status}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph (stacked area)
Description: Operation rate grouped by status (success, error)

1.5 Success Rate Percentage

Query:

sum(rate(rr_s3_operations_total{status="success"}[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100

Configuration:

Legend: Success Rate
Min Step: 15s
Unit: percent (0-100)
Panel Type: Gauge or Graph
Thresholds: Red < 95%, Yellow 95-99%, Green > 99%
Description: Percentage of successful S3 operations

1.6 Total Operations Count

Query:

sum(rr_s3_operations_total)

Configuration:

Legend: Total Operations
Min Step: 1m
Unit: short
Panel Type: Stat
Description: Cumulative count of all S3 operations since start

2. Bucket Analysis

2.1 Most Active Buckets (by Total Operations)

Query:

topk(10, sum by (bucket) (rate(rr_s3_operations_total[5m])))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Bar gauge (horizontal) or Table
Description: Top 10 buckets by operation rate

2.2 Most Active Buckets (by Write Operations)

Query:

topk(10, sum by (bucket) (rate(rr_s3_operations_total{operation="write"}[5m])))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Bar gauge or Table
Description: Top 10 buckets by write operation rate

2.3 Most Active Buckets (by Read Operations)

Query:

topk(10, sum by (bucket) (rate(rr_s3_operations_total{operation="read"}[5m])))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Bar gauge or Table
Description: Top 10 buckets by read operation rate

2.4 Bucket Performance Table

Query 1 (Total OPS):

sum by (bucket) (rate(rr_s3_operations_total[5m]))

Query 2 (Write OPS):

sum by (bucket) (rate(rr_s3_operations_total{operation="write"}[5m]))

Query 3 (Read OPS):

sum by (bucket) (rate(rr_s3_operations_total{operation="read"}[5m]))

Query 4 (Error %):

sum by (bucket) (rate(rr_s3_operations_total{status="error"}[5m])) / sum by (bucket) (rate(rr_s3_operations_total[5m])) * 100

Configuration:

Legend: N/A (Table columns)
Min Step: 15s
Unit:
- Query 1-3: ops (operations/sec)
- Query 4: percent (0-100)
Panel Type: Table
Column Names: Bucket, Total OPS, Write OPS, Read OPS, Error Rate %
Description: Comprehensive bucket performance overview

2.5 Read/Write Ratio by Bucket

Query:

sum by (bucket) (rate(rr_s3_operations_total{operation="read"}[5m])) / sum by (bucket) (rate(rr_s3_operations_total{operation="write"}[5m]))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: short (ratio)
Panel Type: Graph or Table
Description: Read-to-write ratio per bucket (higher = more reads than writes)

3. Operation Type Analysis

3.1 Write Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="write"}[5m]))

Configuration:

Legend: Writes/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file upload rate

3.2 Read Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="read"}[5m]))

Configuration:

Legend: Reads/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file download rate

3.3 Delete Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="delete"}[5m]))

Configuration:

Legend: Deletes/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file deletion rate

3.4 List Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="list"}[5m]))

Configuration:

Legend: Lists/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total object listing rate

3.5 Copy Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="copy"}[5m]))

Configuration:

Legend: Copies/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file copy rate

3.6 Move Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="move"}[5m]))

Configuration:

Legend: Moves/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file move rate

3.7 Exists Check Rate

Query:

sum(rate(rr_s3_operations_total{operation="exists"}[5m]))

Configuration:

Legend: Exists checks/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total file existence check rate

3.8 Metadata Operations Rate

Query:

sum(rate(rr_s3_operations_total{operation="get_metadata"}[5m]))

Configuration:

Legend: Metadata ops/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total metadata retrieval rate

3.9 Visibility Change Rate

Query:

sum(rate(rr_s3_operations_total{operation="set_visibility"}[5m]))

Configuration:

Legend: Visibility changes/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total ACL change rate

3.10 URL Generation Rate

Query:

sum(rate(rr_s3_operations_total{operation="get_url"}[5m]))

Configuration:

Legend: URL gens/sec
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph
Description: Total URL generation rate (public/presigned)

3.11 Operation Distribution (Pie Chart)

Query:

sum by (operation) (rate(rr_s3_operations_total[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100

Configuration:

Legend: {{operation}}
Min Step: 15s
Unit: percent (0-100)
Panel Type: Pie chart
Description: Percentage breakdown of operations by type

3.12 Write vs Read Operations (Stacked)

Query 1 (Writes):

sum(rate(rr_s3_operations_total{operation="write"}[5m]))

Query 2 (Reads):

sum(rate(rr_s3_operations_total{operation="read"}[5m]))

Configuration:

Legend:
- Query 1: Writes
- Query 2: Reads
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph (stacked area)
Description: Visual comparison of write vs read operations

4. Error Tracking

4.1 Total Error Rate

Query:

sum(rate(rr_s3_errors_total[5m]))

Configuration:

Legend: Errors/sec
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Total S3 errors per second (all types)

4.2 Error Rate Percentage

Query:

sum(rate(rr_s3_operations_total{status="error"}[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100

Configuration:

Legend: Error Rate
Min Step: 15s
Unit: percent (0-100)
Panel Type: Gauge or Graph
Thresholds: Green < 1%, Yellow 1-5%, Red > 5%
Description: Percentage of operations that result in errors

4.3 Error Rate by Type

Query:

sum by (error_type) (rate(rr_s3_errors_total[5m]))

Configuration:

Legend: {{error_type}}
Min Step: 15s
Unit: errors/sec
Panel Type: Graph (stacked) or Pie chart
Description: Errors grouped by classification (BUCKET_NOT_FOUND, FILE_NOT_FOUND, S3_OPERATION_FAILED, etc.)

4.4 Error Rate by Bucket

Query:

sum by (bucket) (rate(rr_s3_errors_total[5m]))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: errors/sec
Panel Type: Graph or Table
Description: Errors grouped by bucket

4.5 Most Error-Prone Buckets (by Count)

Query:

topk(10, sum by (bucket) (rate(rr_s3_errors_total[5m])))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: errors/sec
Panel Type: Bar gauge or Table
Description: Buckets with highest error rate

4.6 Most Error-Prone Buckets (by Percentage)

Query:

topk(10, sum by (bucket) (rate(rr_s3_operations_total{status="error"}[5m])) / sum by (bucket) (rate(rr_s3_operations_total[5m])) * 100)

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: percent (0-100)
Panel Type: Bar gauge or Table
Description: Buckets with highest error percentage

4.7 Bucket Not Found Errors

Query:

sum(rate(rr_s3_errors_total{error_type="BUCKET_NOT_FOUND"}[5m]))

Configuration:

Legend: Bucket Not Found
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Rate of bucket not found errors

4.8 File Not Found Errors

Query:

sum(rate(rr_s3_errors_total{error_type="FILE_NOT_FOUND"}[5m]))

Configuration:

Legend: File Not Found
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Rate of file not found errors

4.9 S3 Operation Failed Errors

Query:

sum(rate(rr_s3_errors_total{error_type="S3_OPERATION_FAILED"}[5m]))

Configuration:

Legend: S3 Operation Failed
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Rate of S3 SDK operation failures

4.10 Permission Denied Errors

Query:

sum(rate(rr_s3_errors_total{error_type="PERMISSION_DENIED"}[5m]))

Configuration:

Legend: Permission Denied
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Rate of permission/access denied errors

4.11 Invalid Pathname Errors

Query:

sum(rate(rr_s3_errors_total{error_type="INVALID_PATHNAME"}[5m]))

Configuration:

Legend: Invalid Pathname
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Description: Rate of invalid pathname errors

4.12 Operation Timeout Errors

Query:

sum(rate(rr_s3_errors_total{error_type="OPERATION_TIMEOUT"}[5m]))

Configuration:

Legend: Timeouts
Min Step: 15s
Unit: errors/sec
Panel Type: Graph
Thresholds: Any value > 0 requires investigation
Description: Rate of operation timeout errors

4.13 Error Distribution (Pie Chart)

Query:

sum by (error_type) (rate(rr_s3_errors_total[5m])) / sum(rate(rr_s3_errors_total[5m])) * 100

Configuration:

Legend: {{error_type}}
Min Step: 15s
Unit: percent (0-100)
Panel Type: Pie chart
Description: Percentage breakdown of errors by type

4.14 Error Heatmap (Bucket vs Error Type)

Query:

sum by (bucket, error_type) (rate(rr_s3_errors_total[5m]))

Configuration:

Legend: N/A (heatmap)
Min Step: 30s
Unit: errors/sec
Panel Type: Heatmap
Description: Visual correlation between buckets and error types

5. Combined Operation & Error Analysis

5.1 Operations and Errors (Dual Axis)

Query 1 (Operations - Left Axis):

sum(rate(rr_s3_operations_total[5m]))

Query 2 (Errors - Right Axis):

sum(rate(rr_s3_errors_total[5m]))

Configuration:

Legend:
- Query 1: Operations/sec
- Query 2: Errors/sec
Min Step: 15s
Unit:
- Left Axis: ops (operations/sec)
- Right Axis: errors/sec
Panel Type: Graph (dual Y-axis)
Description: Correlation between operation rate and error rate

5.2 Success vs Error Rate (Stacked)

Query 1 (Success):

sum(rate(rr_s3_operations_total{status="success"}[5m]))

Query 2 (Error):

sum(rate(rr_s3_operations_total{status="error"}[5m]))

Configuration:

Legend:
- Query 1: Success
- Query 2: Error
Min Step: 15s
Unit: ops (operations/sec)
Panel Type: Graph (stacked area)
Description: Visual comparison of successful vs failed operations

5.3 Write Operation Success Rate

Query:

sum(rate(rr_s3_operations_total{operation="write",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="write"}[5m])) * 100

Configuration:

Legend: Write Success Rate
Min Step: 15s
Unit: percent (0-100)
Panel Type: Graph or Gauge
Thresholds: Red < 95%, Yellow 95-99%, Green > 99%
Description: Success rate for write operations only

5.4 Read Operation Success Rate

Query:

sum(rate(rr_s3_operations_total{operation="read",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="read"}[5m])) * 100

Configuration:

Legend: Read Success Rate
Min Step: 15s
Unit: percent (0-100)
Panel Type: Graph or Gauge
Thresholds: Red < 95%, Yellow 95-99%, Green > 99%
Description: Success rate for read operations only

5.5 Operation Success Rate by Type (Table)

Query 1 (Write):

sum(rate(rr_s3_operations_total{operation="write",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="write"}[5m])) * 100

Query 2 (Read):

sum(rate(rr_s3_operations_total{operation="read",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="read"}[5m])) * 100

Query 3 (Delete):

sum(rate(rr_s3_operations_total{operation="delete",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="delete"}[5m])) * 100

Query 4 (List):

sum(rate(rr_s3_operations_total{operation="list",status="success"}[5m])) / sum(rate(rr_s3_operations_total{operation="list"}[5m])) * 100

Configuration:

Legend: N/A (Table rows)
Min Step: 15s
Unit: percent (0-100)
Panel Type: Table
Row Names: Write, Read, Delete, List
Description: Success rate breakdown by operation type

6. Advanced Analytics

6.1 Operation Rate Trend (Hour over Hour)

Query:

sum(rate(rr_s3_operations_total[1h])) / sum(rate(rr_s3_operations_total[1h] offset 24h))

Configuration:

Legend: HoH Change
Min Step: 5m
Unit: short (ratio)
Panel Type: Graph or Stat
Description: Current hour traffic vs same hour yesterday (1.0 = same, 2.0 = double)

6.2 Error Burst Detection

Query:

sum(rate(rr_s3_errors_total[1m])) > 2 * avg_over_time(sum(rate(rr_s3_errors_total[1m]))[10m:1m])

Configuration:

Legend: Error Burst
Min Step: 15s
Unit: bool (0 or 1)
Panel Type: Graph (binary)
Thresholds: Red when value = 1
Description: Detects sudden spikes in errors (>2x baseline)

6.3 Operations Per Bucket (Distribution)

Query:

sum by (bucket) (rr_s3_operations_total)

Configuration:

Legend: {{bucket}}
Min Step: 1m
Unit: short
Panel Type: Pie chart or Bar gauge
Description: Total cumulative operations per bucket

6.4 Bucket Activity Timeline (Heatmap)

Query:

sum by (bucket) (rate(rr_s3_operations_total[5m]))

Configuration:

Legend: N/A (heatmap)
Min Step: 30s
Unit: ops (operations/sec)
Panel Type: Heatmap
Description: Visual activity pattern across buckets over time

6.5 Write-Heavy vs Read-Heavy Buckets

Query:

(sum by (bucket) (rate(rr_s3_operations_total{operation="write"}[5m])) > sum by (bucket) (rate(rr_s3_operations_total{operation="read"}[5m])))

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: bool (0 or 1)
Panel Type: Graph or Table
Description: Identifies write-heavy buckets (1 = more writes than reads)

6.6 Most Reliable Bucket

Query:

bottomk(1, sum by (bucket) (rate(rr_s3_operations_total{status="error"}[5m])) / sum by (bucket) (rate(rr_s3_operations_total[5m])) * 100)

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: percent (0-100)
Panel Type: Stat
Description: Bucket with lowest error rate

6.7 Least Reliable Bucket

Query:

topk(1, sum by (bucket) (rate(rr_s3_operations_total{status="error"}[5m])) / sum by (bucket) (rate(rr_s3_operations_total[5m])) * 100)

Configuration:

Legend: {{bucket}}
Min Step: 15s
Unit: percent (0-100)
Panel Type: Stat
Thresholds: Red > 5%, Yellow 1-5%, Green < 1%
Description: Bucket with highest error rate

7. Dashboard Layout Recommendations

Row 1: Key Metrics Overview (4 panels)

Total Operations/sec - Stat panel
Success Rate % - Gauge with thresholds
Total Errors/sec - Stat panel with threshold colors
Active Buckets - Stat (count of buckets with ops > 0)

Row 2: Operation Analysis (2 panels)

Operations by Type - Stacked area graph
Operations by Bucket - Graph (time series)

Row 3: Success vs Errors (2 panels)

Success vs Error Rate - Stacked area graph
Operation Success Rate by Type - Table

Row 4: Error Analysis (2 panels)

Errors by Type - Pie chart or Stacked area
Most Error-Prone Buckets - Bar gauge

Row 5: Bucket Performance (1 panel)

Bucket Performance Table - Table with multiple queries (Total OPS, Write OPS, Read OPS, Error %)

Row 6: Advanced (2 panels)

Error Heatmap (Bucket vs Type) - Heatmap
Read/Write Ratio by Bucket - Bar gauge

8. Unit Reference Guide

Standard Grafana Units

Rate Units:

ops (operations/sec) - for operation rates
errors/sec - for error rates

Percentage:

percent (0-100) - displays as 95%
percentunit (0.0-1.0) - displays 0.95 as 95%

Count:

short - auto-formats large numbers (1K, 1M)
none - raw number

Boolean:

bool - 0 or 1
bool_yes_no - displays as Yes/No

9. Common Threshold Configurations

Error Rate Thresholds

Green: < 1%
Yellow: 1-5%
Red: > 5%

Success Rate Thresholds

Red: < 95%
Yellow: 95-99%
Green: > 99%

Error Burst Detection

Red: value = 1 (burst detected)
Green: value = 0 (normal)

10. Alert Rules (Prometheus)

Critical Alerts

High Error Rate:

- alert: S3HighErrorRate
  expr: sum(rate(rr_s3_operations_total{status="error"}[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100 > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "S3 plugin error rate above 5%"
    description: "Error rate is {{ $value }}% (threshold: 5%)"

Bucket Not Found:

- alert: S3BucketNotFoundErrors
  expr: sum(rate(rr_s3_errors_total{error_type="BUCKET_NOT_FOUND"}[5m])) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "S3 bucket not found errors detected"
    description: "Configuration issue: bucket references don't exist"

Permission Denied:

- alert: S3PermissionDenied
  expr: sum(rate(rr_s3_errors_total{error_type="PERMISSION_DENIED"}[5m])) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "S3 permission denied errors"
    description: "Credentials or IAM policy issue detected"

Warning Alerts

Elevated Error Rate:

- alert: S3ElevatedErrorRate
  expr: sum(rate(rr_s3_operations_total{status="error"}[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100 > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "S3 plugin error rate elevated"
    description: "Error rate is {{ $value }}% (threshold: 1%)"

Timeout Errors:

- alert: S3OperationTimeouts
  expr: sum(rate(rr_s3_errors_total{error_type="OPERATION_TIMEOUT"}[5m])) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "S3 operation timeouts detected"
    description: "Network or S3 service performance issue"

11. Example Grafana Dashboard JSON

See grafana_dashboard_example.json for a complete pre-configured dashboard including all key metrics and recommended visualizations.

12. Troubleshooting Guide

No Metrics Appearing

Verify metrics plugin is enabled in .rr.yaml:
```
metrics:
  address: 127.0.0.1:2112
```

Check metrics endpoint:

curl http://localhost:2112/metrics | grep rr_s3

Ensure S3 plugin is performing operations (metrics only appear after first operation)

High Error Rates

Check error types with:

sum by (error_type) (rate(rr_s3_errors_total[5m]))

Identify problematic buckets:

topk(5, sum by (bucket) (rate(rr_s3_errors_total[5m])))

Review RoadRunner logs for detailed error messages

Permission Issues

Check AWS credentials configuration
Verify IAM policy has required permissions
Review PERMISSION_DENIED error rate by bucket

Performance Degradation

Monitor operation rate trends (Hour over Hour)
Check for error bursts
Review specific operation types (write/read/list) for bottlenecks

13. Integration with Other RoadRunner Metrics

Combined RoadRunner + S3 Dashboard

HTTP Traffic vs S3 Operations:

# HTTP RPS
sum(rate(rr_http_requests_total[5m]))

# S3 OPS
sum(rate(rr_s3_operations_total[5m]))

Worker Pool vs S3 Activity:

# Worker utilization
rr_http_worker_utilization_percent

# S3 write operations
sum(rate(rr_s3_operations_total{operation="write"}[5m]))

This correlation helps identify if S3 operations are causing worker pool pressure.

14. Best Practices

Query Performance

Use rate() over irate() for smoother graphs
Set appropriate [5m] intervals based on traffic volume
Use topk() to limit cardinality in busy systems

Alerting

Set up critical alerts for BUCKET_NOT_FOUND (config issues)
Monitor PERMISSION_DENIED (security issues)
Alert on error rate >5% sustained for 5 minutes

Dashboard Organization

Group metrics by concern (operations, errors, buckets)
Use consistent color schemes across panels
Include both rate and percentage views

Retention

Default Prometheus retention: 15 days
Increase for long-term S3 usage analysis
Consider recording rules for long-term aggregations

15. Recording Rules (Optional Optimization)

For high-traffic systems, pre-compute common queries:

groups:
  - name: s3_recording_rules
    interval: 30s
    rules:
      - record: s3:operations:rate5m
        expr: sum(rate(rr_s3_operations_total[5m]))
      
      - record: s3:operations:rate5m:by_bucket
        expr: sum by (bucket) (rate(rr_s3_operations_total[5m]))
      
      - record: s3:errors:rate5m
        expr: sum(rate(rr_s3_errors_total[5m]))
      
      - record: s3:success_rate:percent
        expr: sum(rate(rr_s3_operations_total{status="success"}[5m])) / sum(rate(rr_s3_operations_total[5m])) * 100

Then query using: s3:operations:rate5m instead of full expression.

For additional support or questions about S3 plugin metrics, refer to the main RoadRunner documentation or open an issue on GitHub.

FilesExpand file tree

metrics.md

Latest commit

History

metrics.md

File metadata and controls

S3 Plugin Metrics - Complete Grafana Guide

Overview

1. Operation Metrics

1.1 Total Operations Per Second

1.2 Operations Per Second by Bucket

1.3 Operations Per Second by Type

1.4 Operations Per Second by Status

1.5 Success Rate Percentage

1.6 Total Operations Count

2. Bucket Analysis

2.1 Most Active Buckets (by Total Operations)

2.2 Most Active Buckets (by Write Operations)

2.3 Most Active Buckets (by Read Operations)

2.4 Bucket Performance Table

2.5 Read/Write Ratio by Bucket

3. Operation Type Analysis

3.1 Write Operations Rate

3.2 Read Operations Rate

3.3 Delete Operations Rate

3.4 List Operations Rate

3.5 Copy Operations Rate

3.6 Move Operations Rate

3.7 Exists Check Rate

3.8 Metadata Operations Rate

3.9 Visibility Change Rate

3.10 URL Generation Rate

3.11 Operation Distribution (Pie Chart)

3.12 Write vs Read Operations (Stacked)

4. Error Tracking

4.1 Total Error Rate

4.2 Error Rate Percentage

4.3 Error Rate by Type

4.4 Error Rate by Bucket

4.5 Most Error-Prone Buckets (by Count)

4.6 Most Error-Prone Buckets (by Percentage)

4.7 Bucket Not Found Errors

4.8 File Not Found Errors

4.9 S3 Operation Failed Errors

4.10 Permission Denied Errors

4.11 Invalid Pathname Errors

4.12 Operation Timeout Errors

4.13 Error Distribution (Pie Chart)

4.14 Error Heatmap (Bucket vs Error Type)

5. Combined Operation & Error Analysis

5.1 Operations and Errors (Dual Axis)

5.2 Success vs Error Rate (Stacked)

5.3 Write Operation Success Rate

5.4 Read Operation Success Rate

5.5 Operation Success Rate by Type (Table)

6. Advanced Analytics

6.1 Operation Rate Trend (Hour over Hour)

6.2 Error Burst Detection

6.3 Operations Per Bucket (Distribution)

6.4 Bucket Activity Timeline (Heatmap)

6.5 Write-Heavy vs Read-Heavy Buckets

6.6 Most Reliable Bucket

6.7 Least Reliable Bucket

7. Dashboard Layout Recommendations

Row 1: Key Metrics Overview (4 panels)

Row 2: Operation Analysis (2 panels)

Row 3: Success vs Errors (2 panels)

Row 4: Error Analysis (2 panels)

Row 5: Bucket Performance (1 panel)

Row 6: Advanced (2 panels)

8. Unit Reference Guide

Standard Grafana Units

9. Common Threshold Configurations

Error Rate Thresholds

Success Rate Thresholds

Error Burst Detection

10. Alert Rules (Prometheus)

Critical Alerts

Warning Alerts

11. Example Grafana Dashboard JSON