Skip to content

Cacti 1.3 polling engine roadmap: stability, performance, multi-protocol, and storage #7034

@somethingwithproof

Description

@somethingwithproof

Overview

Four clusters of open issues point to related architectural improvements for the polling engine. This issue tracks them as a coordinated roadmap rather than isolated fixes.

1. Resilient Poller Engine

The poller is fragile under load. Overruns cascade into parallel cycles, advisory locks stall without timeout, stale data causes doubled RRD writes, and DDL operations thrash the DB during routine polling.

Goal: a poller engine that fails gracefully under contention.

Key changes:

  • Abort on overrun instead of deleting active process locks and spawning parallel cycles
  • Advisory locks with hard retry limits (fail-closed, retry next cycle)
  • Transaction-safe poller_output processing to prevent stale row doubling
  • Deterministic child process lifecycle (no orphaned zombies on timeout)

Related issues: #6569, #6738, #6777, #6781, #7024, #7030

2. Efficient Poll Cycle

Each poll cycle wastes resources through unclosed SNMP sessions, per-data-source script server spawning, and unbatched SNMP GETs for realtime graphs.

Goal: 50% reduction in per-cycle process spawns and connection overhead.

Key changes:

  • Reuse SNMP sessions across OIDs per host, close deterministically at end of cycle
  • Batch script server queries per host (spawn once, feed all queries, tear down once)
  • Runtime detection and bypass of php-snmp when net-snmp binaries are faster
  • Batched SNMP GET for realtime graph polling

Related issues: #6669, #6722, #6735, #7025

3. Multi-Protocol Data Input

SNMP is insufficient for modern network devices. Vendors are investing in gRPC/OpenConfig telemetry and REST APIs. Device-specific SNMP quirks (e.g., Juniper trailing-zero OID padding) consume disproportionate maintenance effort.

Goal: pluggable data input abstraction that feeds the existing poller_output/RRD pipeline.

Key changes:

  • Define a data input interface contract (collect, format, write to poller_output)
  • gRPC/gNMI collector (sidecar or native) for Juniper JTI, Cisco MDT, Arista OpenConfig
  • REST/API-based availability checks as an alternative to ICMP/SNMP ping
  • Adaptive polling backoff for consistently failing devices
  • Auto-detection of SNMP OID index quirks (trailing-zero padding, etc.)

Related issues: #5919, #6108, #6787, #7032, #7033

4. Zero-DDL Polling Pipeline

The boost subsystem and poller_output table use CREATE/DROP/RENAME/ALTER TABLE as runtime operations every poll cycle. This causes metadata lock contention, query cache invalidation, and InnoDB dictionary mutex stalls at scale.

Goal: no schema mutations during normal polling.

Key changes:

  • Replace MEMORY table DDL-per-cycle with persistent InnoDB tables and TRUNCATE
  • Row-level expiry for boost cache instead of table-level DROP/CREATE
  • Bounded boost_max_records enforcement to prevent unbounded growth

Related issues: #6775, #6786, #7030

Priority

Cluster 1 (stability) and Cluster 4 (storage) are prerequisites for large-scale deployments. Cluster 2 (performance) provides measurable improvement for existing users. Cluster 3 (multi-protocol) is forward-looking and can develop in parallel as a plugin before core integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions