Skip to content

ClusterViewSet static queryset annotations cause extreme API slowness #21732

@jcresp21

Description

@jcresp21

NetBox Edition

NetBox Community

NetBox Version

v4.5.4 (Docker image: netboxcommunity/netbox:v4.5-4.0.1)

Python Version

3.12

Steps to Reproduce

  1. Have a moderate number of clusters (~84) and a significant number of virtual machines (~15,500).
  2. Query the clusters API endpoint:
GET /api/virtualization/clusters/
  1. Even ?brief=True exhibits the same behavior:
GET /api/virtualization/clusters/?brief=True

Expected Behavior

The API request should complete in under 1–2 seconds for 84 clusters.

Observed Behavior

The request takes 85–120 seconds to complete. The bottleneck is the DRF pagination COUNT(*) query, which inherits the Sum() annotations and LEFT OUTER JOIN defined statically on the ClusterViewSet.queryset.

Root Cause

ClusterViewSet in virtualization/api/views.py defines:

class ClusterViewSet(NetBoxModelViewSet):
    queryset = Cluster.objects.prefetch_related('virtual_machines').annotate(
        allocated_vcpus=Sum('virtual_machines__vcpus'),
        allocated_memory=Sum('virtual_machines__memory'),
        allocated_disk=Sum('virtual_machines__disk'),
    )

These annotations and prefetch_related are applied statically to the queryset. This means they are present on every request, including:

  • ?brief=True requests (where these fields are never serialized)
  • The pagination COUNT(*) query (where annotations are completely irrelevant)

However, BaseViewSet.get_queryset() already has a dynamic mechanism for resolving annotations and prefetches based on the serializer fields:

class BaseViewSet(GenericViewSet):
    def get_queryset(self):
        qs = super().get_queryset()
        serializer_class = self.get_serializer_class()

        if prefetch := get_prefetches_for_serializer(serializer_class, **self.field_kwargs):
            qs = qs.prefetch_related(*prefetch)

        if annotations := get_annotations_for_serializer(serializer_class, **self.field_kwargs):
            qs = qs.annotate(**annotations)

        return qs

The RelatedObjectCountField fields (virtualmachine_count, device_count) are already handled dynamically by this mechanism. The static Sum() annotations on ClusterViewSet bypass this entirely, causing them to be applied unconditionally — including to the pagination COUNT(*) query.

Pathological SQL generated by pagination

When DRF's paginator calls queryset.count(), Django cannot strip the annotations, producing:

SELECT COUNT(*) FROM (
  SELECT COALESCE(
    (SELECT COUNT(*) AS "c"
     FROM "virtualization_virtualmachine" U0
     WHERE U0."cluster_id" = ("virtualization_cluster"."id")
     GROUP BY U0."cluster_id"), 0
  ) AS "virtualmachine_count"
  FROM "virtualization_cluster"
  LEFT OUTER JOIN "virtualization_virtualmachine"
    ON ("virtualization_cluster"."id" = "virtualization_virtualmachine"."cluster_id")
  GROUP BY "virtualization_cluster"."id", 1
) subquery

EXPLAIN ANALYZE output:

Aggregate  (cost=1983993.25..1983993.26 rows=1 width=8)
           (actual time=61184.289..61184.299 rows=1 loops=1)
  ->  Group  (cost=23488.67..1983992.20 rows=84 width=16)
             (actual time=261.027..61184.098 rows=84 loops=1)
        Group Key: virtualization_cluster.id, (COALESCE((SubPlan 1), '0'::bigint))
        ->  Incremental Sort  (cost=23488.67..1973207.37 rows=15464 width=16)
                               (actual time=260.998..61182.784 rows=15468 loops=1)
              SubPlan 1
                ->  GroupAggregate  (cost=0.29..127.47 rows=1 width=16)
                                    (actual time=3.921..3.921 rows=1 loops=15468)
                      ->  Index Only Scan [...]
                            (actual time=0.022..3.266 rows=11934 loops=15468)
                            Heap Fetches: 88071177
Execution Time: 61199.586 ms

The correlated subquery executes 15,468 times (once per joined VM row), each scanning ~11,934 rows. Total: ~88 million heap fetches for what should be a trivial count of 84 rows.

Diagnostic benchmarks

Test Time
Cluster.objects.count() — direct ORM 0.036s
Cluster.objects.values_list('id', flat=True) — direct ORM 0.021s
ClusterSerializer(data, many=True).data — serializer only 11.3s
GET /api/virtualization/clusters/?brief=True — full HTTP 85.7s

Possible Fix

Remove the static annotations and prefetch_related from ClusterViewSet.queryset. The BaseViewSet.get_queryset() dynamic mechanism already handles RelatedObjectCountField annotations (virtualmachine_count, device_count) correctly via get_annotations_for_serializer().

class ClusterViewSet(NetBoxModelViewSet):
    queryset = Cluster.objects.all()
    serializer_class = serializers.ClusterSerializer
    filterset_class = filtersets.ClusterFilterSet

For the allocated_vcpus, allocated_memory, and allocated_disk fields, a proper solution would be to implement a dynamic field type (similar to RelatedObjectCountField) that integrates with get_annotations_for_serializer(), so these annotations are only applied when the fields are included in the response and never leak into the pagination COUNT(*) query.

Environment

  • 84 clusters, ~15,500 virtual machines
  • PostgreSQL 17 (Patroni cluster)
  • Upgraded from NetBox 3.7 → 4.5 (issue did not exist in 3.7, as allocated_* annotations were not present on ClusterViewSet)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions