For high-volume deployments, the gateway has a tendency to pick multiple indexers (often 3) to maximize performance. The primary issue with this behavior is that we run a higher risk of unnecessarily overloading indexers. Above some threshold indexer-selection should load-balance requests between indexers that would otherwise all be included in the selected set.
The first challenge is detecting high volume on a deployment. Here's an rough design:
The gateway should track query volume per subgraph deployment. This would likely be a parking_lot::RwLock<HashMap<DeploymentId, AtomicUsize>>. Inserting into the map should be relatively infrequent, and updating an entry only requires a read lock. The atomic counter is incremented by the amount of indexers selected, and decremented once each indexer request completes. A "high volume" state on a deployment is when this counter is above some threshold, meaning there are approximately n outstanding indexer requests happening concurrently.
There are multiple potential approaches for what to do when we detect high volume on a deployment. Here's a list that increase in difficulty, and might be a reasonable order of iterations to go down until we hit "good enough for now":
- When the deployment is "high volume", call
indexer_selection::select with a limit of 1 instead of 3.
- Add a parameter to
indexer_selection::select that acts as a cost to including additional indexers in the selected set. This value should increase at higher volume.
- Use a proper load-balancing algorithm between the selected indexers, see this for inspiration: https://samwho.dev/load-balancing/.
For high-volume deployments, the gateway has a tendency to pick multiple indexers (often 3) to maximize performance. The primary issue with this behavior is that we run a higher risk of unnecessarily overloading indexers. Above some threshold indexer-selection should load-balance requests between indexers that would otherwise all be included in the selected set.
The first challenge is detecting high volume on a deployment. Here's an rough design:
There are multiple potential approaches for what to do when we detect high volume on a deployment. Here's a list that increase in difficulty, and might be a reasonable order of iterations to go down until we hit "good enough for now":
indexer_selection::selectwith a limit of 1 instead of 3.indexer_selection::selectthat acts as a cost to including additional indexers in the selected set. This value should increase at higher volume.