Bug Description
When using BigQuery with a GCS export bucket (CUBEJS_DB_EXPORT_BUCKET), the BigQueryDriver.unload() method generates signed URLs for every exported CSV file and passes them all to CubeStore via CREATE TABLE ... LOCATION. CubeStore then attempts to download all files concurrently using FuturesUnordered in estimate_location_row_count, which overwhelms the HTTP connection pool and fails.
For large tables (e.g. 48M rows), BigQuery's EXPORT DATA produces ~640 CSV gzip files. CubeStore fires 640 concurrent HTTPS requests to GCS, and the reqwest client fails with connection errors after ~20-100 concurrent connections.
Error
Internal: error sending request for url (https://storage.googleapis.com/bucket/file.csv.gz?GoogleAccessId=...&Expires=...&Signature=...)
Originating from:
cubestore::CubeError as core::convert::From<reqwest::error::Error>::from
Environment
- Cube.js: v1.6.36
- CubeStore: v1.6
- Database: BigQuery
- Export bucket: GCS
- Docker Compose deployment (single-node CubeStore)
- Table size: 48M rows → 640 CSV gzip files (~6MB each, ~4GB total)
Steps to Reproduce
- Configure BigQuery with GCS export bucket
- Define an
originalSql pre-aggregation with external: true on a large table (>10M rows)
- Trigger pre-aggregation build
- CubeStore fails during
estimate_location_row_count — 640 concurrent signed URL HEAD requests overwhelm the connection pool
Root Cause Analysis
BigQuery driver (BigQueryDriver.js, unload() method):
- Exports table to GCS via
createExtractJob — produces N files (BigQuery controls the shard count, ~1GB uncompressed per file)
- Lists all files and generates signed URLs for each
- Returns all signed URLs in
csvFile array
CubeStore (import/mod.rs, estimate_location_row_count):
- Receives all URLs in
CREATE TABLE ... LOCATION url1, url2, ..., urlN
- Fires all HEAD requests concurrently via
FuturesUnordered
- reqwest connection pool fails between 20-100 concurrent connections in Docker networking
Individual connections work fine (verified with curl and openssl from inside the container). The issue is purely concurrent connection exhaustion.
Suggested Fixes
Option A: CubeStore — limit concurrency in estimate_location_row_count
Use a semaphore or buffer_unordered(N) instead of unbounded FuturesUnordered when issuing HEAD requests. A concurrency limit of 20-50 would avoid connection exhaustion while still being fast.
Option B: BigQuery driver — use temp:// upload path
Instead of returning signed URLs, download the GCS files in the Cube API container and upload them to CubeStore's /upload-temp-file HTTP endpoint with controlled concurrency (e.g. 10 at a time), then return temp://filename URIs. This bypasses the signed URL path entirely.
Current Workaround
We override unload() via driverFactory in cube.js to download GCS files locally and upload them to CubeStore's temp file API with a concurrency of 10:
driverFactory: ({ dataSource } = {}) => {
const { BigQueryDriver } = require('@cubejs-backend/bigquery-driver');
const driver = new BigQueryDriver({});
const origUnload = driver.unload.bind(driver);
driver.unload = async function(table) {
// ... export to GCS same as original ...
// Download each file from GCS using SA credentials (no signed URL)
// Upload to CubeStore via POST /upload-temp-file?name=<name>
// Return { csvFile: tempNames.map(n => `temp://${n}`) }
};
return driver;
},
This works but shouldn't be necessary — the driver or CubeStore should handle large file counts gracefully.
Bug Description
When using BigQuery with a GCS export bucket (
CUBEJS_DB_EXPORT_BUCKET), theBigQueryDriver.unload()method generates signed URLs for every exported CSV file and passes them all to CubeStore viaCREATE TABLE ... LOCATION. CubeStore then attempts to download all files concurrently usingFuturesUnorderedinestimate_location_row_count, which overwhelms the HTTP connection pool and fails.For large tables (e.g. 48M rows), BigQuery's
EXPORT DATAproduces ~640 CSV gzip files. CubeStore fires 640 concurrent HTTPS requests to GCS, and the reqwest client fails with connection errors after ~20-100 concurrent connections.Error
Originating from:
Environment
Steps to Reproduce
originalSqlpre-aggregation withexternal: trueon a large table (>10M rows)estimate_location_row_count— 640 concurrent signed URL HEAD requests overwhelm the connection poolRoot Cause Analysis
BigQuery driver (
BigQueryDriver.js,unload()method):createExtractJob— produces N files (BigQuery controls the shard count, ~1GB uncompressed per file)csvFilearrayCubeStore (
import/mod.rs,estimate_location_row_count):CREATE TABLE ... LOCATION url1, url2, ..., urlNFuturesUnorderedIndividual connections work fine (verified with
curlandopensslfrom inside the container). The issue is purely concurrent connection exhaustion.Suggested Fixes
Option A: CubeStore — limit concurrency in
estimate_location_row_countUse a semaphore or
buffer_unordered(N)instead of unboundedFuturesUnorderedwhen issuing HEAD requests. A concurrency limit of 20-50 would avoid connection exhaustion while still being fast.Option B: BigQuery driver — use
temp://upload pathInstead of returning signed URLs, download the GCS files in the Cube API container and upload them to CubeStore's
/upload-temp-fileHTTP endpoint with controlled concurrency (e.g. 10 at a time), then returntemp://filenameURIs. This bypasses the signed URL path entirely.Current Workaround
We override
unload()viadriverFactoryincube.jsto download GCS files locally and upload them to CubeStore's temp file API with a concurrency of 10:This works but shouldn't be necessary — the driver or CubeStore should handle large file counts gracefully.