[TOC]
The overall process involves four steps:
- Create BigQuery Dataset: Ensure that BigQuery has a dataset to hold tables.
- Data Preparation: Convert source data (e.g., Shapefiles) to CSV or Parquet with geometries represented as WKT strings. Though Parquet also works as an intermediate format, this guide uses CSV for simplicity.
- Upload to Staging: Load the CSV data into a BigQuery staging (temporary) table
where geometry is kept as a
STRING. This avoids load failures caused by invalid geometries. - Finalize Table: Create the final table by selecting data from the
staging table into a final table, converting the WKT string column into a
GEOGRAPHYtype.
Before uploading, ensure your project has a BigQuery dataset that will contain the tables exists in BigQuery. If it does not, create it using the Cloud Console or by running the following SQL:
-- Replace 'your_dataset_name' with the desired dataset name
CREATE SCHEMA IF NOT EXISTS `your-project.your_dataset_name`
OPTIONS(location="US");If your data is already in CSV format with geometry provided in one column as WKT strings, you can skip this step and proceed to Step 3.
Source data requires conversion to CSV before ingestion if it is currently in another format. The Python script below uses GeoPandas to read a shapefile, ensure the projection is EPSG:4326 (WGS 84), convert geometries to WKT strings, and save the result as a CSV file.
Note: If you are working with WDPA data that comes in a complex nested-zip format, you can use this specialized conversion script instead.
Prerequisites:
Install necessary Python libraries:
pip install geopandas pyarrow shapely google-cloud-bigqueryScript:
import geopandas as gpd
import pandas as pd
import os
# --- CONFIGURATION ---
# The path to your shapefile
SHP_PATH = "/usr/local/google/home/kumaraashutosh/Documents/WDPCA/WDPA_WDOECM_Jan2026_Public_all_shp/WDPA_WDOECM_Jan2026_Public_all_shp_0/WDPA_WDOECM_Jan2026_Public_all_shp-points.shp"
# Where you want the final CSV file to be saved
OUTPUT_DIR = "./csv_output"
OUTPUT_FILENAME = "output.csv"
def convert_shp_to_csv(shp_path, output_dir, output_filename):
"""
Reads a shapefile, reprojects it to EPSG:4326,
converts geometry to WKT, and saves it as a CSV file.
"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"--- Processing {shp_path} ---")
try:
gdf = gpd.read_file(shp_path)
except Exception as e:
print(f"Error reading shapefile '{shp_path}': {e}")
return
if gdf.empty:
print(f"Shapefile is empty. Skipping.")
return
# 1. Force CRS to WGS84 (BigQuery Requirement for GEOGRAPHY)
if gdf.crs is None or gdf.crs.to_epsg() != 4326:
print(f"Reprojecting from {gdf.crs} to EPSG:4326...")
gdf = gdf.to_crs(epsg=4326)
# 2. Fix potential column name issues
# BigQuery columns must be alphanumeric/underscores.
gdf.columns = [c.replace(' ', '_').replace('-', '_').replace('.', '_') for c in gdf.columns]
# 3. Convert geometry to WKT string to ensure compatibility
# with ST_GEOGFROMTEXT in BigQuery.
print("Converting geometry to WKT...")
geom_col = gdf.geometry.name
df = pd.DataFrame(gdf.copy())
df["geometry"] = gdf[geom_col].apply(
lambda x: x.wkt if x is not None and not x.is_empty else None
)
if geom_col != "geometry" and geom_col in df.columns:
df = df.drop(columns=[geom_col])
# 4. Export to CSV
out_path = os.path.join(output_dir, output_filename)
print(f"Writing {out_path}...")
df.to_csv(out_path, index=False)
print(f"\nSuccess! CSV file is located at: {os.path.abspath(out_path)}")
if __name__ == "__main__":
convert_shp_to_csv(SHP_PATH, OUTPUT_DIR, OUTPUT_FILENAME)Result: After running the script, your OUTPUT_DIR will contain output.csv. Proceed to Step 3.
We first load data into a staging table where the geometry column is treated as a STRING. This allows us to handle very large geometry strings and bypass errors during the initial load if some geometries are invalid.
If you generated CSV files in Step 2, or have existing CSV files, upload them to Google Cloud Storage (GCS) and then load them into BigQuery using the bq command-line tool.
3.1. Upload CSV files to GCS:
Replace your-bucket-name with your GCS bucket and csv_output with the directory containing your CSV files.
gcloud storage cp ./csv_output/*.csv gs://your-bucket-name/wdpa_staging_csv/3.2. Load data from GCS into BigQuery:
This command loads all CSV files from the GCS path into your_dataset_name.staging_table, replacing the table if it exists. BigQuery will attempt to infer the schema from the CSV files.
bq --location=US load --replace \
--source_format=CSV \
--autodetect=true \
--allow_quoted_newlines=true \
your-project:your_dataset_name.staging_table \
"gs://your-bucket-name/wdpa_staging_csv/*.csv"Once the data resides in the staging table with geometry as a WKT STRING, use the following SQL query to create the final table. This step converts the WKT string to a GEOGRAPHY type using ST_GEOGFROMTEXT.
It uses make_valid => TRUE, which directs BigQuery to automatically fix common spherical geometry issues like self-intersections or incorrect winding order, which would otherwise cause conversion to fail.
For optimal query performance on geospatial data, the final table must be clustered by the geography column, as shown in the query below.
CREATE OR REPLACE TABLE `your-project.your_dataset_name.final_table`
CLUSTER BY the_geom
AS
SELECT
* EXCEPT(geometry), -- Select all columns except the original WKT string
ST_GEOGFROMTEXT(geometry, make_valid => TRUE) AS the_geom -- Convert WKT to GEOGRAPHY, fixing errors
FROM
`your-project.your_dataset_name.staging_table`
WHERE
geometry IS NOT NULL AND geometry != '';Your data, including a valid the_geom column of type GEOGRAPHY, is now ready in final_table.
-
WKT String is too long in source file
- If you have geometries with millions of vertices, the WKT string itself can be very large. While BigQuery supports large cells, simplification might be necessary if they exceed limits or if processing is too slow.
- Fix: Use the Python
shapelylibrary to simplify the geometry before creating the CSV file or uploading. This reduces vertex count while preserving the general shape.
from shapely.wkt import loads # Assuming 'poly' is a shapely geometry object simplified_poly = poly.simplify(tolerance=0.0001) # Adjust tolerance as needed wkt_string = simplified_poly.wkt
-
Memory Errors during Shapefile conversion
- If the Shapefile-to-CSV conversion script in Step 1 runs out of RAM, your geometries or rows are too large for system memory.
- Fix: If a single shapefile is too large to read into memory, it may need to be split before conversion, or you may need to run conversion on a machine with more RAM.
-
Alternative Upload: Python Chunks for Large Local CSVs
- For very large local CSV files that are difficult to upload to Cloud Storage, or for cases where CSV rows contain extremely large WKT strings that cause memory issues with
bq load, you can use this Python script to upload directly to BigQuery in chunks. This method can also help if you encounter memory errors during upload. - If you use this script and still encounter memory errors, try reducing
CHUNK_SIZE(e.g., to 1000 or 500).
- For very large local CSV files that are difficult to upload to Cloud Storage, or for cases where CSV rows contain extremely large WKT strings that cause memory issues with
Script:
import pandas as pd
from google.cloud import bigquery
# Configuration
PROJECT_ID = 'your-project'
DATASET_ID = 'your_dataset_name'
TABLE_ID = 'staging_table'
FILE_PATH = 'large_geodata.csv'
CHUNK_SIZE = 10000 # Adjust based on memory
client = bigquery.Client(project=PROJECT_ID)
table_ref = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"
# Job configuration to ensure WKT is treated as a STRING
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
autodetect=True,
allow_quoted_newlines=True,
)
# Process in chunks
reader = pd.read_csv(FILE_PATH, chunksize=CHUNK_SIZE, engine='python') # engine='python' may be slower but more flexible
for i, chunk in enumerate(reader):
print(f"Uploading chunk {i+1}...")
# Ensure geometry column is string type before upload
if 'geometry' in chunk.columns:
chunk['geometry'] = chunk['geometry'].astype(str)
# If it's the first chunk, truncate the table. Otherwise, append.
job_config.write_disposition = "WRITE_TRUNCATE" if i == 0 else "WRITE_APPEND"
load_job = client.load_table_from_dataframe(
chunk, table_ref, job_config=job_config
)
load_job.result() # Wait for job to finish
print("Staging upload complete.")