Aircan is a collection of Apache Airflow DAGs for building and operating data pipelines. It provides reusable components for common data engineering tasks β including cloud storage integration, data warehouse loading, schema management, validation, and notifications β and can be extended to support any data pipeline use case.
aircan/
βββ aircan/
β βββ dags/
β β βββ pipeline_ckan_to_bigquery.py # Main DAG: CKAN CSV β GCS β BigQuery
β β βββ pipeline_ckan_to_bigquery_legacy.py # Main DAG: CKAN CSV β GCS β BigQuery (legacy version)
β β βββ pipeline_ckan_to_postgres_legacy.py # Main DAG: CKAN CSV β GCS β Postgres Datstore (legacy version)
β β βββ other_dag.py # Other DAGs can be added here
β βββ dependencies_legacy/ # Legacy dependencies for older DAG versions (to be deprecated)
β βββ dependencies/
β βββ cloud/
β β βββ clients.py # BigQuery / GCS Airflow hook wrappers
β β βββ storage.py # GCS upload / download / signed URL helpers
β β βββ warehouse.py # BigQuery load, upsert, schema helpers
β βββ utils/
β βββ ckan.py # CKAN status update (async, non-blocking)
β βββ email.py # SMTP alert email helpers
β βββ schema.py # Frictionless β BigQuery schema conversion
β βββ validation.py # CSV validation via frictionless
βββ docker.compose.yaml # Local development Airflow cluster
βββ config/ # Airflow config overrides (airflow.cfg)
Aircan DAGs can be triggered in several ways depending on your use case:
- CKAN integration β Install the ckanext-aircan extension, an alternative to DataPusher and XLoader, which automatically triggers the DAG whenever a new CSV resource is added or updated in CKAN.
- Scheduled runs β Configure a cron-based schedule directly on the DAG to ingest data at regular intervals (e.g. nightly, hourly).
- Manual triggers β Run a DAG on demand via the Airflow UI, passing any required parameters at runtime.
- REST API β Trigger DAGs programmatically using the Airflow REST API, suitable for integration with external systems or CI/CD pipelines.
Copy the template below and fill in your values. The Docker Compose file reads this file automatically.
# ββ Airflow infrastructure ββββββββββββββββββββββββββββββββββββββββββββββββββ
AIRFLOW_IMAGE_NAME=apache/airflow:3.1.7 # Docker image to use
AIRFLOW_UID=50000 # UID inside containers (use $(id -u) on Linux)
AIRFLOW_PROJ_DIR=. # Host directory mounted into containers
# ββ Admin credentials βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
# ββ Optional: custom env file path ββββββββββββββββββββββββββββββββββββββββββ
# ENV_FILE_PATH=.env # Default: .env in project rootNote:
.envis git-ignored. Never commit credentials.
docker compose -f docker.compose.yaml up -dAirflow UI will be available at http://localhost:8080 (default credentials: airflow / airflow).
Each pipeline run is namespaced by a site_id (set in the DAG trigger params). Airflow connection IDs follow the pattern {site_id}_{type}. Register the three connections below before triggering any DAG.
| DAG ID | Description | Documentation |
|---|---|---|
pipeline_ckan_to_bigquery |
Load a CSV from CKAN into BigQuery via GCS | pipeline_ckan_to_bigquery.md |
pipeline_ckan_to_bigquery_legacy |
Legacy version of the above DAG | |
pipeline_ckan_to_postgres_legacy |
Legacy DAG for loading CKAN CSV into Postgres Datastore |