This directory contains all SQL scripts for creating, populating, and managing the data warehouse, including the star schema, dimensions, facts, and datamarts.
The SQL scripts implement a complete ETL (Extract, Transform, Load) pipeline that transforms raw OSM notes data into a star schema data warehouse with pre-computed datamarts for analytics.
sql/
└── dwh/ # data warehouse SQL scripts
├── ETL_*.sql # Main ETL scripts
├── Staging_*.sql # Staging area scripts
├── datamarts_lastYearActivities.sql # Last year activity aggregation
├── datamartCountries/ # Country datamart SQL
│ ├── datamartCountries_10_checkDatamartCountriesTables.sql
│ ├── datamartCountries_11_createDatamarCountriesTable.sql
│ ├── datamartCountries_12_createProcedure.sql
│ ├── datamartCountries_21_alterTableAddYears.sql
│ ├── datamartCountries_30_populateDatamartCountriesTable.sql
│ └── datamartCountries_00_dropDatamartObjects.sql
└── datamartUsers/ # User datamart SQL
├── datamartUsers_10_checkDatamartUsersTables.sql
├── datamartUsers_11_createDatamartUsersTable.sql
├── datamartUsers_12_createProcedure.sql
├── datamartUsers_21_alterTableAddYears.sql
├── datamartUsers_31_populateDatamartUsersTable.sql
└── datamartUsers_00_dropDatamartObjects.sql
SQL scripts follow a structured naming pattern:
<Component>_<Phase><Step>_<Description>.sql
- Component: ETL, Staging, datamartCountries, datamartUsers
- Phase: Numeric prefix indicating execution order
1x: Setup and validation2x: Object creation3x: Data population4x: Constraints and indexes5x: Finalization6x: Incremental updates
- Step: Sequential number within phase
- Description: What the script does
Examples:
ETL_11_checkDWHTables.sql- Check if DWH tables existETL_20_createDWHTables.sql- Create DWH dimension and fact tablesStaging_31_createBaseStagingObjects.sql- Create staging objectsdatamartCountries_30_populateDatamartCountriesTable.sql- Populate country datamart
Purpose: Validates that required base tables exist in the public schema.
Checks:
public.notes- Note recordspublic.note_comments- Comment recordspublic.users- User informationpublic.countries- Country boundaries
Exit Behavior: Returns error if any required table is missing.
Usage: Automatically called by ETL.sh
Purpose: Drops all datamart objects (tables, procedures, functions).
Objects Dropped:
dwh.datamartCountriesdwh.datamartUsers- Related procedures and functions
Warning: This is destructive! Only used during full rebuild.
Usage: Called by ETL.sh on first execution (auto-detected)
Purpose: Drops all DWH objects (dimensions, facts, staging).
Objects Dropped:
- All dimension tables
dwh.factstable- Staging schema objects
- Functions and procedures
Warning: This removes all data warehouse data!
Usage: Called by ETL.sh on first execution (auto-detected) for clean rebuild
Purpose: Creates the complete star schema structure.
Creates:
-
Schema:
dwhschema
-
Dimension Tables:
dimension_users- User information with SCD2 supportdimension_countries- Country informationdimension_regions- Geographic regionsdimension_continents- Continental groupingsdimension_days- Date dimensiondimension_time_of_week- Hour/day of week dimensiondimension_applications- Applications used to create notesdimension_application_versions- Version trackingdimension_hashtags- Hashtags found in notesdimension_timezones- Timezone informationdimension_seasons- Seasonal information
-
Fact Table:
dwh.facts- Central fact table with all note actions (see Data Dictionary for complete column definitions)
-
Control Tables:
dwh.properties- ETL metadatadwh.contributor_types- User classification types
Key Features:
- Surrogate keys for all dimensions
- Support for slowly changing dimensions (SCD2)
- Optimized data types
- Basic constraints (enhanced in Phase 4)
Purpose: Populates world regions for countries.
Creates:
- Regional groupings (South America, Europe, Asia, etc.)
- Continental associations
- Geographic hierarchies
Data Source: Predefined regional classifications
Purpose: Creates utility functions for ETL processing.
Functions:
- Date/time conversion functions
- Dimension lookup functions
- Data validation functions
- Helper utilities
Purpose: Creates and populates ISO country codes reference table.
Creates:
dwh.iso_country_codestable with ~60 major countries- Mapping from OSM country relation IDs to ISO 3166-1 codes
- Alpha-2 codes (e.g., CO, US, DE)
- Alpha-3 codes (e.g., COL, USA, DEU)
Maintenance:
- To add new countries, edit the INSERT statement in this file
- Countries not in the table will have NULL ISO codes (acceptable)
- See
sql/dwh/ISO_CODES_README.mdfor detailed instructions
Note: This is a reference table, not populated from base tables. ISO codes are optional enrichment data.
Purpose: Initial population of dimension tables.
Populates:
dimension_days- Dates from 2013 to current year + 5dimension_time_of_week- All 168 hours of the weekdimension_applications- Known applicationsdimension_continents- World continentsdimension_regions- Geographic regionsdimension_countries- From base tablesdimension_timezones- World timezonesdimension_seasons- Seasonal classifications
Note: dimension_users populated incrementally during fact loading
Purpose: Updates dimension tables with new data from base tables.
Updates:
- New users added
- Username changes (SCD2 handling)
- New countries
- Application versions
Usage: Called during both initial and incremental loads
Note: Renamed countries are reported with a SELECT (visible in psql output). A previous COPY ... TO '/tmp/...' was removed because server-side COPY TO file requires the PostgreSQL role to have pg_write_server_files (or equivalent superuser-like rights). To save the same report as CSV from your machine, run the same query with psql's client-side \copy (writes to a path on the client, not on the server).
Purpose: Creates the staging schema and base staging objects.
Creates:
stagingschema- Base staging tables
- Temporary processing tables
Purpose: Creates staging procedures and functions.
Functions/Procedures:
process_notes_at_date()- Process notes for a specific date range- Per-year variants for parallel processing
- Dimension lookup functions
- Data transformation utilities
Purpose: Creates year-specific staging tables for parallel processing.
Creates: staging.facts_YYYY for each year (2013-present)
Purpose: Allows parallel loading by year to improve performance
Purpose: Creates procedures for loading facts for a specific year.
Dynamic: Year substituted via environment variable $YEAR
Purpose: Executes the fact loading procedure for a specific year.
Process:
- Reads notes/comments from base tables for the year
- Resolves dimension keys
- Computes metrics (days to resolution, etc.)
- Loads into
staging.facts_YYYY
Performance: Runs in parallel (one per year)
Purpose: Drops staging objects after facts are loaded into main table.
Cleanup: Removes staging.facts_YYYY tables
Purpose: Adds referential integrity, indexes, and triggers to fact table.
Adds:
-
Foreign Key Constraints:
- Links to all dimension tables
- Ensures referential integrity
-
Indexes:
- Timestamp indexes for time-based queries
- Dimension key indexes for joins
- Composite indexes for common query patterns
- Covering indexes for reporting queries
-
Triggers:
- Resolution metrics calculation on insert
- Data validation triggers
- Audit logging triggers
Performance Impact: Significantly improves query performance
Purpose: Unifies facts across years and computes cross-year metrics.
Process:
- Copies facts from all
staging.facts_YYYYtodwh.facts - Computes metrics that span years
- Fills in
recent_opened_dimension_id_datefor all facts - Validates data consistency
Critical: Must run after all year-specific loads complete
Purpose: Loads new notes incrementally (for scheduled updates).
Process:
- Identifies notes added since last ETL run
- Processes only new note actions
- Updates dimensions if needed
- Appends to fact table
Performance: Much faster than full reload (minutes vs hours)
Location: sql/dwh/datamartCountries/
Validates that country datamart prerequisites exist.
Creates dwh.datamartCountries table with:
- Country-level aggregates
- Yearly history (2013-present)
- Current/monthly/daily statistics
- Rankings and leaderboards
- Activity patterns
Creates stored procedure to populate the country datamart.
Procedure: dwh.populate_datamart_countries()
Dynamically adds columns for new years as time passes.
Example: Adds history_2026_open column in 2026
Main population script that:
- Aggregates facts by country
- Computes all metrics
- Generates rankings
- Calculates working hours patterns
- Updates all rows
Execution Time: ~20 minutes
Cleanup script to remove all country datamart objects.
Location: sql/dwh/datamartUsers/
Validates user datamart prerequisites.
Creates dwh.datamartUsers table with:
- User-level aggregates
- Yearly history per user
- Contribution patterns
- Country rankings
- Activity classifications
Creates procedures for incremental user datamart updates.
Procedures:
dwh.populate_datamart_users(user_id)- Single userdwh.populate_datamart_users_batch()- Batch of 500 users
Adds year columns for user datamart as needed.
Main incremental population script:
- Processes 500 users per run
- Marks users as processed
- Computes all user metrics
- Generates country rankings
Design: Incremental to avoid overwhelming database
Computes last year activity patterns (GitHub-style contribution graph).
Used By: Both country and user datamarts
For detailed ETL process documentation, see ETL Enhanced Features.
graph TD
A[Start] --> B[Check Base Tables - 11]
B --> C[Drop Existing - 12,13]
C --> D[Create DWH Tables - 22]
D --> E[Populate Regions - 23]
E --> F[Add Functions - 24]
F --> G[Populate Dimensions - 25,26]
G --> H[Create Staging - 31,32,33]
H --> I[Load Facts Parallel - 34,35]
I --> J[Unify Facts - 51]
J --> K[Add Constraints - 41]
K --> L[Populate Datamarts]
L --> M[End]
graph TD
A[Start] --> B[Check Tables - 11]
B --> C[Update Dimensions - 26]
C --> D[Load New Notes - 61]
D --> E[Update Datamarts]
E --> F[End]
For complete ETL flow documentation, see:
- ETL Enhanced Features - ETL capabilities and features
- bin/dwh/README.md - High-level ETL overview
The data warehouse uses a star schema design with:
- Fact Table:
dwh.facts- One row per note action (partitioned by year) - Dimension Tables: Users, countries, dates, times, applications, hashtags, and more
- Datamart Tables: Pre-computed analytics for users and countries
For complete schema documentation:
- DWH Star Schema ERD - Complete entity-relationship diagram with all relationships and cardinalities
- Data Dictionary - Detailed column definitions for all tables
dwh.facts- ~20M+ rows (depends on notes volume)dimension_users- ~500K rowsdimension_countries- ~200 rowsdimension_days- ~5K rows (2013-2030)dimension_time_of_week- 168 rowsdwh.datamartCountries- ~200 rowsdwh.datamartUsers- ~500K rows
- Facts loaded in parallel by year (2013-present)
- Each year runs as separate job
- Typically 12-13 parallel jobs
- Requires sufficient CPU/memory
- B-tree indexes on foreign keys
- Covering indexes for common queries
- Partial indexes for recent data
- GiST indexes for geographic queries
-
Increase work_mem for loading:
SET work_mem = '256MB';
-
Disable autovacuum during initial load:
ALTER TABLE dwh.facts SET (autovacuum_enabled = false); -- Re-enable after load ALTER TABLE dwh.facts SET (autovacuum_enabled = true);
-
Run VACUUM ANALYZE after load:
VACUUM ANALYZE dwh.facts;
# Validate SQL syntax
for file in sql/dwh/*.sql; do
psql -d postgres -f "$file" --dry-run 2>&1 | grep -i error
done# Test single script
psql -d dwh -f sql/dwh/ETL_20_createDWHTables.sql
# Check for errors
echo $? # Should be 0-- Test dimension population
SELECT COUNT(*) FROM dwh.dimension_users;
-- Test fact table
SELECT COUNT(*) FROM dwh.facts;
-- Test referential integrity
SELECT COUNT(*)
FROM dwh.facts f
LEFT JOIN dwh.dimension_users u
ON f.action_dimension_id_user = u.dimension_user_id
WHERE u.dimension_user_id IS NULL;
-- Should return 0Check schema search path:
SHOW search_path;
SET search_path TO dwh, public;Reduce batch size or parallel jobs:
# Edit etc/etl.properties
ETL_BATCH_SIZE=500
ETL_MAX_PARALLEL_JOBS=2Analyze tables:
ANALYZE dwh.facts;
ANALYZE dwh.dimension_users;Check missing indexes:
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'dwh' AND tablename = 'facts';Check for locks:
SELECT pid, usename, query, state
FROM pg_stat_activity
WHERE datname = 'notes_dwh' AND state != 'idle';Kill blocking queries:
SELECT pg_terminate_backend(pid);-
Always backup before major changes:
pg_dump -d notes_dwh -n dwh > dwh_backup.sql -
Test scripts in development database first
-
Use transactions for data modifications:
BEGIN; -- Your changes ROLLBACK; -- or COMMIT;
-
Monitor long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';
-
Document complex queries with comments
- DWH Star Schema ERD - Visual schema diagrams
- Data Dictionary - Complete column definitions
- ETL Enhanced Features - ETL capabilities and features
- bin/dwh/ETL.sh - Main ETL orchestration script
- bin/dwh/ENTRY_POINTS.md - Script entry points
- bin/README.md - Script usage guide
- etc/README.md - Configuration files and ETL properties
- bin/dwh/ENVIRONMENT_VARIABLES.md - Environment variables
- DWH Maintenance Guide - Maintenance procedures
- Troubleshooting Guide - Common SQL and database issues
- Partitioning_Strategy.md - Facts table partitioning
For SQL-related issues:
- Check PostgreSQL logs
- Review query execution plans:
EXPLAIN ANALYZE - Check for table bloat:
pg_stat_user_tables - Create an issue with query and error message