This directory contains the executable scripts for the OSM-Notes-Analytics project, primarily focused on ETL (Extract, Transform, Load) processes and datamart generation.
The bin/ directory houses the main operational scripts that transform raw OSM notes data into a
comprehensive data warehouse with pre-computed analytics datamarts.
New to the project? Start here:
- Entry Points Documentation - Which scripts can be called directly
- Environment Variables - Configuration via environment variables
- DWH README - Detailed DWH documentation
Key Entry Points:
bin/dwh/ETL.sh- Main ETL process (creates/updates data warehouse)bin/dwh/datamartCountries/datamartCountries.sh- Country datamartbin/dwh/datamartUsers/datamartUsers.sh- User datamartbin/dwh/profile.sh- Profile generatorbin/dwh/exportDatamartsToJSON.sh- Export to JSONbin/dwh/cleanupDWH.sh- Cleanup script
See Entry Points Documentation for complete details.
bin/
└── dwh/ # data warehouse scripts
├── ETL.sh # Main ETL orchestration script
├── profile.sh # Profile generator for users and countries
├── cleanupDWH.sh # Cleanup DWH objects and temp files
├── README.md # DWH documentation
├── datamartCountries/ # Country datamart scripts
│ └── datamartCountries.sh
└── datamartUsers/ # User datamart scripts
└── datamartUsers.sh
Location: bin/dwh/ETL.sh
Purpose: Orchestrates the complete ETL process to populate the data warehouse from base tables.
Features:
- Creates star schema dimensions and fact tables
- Supports initial load and incremental updates
- Parallel processing by year (2013-present)
- Recovery and resume capabilities
- Resource monitoring and validation
- Comprehensive logging
Usage:
# ETL execution (auto-detects first run vs incremental)
./bin/dwh/ETL.sh
# Show help
./bin/dwh/ETL.sh --helpAuto-detection:
- First execution: Automatically detects if DWH doesn't exist or is empty, creates all DWH objects and performs initial load
- Subsequent runs: Automatically detects existing data and processes only incremental updates
- Perfect for cron: Same command works for both scenarios
Configuration:
- Database connection:
etc/properties.sh - ETL settings:
etc/etl.properties
Logging:
# Follow ETL progress in real-time
tail -40f $(ls -1rtd /tmp/ETL_* | tail -1)/ETL.logPerformance:
- Initial load: ~30 hours (with parallel processing)
- Incremental update: 5-15 minutes (depends on new data volume)
- Resource requirements: 4GB+ RAM, multi-core CPU recommended
Location: bin/dwh/profile.sh
Purpose: Generates detailed profiles for users and countries based on datamart data.
Usage:
# User profile
./bin/dwh/profile.sh --user AngocA
# Country profile (English name)
./bin/dwh/profile.sh --country Colombia
./bin/dwh/profile.sh --country "United States of America"
# Country profile (Spanish name)
./bin/dwh/profile.sh --pais Colombia
./bin/dwh/profile.sh --pais "Estados Unidos"
# General notes statistics
./bin/dwh/profile.shOutput includes:
- Historical activity timeline
- Geographic distribution
- Working hours heatmap
- Rankings and leaderboards
- Activity patterns
- First and most recent actions
Prerequisites:
- Datamarts must be populated first
- Run
datamartCountries.shanddatamartUsers.shbefore generating profiles
Location: bin/dwh/datamartCountries/datamartCountries.sh
Purpose: Populates the country-level datamart with pre-computed analytics.
Usage:
./bin/dwh/datamartCountries/datamartCountries.shFeatures:
- Aggregates note statistics by country
- Computes yearly historical data (2013-present)
- Generates user rankings per country
- Calculates working hours patterns
- Tracks first and latest activities
Execution time: ~20 minutes
Prerequisites:
- ETL must be completed
- DWH fact and dimension tables must exist
Output:
- Populates
dwh.datamartCountriestable - One row per country with comprehensive metrics
Location: bin/dwh/cleanupDWH.sh
Purpose: Removes data warehouse objects from the database and cleans up temporary files. Uses
database configuration from etc/properties.sh.
--dry-run first to see what
will be removed.
Usage:
# Safe operations (no confirmation required):
./bin/dwh/cleanupDWH.sh --remove-temp-files # Remove only temporary files
./bin/dwh/cleanupDWH.sh --dry-run # Show what would be done (safe)
# Destructive operations (require confirmation):
./bin/dwh/cleanupDWH.sh # Full cleanup - REMOVES ALL DATA!
./bin/dwh/cleanupDWH.sh --remove-all-data # Remove DWH schema and data only
# Help:
./bin/dwh/cleanupDWH.sh --help # Show detailed helpWhat it removes:
DWH Objects (--remove-all-data or default behavior):
- Staging schema and all objects
- Datamart tables (countries and users)
- DWH schema with all dimensions and facts
- All functions, procedures, and triggers
⚠️ PERMANENT DATA LOSS - requires confirmation
Temporary Files (--remove-temp-files or default behavior):
/tmp/ETL_*directories/tmp/datamartCountries_*directories/tmp/datamartUsers_*directories/tmp/profile_*directories/tmp/cleanupDWH_*directories- ✅ Safe operation - no confirmation required
When to use:
- Development/Testing: Use
--remove-temp-filesto clean temporary files - Complete Reset: Use default behavior to remove everything (with confirmation)
- DWH Only: Use
--remove-all-datato remove only database objects - Safety First: Always use
--dry-runto preview operations
Prerequisites:
- Database configured in
etc/properties.sh - User must have DROP privileges on target database
- PostgreSQL client tools installed (
psql) - Script must be run from project root directory
Use cases:
- Development: Clean temporary files with
--remove-temp-files - Testing: Reset environment with
--dry-runfirst, then full cleanup - Troubleshooting: Remove corrupted DWH objects with
--remove-all-data - Clean restart: Remove all objects before running
ETL.sh - Maintenance: Regular cleanup of temporary files
Location: bin/dwh/datamartUsers/datamartUsers.sh
Purpose: Populates the user-level datamart with pre-computed analytics.
Usage:
./bin/dwh/datamartUsers/datamartUsers.shFeatures:
- Aggregates note statistics by user
- Processes incrementally (MAX_USERS_PER_CYCLE per run, default 4000; fits in 15 min on typical prod)
- Computes yearly historical data
- Generates country rankings per user
- Tracks contribution patterns
- Classifies contributor types
Execution time:
- Per run: ~3–5 minutes for 1000 users on typical production; ~4000 users fit in a 15-minute ETL window
- Full initial load: incremental over multiple ETL cycles (use catch-up mode for large backlogs)
Prerequisites:
- ETL must be completed
- DWH fact and dimension tables must exist
Output:
- Populates
dwh.datamartUserstable - One row per active user with comprehensive metrics
Note: This script is designed to run incrementally to avoid overwhelming the database. Schedule it to run regularly until all users are processed.
Location: bin/dwh/datamartGlobal/datamartGlobal.sh
Purpose: Populates the global-level datamart with aggregated statistics.
Usage:
./bin/dwh/datamartGlobal/datamartGlobal.shFeatures:
- Aggregates global note statistics
- Computes worldwide metrics
- Provides system-wide analytics
Prerequisites:
- ETL must be completed
- DWH fact and dimension tables must exist
Output:
- Populates
dwh.datamartGlobaltable - Global statistics and aggregated metrics
Note: This script is automatically called by ETL.sh after processing. Manual execution is
usually not needed.
Location: bin/dwh/exportDatamartsToJSON.sh
Purpose: Exports datamart data to JSON files for web viewer consumption.
Usage:
./bin/dwh/exportDatamartsToJSON.shFeatures:
- Exports user datamarts to individual JSON files
- Exports country datamarts to individual JSON files
- Creates index files for efficient lookup
- Generates metadata file
- Atomic writes: Files generated in temporary directory, validated, then moved atomically
- Schema validation: Each JSON file validated against schemas before export
- Fail-safe: On validation failure, keeps existing files and exits with error
Output:
Creates JSON files in ./output/json/:
- Individual files per user:
users/{user_id}.json - Individual files per country:
countries/{country_id}.json - Index files:
indexes/users.json,indexes/countries.json - Metadata:
metadata.json
Prerequisites:
- Datamarts must be populated
jqandajv-clirecommended for validation
Example:
# Export all datamarts to JSON
./bin/dwh/exportDatamartsToJSON.sh
# Verify export
ls -lh ./output/json/users/ | head -10
ls -lh ./output/json/countries/ | head -10See also: JSON Export Documentation
Location: bin/dwh/exportAndPushJSONToGitHub.sh
Purpose: Exports JSON files and automatically deploys them to GitHub Pages using intelligent incremental mode.
Usage:
./bin/dwh/exportAndPushJSONToGitHub.shFeatures:
- Intelligent incremental export: Exports countries one by one and pushes immediately
- Automatic detection: Identifies missing, outdated (default: 30 days), or not exported countries
- Cleanup: Removes countries from GitHub that no longer exist in local database
- Documentation: Auto-generates README.md with alphabetical list of countries
- Resilient: Continues processing even if one country fails
- Progress tracking: Shows which countries are being processed
- Schema validation: Validates each JSON file before pushing
Prerequisites:
- Datamarts must be populated
- Git repository configured (
OSM-Notes-Datacloned to~/OSM-Notes-Dataor~/github/OSM-Notes-Data) - GitHub Pages enabled
- Git credentials configured
Environment variables:
MAX_AGE_DAYS: Maximum age in days before regeneration (default: 30, matches monthly cron)COUNTRIES_PER_BATCH: Number of countries to process before break (default: 10)DBNAME_DWH: Database name (default: from etc/properties.sh)
Example:
# Default: monthly refresh (30 days)
./bin/dwh/exportAndPushJSONToGitHub.sh
# Custom age threshold for testing
MAX_AGE_DAYS=7 ./bin/dwh/exportAndPushJSONToGitHub.shNote: This script is typically scheduled to run monthly via cron after datamart updates.
# 1. Configure database connection
cp etc/properties.sh.example etc/properties.sh
nano etc/properties.sh
# 2. Configure ETL settings (optional, defaults work for most cases)
cp etc/etl.properties.example etc/etl.properties
nano etc/etl.properties
# 3. Verify base tables exist (from OSM-Notes-Ingestion)
psql -d notes_dwh -c "SELECT COUNT(*) FROM notes;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM note_comments;"
# 4. Run initial ETL (creates DWH, populates facts/dimensions, updates datamarts)
./bin/dwh/ETL.sh
# Wait ~30 hours for completion
# Note: ETL.sh automatically updates datamarts, so steps 5-6 are optional
# 5. Verify DWH creation
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.facts;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartcountries;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartusers;"
# 6. (Optional) Manually update datamarts if needed
./bin/dwh/datamartCountries/datamartCountries.sh
./bin/dwh/datamartUsers/datamartUsers.sh
# 7. Generate a test profile
./bin/dwh/profile.sh --user AngocA
# 8. Export to JSON (optional, for web viewer)
./bin/dwh/exportDatamartsToJSON.sh# Crontab example (add with: crontab -e):
# Incremental ETL every 15 minutes (automatically updates datamarts)
*/15 * * * * cd ~/OSM-Notes-Analytics && ./bin/dwh/ETL.sh >> /tmp/osm-analytics-etl.log 2>&1
# Export to JSON and push to GitHub Pages (after datamarts update)
45 * * * * cd ~/OSM-Notes-Analytics && ./bin/dwh/exportAndPushJSONToGitHub.sh >> /tmp/osm-analytics-export.log 2>&1
# Or: ETL + export together, log to /var/log (same layout as osm-notes-ingestion, osm-notes-monitoring).
# One-time setup: see etc/cron.example and etc/logrotate.osm-analytics.conf.
# */15 * * * * cd ~/OSM-Notes-Analytics && ./bin/dwh/ETL.sh && ./bin/dwh/exportAndPushJSONToGitHub.sh >> /var/log/osm-notes-analytics/analytics.log 2>&1
# Optional: Manual datamart updates (usually not needed, ETL does this automatically)
# 0 2 * * * cd ~/OSM-Notes-Analytics && ./bin/dwh/datamartCountries/datamartCountries.sh >> /tmp/osm-analytics-datamart.log 2>&1
# 30 2 * * * cd ~/OSM-Notes-Analytics && ./bin/dwh/datamartUsers/datamartUsers.sh >> /tmp/osm-analytics-datamart.log 2>&1Note: ETL.sh automatically updates all datamarts, so separate datamart cron jobs are usually
not needed.
# After datamarts are populated, generate profiles:
# User profile
./bin/dwh/profile.sh --user AngocA
# Country profile
./bin/dwh/profile.sh --country Colombia
# General statistics
./bin/dwh/profile.shCreate etc/properties.sh from the example:
cp etc/properties.sh.example etc/properties.sh
nano etc/properties.shKey settings:
DBNAME="notes_dwh" # Database name
DB_USER="notes" # Database user
MAX_THREADS="4" # Parallel processing threads (auto-calculated from CPU cores)
CLEAN="true" # Clean temporary files after processingOverride via environment:
export DBNAME=osm_notes_analytics_test
export DB_USER=postgres
./bin/dwh/ETL.shCreate etc/etl.properties from the example (optional):
cp etc/etl.properties.example etc/etl.properties
nano etc/etl.propertiesKey settings:
ETL_BATCH_SIZE=1000 # Records per batch
ETL_PARALLEL_ENABLED=true # Enable parallel processing
ETL_MAX_PARALLEL_JOBS=4 # Max parallel jobs
ETL_RECOVERY_ENABLED=true # Enable recovery
ETL_VALIDATE_INTEGRITY=true # Validate data integrity
MAX_MEMORY_USAGE=80 # Memory usage threshold (%)
MAX_DISK_USAGE=90 # Disk usage threshold (%)
ETL_TIMEOUT=7200 # Execution timeout (seconds)Override via environment:
export ETL_BATCH_SIZE=5000
export ETL_MAX_PARALLEL_JOBS=8
./bin/dwh/ETL.shSee also: Environment Variables Documentation for complete variable reference.
All scripts create detailed logs in /tmp/:
# ETL logs (follow latest)
tail -40f $(ls -1rtd /tmp/ETL_* | tail -1)/ETL.log
# Country datamart logs
tail -f $(ls -1rtd /tmp/datamartCountries_* | tail -1)/datamartCountries.log
# User datamart logs
tail -f $(ls -1rtd /tmp/datamartUsers_* | tail -1)/datamartUsers.log
# Profile logs
tail -f $(ls -1rtd /tmp/profile_* | tail -1)/profile.log
# Export logs
tail -f $(ls -1rtd /tmp/exportDatamartsToJSON_* | tail -1)/exportDatamartsToJSON.logSet log level:
# Debug mode (verbose)
export LOG_LEVEL=DEBUG
./bin/dwh/ETL.sh
# Info mode (moderate)
export LOG_LEVEL=INFO
./bin/dwh/ETL.sh
# Error mode (minimal, default)
export LOG_LEVEL=ERROR
./bin/dwh/ETL.shKeep temporary files for inspection:
export CLEAN=false
export LOG_LEVEL=DEBUG
./bin/dwh/ETL.sh
# Files will remain in /tmp/ETL_*/
# Inspect logs, CSV files, etc.If ETL fails:
- Check logs in
/tmp/ETL_*/ETL.log - Review error messages
- Fix underlying issue
- Restart:
./bin/dwh/ETL.sh
Scripts monitor system resources:
- Memory usage (default: alert at 80%)
- Disk usage (default: alert at 90%)
- Execution timeout (default: 2 hours)
- PostgreSQL 12+
- Bash 4.0+
- Base tables populated by OSM-Notes-Ingestion system
jqfor JSON parsing (for recovery features)parallelfor enhanced parallel processing
These scripts integrate with:
-
OSM-Notes-Ingestion (upstream)
- Reads base tables:
notes,note_comments,users,countries - Requires ingestion system to run first
- Reads base tables:
-
OSM-Notes-Viewer (sister project - downstream)
- Web application that consumes JSON exports
- Interactive dashboards and visualizations
- User and country profiles
- Reads JSON files exported by this analytics system
# Edit etc/etl.properties
ETL_MAX_PARALLEL_JOBS=8 # Increase for more cores# Edit etc/etl.properties
ETL_BATCH_SIZE=5000 # Increase for better throughput# After initial load
psql -d notes_dwh -c "VACUUM ANALYZE dwh.facts;"
psql -d notes_dwh -c "REINDEX TABLE dwh.facts;"Problem: ETL cannot find base tables populated by OSM-Notes-Ingestion.
Solution:
# Verify base tables exist
psql -d notes_dwh -c "SELECT COUNT(*) FROM notes;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM note_comments;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM users;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM countries;"
# If tables are empty or don't exist, run OSM-Notes-Ingestion first
# See: https://github.com/OSM-Notes/OSM-Notes-IngestionProblem: DWH schema not created yet.
Solution:
# Run initial ETL to create schema
./bin/dwh/ETL.shProblem: Another instance is running or previous execution crashed.
Solution:
# Check if process is actually running
ps aux | grep ETL.sh
# If no process found, remove lock file
rm /tmp/ETL_*.lock
# Or remove all lock files (use with caution)
find /tmp -name "*ETL*.lock" -deleteProblem: System running out of memory during processing.
Solution:
# Reduce parallel jobs
export ETL_MAX_PARALLEL_JOBS=2
# Reduce batch size
export ETL_BATCH_SIZE=500
# Disable parallel processing
export ETL_PARALLEL_ENABLED=false
# Or edit etc/etl.properties
nano etc/etl.properties
# Set: ETL_MAX_PARALLEL_JOBS=2
# Set: ETL_BATCH_SIZE=500Problem: ETL process is slow.
Solution:
# Increase parallel jobs (if you have more CPU cores)
export ETL_MAX_PARALLEL_JOBS=8
# Increase batch size (if you have more memory)
export ETL_BATCH_SIZE=5000
# Check if base tables have indexes
psql -d notes_dwh -c "\d notes"
psql -d notes_dwh -c "\d note_comments"
# Run VACUUM ANALYZE on base tables
psql -d notes_dwh -c "VACUUM ANALYZE notes;"
psql -d notes_dwh -c "VACUUM ANALYZE note_comments;"Problem: Datamart tables are empty or incomplete.
Solution:
# Check datamart counts
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartcountries;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartusers;"
# Re-run datamart scripts
./bin/dwh/datamartCountries/datamartCountries.sh
./bin/dwh/datamartUsers/datamartUsers.sh
# For users datamart, run multiple times (processes MAX_USERS_PER_CYCLE per run, default 4000)
# Keep running until it says "0 users processed"
while true; do
./bin/dwh/datamartUsers/datamartUsers.sh
sleep 5
doneProblem: JSON export produces no files or empty files.
Solution:
# Verify datamarts have data
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartusers;"
psql -d notes_dwh -c "SELECT COUNT(*) FROM dwh.datamartcountries;"
# If counts are 0, re-run datamart population
./bin/dwh/datamartCountries/datamartCountries.sh
./bin/dwh/datamartUsers/datamartUsers.sh
# Check export output directory
ls -lh ./output/json/
# Run export with debug logging
export LOG_LEVEL=DEBUG
./bin/dwh/exportDatamartsToJSON.shProblem: Cannot connect to database.
Solution:
# Test connection
psql -d notes_dwh -c "SELECT version();"
# Verify database name in properties
cat etc/properties.sh | grep DBNAME
# Check PostgreSQL is running
sudo systemctl status postgresql
# Verify user permissions
psql -d notes_dwh -c "SELECT current_user;"Problem: Profile script cannot find user or country.
Solution:
# Check if user exists in datamart
psql -d notes_dwh -c "SELECT username FROM dwh.datamartusers WHERE username = 'AngocA';"
# Check if country exists
psql -d notes_dwh -c "SELECT country_name_en FROM dwh.datamartcountries WHERE country_name_en = 'Colombia';"
# Use exact name as stored in database
# For countries, try both English and Spanish names
./bin/dwh/profile.sh --country Colombia
./bin/dwh/profile.sh --pais Colombia- Follow naming convention:
descriptiveName.sh - Include header with purpose, author, version
- Source common libraries from
lib/osm-common/(OSM-Notes-Common submodule) - Add error handling and logging
- Include help text (
--helpflag) - Test with shellcheck:
shellcheck -x -o all script.sh - Format with shfmt:
shfmt -w -i 1 -sr -bn script.sh
- Use descriptive function names with
__prefix - Add comments for complex logic
- Include error codes in help text
- Use strict error handling (
set -euo pipefail)
- Entry Points - Which scripts can be called directly
- Environment Variables - Complete environment variable reference
- DWH README - Detailed DWH documentation
- Main README - Project overview and quick start
- ETL Enhanced Features - Advanced ETL capabilities
- DWH Star Schema ERD - Entity-relationship diagram
- Data Dictionary - Complete schema documentation
- DWH Maintenance Guide - Maintenance procedures
- etc/README.md - Configuration files and setup
- sql/README.md - SQL scripts documentation
- Troubleshooting Guide - Common issues and solutions
- Testing Guide - Testing documentation
- Contributing Guide - Development standards
- CI/CD Guide - CI/CD workflows
- OSM-Notes-Ingestion - Data ingestion system (upstream)
- OSM-Notes-Viewer - Web application (sister project)
- OSM-Notes-Common - Shared libraries (Git submodule)
- Entry Points - Which scripts can be called directly
- Environment Variables - Complete environment variable reference
- DWH README - Detailed DWH documentation
- ETL Enhanced Features - Advanced ETL capabilities
- DWH Star Schema ERD - Entity-relationship diagram
- Data Dictionary - Complete schema documentation
- Testing Guide - Testing documentation
- OSM-Notes-Ingestion - Data ingestion system (upstream)
- OSM-Notes-Viewer - Web application (sister project)
For issues with scripts:
- Check log files in
/tmp/ - Review error messages
- Validate configuration files
- Create an issue with logs and error details