| title | Cron Setup Guide - OSM Notes Analytics | |
|---|---|---|
| description | Complete guide for setting up automated ETL execution using cron. | |
| version | 1.0.0 | |
| last_updated | 2026-01-25 | |
| author | AngocA | |
| tags |
|
|
| audience |
|
|
| project | OSM-Notes-Analytics | |
| status | active |
Complete guide for setting up automated ETL execution using cron.
Cron automation allows the ETL process to run automatically at regular intervals without manual intervention. This is essential for keeping the data warehouse up-to-date with new OSM notes.
- Automated Updates: Process new notes every 15 minutes
- Consistent Data: Always have fresh data available
- Minimal Overhead: Only processes new/changed data
- Production Ready: Suitable for production environments
- ETL Script:
bin/dwh/ETL.sh(already exists) - Cron Configuration:
etc/cron.example(template provided) - Monitoring: Handled by the Monitoring project (sister project)
- Linux/Unix system with cron installed
- Sufficient disk space for logs (/tmp)
- Database connectivity configured
- Write permissions to log directories
Copy the example configuration:
cp etc/cron.example /tmp/osm-notes-cronEdit /tmp/osm-notes-cron and update the paths:
# Example configuration (production path):
*/15 * * * * export CLEAN=false ; export LOG_LEVEL=INFO ; export DBNAME=notes ; export DB_USER=notes ; /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.shProduction Configuration:
CLEAN=false: Keeps temporary files for debugging (useCLEAN=trueto save disk space)LOG_LEVEL=INFO: Balanced logging (useERRORfor less verbose,DEBUGfor debugging)DBNAME=notes: Your database nameDB_USER=notes: Your database user
Also update:
SHELL=/bin/bash(if using different shell)HOME=/home/your-username(your actual home directory)- Any other paths as needed
Install the cron configuration:
crontab /tmp/osm-notes-cronCheck that cron is installed correctly:
crontab -lYou should see output like:
*/15 * * * * export CLEAN=false ; export LOG_LEVEL=INFO ; export DBNAME=notes ; export DB_USER=notes ; /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.sh
0 2 15 * * /home/notes/OSM-Notes-Analytics/bin/dwh/exportAndPushCSVToGitHub.sh
0 4 1 * * /home/notes/OSM-Notes-Analytics/bin/dwh/ml_retrain.sh >> /tmp/ml-retrain.log 2>&1
If you want ETL + export output in /var/log (same layout as sibling projects osm-notes-ingestion,
osm-notes-monitoring), do this once as root:
-
Create log directory (same style as sibling projects):
sudo mkdir -p /var/log/osm-notes-analytics sudo chown notes:maptimebogota /var/log/osm-notes-analytics sudo chmod 2775 /var/log/osm-notes-analytics sudo touch /var/log/osm-notes-analytics/analytics.log sudo chown notes:maptimebogota /var/log/osm-notes-analytics/analytics.log sudo chmod 640 /var/log/osm-notes-analytics/analytics.log
(If your server has no group
maptimebogota, usenotes:notesand adjust the cron user’s primary group.) -
Install logrotate so the log is rotated daily (e.g.
analytics.log-YYYYMMDD.gz):sudo cp /path/to/OSM-Notes-Analytics/etc/logrotate.osm-analytics.conf /etc/logrotate.d/osm-analytics sudo chmod 644 /etc/logrotate.d/osm-analytics
See
etc/logrotate.osm-analytics.conffor details. Logrotate is run daily bycron.daily. -
Point cron at the log: in your crontab, use:
>> /var/log/osm-notes-analytics/analytics.log 2>&1
for the ETL + export line. Full example in
etc/cron.example.
Choose the appropriate frequency based on your needs:
| Frequency | Schedule | Use Case |
|---|---|---|
| Every 15 minutes | */15 * * * * |
Production, frequent updates |
| Every hour | 0 * * * * |
Standard updates |
| Every 3 hours | 0 */3 * * * |
Moderate updates |
| Daily | 0 2 * * * |
Low-frequency updates |
The ETL automatically creates lock files to prevent concurrent execution:
- Lock exists: New job skips execution
- Lock older than 4 hours: Considered stale, job proceeds
- Lock created: Current execution timestamp
This ensures that:
- If ETL takes >15 minutes, next job waits
- No duplicate executions
- System remains stable
ETL logs are stored in /tmp/ETL_XXXXXX/ directories.
Automatic cleanup is configured in cron:
30 3 * * 0 find /tmp/ETL_* -type d -mtime +7 -exec rm -rf {} \; 2>/dev/null || trueThis removes logs older than 7 days every Sunday at 3:30 AM.
Monitoring is handled by the Monitoring project (sister project located at the same filesystem level as this project). Configure monitoring tasks in that project instead.
The Monitoring project provides:
- Process status monitoring (running/not running)
- Last execution log tracking
- Database connection status checks
- Data warehouse statistics
- Disk space usage monitoring
- Email alerts for failures
Check cron service:
# Check if cron is running
systemctl status cron
# Start cron if needed
sudo systemctl start cronCheck cron logs:
# View cron execution logs
sudo tail -f /var/log/syslog | grep CRON
# On some systems:
sudo tail -f /var/log/cronCheck permissions:
# Script must be executable
chmod +x bin/dwh/ETL.sh
# All scripts should be readable
ls -la bin/dwh/*.shCheck PATH: Cron jobs have limited PATH. Use full paths:
# Good
/home/notes/OSM-Notes-Analytics/bin/dwh/ETL.sh
# Bad
bin/dwh/ETL.shCheck environment: Cron doesn't inherit your shell environment. Add to cron:
# Load environment
. /home/your-username/.bashrcOr explicitly set variables in cron:
DBHOST=localhost
DBPORT=5432
DBUSER=postgres
# Database configuration (recommended: use DBNAME_INGESTION and DBNAME_DWH)
DBNAME_INGESTION=notes_dwh
DBNAME_DWH=notes_dwh
# Legacy/compatibility (use when both databases are the same):
# DBNAME=notes_dwhStale lock file: If ETL crashes, lock file may remain:
# Find lock files
find /tmp/ETL_* -name "ETL.lock"
# Check age
ls -lh /tmp/ETL_*/ETL.lock
# Remove if stale (>4 hours old)
find /tmp/ETL_* -name "ETL.lock" -mmin +240 -deleteTest cron jobs manually first:
# Run ETL manually
/home/notes/OSM-Notes-Analytics/bin/dwh/ETL.sh
# Check output
tail -f /tmp/ETL_*/ETL.logBegin with less frequent execution:
# Start with hourly updates
0 * * * * export CLEAN=false ; export LOG_LEVEL=INFO ; export DBNAME_INGESTION=notes ; export DBNAME_DWH=notes_dwh ; export DB_USER=notes ; /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.shThen increase frequency based on performance.
Watch log directory size:
# Check disk usage
du -sh /tmp/ETL_*
# Set up alert in cron
0 4 * * * du -sh /tmp/ETL_* | mail -s "ETL Log Size" notes@osm.latAdd these tasks to cron:
# Weekly VACUUM ANALYZE
0 3 * * 0 psql -U notes -d notes_dwh -c "VACUUM ANALYZE dwh.facts"
# Weekly log cleanup
30 3 * * 0 find /tmp/ETL_* -type d -mtime +7 -exec rm -rf {} \;Implement backups:
# Daily DWH backup
0 1 * * * pg_dump -U notes -d notes_dwh -n dwh > /backups/dwh_$(date +\%Y\%m\%d).sql
# Keep last 30 days
0 2 * * * find /backups/dwh_*.sql -mtime +30 -deleteMonitor for failures:
# Check for errors in logs
find /tmp/ETL_* -name "ETL.log" -exec grep -i error {} \; | tail -20
# Set up daily error report (or configure in Monitoring project)
0 7 * * * find /tmp/ETL_* -name "ETL.log" -exec grep -i error {} \; | tail -50 | mail -s "ETL Errors" notes@osm.latYou can run different ETL operations at different times:
# Incremental updates every 15 minutes (production)
*/15 * * * * export CLEAN=false ; export LOG_LEVEL=INFO ; export DBNAME=notes ; export DB_USER=notes ; /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.sh
# Export CSV monthly (15th day at 2 AM)
0 2 15 * * /home/notes/OSM-Notes-Analytics/bin/dwh/exportAndPushCSVToGitHub.sh
# ML training/retraining monthly (1st day at 4 AM)
0 4 1 * * /home/notes/OSM-Notes-Analytics/bin/dwh/ml_retrain.sh >> /tmp/ml-retrain.log 2>&1
# Full reload every Sunday at 2 AM (if needed)
0 2 * * 0 /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.shETL logs are stored in /tmp/ETL_XXXXXX/ directories. Automatic cleanup is configured in cron (see
etc/cron.example).
Run ETL only during business hours:
# Only run 8 AM - 8 PM
*/15 8-20 * * * export CLEAN=false ; export LOG_LEVEL=INFO ; export DBNAME_INGESTION=notes ; export DBNAME_DWH=notes_dwh ; export DB_USER=notes ; /home/notes/OSM-Notes-Analytics/bin/dwh/ETL.sh- Deployment Diagram: Complete deployment architecture and operational workflows
- Troubleshooting Guide: Common cron and deployment issues
- DWH Maintenance Guide: Database maintenance procedures
With cron automation configured, your OSM Notes Analytics data warehouse will stay current with minimal manual intervention. Regular monitoring and maintenance ensure optimal performance.
For complete deployment documentation including infrastructure, scheduling, and disaster recovery, see Deployment Diagram.