A semantic layer built on DuckDB that provides a clean, business-friendly interface to the HRMS SQL Server database.
This semantic layer:
- Connects to SQL Server database
hrmsdbat192.168.20.203:1433using pymssql - Imports data into DuckDB for fast local queries
- Provides business-friendly views and metrics
- Works on ARM Mac (Apple Silicon) without ODBC dependencies
- Activity logs filtered to last 30 days to keep database size manageable
SQL Server (hrmsdb)
↓ (Data import via pymssql)
DuckDB Semantic Layer
├── raw.* - Imported tables from SQL Server
├── staging.* - Cleaned, typed data views
├── business.* - Denormalized, user-friendly views
└── metrics.* - Aggregated KPIs and metrics
The semantic layer uses a hybrid approach with 3-tier storage:
- Live Queries (MSSQL Extension): Small, frequently-changing tables queried directly from SQL Server
- DuckDB Tables: Medium-sized tables imported into DuckDB for fast analytics
- Parquet Files: Large tables exported to Parquet with DuckDB views for optimal storage
The system uses 3-tier storage optimization:
| Size | Storage | Benefits |
|---|---|---|
| < 10K rows | Live queries (MSSQL extension) | Real-time data, no storage |
| 10K - 100K rows | DuckDB tables | Fast queries, good compression |
| > 100K rows | Parquet files | Excellent compression (~60% reduction), query via views |
Parquet Features:
- Automatic partitioning for tables > 1M rows
- Configurable compression (zstd, snappy, gzip)
- Transparent access via DuckDB views
- Automatic migration on re-initialization
Tables are automatically classified based on size:
- Tables < 10K rows → Live queries
- Tables 10K-100K rows → DuckDB import
- Tables ≥ 100K rows → Parquet storage
Manual overrides available in config.yaml:
sync:
auto_classify: true
live_threshold: 10000
parquet_threshold: 100000
parquet_enabled: true
parquet_compression: "zstd"
force_live:
- "Activity_Log"
force_import:
- "MediumTable"
force_parquet:
- "HistoricalData"If SQL Server becomes unavailable:
- Imported tables continue working normally
- Live tables show appropriate warnings
- Optional cache snapshots available as fallback
Use python scripts/cache_view.py raw.activity_log to create cache snapshots.
- ARM Mac Compatible - Uses pymssql instead of ODBC
- Fast Local Queries - Data stored in DuckDB columnar format
- Business-Friendly - Clean column names and structures
- Configurable Import - Choose which tables to sync
- Activity Log Filtering - Only imports last 30 days
- Python 3.8+
- Conda (recommended)
- VPN access to SQL Server at 192.168.20.203
# Create conda environment
conda create -n hrms python=3.11 -y
conda activate hrms
# Install dependencies
pip install -r requirements.txt- Review
config.yamlfor database settings - Run initialization:
python init_semantic_layer.py
hrms-semantic-layer/
├── config.yaml # Database connection config
├── init_semantic_layer.py # Initialize DuckDB semantic layer
├── requirements.txt # Python dependencies
├── models/ # SQL models for semantic layer
│ ├── staging/ # Cleaned raw data views
│ ├── business/ # Business-friendly views
│ └── metrics/ # Aggregated metrics
└── hrmsdb.duckdb # DuckDB database file (created on init)
conda activate hrms
python init_semantic_layer.pyimport duckdb
conn = duckdb.connect('hrmsdb.duckdb')
# Query employee summary
result = conn.execute("""
SELECT * FROM business.employee_summary
LIMIT 10
""").fetchdf()
print(result)
# Query benefits metrics
result = conn.execute("""
SELECT * FROM metrics.headcount_metrics
""").fetchdf()
print(result)staging.stg_activity_log- User activity logs (last 30 days)staging.stg_attendance- Attendance recordsstaging.stg_employees- Employee informationstaging.stg_payroll- Payroll/benefits data
business.employee_summary- Employee with benefit plan summarybusiness.payroll_detail- Detailed benefits enrollmentbusiness.attendance_detail- Attendance with employee infobusiness.staffing_by_shift- Hours by shift and employeebusiness.staff_summary- Staff benefits overview
metrics.headcount_metrics- Employee counts by benefit typemetrics.monthly_payroll_metrics- Benefits enrollment metricsmetrics.attendance_metrics- Attendance by week/shiftmetrics.clinical_workforce_metrics- Workforce summarymetrics.department_staffing_ratios- Staffing by departmentmetrics.shift_coverage_metrics- Shift coverage analysis
Current import includes:
- 224,433 payroll/benefits records
- 1,129 unique employees
- Attendance records across 4 shifts
- 1099 and 401k data
- Activity logs (last 30 days only)
To refresh data from SQL Server, simply re-run:
python init_semantic_layer.pyThis will:
- Connect to SQL Server via pymssql
- Import configured tables (see
config.yaml) - Recreate all staging, business, and metrics views
duckdb- Local analytical databasepymssql- SQL Server connection (ARM Mac compatible)pandas- Data manipulationpyyaml- Configurationpython-dotenv- Environment variablessqlalchemy- SQL toolkit
- Activity logs are filtered to last 30 days (configurable in
config.yaml) - Tables starting with numbers get
t_prefix (e.g.,401kdata→t_401kdata) - Workforce data tables were not available in SQL Server