HRMS Semantic Layer with DuckDB

A semantic layer built on DuckDB that provides a clean, business-friendly interface to the HRMS SQL Server database.

Overview

This semantic layer:

Connects to SQL Server database hrmsdb at 192.168.20.203:1433 using pymssql
Imports data into DuckDB for fast local queries
Provides business-friendly views and metrics
Works on ARM Mac (Apple Silicon) without ODBC dependencies
Activity logs filtered to last 30 days to keep database size manageable

Architecture

SQL Server (hrmsdb)
    ↓ (Data import via pymssql)
DuckDB Semantic Layer
    ├── raw.*           - Imported tables from SQL Server
    ├── staging.*       - Cleaned, typed data views
    ├── business.*      - Denormalized, user-friendly views
    └── metrics.*       - Aggregated KPIs and metrics

Hybrid Architecture

The semantic layer uses a hybrid approach with 3-tier storage:

Live Queries (MSSQL Extension): Small, frequently-changing tables queried directly from SQL Server
DuckDB Tables: Medium-sized tables imported into DuckDB for fast analytics
Parquet Files: Large tables exported to Parquet with DuckDB views for optimal storage

Storage Tiers

The system uses 3-tier storage optimization:

Size	Storage	Benefits
< 10K rows	Live queries (MSSQL extension)	Real-time data, no storage
10K - 100K rows	DuckDB tables	Fast queries, good compression
> 100K rows	Parquet files	Excellent compression (~60% reduction), query via views

Parquet Features:

Automatic partitioning for tables > 1M rows
Configurable compression (zstd, snappy, gzip)
Transparent access via DuckDB views
Automatic migration on re-initialization

Auto-Classification

Tables are automatically classified based on size:

Tables < 10K rows → Live queries
Tables 10K-100K rows → DuckDB import
Tables ≥ 100K rows → Parquet storage

Manual overrides available in config.yaml:

sync:
  auto_classify: true
  live_threshold: 10000
  parquet_threshold: 100000
  parquet_enabled: true
  parquet_compression: "zstd"

  force_live:
    - "Activity_Log"

  force_import:
    - "MediumTable"

  force_parquet:
    - "HistoricalData"

Graceful Degradation

If SQL Server becomes unavailable:

Imported tables continue working normally
Live tables show appropriate warnings
Optional cache snapshots available as fallback

Use python scripts/cache_view.py raw.activity_log to create cache snapshots.

Key Features

ARM Mac Compatible - Uses pymssql instead of ODBC
Fast Local Queries - Data stored in DuckDB columnar format
Business-Friendly - Clean column names and structures
Configurable Import - Choose which tables to sync
Activity Log Filtering - Only imports last 30 days

Setup

Prerequisites

Python 3.8+
Conda (recommended)
VPN access to SQL Server at 192.168.20.203

Installation

# Create conda environment
conda create -n hrms python=3.11 -y
conda activate hrms

# Install dependencies
pip install -r requirements.txt

Configuration

Review config.yaml for database settings
Run initialization: python init_semantic_layer.py

Project Structure

hrms-semantic-layer/
├── config.yaml              # Database connection config
├── init_semantic_layer.py   # Initialize DuckDB semantic layer
├── requirements.txt         # Python dependencies
├── models/                  # SQL models for semantic layer
│   ├── staging/            # Cleaned raw data views
│   ├── business/           # Business-friendly views
│   └── metrics/            # Aggregated metrics
└── hrmsdb.duckdb           # DuckDB database file (created on init)

Usage

Initialize/Refresh Data

conda activate hrms
python init_semantic_layer.py

Query the Semantic Layer

import duckdb

conn = duckdb.connect('hrmsdb.duckdb')

# Query employee summary
result = conn.execute("""
    SELECT * FROM business.employee_summary
    LIMIT 10
""").fetchdf()
print(result)

# Query benefits metrics
result = conn.execute("""
    SELECT * FROM metrics.headcount_metrics
""").fetchdf()
print(result)

Available Views

Staging Views

staging.stg_activity_log - User activity logs (last 30 days)
staging.stg_attendance - Attendance records
staging.stg_employees - Employee information
staging.stg_payroll - Payroll/benefits data

Business Views

business.employee_summary - Employee with benefit plan summary
business.payroll_detail - Detailed benefits enrollment
business.attendance_detail - Attendance with employee info
business.staffing_by_shift - Hours by shift and employee
business.staff_summary - Staff benefits overview

Metrics Views

metrics.headcount_metrics - Employee counts by benefit type
metrics.monthly_payroll_metrics - Benefits enrollment metrics
metrics.attendance_metrics - Attendance by week/shift
metrics.clinical_workforce_metrics - Workforce summary
metrics.department_staffing_ratios - Staffing by department
metrics.shift_coverage_metrics - Shift coverage analysis

Data Import Summary

Current import includes:

224,433 payroll/benefits records
1,129 unique employees
Attendance records across 4 shifts
1099 and 401k data
Activity logs (last 30 days only)

Refreshing Data

To refresh data from SQL Server, simply re-run:

python init_semantic_layer.py

This will:

Connect to SQL Server via pymssql
Import configured tables (see config.yaml)
Recreate all staging, business, and metrics views

Dependencies

duckdb - Local analytical database
pymssql - SQL Server connection (ARM Mac compatible)
pandas - Data manipulation
pyyaml - Configuration
python-dotenv - Environment variables
sqlalchemy - SQL toolkit

Notes

Activity logs are filtered to last 30 days (configurable in config.yaml)
Tables starting with numbers get t_ prefix (e.g., 401kdata → t_401kdata)
Workforce data tables were not available in SQL Server

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
docs		docs
models		models
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
HEALTHCARE_VIEWS.md		HEALTHCARE_VIEWS.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
hybrid_semantic_layer.py		hybrid_semantic_layer.py
init_semantic_layer.py		init_semantic_layer.py
mssql_extension_examples.sql		mssql_extension_examples.sql
pytest.ini		pytest.ini
requirements.txt		requirements.txt
table_classifier.py		table_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HRMS Semantic Layer with DuckDB

Overview

Architecture

Hybrid Architecture

Storage Tiers

Auto-Classification

Graceful Degradation

Key Features

Setup

Prerequisites

Installation

Configuration

Project Structure

Usage

Initialize/Refresh Data

Query the Semantic Layer

Available Views

Staging Views

Business Views

Metrics Views

Data Import Summary

Refreshing Data

Dependencies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HRMS Semantic Layer with DuckDB

Overview

Architecture

Hybrid Architecture

Storage Tiers

Auto-Classification

Graceful Degradation

Key Features

Setup

Prerequisites

Installation

Configuration

Project Structure

Usage

Initialize/Refresh Data

Query the Semantic Layer

Available Views

Staging Views

Business Views

Metrics Views

Data Import Summary

Refreshing Data

Dependencies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages