ClaimLens NPDB is a healthcare analytics dashboard for exploring U.S. medical malpractice payment patterns using the National Practitioner Data Bank Public Use Data File.
The project converts a raw, coded government dataset into an interactive analytics product focused on malpractice allegation types, injury severity, payment concentration, geography, practitioner fields, reporting lag, and data reliability.
The objective of ClaimLens NPDB is to make public malpractice payment data easier to understand, analyze, and communicate in a responsible healthcare analytics context.
The dashboard helps answer questions such as:
- Which medical malpractice allegation categories account for the highest payment totals?
- How do payment patterns differ across injury severity levels?
- How have report volume and median payment amounts changed over time?
- Which practitioner license fields are associated with the largest payment concentration?
- Which states show higher reported malpractice payment totals?
- How complete and reliable are the fields used in the analysis?
This is not a clinical diagnosis tool, provider ranking system, or negligence detector. It is a healthcare data analytics project built around de-identified public-use malpractice payment records.
Medical malpractice data is high-stakes, sensitive, and easy to misinterpret. Raw NPDB public-use records are coded, de-identified, and difficult to use directly without understanding the codebook and data limitations.
This project demonstrates how healthcare data should be handled:
- Decode official public-use codes before analysis.
- Separate payment reports from proof of clinical negligence.
- Show data quality instead of hiding missingness.
- Avoid unsupported clinical claims.
- Present findings in a clear, decision-friendly interface.
- Executive KPI dashboard: reports, total paid, median payment, P90 payment, and severe injury share.
- Trend analysis: yearly report volume and median malpractice payment trends.
- Payment distribution: payment-band analysis across filtered records.
- Injury severity analysis: NPDB outcome severity mix with median payment context.
- Allegation intelligence: ranked allegation categories by total payment, frequency, severity share, and median payment.
- Geographic analysis: state-level malpractice payment concentration using available NPDB location fields.
- Practitioner field analysis: decoded license/practitioner fields grouped into broader categories.
- Reliability tab: field completeness checks for payment, severity, demographics, reporting lag, and location availability.
- Dark-mode UI: polished Streamlit dashboard designed for portfolio presentation.
Source: National Practitioner Data Bank Public Use Data File.
Local source file:
data/npdb_public.csv
Official code mappings are stored locally in:
data/npdb_codebook.json
The codebook JSON was generated from the official NPDB Public Use Data File Format Specifications.
Generated analytics outputs:
data/npdb_analytics.csv
data/data_quality.json
The project follows a reproducible analytics workflow:
- Load the raw NPDB public-use CSV.
- Decode coded NPDB fields using the official public-use format specification.
- Clean payment fields and convert dollar strings into numeric amounts.
- Derive healthcare analytics features, including:
- report year
- event year
- event-to-report lag
- payment band
- injury severity score
- practitioner group
- state proxy
- Generate an analytics-ready dataset.
- Generate a data-quality summary.
- Render the dashboard with Streamlit and Plotly.
- Python
- pandas
- Streamlit
- Plotly
- NPDB Public Use Data File
- Official NPDB format/codebook specifications
.
├── analyzer.py # Aggregation, KPI, trend, and reliability helpers
├── dashboard.py # Streamlit dashboard
├── pipeline.py # Cleaning, decoding, and feature engineering pipeline
├── requirements.txt # Python dependencies
├── data/
│ ├── npdb_public.csv # Raw NPDB public-use data
│ ├── npdb_codebook.json # Official code mappings
│ ├── npdb_analytics.csv # Generated analytics dataset
│ └── data_quality.json # Generated data-quality summary
Install dependencies:
pip install -r requirements.txtBuild the analytics dataset:
python3 pipeline.pyLaunch the dashboard:
python3 -m streamlit run dashboard.pyThen open the local Streamlit URL shown in the terminal.