A local-only tool for removing personally identifiable information (PII) from research data. Built for human subjects researchers who need HIPAA-compliant de-identification with full audit trails.
All processing happens on your computer. Your data never leaves your machine.
- HIPAA Safe Harbor Compliant - Detects all 18 identifier types
- Multiple Anonymization Strategies - Redact, mask, hash, or replace with fake data
- Audit Logs for IRB - Document every detection and action taken
- Local-Only Access - App is bound to
127.0.0.1:8501by default - User-Friendly UI - Streamlit interface designed for non-technical researchers
- Supports CSV, Excel, and Text - Common research data formats
- Docker Desktop (Mac, Windows, or Linux)
Mac/Linux:
git clone https://github.com/jinghanlib/deidentify-app.git
cd deidentify-app
./run.shWindows:
git clone https://github.com/jinghanlib/deidentify-app.git
cd deidentify-app
run.batThe launcher will:
- Check that Docker is installed and running
- Build the image (first time only, ~5-10 minutes)
- Start the container with localhost-only port binding
- Open your browser to http://localhost:8501
Upload CSV, Excel, or text files. The app previews the first 5 rows.
Tell the app how to process each column:
| Type | Description | Example |
|---|---|---|
| Skip | Don't process | record_id, date_collected |
| Structured | Regex patterns only | Email or phone columns |
| Free Text | Full NER + regex | Notes, comments, descriptions |
| Direct Identifier | Always fully redact | SSN, MRN columns |
Choose which PII types to detect:
- Names, Locations, Dates
- Phone numbers, Email addresses
- SSN, Medical record numbers
- And more (18 HIPAA identifier types)
| Strategy | Example | Use Case |
|---|---|---|
| Redact | [REDACTED] |
Maximum privacy |
| Type Tag | [PERSON_1] |
Preserve entity counts for analysis |
| Mask | J*** S**** |
Keep partial structure |
| Hash | a1b2c3d4 |
Link records across files |
| Fake | Jane Doe |
Preserve readability |
Download:
- De-identified file (same format as input)
- Audit log (JSON, required for IRB documentation)
| Purpose | Location |
|---|---|
| Input data | data/ |
| De-identified output | output/ |
| Audit logs | audit/ |
| Sample data | data/sample/ |
This tool is designed with research data security in mind:
- Localhost Binding: App is exposed only on
127.0.0.1:8501by default - No Cloud APIs: All NLP processing uses local SpaCy models
- No Telemetry: Streamlit telemetry is disabled
- Non-Root User: Container runs as unprivileged user
- Read-Only Input: Input data directory is mounted read-only
- Audit Trail: Every detection and action is logged (without PII)
The audit log provides everything needed for IRB compliance:
{
"run_id": "uuid",
"timestamp": "2024-01-15T10:30:00Z",
"settings": {
"entities_selected": ["PERSON", "EMAIL_ADDRESS"],
"confidence_threshold": 0.7,
"strategy_per_entity": {"PERSON": "hash"}
},
"summary": {
"total_records": 100,
"total_detections": 342
}
}The audit log never contains the original PII - only entity type, position, confidence, and action taken.
See docs/irb_methodology.md for a template you can include in your IRB application.
Adjust detection sensitivity:
| Level | Threshold | Description |
|---|---|---|
| Aggressive | 0.3-0.5 | More detections, some false positives |
| Moderate | 0.7 | Balanced (default) |
| Conservative | 0.8-1.0 | Fewer false positives, may miss some |
For IRB: We recommend 0.5 with manual review of flagged items.
- Microsoft Presidio - PII detection and anonymization
- SpaCy (
en_core_web_lg) - Named entity recognition - Streamlit - Web interface
- Docker - Containerization and isolation
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -vdeidentify-app/
├── app/
│ ├── main.py # Streamlit entry point
│ ├── pipeline.py # Core Presidio pipeline
│ ├── recognizers/ # Custom PII recognizers
│ ├── operators/ # Anonymization strategies
│ ├── utils/ # File I/O, audit logging
│ └── ui/ # Streamlit UI components
├── data/sample/ # Synthetic test data
├── tests/ # Unit tests
└── docs/ # Documentation
See app/recognizers/custom.py for examples of custom PII patterns (MRNs, study IDs, etc.).
See docs/QUICKSTART.md for common issues and solutions.
MIT License - see LICENSE for details.
- Microsoft Presidio for the PII detection engine
- SpaCy for NER models
- Streamlit for the UI framework