De-Identification App

A local-only tool for removing personally identifiable information (PII) from research data. Built for human subjects researchers who need HIPAA-compliant de-identification with full audit trails.

All processing happens on your computer. Your data never leaves your machine.

Features

HIPAA Safe Harbor Compliant - Detects all 18 identifier types
Multiple Anonymization Strategies - Redact, mask, hash, or replace with fake data
Audit Logs for IRB - Document every detection and action taken
Local-Only Access - App is bound to 127.0.0.1:8501 by default
User-Friendly UI - Streamlit interface designed for non-technical researchers
Supports CSV, Excel, and Text - Common research data formats

Quick Start

Prerequisites

Docker Desktop (Mac, Windows, or Linux)

Run the App

Mac/Linux:

git clone https://github.com/jinghanlib/deidentify-app.git
cd deidentify-app
./run.sh

Windows:

git clone https://github.com/jinghanlib/deidentify-app.git
cd deidentify-app
run.bat

The launcher will:

Check that Docker is installed and running
Build the image (first time only, ~5-10 minutes)
Start the container with localhost-only port binding
Open your browser to http://localhost:8501

How It Works

1. Upload Your Data

Upload CSV, Excel, or text files. The app previews the first 5 rows.

2. Configure Columns

Tell the app how to process each column:

Type	Description	Example
Skip	Don't process	`record_id`, `date_collected`
Structured	Regex patterns only	Email or phone columns
Free Text	Full NER + regex	Notes, comments, descriptions
Direct Identifier	Always fully redact	SSN, MRN columns

3. Select Entity Types

Choose which PII types to detect:

Names, Locations, Dates
Phone numbers, Email addresses
SSN, Medical record numbers
And more (18 HIPAA identifier types)

4. Choose Anonymization Strategy

Strategy	Example	Use Case
Redact	`[REDACTED]`	Maximum privacy
Type Tag	`[PERSON_1]`	Preserve entity counts for analysis
Mask	`J* S**`	Keep partial structure
Hash	`a1b2c3d4`	Link records across files
Fake	`Jane Doe`	Preserve readability

5. Export

Download:

De-identified file (same format as input)
Audit log (JSON, required for IRB documentation)

File Locations

Purpose	Location
Input data	`data/`
De-identified output	`output/`
Audit logs	`audit/`
Sample data	`data/sample/`

Security

This tool is designed with research data security in mind:

Localhost Binding: App is exposed only on 127.0.0.1:8501 by default
No Cloud APIs: All NLP processing uses local SpaCy models
No Telemetry: Streamlit telemetry is disabled
Non-Root User: Container runs as unprivileged user
Read-Only Input: Input data directory is mounted read-only
Audit Trail: Every detection and action is logged (without PII)

For IRB Documentation

The audit log provides everything needed for IRB compliance:

{
  "run_id": "uuid",
  "timestamp": "2024-01-15T10:30:00Z",
  "settings": {
    "entities_selected": ["PERSON", "EMAIL_ADDRESS"],
    "confidence_threshold": 0.7,
    "strategy_per_entity": {"PERSON": "hash"}
  },
  "summary": {
    "total_records": 100,
    "total_detections": 342
  }
}

The audit log never contains the original PII - only entity type, position, confidence, and action taken.

See docs/irb_methodology.md for a template you can include in your IRB application.

Configuration

Confidence Threshold

Adjust detection sensitivity:

Level	Threshold	Description
Aggressive	0.3-0.5	More detections, some false positives
Moderate	0.7	Balanced (default)
Conservative	0.8-1.0	Fewer false positives, may miss some

For IRB: We recommend 0.5 with manual review of flagged items.

Tech Stack

Microsoft Presidio - PII detection and anonymization
SpaCy (en_core_web_lg) - Named entity recognition
Streamlit - Web interface
Docker - Containerization and isolation

Development

Running Tests

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -v

Project Structure

deidentify-app/
├── app/
│   ├── main.py              # Streamlit entry point
│   ├── pipeline.py          # Core Presidio pipeline
│   ├── recognizers/         # Custom PII recognizers
│   ├── operators/           # Anonymization strategies
│   ├── utils/               # File I/O, audit logging
│   └── ui/                  # Streamlit UI components
├── data/sample/             # Synthetic test data
├── tests/                   # Unit tests
└── docs/                    # Documentation

Adding Custom Recognizers

See app/recognizers/custom.py for examples of custom PII patterns (MRNs, study IDs, etc.).

Troubleshooting

See docs/QUICKSTART.md for common issues and solutions.

License

MIT License - see LICENSE for details.

Acknowledgments

Microsoft Presidio for the PII detection engine
SpaCy for NER models
Streamlit for the UI framework

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
app		app
data		data
docs		docs
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De-Identification App

Features

Quick Start

Prerequisites

Run the App

How It Works

1. Upload Your Data

2. Configure Columns

3. Select Entity Types

4. Choose Anonymization Strategy

5. Export

File Locations

Security

For IRB Documentation

Configuration

Confidence Threshold

Tech Stack

Development

Running Tests

Project Structure

Adding Custom Recognizers

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

De-Identification App

Features

Quick Start

Prerequisites

Run the App

How It Works

1. Upload Your Data

2. Configure Columns

3. Select Entity Types

4. Choose Anonymization Strategy

5. Export

File Locations

Security

For IRB Documentation

Configuration

Confidence Threshold

Tech Stack

Development

Running Tests

Project Structure

Adding Custom Recognizers

Troubleshooting

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages