Real-Time PII Detection & Necessity Analysis Web Platform

Project Overview

A secure web platform for uploading documents (PDF, image, scanned copy, or text), detecting PII (Personally Identifiable Information) using OCR, regex, and advanced NLP, checking if that PII is necessary for the selected service, and offering to mask or download masked versions of the document.

Features

Real-time PII detection using OCR (Tesseract), regex, and spaCy NLP (upgradeable to transformer-based models)
Context-aware DOB detection to avoid false positives (e.g., ignores print dates)
Business rules engine to determine if detected PII is required for the selected service
Option to mask PII and download a masked version of the document
Supports text, images, and scanned documents
Chrome extension support for quick PII analysis and masked file download

Architecture

Backend: Python Flask (handles file upload, OCR, PII detection, necessity analysis, masking, and download)
Frontend: HTML/CSS/JS (UI for uploading, analyzing, and downloading masked files)
Config: YAML file for service rules
OCR: Tesseract via pytesseract and pdfplumber
PII Detection: Regex for patterns (Aadhaar, PAN, email, phone, DOB), spaCy for names (upgradeable to transformer-based NER for better accuracy)
Masking: Detected PII is replaced with masked values in the downloadable file

File/Module Descriptions

api/app.py: Main Flask app, API endpoints /upload, /redact, /download-masked
api/ocr.py: OCR utilities using Tesseract and pdfplumber
api/pii_detection.py: PII detection logic (regex + spaCy), context-aware DOB detection, masking
api/rules_engine.py: Business rules engine (loads YAML rules)
api/redaction.py: Redaction utilities (placeholder)
config/service_rules.yaml: Service-specific required PII fields
requirements.txt: Python dependencies
api/sample.txt: Example file for testing
ui/index.html: Main frontend UI
chrome-extension/: Chrome extension for PII detection and masked file download

How It Works (Step by Step)

User uploads a document and selects a service type.
- Handled by the frontend (ui/index.html or Chrome extension) which sends the file and service type to the backend.
Backend extracts text using OCR (if the file is an image or PDF).
- Implemented in api/ocr.py using Tesseract (for images) and pdfplumber (for PDFs).
PII detection is performed in api/pii_detection.py:
- Regex is used for structured patterns (Aadhaar, PAN, email, phone, DOB).
- spaCy NLP is used for names (PERSON) and can be upgraded to transformer-based NER for better accuracy.
- Context-aware logic (e.g., for DOB) is applied to avoid false positives.
- Masking is applied to detected PII values using the mask_value function.
Necessity analysis is performed in api/rules_engine.py:
- Checks which PII is required for the selected service using business rules from the YAML config (config/service_rules.yaml).
Response includes detected PII (with masked values) and necessity info:
- Returned by the Flask API in api/app.py to the frontend.
User can download a masked version of the document as plain text:
- Handled by the /download-masked endpoint in api/app.py, available in both the web UI and Chrome extension.

How to Run

Create and activate a Python virtual environment:

python -m venv venv
venv\Scripts\activate   # On Windows
# or
source venv/bin/activate  # On macOS/Linux

Install Python dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Install Tesseract OCR and add it to your system PATH.
Start the Flask server:
```
cd api
python app.py
```
Open ui/index.html in your browser for the web UI.
(Optional) Load the Chrome extension from chrome-extension/ for browser-based PII detection and masked file download.

Extending the Project

Fine-tune or replace the spaCy NLP model with transformer-based models (e.g., DeBERTa-v3, BERT) specialized for PII detection.
Use synthetic PII data generation and domain adaptation to boost real-world detection accuracy.
Combine pattern matching with machine learning classifiers for hybrid detection.
Enhance redaction to actually mask/remove PII in PDFs and images.
Add support for more file types (DOCX, etc.).
Add user authentication and audit logging.
Integrate with cloud storage or enterprise document management systems.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
api		api
chrome-extension		chrome-extension
config		config
custom_pii_ner_model		custom_pii_ner_model
doc		doc
tessdata		tessdata
ui		ui
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt
tesseract-5.5.1.zip		tesseract-5.5.1.zip
tesseract-ocr-w64-setup-5.5.0.20241111.exe		tesseract-ocr-w64-setup-5.5.0.20241111.exe
train_custom_ner.py		train_custom_ner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time PII Detection & Necessity Analysis Web Platform

Project Overview

Features

Architecture

File/Module Descriptions

How It Works (Step by Step)

How to Run

Extending the Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-Time PII Detection & Necessity Analysis Web Platform

Project Overview

Features

Architecture

File/Module Descriptions

How It Works (Step by Step)

How to Run

Extending the Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages