EPSTEIN-RIPPER

Reliable, resumable archival downloader and validator for DOJ Epstein disclosure datasets.

HANDS-FREE DOWNLOAD VERSION AVAILABLE!! [ 03/02/2026 ] NOW WITH FULLY AUTOMATED AUTH-VERIFICATION! - auto_ep_rip.py is the new version that automates all the authorization checks that happen during the process. "Robot" and "age" buttons . Select your dataset, download, and it'll take care of the rest!

Hands-free upgrades:

Auto-click abuse-deterrent "I am not a robot" button (reauth gate)
Auto-click age gate YES (#age-button-yes)
No more "Press ENTER..." pauses for session refresh
Waits until dataset list is visible, then resumes automatically
Adds configurable sleeps between auth stages (stability)
Hardens safe_json_save to avoid .tmp -> .json FileNotFound crash

Patch :

Prevent infinite loops on bad PDFs / poison: per-file poison cap + immediate skip for clearly bad payloads
bad_files.log audit trail for skipped/bad-source files
Retryable network error handling (ETIMEDOUT/ECONNRESET/socket hang up/etc) with backoff

Patch :

Bad-file messaging: "BAD_SERVER_FILE (PDF endpoint returned non-PDF bytes)"
Ctrl+C / shutdown session summary stats (downloaded, bad/skips, net errors, etc.)
Warmup REMOVED: replaced with clean, confidence-forward initialization + settle delay

Patch (poison retry fix):

After session refresh/re-auth, ACTUALLY retry the same file before moving on (inner per-file loop; refresh triggers a retry of the current filename)

Just incase anyone has issue's with the automated version - the last working epstein_ripper.py is still here.

Overview

epstein-ripper is a resilient browser-driven crawler and downloader designed to archive publicly released Epstein document datasets hosted by the U.S. Department of Justice.

The DOJ interface presents multiple challenges:

Pagination that repeats or remixes pages
No reliable "last page" indicator
Short-lived authorization cookies
Anti-automation challenges
Occasional HTML responses served as .pdf files

This tool prioritizes reliability, integrity, and safe resume behavior, while striving to be user friendl. During the pursuit of establishing consistent and accurate index's of the DOJ's file lists i've found manyobstacles. I've done my best to defeat them to accomplish this goal, and to share with you.

Please leave a star, watch, or fork to help spread this software to those who may use it. Thank you for reading, cloning, using, etc !

The pursuit of truth, justice, and .pdf punishment is imperative. We're all a tool for change. - Prizm

Quick Start

git clone https://github.com/prizmatik666/epstein-ripper
cd epstein-ripper
pip install -r requirements.txt
playwright install chromium
python epstein_ripper.py

You will be prompted for:

Dataset selection
Operating mode (sync / scan / download)

Core Features

Dataset Selection

Choose individual datasets or ranges:

1,3,5
1-11
9-11

Dynamic Page Detection

Pages are scanned until no new PDFs appear for a defined threshold.
Pagination behavior from DOJ is unpredictable --- this system adapts.

Persistent Scan Index

Each dataset maintains its own index:

dataX/index_dataX.json

The index tracks:

Discovered PDFs
Source page numbers
Download status
Retry counts
Timestamps

This enables:

Crash-safe resume
Missing file repair
Safe re-walk scanning
Update detection

Crash-Safe Downloads

Files download to:

filename.pdf.part

They are renamed only after validation completes.
This prevents partial or corrupted files from being marked complete.

Session Protection

If DOJ returns HTML instead of a real PDF:

File is NOT written
A visible alert is triggered
Download pauses
User re-authenticates
Fresh context is created
File is safely retried

Normal HTTP errors do not trigger re-authentication.

Operating Modes

Mode Behavior

sync Scan + download missing files (recommended) scan Scan only, update index download Download missing files using existing index

Output Structure

Example:

data9/
    EFTA00012345.pdf
    index_data9.json

resume_data9.txt
download.log

Files:

PDFs --- Downloaded documents
index_dataX.json --- Scan index
resume_dataX.txt --- Last scanned page
download.log --- Activity log

Do not rename or delete files while the script is running.

Data Integrity Notes

Updated index_repair.py (2/26/2026)

The upgraded repair utility:

Correctly flips downloaded=False ΓåÆ True when files exist
Correctly flips downloaded=True ΓåÆ False when missing
Provides structured integrity reporting

Use the updated version.

Pagination Warning (2/25/2026)

DOJ pagination can repeat page results far beyond actual dataset depth.
Short "no new page" thresholds are unsafe.

The default stop threshold was increased significantly after real-world testing revealed new PDFs appearing thousands of pages later.

If performing deep archival scans, use a high no-new threshold.

Validation Required for Older Downloads

Older versions may have saved HTML as PDFs due to upstream behavior.

If you downloaded datasets before the validation upgrade:

Run integrity utilities
Validate file signatures
Perform a repair pass

Main Utilities

Optional but recommended tools are included for dataset validation and analysis.

active_watcher.py

Real-time corruption detection while downloading.

Monitors dataset directory
Validates PDF headers
Quarantines corrupted files
Logs quarantine events
Pauses with visible alert until acknowledged
this was included as a temporary fix utility while
a fix was being implemented for the 'html-served-as-.pdf' bug
no longer needed to be utilized during downloads as the check
happens inside the ripper now before saving to disk.

corruption_scan.py

One-time sweep utility.

Scans a directory for corrupted files
Validates %PDF- signature
Detects HTML markers
Moves corrupted files to quarantine/
Prints summary report
If files are removed from a dataset, run index_repair to flip the download= value back to false in the index

Safe to run multiple times.

index_repair.py

Index reconciliation tool.

Creates .bak backup of index
Validates disk vs index state
Repairs mismatches(downloaded=True/False)
Reports correctness buckets
Safe to rerun
After running corruptions_scan on a dataset - if it removes files run index_repair on that datasets index to flip the downloaded value back to false

image_ripper.py

Bulk embedded image extractor (GUI).

Extracts embedded images from large PDF collections.

Features:

Recursive folder scanning
Incremental re-run support
Process tracking via processed_pdfs.txt
Image mapping log (image_map.txt)

Requirements:

pip install pymupdf pillow

Designed for:

Large disclosure datasets
Forensic review
Visual content isolation

filter_ripped_images.py

sorts folder containing image (like pulled from pdf's using image_ripper . And moves all black(redatcted) images, images that appear to be all text/documents, and other images that have traits that dont seem like an image/picture. It puts these in different categories/buckets to go through manually and review. I suggest using (M)ove instead of (C)opy , to avoid massive memory ballooning on hard disk that would happen from copying a massive ammount of files on a harddisk

example: I ran image_ripper on the datasets i have (far from complete, but several hundred thousand pdf's) - i ended up with over 400k+ image files ripped. After running this filter program on my ripped_images, i reviewed what it pulled out and it was very accurate. Not many images that I needed to save out of the images it pulled out. I havent gone through all of what was left behind, but i did go through the pulls - and out of the almost half million images, over 300k were moved from the scanned dir.

making, imo, this a very valuable tool in cleaning up the extracted images from using image_ripper.py

Requirements for ripper

Python 3.9+
Playwright
aiohttp
Chromium browser (installed via Playwright)

<!-- -->

pip install -r requirements.txt
playwright install chromium

index_files/

This is where I include index_data#.json file's that i've made through scanning the datasets.
If you wish to use mine instead of scanning and building your own index - move the .json for the dataset your working with into it's data#/ directory, named as index_data#.json exactly (where # = dataset number)
If you already have downloads in your data#/ when deciding to try one of my index files- run index_repair.py on it before downloading again. it will set the files you have on disk to downloaded=True in the index , so theyre not downloaded again.
Scans to build full indexes on these massive datasets take a LONG time. I will be uploading them as I get them ready.

index_tools/

I'm working with doing scanning with a util that scans and uses a sqlte database for the index file instead of .json

using sql/db files for the download index in the ripper is not currently supported - but will probably add in that option later.

mostly i'm experimenting with:

speed and reliability of the sql scan vs. the built in ripper -> .json scanner for making the index file.
how the DOJ site behaves as far as serving duplicate file list pages in higher # page's in the various datasets
data9 started having alot of trouble after page 1000
db scanner couldnt break out of the 'same file list' loop that was happening
the built in ripper scan had the same problem but would eventually break out of a no-new streak. It had high value streaks: more than 100,200 no-new-pdf's in a row before breaking out and returning new filenames. -i ran my data9 scan with max no new @ 300.

[2026-02-25 21:58:57] [DS 9] No NEW PDFs on page 7990 (streak=300/300)
[2026-02-25 21:59:00] [DS 9] Stopping scan: no new PDFs for 300 consecutive pages.
[2026-02-25 21:59:14] === DATASET 9 COMPLETE ===
[2026-02-25 21:59:58] ALL DATASETS COMPLETE

doj's pagination makes knowing if your dataset file list is complete, but with 300 as the end count for no-new , you can have a much higher confidence that you scanned everything.

I will be trying to find the fastest, most reliable, and above all ACURATE - way of indexing the file names for download. I thought it would be good to include those tools here now to make updates easier - and for others to play around with.

index_tools contains:
db_index.py -> the page scanner to build file index w/ database
db_to_json.py -> converts a db_index scan file into a ripper useable .json for downloading
dupe_check.py -> checks a .json index for duplicate entries
dupe_index.py -> duplicates a .json index and flips all download= values to false - making it into a fresh runable copy to be shared or freshly ran for download.
will make it useable on db files when db functionality is adopted in the main program also

Disclaimer

This tool accesses publicly available DOJ materials. It does not bypass authentication or security controls. All verification steps require explicit human interaction. Provided for archival, research, and transparency purposes. Use responsibly and in accordance with applicable laws and site terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPSTEIN-RIPPER

Overview

Quick Start

Core Features

Dataset Selection

Dynamic Page Detection

Persistent Scan Index

Crash-Safe Downloads

Session Protection

Operating Modes

Output Structure

Data Integrity Notes

Updated index_repair.py (2/26/2026)

Pagination Warning (2/25/2026)

Validation Required for Older Downloads

Main Utilities

active_watcher.py

corruption_scan.py

index_repair.py

image_ripper.py

filter_ripped_images.py

Requirements for ripper

index_files/

index_tools/

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
index_files		index_files
index_tools		index_tools
.gitignore		.gitignore
README.md		README.md
active_watcher.py		active_watcher.py
auto_ep_rip.py		auto_ep_rip.py
corruption_scan.py		corruption_scan.py
epstein_ripper.py		epstein_ripper.py
filter_ripped_images.py		filter_ripped_images.py
image_ripper.py		image_ripper.py
important_notic_hist.txt		important_notic_hist.txt
index_repair.py		index_repair.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EPSTEIN-RIPPER

Overview

Quick Start

Core Features

Dataset Selection

Dynamic Page Detection

Persistent Scan Index

Crash-Safe Downloads

Session Protection

Operating Modes

Output Structure

Data Integrity Notes

Updated index_repair.py (2/26/2026)

Pagination Warning (2/25/2026)

Validation Required for Older Downloads

Main Utilities

active_watcher.py

corruption_scan.py

index_repair.py

image_ripper.py

filter_ripped_images.py

Requirements for ripper

index_files/

index_tools/

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages