Reliable, resumable archival downloader and validator for DOJ Epstein disclosure datasets.
HANDS-FREE DOWNLOAD VERSION AVAILABLE!! [ 03/02/2026 ] NOW WITH FULLY AUTOMATED AUTH-VERIFICATION! - auto_ep_rip.py is the new version that automates all the authorization checks that happen during the process. "Robot" and "age" buttons . Select your dataset, download, and it'll take care of the rest!
Hands-free upgrades:
- Auto-click abuse-deterrent "I am not a robot" button (reauth gate)
- Auto-click age gate YES (#age-button-yes)
- No more "Press ENTER..." pauses for session refresh
- Waits until dataset list is visible, then resumes automatically
- Adds configurable sleeps between auth stages (stability)
- Hardens safe_json_save to avoid .tmp -> .json FileNotFound crash
Patch :
- Prevent infinite loops on bad PDFs / poison: per-file poison cap + immediate skip for clearly bad payloads
- bad_files.log audit trail for skipped/bad-source files
- Retryable network error handling (ETIMEDOUT/ECONNRESET/socket hang up/etc) with backoff
Patch :
- Bad-file messaging: "BAD_SERVER_FILE (PDF endpoint returned non-PDF bytes)"
- Ctrl+C / shutdown session summary stats (downloaded, bad/skips, net errors, etc.)
- Warmup REMOVED: replaced with clean, confidence-forward initialization + settle delay
Patch (poison retry fix):
- After session refresh/re-auth, ACTUALLY retry the same file before moving on (inner per-file loop; refresh triggers a retry of the current filename)
Just incase anyone has issue's with the automated version - the last working epstein_ripper.py is still here.
epstein-ripper is a resilient browser-driven crawler and downloader
designed to archive publicly released Epstein document datasets hosted
by the U.S. Department of Justice.
The DOJ interface presents multiple challenges:
- Pagination that repeats or remixes pages
- No reliable "last page" indicator
- Short-lived authorization cookies
- Anti-automation challenges
- Occasional HTML responses served as
.pdffiles
This tool prioritizes reliability, integrity, and safe resume behavior, while striving to be user friendl. During the pursuit of establishing consistent and accurate index's of the DOJ's file lists i've found manyobstacles. I've done my best to defeat them to accomplish this goal, and to share with you.
Please leave a star, watch, or fork to help spread this software to those who may use it. Thank you for reading, cloning, using, etc !
The pursuit of truth, justice, and .pdf punishment is imperative. We're all a tool for change. - Prizm
git clone https://github.com/prizmatik666/epstein-ripper
cd epstein-ripper
pip install -r requirements.txt
playwright install chromium
python epstein_ripper.pyYou will be prompted for:
- Dataset selection
- Operating mode (sync / scan / download)
Choose individual datasets or ranges:
1,3,5
1-11
9-11
Pages are scanned until no new PDFs appear for a defined threshold.
Pagination behavior from DOJ is unpredictable --- this system adapts.
Each dataset maintains its own index:
dataX/index_dataX.json
The index tracks:
- Discovered PDFs
- Source page numbers
- Download status
- Retry counts
- Timestamps
This enables:
- Crash-safe resume
- Missing file repair
- Safe re-walk scanning
- Update detection
Files download to:
filename.pdf.part
They are renamed only after validation completes.
This prevents partial or corrupted files from being marked complete.
If DOJ returns HTML instead of a real PDF:
- File is NOT written
- A visible alert is triggered
- Download pauses
- User re-authenticates
- Fresh context is created
- File is safely retried
Normal HTTP errors do not trigger re-authentication.
Mode Behavior
sync Scan + download missing files (recommended)
scan Scan only, update index
download Download missing files using existing index
Example:
data9/
EFTA00012345.pdf
index_data9.json
resume_data9.txt
download.log
Files:
- PDFs --- Downloaded documents
index_dataX.json--- Scan indexresume_dataX.txt--- Last scanned pagedownload.log--- Activity log
Do not rename or delete files while the script is running.
The upgraded repair utility:
- Correctly flips
downloaded=False → Truewhen files exist - Correctly flips
downloaded=True → Falsewhen missing - Provides structured integrity reporting
Use the updated version.
DOJ pagination can repeat page results far beyond actual dataset depth.
Short "no new page" thresholds are unsafe.
The default stop threshold was increased significantly after real-world testing revealed new PDFs appearing thousands of pages later.
If performing deep archival scans, use a high no-new threshold.
Older versions may have saved HTML as PDFs due to upstream behavior.
If you downloaded datasets before the validation upgrade:
- Run integrity utilities
- Validate file signatures
- Perform a repair pass
Optional but recommended tools are included for dataset validation and analysis.
Real-time corruption detection while downloading.
-
Monitors dataset directory
-
Validates PDF headers
-
Quarantines corrupted files
-
Logs quarantine events
-
Pauses with visible alert until acknowledged
-
this was included as a temporary fix utility while
-
a fix was being implemented for the 'html-served-as-.pdf' bug
-
no longer needed to be utilized during downloads as the check
-
happens inside the ripper now before saving to disk.
One-time sweep utility.
- Scans a directory for corrupted files
- Validates
%PDF-signature - Detects HTML markers
- Moves corrupted files to
quarantine/ - Prints summary report
- If files are removed from a dataset, run index_repair to flip the download= value back to false in the index
Safe to run multiple times.
Index reconciliation tool.
- Creates
.bakbackup of index - Validates disk vs index state
- Repairs mismatches(downloaded=True/False)
- Reports correctness buckets
- Safe to rerun
- After running corruptions_scan on a dataset - if it removes files run index_repair on that datasets index to flip the downloaded value back to false
Bulk embedded image extractor (GUI).
Extracts embedded images from large PDF collections.
Features:
- Recursive folder scanning
- Incremental re-run support
- Process tracking via
processed_pdfs.txt - Image mapping log (
image_map.txt)
Requirements:
pip install pymupdf pillow
Designed for:
- Large disclosure datasets
- Forensic review
- Visual content isolation
sorts folder containing image (like pulled from pdf's using image_ripper . And moves all black(redatcted) images, images that appear to be all text/documents, and other images that have traits that dont seem like an image/picture. It puts these in different categories/buckets to go through manually and review. I suggest using (M)ove instead of (C)opy , to avoid massive memory ballooning on hard disk that would happen from copying a massive ammount of files on a harddisk
example: I ran image_ripper on the datasets i have (far from complete, but several hundred thousand pdf's) - i ended up with over 400k+ image files ripped. After running this filter program on my ripped_images, i reviewed what it pulled out and it was very accurate. Not many images that I needed to save out of the images it pulled out. I havent gone through all of what was left behind, but i did go through the pulls - and out of the almost half million images, over 300k were moved from the scanned dir.
making, imo, this a very valuable tool in cleaning up the extracted images from using image_ripper.py
- Python 3.9+
- Playwright
- aiohttp
- Chromium browser (installed via Playwright)
<!-- -->pip install -r requirements.txt
playwright install chromium
- This is where I include index_data#.json file's that i've made through scanning the datasets.
- If you wish to use mine instead of scanning and building your own index - move the .json for the dataset your working with into it's data#/ directory, named as index_data#.json exactly (where # = dataset number)
- If you already have downloads in your data#/ when deciding to try one of my index files- run index_repair.py on it before downloading again. it will set the files you have on disk to downloaded=True in the index , so theyre not downloaded again.
- Scans to build full indexes on these massive datasets take a LONG time. I will be uploading them as I get them ready.
I'm working with doing scanning with a util that scans and uses a sqlte database for the index file instead of .json
using sql/db files for the download index in the ripper is not currently supported - but will probably add in that option later.
mostly i'm experimenting with:
- speed and reliability of the sql scan vs. the built in ripper -> .json scanner for making the index file.
- how the DOJ site behaves as far as serving duplicate file list pages in higher # page's in the various datasets
- data9 started having alot of trouble after page 1000
- db scanner couldnt break out of the 'same file list' loop that was happening
- the built in ripper scan had the same problem but would eventually break out of a no-new streak. It had high value streaks: more than 100,200 no-new-pdf's in a row before breaking out and returning new filenames. -i ran my data9 scan with max no new @ 300.
[2026-02-25 21:58:57] [DS 9] No NEW PDFs on page 7990 (streak=300/300)
[2026-02-25 21:59:00] [DS 9] Stopping scan: no new PDFs for 300 consecutive pages.
[2026-02-25 21:59:14] === DATASET 9 COMPLETE ===
[2026-02-25 21:59:58] ALL DATASETS COMPLETE
- doj's pagination makes knowing if your dataset file list is complete, but with 300 as the end count for no-new , you can have a much higher confidence that you scanned everything.
I will be trying to find the fastest, most reliable, and above all ACURATE - way of indexing the file names for download. I thought it would be good to include those tools here now to make updates easier - and for others to play around with.
- index_tools contains:
- db_index.py -> the page scanner to build file index w/ database
- db_to_json.py -> converts a db_index scan file into a ripper useable .json for downloading
- dupe_check.py -> checks a .json index for duplicate entries
- dupe_index.py -> duplicates a .json index and flips all download= values to false - making it into a fresh runable copy to be shared or freshly ran for download.
- will make it useable on db files when db functionality is adopted in the main program also
This tool accesses publicly available DOJ materials. It does not bypass authentication or security controls. All verification steps require explicit human interaction. Provided for archival, research, and transparency purposes. Use responsibly and in accordance with applicable laws and site terms.