Skip to content

prizmatik666/epstein-ripper

Repository files navigation

EPSTEIN-RIPPER

Reliable, resumable archival downloader and validator for DOJ Epstein disclosure datasets.


HANDS-FREE DOWNLOAD VERSION AVAILABLE!! [ 03/02/2026 ] NOW WITH FULLY AUTOMATED AUTH-VERIFICATION! - auto_ep_rip.py is the new version that automates all the authorization checks that happen during the process. "Robot" and "age" buttons . Select your dataset, download, and it'll take care of the rest!

Hands-free upgrades:

  • Auto-click abuse-deterrent "I am not a robot" button (reauth gate)
  • Auto-click age gate YES (#age-button-yes)
  • No more "Press ENTER..." pauses for session refresh
  • Waits until dataset list is visible, then resumes automatically
  • Adds configurable sleeps between auth stages (stability)
  • Hardens safe_json_save to avoid .tmp -> .json FileNotFound crash

Patch :

  • Prevent infinite loops on bad PDFs / poison: per-file poison cap + immediate skip for clearly bad payloads
  • bad_files.log audit trail for skipped/bad-source files
  • Retryable network error handling (ETIMEDOUT/ECONNRESET/socket hang up/etc) with backoff

Patch :

  • Bad-file messaging: "BAD_SERVER_FILE (PDF endpoint returned non-PDF bytes)"
  • Ctrl+C / shutdown session summary stats (downloaded, bad/skips, net errors, etc.)
  • Warmup REMOVED: replaced with clean, confidence-forward initialization + settle delay

Patch (poison retry fix):

  • After session refresh/re-auth, ACTUALLY retry the same file before moving on (inner per-file loop; refresh triggers a retry of the current filename)

Just incase anyone has issue's with the automated version - the last working epstein_ripper.py is still here.

Overview

epstein-ripper is a resilient browser-driven crawler and downloader designed to archive publicly released Epstein document datasets hosted by the U.S. Department of Justice.

The DOJ interface presents multiple challenges:

  • Pagination that repeats or remixes pages
  • No reliable "last page" indicator
  • Short-lived authorization cookies
  • Anti-automation challenges
  • Occasional HTML responses served as .pdf files

This tool prioritizes reliability, integrity, and safe resume behavior, while striving to be user friendl. During the pursuit of establishing consistent and accurate index's of the DOJ's file lists i've found manyobstacles. I've done my best to defeat them to accomplish this goal, and to share with you.

Please leave a star, watch, or fork to help spread this software to those who may use it. Thank you for reading, cloning, using, etc !

The pursuit of truth, justice, and .pdf punishment is imperative. We're all a tool for change. - Prizm


Quick Start

git clone https://github.com/prizmatik666/epstein-ripper
cd epstein-ripper
pip install -r requirements.txt
playwright install chromium
python epstein_ripper.py

You will be prompted for:

  • Dataset selection
  • Operating mode (sync / scan / download)

Core Features

Dataset Selection

Choose individual datasets or ranges:

1,3,5
1-11
9-11

Dynamic Page Detection

Pages are scanned until no new PDFs appear for a defined threshold.
Pagination behavior from DOJ is unpredictable --- this system adapts.

Persistent Scan Index

Each dataset maintains its own index:

dataX/index_dataX.json

The index tracks:

  • Discovered PDFs
  • Source page numbers
  • Download status
  • Retry counts
  • Timestamps

This enables:

  • Crash-safe resume
  • Missing file repair
  • Safe re-walk scanning
  • Update detection

Crash-Safe Downloads

Files download to:

filename.pdf.part

They are renamed only after validation completes.
This prevents partial or corrupted files from being marked complete.

Session Protection

If DOJ returns HTML instead of a real PDF:

  • File is NOT written
  • A visible alert is triggered
  • Download pauses
  • User re-authenticates
  • Fresh context is created
  • File is safely retried

Normal HTTP errors do not trigger re-authentication.


Operating Modes

Mode Behavior


sync Scan + download missing files (recommended) scan Scan only, update index download Download missing files using existing index


Output Structure

Example:

data9/
    EFTA00012345.pdf
    index_data9.json

resume_data9.txt
download.log

Files:

  • PDFs --- Downloaded documents
  • index_dataX.json --- Scan index
  • resume_dataX.txt --- Last scanned page
  • download.log --- Activity log

Do not rename or delete files while the script is running.


Data Integrity Notes

Updated index_repair.py (2/26/2026)

The upgraded repair utility:

  • Correctly flips downloaded=False ΓåÆ True when files exist
  • Correctly flips downloaded=True ΓåÆ False when missing
  • Provides structured integrity reporting

Use the updated version.

Pagination Warning (2/25/2026)

DOJ pagination can repeat page results far beyond actual dataset depth.
Short "no new page" thresholds are unsafe.

The default stop threshold was increased significantly after real-world testing revealed new PDFs appearing thousands of pages later.

If performing deep archival scans, use a high no-new threshold.

Validation Required for Older Downloads

Older versions may have saved HTML as PDFs due to upstream behavior.

If you downloaded datasets before the validation upgrade:

  • Run integrity utilities
  • Validate file signatures
  • Perform a repair pass

Main Utilities

Optional but recommended tools are included for dataset validation and analysis.

active_watcher.py

Real-time corruption detection while downloading.

  • Monitors dataset directory

  • Validates PDF headers

  • Quarantines corrupted files

  • Logs quarantine events

  • Pauses with visible alert until acknowledged

  • this was included as a temporary fix utility while

  • a fix was being implemented for the 'html-served-as-.pdf' bug

  • no longer needed to be utilized during downloads as the check

  • happens inside the ripper now before saving to disk.


corruption_scan.py

One-time sweep utility.

  • Scans a directory for corrupted files
  • Validates %PDF- signature
  • Detects HTML markers
  • Moves corrupted files to quarantine/
  • Prints summary report
  • If files are removed from a dataset, run index_repair to flip the download= value back to false in the index

Safe to run multiple times.


index_repair.py

Index reconciliation tool.

  • Creates .bak backup of index
  • Validates disk vs index state
  • Repairs mismatches(downloaded=True/False)
  • Reports correctness buckets
  • Safe to rerun
  • After running corruptions_scan on a dataset - if it removes files run index_repair on that datasets index to flip the downloaded value back to false

image_ripper.py

Bulk embedded image extractor (GUI).

Extracts embedded images from large PDF collections.

Features:

  • Recursive folder scanning
  • Incremental re-run support
  • Process tracking via processed_pdfs.txt
  • Image mapping log (image_map.txt)

Requirements:

pip install pymupdf pillow

Designed for:

  • Large disclosure datasets
  • Forensic review
  • Visual content isolation

filter_ripped_images.py


sorts folder containing image (like pulled from pdf's using image_ripper . And moves all black(redatcted) images, images that appear to be all text/documents, and other images that have traits that dont seem like an image/picture. It puts these in different categories/buckets to go through manually and review. I suggest using (M)ove instead of (C)opy , to avoid massive memory ballooning on hard disk that would happen from copying a massive ammount of files on a harddisk

example: I ran image_ripper on the datasets i have (far from complete, but several hundred thousand pdf's) - i ended up with over 400k+ image files ripped. After running this filter program on my ripped_images, i reviewed what it pulled out and it was very accurate. Not many images that I needed to save out of the images it pulled out. I havent gone through all of what was left behind, but i did go through the pulls - and out of the almost half million images, over 300k were moved from the scanned dir.

making, imo, this a very valuable tool in cleaning up the extracted images from using image_ripper.py

Requirements for ripper

  • Python 3.9+
  • Playwright
  • aiohttp
  • Chromium browser (installed via Playwright)
<!-- -->
pip install -r requirements.txt
playwright install chromium

index_files/


  • This is where I include index_data#.json file's that i've made through scanning the datasets.
  • If you wish to use mine instead of scanning and building your own index - move the .json for the dataset your working with into it's data#/ directory, named as index_data#.json exactly (where # = dataset number)
  • If you already have downloads in your data#/ when deciding to try one of my index files- run index_repair.py on it before downloading again. it will set the files you have on disk to downloaded=True in the index , so theyre not downloaded again.
  • Scans to build full indexes on these massive datasets take a LONG time. I will be uploading them as I get them ready.

index_tools/


I'm working with doing scanning with a util that scans and uses a sqlte database for the index file instead of .json

using sql/db files for the download index in the ripper is not currently supported - but will probably add in that option later.

mostly i'm experimenting with:

  • speed and reliability of the sql scan vs. the built in ripper -> .json scanner for making the index file.
  • how the DOJ site behaves as far as serving duplicate file list pages in higher # page's in the various datasets
  • data9 started having alot of trouble after page 1000
  • db scanner couldnt break out of the 'same file list' loop that was happening
  • the built in ripper scan had the same problem but would eventually break out of a no-new streak. It had high value streaks: more than 100,200 no-new-pdf's in a row before breaking out and returning new filenames. -i ran my data9 scan with max no new @ 300.
[2026-02-25 21:58:57] [DS 9] No NEW PDFs on page 7990 (streak=300/300)
[2026-02-25 21:59:00] [DS 9] Stopping scan: no new PDFs for 300 consecutive pages.
[2026-02-25 21:59:14] === DATASET 9 COMPLETE ===
[2026-02-25 21:59:58] ALL DATASETS COMPLETE 
  • doj's pagination makes knowing if your dataset file list is complete, but with 300 as the end count for no-new , you can have a much higher confidence that you scanned everything.

I will be trying to find the fastest, most reliable, and above all ACURATE - way of indexing the file names for download. I thought it would be good to include those tools here now to make updates easier - and for others to play around with.

  • index_tools contains:
  • db_index.py -> the page scanner to build file index w/ database
  • db_to_json.py -> converts a db_index scan file into a ripper useable .json for downloading
  • dupe_check.py -> checks a .json index for duplicate entries
  • dupe_index.py -> duplicates a .json index and flips all download= values to false - making it into a fresh runable copy to be shared or freshly ran for download.
  • will make it useable on db files when db functionality is adopted in the main program also

Disclaimer

This tool accesses publicly available DOJ materials. It does not bypass authentication or security controls. All verification steps require explicit human interaction. Provided for archival, research, and transparency purposes. Use responsibly and in accordance with applicable laws and site terms.

About

downloads .pdf files from DOJ website / epstein data-sets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages