Welcome to epstein-ripper Discussions! #1

prizmatik666 · 2026-02-25T02:05:37Z

prizmatik666
Feb 25, 2026
Maintainer

👋 Welcome to Epstein Ripper Discussions

This space is for users, researchers, developers, and contributors
who are working with or interested in the Epstein DOJ Dataset tools.

We use Discussions for:

❓ Questions about setup, environment, or usage
💡 Feature suggestions and improvement ideas
🛠 Troubleshooting and debugging help
📊 Research workflow discussions
🚀 Future tool expansion ideas

This project focuses on:

Structured dataset downloading
Resume-safe scraping
Integrity validation
Indexing and large-scale document handling

Please keep discussions:

Constructive
Technical when possible
Focused on tooling and workflow
Respectful of others

If you're new, feel free to introduce yourself:

What environment are you running in? (Windows, WSL, Linux, etc.)
Which dataset are you working with?
What are you trying to accomplish?

⚠️ Scope Reminder

This repository is for tooling and archival access purposes.
Discussions should remain centered on technical usage,
improvements, and research workflows.

Thanks for contributing to independent open-source research tools.

prizmatik666 · 2026-02-25T02:16:55Z

prizmatik666
Feb 25, 2026
Maintainer Author

thanks for checking out my repo!

I developed these programs because of how user-unfriendly the DOJ made the epstein dataset dumps for everyday people to download.

They technically followed the law by releasing data, but they did it in such a way that it's so hard to download/find anything, that the waters have become muddy and people will be looking through the .pdf's for years to come trying to find information.

when i first started downloading the datasets , the DOJ had .zip files for early sets. They downloaded fine for the smaller ones, but the bigger set's .zip files would stall while i was downloading- not able to resume, just a giant pain in the ass.

that led me to finding a way to scrape the .pdf's , leading to reproducing conditions for doj security checks / cookies/session etc.

in the beginning , the way the site worked , i would have to click 'not a robot' every 1-2 minutes. as you can guess that was exhausting and terrible experience. some where along the release of datasets 9-11 something changed. I would only have to verify at the beginning of the new context, and it would run almost indefinitely. WOW! i thought, this is great!

unfortunately i came to find that the session silently expired somewhere along the way during download, but still showed .pdf's as downloaded- in reality, it was saving the age verification main page as the .pdf document!!!! This led to me having over 80k corrupted files - as soon as i realized this i updated github with a warning, threw up some quick fix utilities (active_watcher and corruption_scan ) so users could still download and have a way to warn themselves of error.

these new util's are not 'perfect' but they are neccessary and a step in the right direction.
active_watcher has problems with its watcher_state.json where subsequent runs of the program never start actively scanning- to fix- you need to simply delete that .json file before running the watcher again. annoying - but i will get to it. sorry

and corruption_scan - is great for scanning your already downloaded files. it detects docs in the downloaded .pdf's and moves them to quarantine.

again- the way DOJ has released these files and given access to them is atrocious. although a technical legal way of releasing them, the way they did it just sucks. There should exist a better way of downloading these things without having to know how to code, or run python scripts,,,,or use the doj's search tool to look through millions of documents online..but , here we are.

I hope these tools are a step in the right direction, and helpful to people that are also interested in storing/analyzing these files locally.

use this section to ask questions, leave comments, notes, etc. thank you :)

Support This Project

This project is developed and maintained independently by Prizm
(Prizmatik Underground).

If this tool has been useful to you, consider supporting future
development and research:

PayPal Donation Link:
https://www.paypal.com/ncp/payment/VVDAXZGKPQZKW

PayPal Email:
prizmatikug@gmail.com

Original Repository:
https://github.com/prizmatik666/epstein-ripper

Thank you for supporting independent open-source tools.

1 reply

prizmatik666 Feb 25, 2026
Maintainer Author

just updated the ripper.

It now checks to make sure each download is a valid pdf. When it eventually tries to pull a pdf and gets served the html/age-verification page, it stops pulling that download, alerts user, waits for [enter] press, launches a new playwright/context window for you to click 'not a robot' and start a new session, then you can go back to program and hit enter to start downloads like usual.

the active_watcher.py is no longer neccessary to run in conjunction with the ripper. It was a good alert/watchdog though while i was working on a fix that could be self contained in the main ripper.

sorry for all the various updates over the last couple days. Traffic showed that a good number of people started cloning the tool and i felt responsible for getting this problem fixed ASAP - it was a pain in my @ss so i know that another user really wouldn't appreciate it either! :)

happy things are finally stable

prizmatik666 · 2026-02-27T00:58:36Z

prizmatik666
Feb 27, 2026
Maintainer Author

yesterday i realized the 6 page streak for finding the end of new pdf's in the indexer was inadequate.

the way DOJ serves the pages / contents is chaotic and ridiculous. When scanning I got many streaks over 100 or 200 with no new pdf's then BOOM - 50 new pdfs.

here's the readout from the ripper's indexer/scan, took me atleast 12 hours to get through :

[2026-02-25 21:58:57] [DS 9] No NEW PDFs on page 7990 (streak=300/300)
[2026-02-25 21:59:00] [DS 9] Stopping scan: no new PDFs for 300 consecutive pages.
[2026-02-25 21:59:14] === DATASET 9 COMPLETE ===
[2026-02-25 21:59:58] ALL DATASETS COMPLETE

here's a readout of index_repair.py on my dataset 9 index .json - it checks the index vs actual downloaded files and make sure the .json doesnt have any files marked as downloaded that arent, but also has the effect of giving me a count of how many files are downloaded/how many are left to download, this will give you an idea of total files in dataset9:

=============================================================
INDEX REPAIR REPORT

Index file: data9/index_data9.json
Dataset directory: data9
Total index entries: 216408
Already valid entries: 1099
Missing files found: 0
Non-PDF files found: 0
Total repairs applied: 0

0 replies

prizmatik666 · 2026-02-27T01:00:57Z

prizmatik666
Feb 27, 2026
Maintainer Author

i'm now working on an sqlite/db index scanner, with multiple modes for rewalking the db , repairing suspected errors, etc.
i'm doing this because its faster than comparing each page scanned to the .json in real time.
also - once it's reliably working, i'll make a tool that will take the filenames from the db file , compare against the .json index for the dataset and pull the needed filenames from the db and put them into the .json in proper formatting so they can be downloaded with the ripper.

2 replies

prizmatik666 Feb 27, 2026
Maintainer Author

the sqlite/db tool, started getting the same .pdf list sent to it from the doj ~page#1000 (idr exact) - but it wouldnt break out of the loop, and hit new .pdf's - no matter the page number doj sent the same slice of .pdf's

when i scanned yesterday with the ripper, it would eventually breakout and find some new pdf's . the db tool was built off same functionality as the indexer- so im not sure WHY this is occuring.

made tool to compare the .json index vs the .db index - here's the readout:

prizm@Prizm:/mnt/c/Users/prizm/Desktop/epstein$ python3 json_vs_db.py

=== JSON vs DB Filename Compare (AUTO) ===

Available datasets: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Pick dataset (1-12): 9
EFTA-only (recommended)? [Y/n]: y

Working...
Dataset: 9
JSON: /mnt/c/Users/prizm/Desktop/epstein/data9/index_data9.json
DB: /mnt/c/Users/prizm/Desktop/epstein/data9/index_data9.sqlite
Out: /mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9
EFTA-only: True

=== DONE ===
JSON count: 216408
DB count: 34913
Only JSON: 193219
Only DB: 11724

Wrote files:
/mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9/json_filenames.txt
/mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9/db_filenames.txt
/mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9/only_in_json.txt
/mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9/only_in_db.txt
/mnt/c/Users/prizm/Desktop/epstein/out_compare_ds9/compare_report.txt

i will be uploading an index_data9.json file , where everything has been converted to downloaded=False

if you want, use that for your index file, just place it inside the data9/ directory , and run index_repair.py on it if you've already started downloading for that dataset so it matches your disk reality.

im not sure "why" i found so many .pdf's that DOJ won't show by manually going to the pages, perhaps theyre ment to be hidden...who knows. can only theorize considering how terrible their page's behavior is.

prizmatik666 Feb 28, 2026
Maintainer Author

forgive me as i'm not familiar yet with markdown's syntax and how it formats posts. i'll dive into it though.

everythings pretty stable now with the ripper- theres some things i could add error handling for - like timeout's - but with all the fixes i've done this week for what i consider emergency fixes, that one's low on the list.

and i dont want to stress people out with recloning for something trivial, just reload it again if theres a timeout error(which i havent ran into much, mostly just when my personal machine is freezing, other than that its a non-issue)

added index files in .zip's for dataset 9,10,11,12 . for people to use to download the files. I'm pretty confident in them , and using those will save alot of painful hours letting the index scanner run. :)

happy to finally be at a point where everything is good , functioning, helpful. Now i can continue the download process - and yall have a jumpstart to get right into downloading stuff as the index(S) for the larger datasets are already started for you.

thanks for all the clones, hope yall are having a good experience with it!

-Prizm->

prizmatik666 · 2026-03-30T03:22:37Z

prizmatik666
Mar 30, 2026
Maintainer Author

added a new tool in the main directory.
filter_ripped_images.py

scans a directory (like ripped_images -> if using image_ripper to extract the images from the epstein .pdf's)
asks options on mode/what to filter.

analyzes the image files for traits and scores them and categorizes them.
such as:

all black docs (like full redaction pages)
images that appear to be documents ( since the image_ripper will make pdf's into image files if they register as an image)
high confidence category - image files that appear to be something other than an image of people/places/things
low confidence category - image files that are somewhat questionable on what they are of, but not high enough to categorize ' ' definitively ' ' .

i suggest using M ove, instead of C opy, for the mode. If running on a sizeable directory (data size) using copy - you can balloon the disk storage- which can be a problem , depending on your free storage available.

this came about after the experience and trials of learning to maneuver and interact with this large ammount of data i've ammassed. I was running out of storage space and needed a quick way to filter out the useless images from the ~60gigs i had. While using this and dealing with my problem i realized this was a must to include for others as well. :)

-prizm->

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to epstein-ripper Discussions! #1

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Welcome to epstein-ripper Discussions! #1

Uh oh!

prizmatik666 Feb 25, 2026 Maintainer