Welcome to epstein-ripper Discussions! #1
Replies: 4 comments 3 replies
-
|
thanks for checking out my repo! I developed these programs because of how user-unfriendly the DOJ made the epstein dataset dumps for everyday people to download. They technically followed the law by releasing data, but they did it in such a way that it's so hard to download/find anything, that the waters have become muddy and people will be looking through the .pdf's for years to come trying to find information. when i first started downloading the datasets , the DOJ had .zip files for early sets. They downloaded fine for the smaller ones, but the bigger set's .zip files would stall while i was downloading- not able to resume, just a giant pain in the ass. that led me to finding a way to scrape the .pdf's , leading to reproducing conditions for doj security checks / cookies/session etc. in the beginning , the way the site worked , i would have to click 'not a robot' every 1-2 minutes. as you can guess that was exhausting and terrible experience. some where along the release of datasets 9-11 something changed. I would only have to verify at the beginning of the new context, and it would run almost indefinitely. WOW! i thought, this is great! unfortunately i came to find that the session silently expired somewhere along the way during download, but still showed .pdf's as downloaded- in reality, it was saving the age verification main page as the .pdf document!!!! This led to me having over 80k corrupted files - as soon as i realized this i updated github with a warning, threw up some quick fix utilities (active_watcher and corruption_scan ) so users could still download and have a way to warn themselves of error. these new util's are not 'perfect' but they are neccessary and a step in the right direction. and corruption_scan - is great for scanning your already downloaded files. it detects docs in the downloaded .pdf's and moves them to quarantine. again- the way DOJ has released these files and given access to them is atrocious. although a technical legal way of releasing them, the way they did it just sucks. There should exist a better way of downloading these things without having to know how to code, or run python scripts,,,,or use the doj's search tool to look through millions of documents online..but , here we are. I hope these tools are a step in the right direction, and helpful to people that are also interested in storing/analyzing these files locally. use this section to ask questions, leave comments, notes, etc. thank you :) Support This ProjectThis project is developed and maintained independently by Prizm If this tool has been useful to you, consider supporting future PayPal Donation Link: PayPal Email: Original Repository: Thank you for supporting independent open-source tools. |
Beta Was this translation helpful? Give feedback.
-
|
yesterday i realized the 6 page streak for finding the end of new pdf's in the indexer was inadequate. the way DOJ serves the pages / contents is chaotic and ridiculous. When scanning I got many streaks over 100 or 200 with no new pdf's then BOOM - 50 new pdfs. here's the readout from the ripper's indexer/scan, took me atleast 12 hours to get through : [2026-02-25 21:58:57] [DS 9] No NEW PDFs on page 7990 (streak=300/300) here's a readout of index_repair.py on my dataset 9 index .json - it checks the index vs actual downloaded files and make sure the .json doesnt have any files marked as downloaded that arent, but also has the effect of giving me a count of how many files are downloaded/how many are left to download, this will give you an idea of total files in dataset9: =============================================================
|
Beta Was this translation helpful? Give feedback.
-
|
i'm now working on an sqlite/db index scanner, with multiple modes for rewalking the db , repairing suspected errors, etc. |
Beta Was this translation helpful? Give feedback.
-
|
added a new tool in the main directory. scans a directory (like ripped_images -> if using image_ripper to extract the images from the epstein .pdf's) analyzes the image files for traits and scores them and categorizes them.
i suggest using M ove, instead of C opy, for the mode. If running on a sizeable directory (data size) using copy - you can balloon the disk storage- which can be a problem , depending on your free storage available. this came about after the experience and trials of learning to maneuver and interact with this large ammount of data i've ammassed. I was running out of storage space and needed a quick way to filter out the useless images from the ~60gigs i had. While using this and dealing with my problem i realized this was a must to include for others as well. :) -prizm-> |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
👋 Welcome to Epstein Ripper Discussions
This space is for users, researchers, developers, and contributors
who are working with or interested in the Epstein DOJ Dataset tools.
We use Discussions for:
This project focuses on:
Please keep discussions:
If you're new, feel free to introduce yourself:
This repository is for tooling and archival access purposes.
Discussions should remain centered on technical usage,
improvements, and research workflows.
Thanks for contributing to independent open-source research tools.
Beta Was this translation helpful? Give feedback.
All reactions