Scripts that download and then extract transcripts from the Podcasts app on macOS.
Commands used in the above demo:
python3 fetchTranscripts.py 1828967504 -o "ttmls/The BugBash Podcast"
python3 extractTranscript.py -i "ttmls/The BugBash Podcast" -o "transcripts/The BugBash Podcast"Note 1828967504 is the show's collection_id defined in the ZMTPODCAST table. See schema.
- Clone the repository
- (Optional) Ensure you have Python 3 installed.
Note: You need to download the desired podcast episode(s) before you can extract the transcript. Usually this means open the Podcast app, click on the episode you want to download, and click on More button and select View Transcript from the popup menu.
To process all TTML files in your Apple Podcasts cache:
python3 extractTranscript.py [--timestamps]This will:
- Find all TTML files in
~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML - Create a
./transcriptsdirectory - Save each transcript as
./transcripts/<podcase_name> <episode_title>.txt
Note the podcast name and episode title are extracted from the SQLite database at Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite.
Sample output is:
$ ls ./transcripts
Secret Leaders - Hermann Hauser- The Man Who Saved Apple from Bankruptcy.txt
Secret Leaders - How to Earn Millions by charging $0.txt
Secret Leaders - I built a $2bn Company By Paying Everybody The Same As Me - Nicola Kilner.txt
Secret Leaders - Spencer Matthews- How pushing my body to the limit changed my life.txt
The BugBash Podcast - Ergonomics, reliability, durability.txt
The BugBash Podcast - Every map is wrong, but we made one anyway.txt
The BugBash Podcast - FoundationDB- From Idea to Apple Acquisition.txt
The Intelligence from The Economist - Against the clock- Gaza peace talks.txt
The Intelligence from The Economist - All the president’s money men- the Trumponomics team.txt
The Intelligence from The Economist - Billions of voices heard- a year of elections.txt
Add --timestamps to include timestamps for each paragraph in the format [HH:MM:SS].
For example:
[00:01:23] This is what the speaker said
[00:01:25] And then they said this
node extractTranscript.js <input_file> <output_file> [--timestamps]For this script to work, you need to create a config.json file. You can modify config.json.example to fill in your information. The steps are taken from this post. Once you obtain the "x-request-timestamp" and "X-Apple-ActionSignature", run the get_bearer_token() function in fetchTranscripts.py to get the bearer token, which can be reused for 30 days. After that, fetchTranscripts.py should work properly (see the demo above).
If you need help to download raw TTML files or the converted transcripts for any show, I can help you with a small fee (contact me AT akin-00.dowered AT icloud.com).
The input file comes from the transcript_<long_episode_id>.ttml file in the ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent<short_episode_id> directory.
I don't know how these IDs are generated by the Podcasts app.
Generating SQLite schemas for the Apple Podcasts cache:
sqlite3 /Users/<username>/Library/Group\ Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite .schema > schema.sql
sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE limit 5;
PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml
PodcastContent122/v4/f4/1a/aa/f41aaa81-24f0-b259-9870-9b1e48e676f6/transcript_1000427522064.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION,
...> p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY
...> FROM ZMTEPISODE e
...> JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID
...> WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml";
Supernova in the East I|553291535|16097.0|Dan Carlin's Hardcore History|Dan Carlin|HistoryThe schema.sql file contains the SQLite schemas for the Apple Podcasts cache.
Key Tables for Metadata:
ZMTEPISODE (episodes):
- ZTITLE - Episode title
- ZITEMDESCRIPTION - Episode description
- ZPUBDATE - Publication date
- ZDURATION - Episode duration
- ZPODCASTUUID - Links to podcast
- ZTRANSCRIPTIDENTIFIER - Likely matches your TTML file ID
ZMTPODCAST (podcasts/shows):
- ZTITLE - Podcast title
- ZAUTHOR - Podcast author
- ZCATEGORY - Podcast category
- ZUUID - Podcast UUID (matches ZPODCASTUUID in episodes)
ZMTCHANNEL (channels):
- ZNAME - Channel name
- Linked via ZCHANNEL field in ZMTPODCAST
For file "~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml". PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml is the ZMTEPISODE.ZTRANSCRIPTIDENTIFIER. We can extract metadata as:
sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE where ZTRANSCRIPTIDENTIFIER like "%02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml%";
PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml
sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION, p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY FROM ZMTEPISODE e JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml";
Transitional injustice: Syria one year after Assad|786884509|1486.0|The Intelligence from The Economist|The Economist|Daily Newswant podcast show information instead:
sqlite> SELECT
ZSTORECOLLECTIONID as collection_id,
ZTITLE as podcast_title,
ZUUID as podcast_uuid
FROM ZMTPODCAST
WHERE ZSTORECOLLECTIONID IS NOT NULL;
80867514|Entrepreneurial Thought Leaders (ETL)|3F7F4164-D8FF-41AC-BB36-BD6CB1A6E007
1184022695|The Jordan B. Peterson Podcast|6DEFC4FD-6A86-48DD-8E4B-9DFA048EF8B3
...