Apple Podcasts Transcript Extractor

Scripts that download and then extract transcripts from the Podcasts app on macOS.

Commands used in the above demo:

python3 fetchTranscripts.py 1828967504 -o "ttmls/The BugBash Podcast"

python3 extractTranscript.py -i "ttmls/The BugBash Podcast" -o "transcripts/The BugBash Podcast"

Note 1828967504 is the show's collection_id defined in the ZMTPODCAST table. See schema.

Installation

Clone the repository
(Optional) Ensure you have Python 3 installed.

Usage

Note: You need to download the desired podcast episode(s) before you can extract the transcript. Usually this means open the Podcast app, click on the episode you want to download, and click on More button and select View Transcript from the popup menu.

Batch Mode

To process all TTML files in your Apple Podcasts cache:

python3 extractTranscript.py [--timestamps]

This will:

Find all TTML files in ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML
Create a ./transcripts directory
Save each transcript as ./transcripts/<podcase_name> <episode_title>.txt

Note the podcast name and episode title are extracted from the SQLite database at Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite.

Sample output is:

$ ls ./transcripts
Secret Leaders - Hermann Hauser- The Man Who Saved Apple from Bankruptcy.txt
Secret Leaders - How to Earn Millions by charging $0.txt
Secret Leaders - I built a $2bn Company By Paying Everybody The Same As Me - Nicola Kilner.txt
Secret Leaders - Spencer Matthews- How pushing my body to the limit changed my life.txt
The BugBash Podcast - Ergonomics, reliability, durability.txt
The BugBash Podcast - Every map is wrong, but we made one anyway.txt
The BugBash Podcast - FoundationDB- From Idea to Apple Acquisition.txt
The Intelligence from The Economist - Against the clock- Gaza peace talks.txt
The Intelligence from The Economist - All the president’s money men- the Trumponomics team.txt
The Intelligence from The Economist - Billions of voices heard- a year of elections.txt

Timestamps Option

Add --timestamps to include timestamps for each paragraph in the format [HH:MM:SS].

For example:

[00:01:23] This is what the speaker said
[00:01:25] And then they said this

Single File Mode

node extractTranscript.js <input_file> <output_file> [--timestamps]

Set up for fetchTranscripts.py script

For this script to work, you need to create a config.json file. You can modify config.json.example to fill in your information. The steps are taken from this post. Once you obtain the "x-request-timestamp" and "X-Apple-ActionSignature", run the get_bearer_token() function in fetchTranscripts.py to get the bearer token, which can be reused for 30 days. After that, fetchTranscripts.py should work properly (see the demo above).

If you need help to download raw TTML files or the converted transcripts for any show, I can help you with a small fee (contact me AT akin-00.dowered AT icloud.com).

Where does the input file come from?

The input file comes from the transcript_<long_episode_id>.ttml file in the ~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent<short_episode_id> directory.

How do I get the episode IDs?

I don't know how these IDs are generated by the Podcasts app.

SQLite Schemas

Generating SQLite schemas for the Apple Podcasts cache:

sqlite3 /Users/<username>/Library/Group\ Containers/243LU875E5.groups.com.apple.podcasts/Documents/MTLibrary.sqlite .schema > schema.sql

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE limit 5;
PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml
PodcastContent122/v4/f4/1a/aa/f41aaa81-24f0-b259-9870-9b1e48e676f6/transcript_1000427522064.ttml

sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION,
   ...>          p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY
   ...>   FROM ZMTEPISODE e
   ...>   JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID
   ...>   WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent112/v4/aa/35/30/aa35304c-cf37-97f4-fe9f-08cc3aee2c86/transcript_1000415837465.ttml";
Supernova in the East I|553291535|16097.0|Dan Carlin's Hardcore History|Dan Carlin|History

The schema.sql file contains the SQLite schemas for the Apple Podcasts cache.

Key Tables for Metadata:

ZMTEPISODE (episodes):

ZTITLE - Episode title
ZITEMDESCRIPTION - Episode description
ZPUBDATE - Publication date
ZDURATION - Episode duration
ZPODCASTUUID - Links to podcast
ZTRANSCRIPTIDENTIFIER - Likely matches your TTML file ID

ZMTPODCAST (podcasts/shows):

ZTITLE - Podcast title
ZAUTHOR - Podcast author
ZCATEGORY - Podcast category
ZUUID - Podcast UUID (matches ZPODCASTUUID in episodes)

ZMTCHANNEL (channels):

ZNAME - Channel name
Linked via ZCHANNEL field in ZMTPODCAST

For file "~/Library/Group Containers/243LU875E5.groups.com.apple.podcasts/Library/Cache/Assets/TTML/PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml". PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml-1000740205141.ttml is the ZMTEPISODE.ZTRANSCRIPTIDENTIFIER. We can extract metadata as:

sqlite> select ZTRANSCRIPTIDENTIFIER from ZMTEPISODE where ZTRANSCRIPTIDENTIFIER like "%02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml%";
PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml

sqlite> SELECT e.ZTITLE as episode_title, e.ZPUBDATE, e.ZDURATION, p.ZTITLE as podcast_title, p.ZAUTHOR, p.ZCATEGORY FROM ZMTEPISODE e JOIN ZMTPODCAST p ON e.ZPODCASTUUID = p.ZUUID WHERE e.ZTRANSCRIPTIDENTIFIER = "PodcastContent221/v4/02/af/94/02af94ae-1dd8-3aea-44c4-24c467bdddfd/transcript_1000740205141.ttml";

Transitional injustice: Syria one year after Assad|786884509|1486.0|The Intelligence from The Economist|The Economist|Daily News

Find a show's ID

want podcast show information instead:

sqlite> SELECT
      ZSTORECOLLECTIONID as collection_id,
      ZTITLE as podcast_title,
      ZUUID as podcast_uuid
  FROM ZMTPODCAST
  WHERE ZSTORECOLLECTIONID IS NOT NULL;

80867514|Entrepreneurial Thought Leaders (ETL)|3F7F4164-D8FF-41AC-BB36-BD6CB1A6E007
1184022695|The Jordan B. Peterson Podcast|6DEFC4FD-6A86-48DD-8E4B-9DFA048EF8B3
...

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ttmls/The BugBash Podcast		ttmls/The BugBash Podcast
.gitignore		.gitignore
README.MD		README.MD
config.json.example		config.json.example
extractTranscript.js		extractTranscript.js
extractTranscript.py		extractTranscript.py
fetchTranscripts.py		fetchTranscripts.py
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
podcast_db.py		podcast_db.py
schema.sql		schema.sql
test_fetchTranscripts.py		test_fetchTranscripts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apple Podcasts Transcript Extractor

Installation

Usage

Batch Mode

Timestamps Option

Single File Mode

Set up for fetchTranscripts.py script

Where does the input file come from?

How do I get the episode IDs?

SQLite Schemas

Find a show's ID

About

Uh oh!

Releases

Packages

Languages

jzhou77/apple-podcast-transcript-extractor

Folders and files

Latest commit

History

Repository files navigation

Apple Podcasts Transcript Extractor

Installation

Usage

Batch Mode

Timestamps Option

Single File Mode

Set up for fetchTranscripts.py script

Where does the input file come from?

How do I get the episode IDs?

SQLite Schemas

Find a show's ID

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages