create ljspeech dataset from audiobook #413

engel75 · 2024-11-08T13:43:33Z

engel75
Nov 8, 2024

Hi,

I tried to create a python script to do the following:

transcribe multiple big wav files (audiobooks)
while detecting the start and the end of a full sentence (not just a segment)
extracting each full sentence in separated single wav files named LJ.wav in subfolder "wavs"
create metadata.csv

The file format is described here: https://huggingface.co/datasets/flexthink/ljspeech

I tested different approaches, but non was a perfect solution.
Using pydub to detect silences fails to detect complete sentences, while using stable-ts start and end values will extract not extract the audio at perfect silence.

Is it possible to add a transcribe parameter allowing stable-ts to detect a full sentence as a segment and store the start and end values as perfect audio timestamps (detect silence and cut before and after the sentence starts).

Optional: Detect different speakers and add them to the metadata.csv

all the best
Flo

jianfch · 2024-11-09T21:55:33Z

jianfch
Nov 9, 2024
Maintainer

The final segment are just the result of heuristically splitting the transcription (a very long string) into segments (short strings).
Since the models are able to produce punctuated transcriptions, you can simply use the punctuations to split the transcription into sentences with a custom regrouping algorithm: https://github.com/jianfch/stable-ts?tab=readme-ov-file#regrouping-words.
Something like this:

result.clamp_max().merge_all_segments().split_by_punctuation([('.', ' '), '。', '?', '？'])

Sometime the models will produce no punctuations with default settings. If that occurs, you can try this: openai/whisper#194 (comment)

15 replies

engel75 Nov 19, 2024
Author

Hi @jianfch

I guess I found an issue. Not sure if my code is causing the issue.
Part of the code looks like:

...
    model = stable_whisper.load_model(use_model, device=use_type)

    for audio in audio_names:
        result = model.transcribe(os.path.join(input_path, audio), initial_prompt='The following is an english audiobook.', regroup=False, vad=False, word_timestamps=True, suppress_silence=True, no_speech_threshold=0.9, temperature=(0.01,0.1,0.2,0.3,0.4,0.5), q_levels=100, only_voice_freq=True)

        result.save_as_json(audio + '.json')

        result = result.clamp_max().merge_all_segments().split_by_punctuation([('.', ' '), '。', '?', '？'])
        result.save_as_json(audio + '-punct.json')

        result = result.convert_to_segment_level().adjust_gaps(duration_threshold=0.6, one_section=True)
        result.save_as_json(audio + '-agaps.json')

...

Those three save_as_json are just for debugging. I managed to reproduce one of my issues with a public domain audio file:
a_day_with_great_poets_01_byron-short.zip

The json file look like are in the following zip:
a_day_with_great_poets_01_byron-short.wav.json.zip

The issue I noticed happens after the last conversion with result.convert_to_segment_level().adjust_gaps:

cat a_day_with_great_poets_01_byron-short.wav.json | jq .segments[0] | head -n8
{
  "start": 0.8,
  "end": 3.9,
  "text": " Chapter 1 of Days with Great Poets",
  "seek": 0.0,
  "tokens": [
    18874,
    502,

-->

cat a_day_with_great_poets_01_byron-short.wav-punct.json | jq .segments[0] | head -n8
{
  "start": 0.8,
  "end": 7.0,
  "text": " Chapter 1 of Days with Great Poets This is a LibriVox recording.",
  "seek": 0.0,
  "tokens": [
    18874,
    502,

-->

cat a_day_with_great_poets_01_byron-short.wav-agaps.json | jq .segments[0] | head -n8
{
  "start": 0.8,
  "end": 3.9,
  "text": " Chapter 1 of Days with Great Poets This is a LibriVox recording.",
  "seek": 0.0,
  "tokens": [
    18874,
    502,

So while end was fine after split_by_punctuation it looks like the last conversion messed it up. Is this caused by something I do the wrong way or is it caused by one of the functions?

jianfch Nov 19, 2024
Maintainer

Segment 1: " Chapter 1 of Days with Great Poets This is a LibriVox recording."
Segment 2: " All Libravox recordings are in the public domain."
When the result is word-level, adjust_gaps() will only look for gaps from the start of "recording." to the end of " All".
But once converted to segment-level with convert_to_segment_level(), it will look for gaps from the start of " Chapter" to end of "domain.".
It will start with the longest gap and discard any gaps less than duration_threshold * duration of the longest gap. Lastly, it will choose the one closest to the end of "recording." to the start of " All".
The cause of the issue here is the gap between "Poets" and "This" is longest and gap between "recording." and " All" does not meet the duration_threshold and got discarded.
Lowering the duration_threshold to 0.0, will prevent similar issues.

engel75 Nov 22, 2024
Author

@jianfch very well explained! The solution with duration_threshold fixed that issue.
Today I tested my script against ~ 20 hrs of audiobook and the result is very good. Some minor issues:

split_by_punctuation([('.', ' '), '。', '?', '？']) failed in some rare cases. eg. "J.K. Rowling ...." or "Mrs. White" and sometimes whisper recognizes that the reader is quoting another person in the book and puts this quote in " .... " , which causes such a quote ending with a ? to lead to a strange break in the sentence. I changed my to -> split_by_punctuation([('.', ' '), '。', '? ', '？ ']) but this does not help me with those Mrs. and Mr. or J.K. Rowling. Any idea how to fix those punctuations?
I have to use the turbo model as large or large-v3 lead to intermittent and abrupt sentence endings. Could be related to my 24GB vram limit? But turbo is fine anyway.
In some VERY rare cases whipser failed to transcribe a sentence and begins to stutter by repeating the same short sentence over and over a couple of times. In those cases, the start and end timestamps are messed up, leading to corrupt split audio files. I guess I will have to fix that by deleting those files/sentences of transcribe those parts "manually".

Regarding Mrs. and J.K. I got a question: How difficult would it be to add regex to split_by_punctuation and exceptions?
On a rough analyse I found the following cases:

Mr.
Mrs.
J.K.
A.
B.
20.
17.
St.
L.A.
Dr.
C.G.
I.
24.

So a regex exceptions could exclude all [A-Z]. and all [0-9]. and [A-Z][r,s,t]. ?

I really want to thank you for your great work and awesome help! I will upload my script asap and link it here. As I am not a developer, the code is hideous and might need some improvements.

engel75 Nov 22, 2024
Author

not the most beautiful solution but after splitting my input audio files in 30 - 60 minutes files and adding:

result = result.clamp_max().merge_all_segments().split_by_punctuation([('.', ' '), '。', ('?',' '), ('？', ' ')]).merge_by_punctuation([('Mrs.', ' '), ('Mr.',' '), ('Dr.',' '), ('K.',' '), ('A.',' '), ('B.',' '), ('I.',' '), ('O.',' '), ('U.',' '), ('0.',' '), ('1.',' '), ('2.',' '), ('3.',' '), ('4.',' '), ('5.',' '), ('6.',' '), ('7.',' '), ('8.',' '), ('9.',' '), ('St.',' '), ('G.',' ')])

There are no broken output audio files and the segmentation looks perfect. I will check further tomorrow.

jianfch Nov 22, 2024
Maintainer

You can use regex to find all the matching words and lock the right side of those words before split_by_punctuation().

result.clamp_max().merge_all_segments()
matches = result.find(r'[0-9]+\.')
for match for matches:
    for word in match.words:
        word.lock_right()
result.split_by_punctuation([('.', ' '), '。', ('?',' '), ('？', ' ')])

The large-v3 model has issues with the quality of its output for some audio. The key, large, loads the latest version of the large model which is currently large-v3. 24GB is more than enough for large-v3 and you will get out-of-memory error if you don't have enough. So you can try beam search with model.transcribe(..., beam_size=5) or/and use large-v1.
This is an old issue, usually resolved by lowering compression_ratio_threshold: Some wired repetition happens in transcription. openai/whisper#192 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

create ljspeech dataset from audiobook #413

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 15 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

create ljspeech dataset from audiobook #413

Uh oh!

engel75 Nov 8, 2024

Replies: 1 comment · 15 replies

Uh oh!

jianfch Nov 9, 2024 Maintainer

Uh oh!

engel75 Nov 19, 2024 Author

Uh oh!

jianfch Nov 19, 2024 Maintainer

Uh oh!

Uh oh!

engel75 Nov 22, 2024 Author

Uh oh!

engel75 Nov 22, 2024 Author

Uh oh!

Uh oh!

jianfch Nov 22, 2024 Maintainer

engel75
Nov 8, 2024

Replies: 1 comment 15 replies

jianfch
Nov 9, 2024
Maintainer

engel75 Nov 19, 2024
Author

jianfch Nov 19, 2024
Maintainer

engel75 Nov 22, 2024
Author

engel75 Nov 22, 2024
Author

jianfch Nov 22, 2024
Maintainer