Data Extract file 4hr 42mins 20sec #2

MalcGee · 2023-09-04T08:36:18Z

MalcGee
Sep 4, 2023

Hey,
Thanks for the course, I've been slowly working my way through it (15hrs so far and only 4 and half hours in). Just a little stuck (but using chat-gpt4 for q's like you suggested, been a huge help as a learning support tool).

Any chance you could elaborate more on the file location or microsoft visual code used to when you transitioned from extracting the corpus to loading in the modules (os, lzma, win7zip, tqmd)?

The aspect ratio of youtube cuts off a border/margin of the recording so ends up missing a bit of info at times.

Cheers,
Mal

MalcGee · 2023-09-04T11:00:55Z

MalcGee
Sep 4, 2023
Author

All good! took me a while to get there but finally got a successful data extract of the corpus. Needed to make a .py file in a new /openwebtext folder inside the /fcc-gpt-course folder and then type out the code as you displayed (customising folder_path to where you downloaded the corpus and extracted).

pasting code in case anyone else doesn't want to type it out.

import os
import lzma
from tqdm import tqdm

def xz_files_in_dir(directory):
files = []
for filename in os.listdir(directory):
if filename.endswith(".xz") and os.path.isfile(os.path.join(directory, filename)):
files.append(filename)
return files

folder_path = "input your openwebtext extracted corpus file directory here"
output_file_train = "output_train.txt"
output_file_val = "output_val.txt"
vocab_file = "vocab.txt"

files = xz_files_in_dir(folder_path)
total_files = len(files)

Calculate the split indices

split_index = int(total_files * 0.9) # 90% for training
files_train = files[:split_index]
files_val = files[split_index:]

Process the files for training and validation separately

vocab = set()

Process the training files

with open(output_file_train, "w", encoding="utf-8") as outfile:
for filename in tqdm(files_train, total=len(files_train)):
file_path = os.path.join(folder_path, filename)
with lzma.open(file_path, "rt", encoding="utf-8") as infile:
text = infile.read()
outfile.write(text)
characters = set(text)
vocab.update(characters)

Process the validation files

with open(output_file_val, "w", encoding="utf-8") as outfile:
for filename in tqdm(files_val, total=len(files_val)):
file_path = os.path.join(folder_path, filename)
with lzma.open(file_path, "rt", encoding="utf-8") as infile:
text = infile.read()
outfile.write(text)
characters = set(text)
vocab.update(characters)

write the vocabulary to vocab.txt

with open(vocab_file, "w", encoding="utf-8") as vfile:
for char in vocab:
vfile.write(char + '\n')

4 replies

Infatoshi Sep 4, 2023
Maintainer

I'm glad you were able to figure this out quickly with the help of gpt-4. I guess I forgot to push the data extractor to this repo :/ . Will push the data extractor now.

Chen1098 Jan 13, 2024

Thanks a lot for you video. Easily the best video on YouTube. I have been following your steps and no bug has occurred until now.
same part (4h48min) . My anaconda prompt says "ModuleNotFoundError: No module named 'tqdm'"
when i type in python data-extract.py. Cheers.

Chen1098 Jan 13, 2024

I think i solved the tqdm problem by "conda install -c conda-forge tqdm" but anaconda prompt says it cannot find 'C:/Users/drwen/Desktop/openwebtext.tar/openwebtext' so then i decompressed the .tar file and it started working. but when i run "python data-extra
ct.py" it says

Infatoshi Jan 13, 2024
Maintainer

You should not need to use conda for tqdm or any other packages. Did you make sure tqdm is installed? If not, it just acts as a progress bar so you are certain the transfer is working. If you cannot find a solution to the tqdm error (possibly just pip3 install tqdm), you should be able to run it anyways :)

Chen1098 · 2024-01-13T15:58:31Z

Chen1098
Jan 13, 2024

thank you so much！you are my idol！！！！ | | Shuguang Wen | ---- Replied Message ---- | From | Elliot ***@***.***> | | Date | 01/13/2024 23:55 | | To | Infatoshi/fcc-intro-to-llms ***@***.***> | | Cc | Chen1098 ***@***.***>, Comment ***@***.***> | | Subject | Re: [Infatoshi/fcc-intro-to-llms] Data Extract file 4hr 42mins 20sec (Discussion #2) | You should not need to use conda for tqdm or any other packages. Did you make sure tqdm is installed? If not, it just acts as a progress bar so you are certain the transfer is working. If you cannot find a solution to the tqdm error (possibly just pip3 install tqdm), you should be able to run it anyways :) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

Infatoshi Jan 13, 2024
Maintainer

Awesome, have fun :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Extract file 4hr 42mins 20sec #2

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data Extract file 4hr 42mins 20sec #2

Uh oh!

MalcGee Sep 4, 2023

Replies: 2 comments · 5 replies

Uh oh!

Uh oh!

MalcGee Sep 4, 2023 Author

Calculate the split indices

Process the files for training and validation separately

Process the training files

Process the validation files

write the vocabulary to vocab.txt

Uh oh!

Infatoshi Sep 4, 2023 Maintainer

Uh oh!

Uh oh!

Chen1098 Jan 13, 2024

Uh oh!

Chen1098 Jan 13, 2024

Uh oh!

Infatoshi Jan 13, 2024 Maintainer

Uh oh!

Chen1098 Jan 13, 2024

Uh oh!

Infatoshi Jan 13, 2024 Maintainer

MalcGee
Sep 4, 2023

Replies: 2 comments 5 replies

MalcGee
Sep 4, 2023
Author

Infatoshi Sep 4, 2023
Maintainer

Infatoshi Jan 13, 2024
Maintainer

Chen1098
Jan 13, 2024

Infatoshi Jan 13, 2024
Maintainer