Replies: 2 comments 5 replies
-
|
All good! took me a while to get there but finally got a successful data extract of the corpus. Needed to make a .py file in a new /openwebtext folder inside the /fcc-gpt-course folder and then type out the code as you displayed (customising folder_path to where you downloaded the corpus and extracted). pasting code in case anyone else doesn't want to type it out. import os def xz_files_in_dir(directory): folder_path = "input your openwebtext extracted corpus file directory here" files = xz_files_in_dir(folder_path) Calculate the split indicessplit_index = int(total_files * 0.9) # 90% for training Process the files for training and validation separatelyvocab = set() Process the training fileswith open(output_file_train, "w", encoding="utf-8") as outfile: Process the validation fileswith open(output_file_val, "w", encoding="utf-8") as outfile: write the vocabulary to vocab.txtwith open(vocab_file, "w", encoding="utf-8") as vfile: |
Beta Was this translation helpful? Give feedback.
-
|
thank you so much!you are my idol!!!!
| |
Shuguang Wen
|
---- Replied Message ----
| From | Elliot ***@***.***> |
| Date | 01/13/2024 23:55 |
| To | Infatoshi/fcc-intro-to-llms ***@***.***> |
| Cc | Chen1098 ***@***.***>,
Comment ***@***.***> |
| Subject | Re: [Infatoshi/fcc-intro-to-llms] Data Extract file 4hr 42mins 20sec (Discussion #2) |
You should not need to use conda for tqdm or any other packages. Did you make sure tqdm is installed? If not, it just acts as a progress bar so you are certain the transfer is working. If you cannot find a solution to the tqdm error (possibly just pip3 install tqdm), you should be able to run it anyways :)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Hey,
Thanks for the course, I've been slowly working my way through it (15hrs so far and only 4 and half hours in). Just a little stuck (but using chat-gpt4 for q's like you suggested, been a huge help as a learning support tool).
Any chance you could elaborate more on the file location or microsoft visual code used to when you transitioned from extracting the corpus to loading in the modules (os, lzma, win7zip, tqmd)?
The aspect ratio of youtube cuts off a border/margin of the recording so ends up missing a bit of info at times.
Cheers,
Mal
Beta Was this translation helpful? Give feedback.
All reactions