-
Notifications
You must be signed in to change notification settings - Fork 25
Guide to Datasets
Nina Gial edited this page Mar 1, 2024
·
1 revision
| File | Explanation |
|---|---|
| dimodis.rda | Scraped data from Δημώδης Ελληνική Γραμματεία |
| gutenberg.rda | Scraped data from the Greek side of Gutenberg Project |
| glc.rda | Greek Legal Code from Huggingface |
| alpaca.rda | Alpaca instruction finetuning dataset from HuggingFace |
| result_sentences.pkl | Random sample of circa 350K sentences from Bible, Europarl, HNC, GlobalVoices |
RDA contents are usually R environments.
load("data/dimodis.rda")
ls(dimodis)
str(dimodis$works$ergoes) # reach the actual text dataWe will fix this interface soon. Suggest your preferred formats in the issues.
PKL files can be used via pickle.load()
import pickle
with open("result_sentences.pkl", "rb") as f:
sentences = pickle.load(f)See the scripts section on how to further use the files.