This is the most comprehensive repository of Nepali datasets available on GitHub. We aggregate and curate machine learning datasets for the Nepali language from multiple open sources.
Note: This repository aggregates publicly available datasets from open sources. If any content belongs to you, please submit a PR or contact us to request removal.
- Overview
- Text & NLP Datasets
- Audio & Speech Datasets
- Image Datasets
- Geospatial & Location Datasets
- Time Series & Real-Time Data
- Financial & Economic Datasets
- Specialized Datasets
- Embedding & Representation Learning
- Public Data Sources
- Related NLP Research & Tools
- Nepali Literature Dataset
- Nepali MNIST
- Contributing
- Nepali News Dataset - https://www.kaggle.com/lotusacharya/nepalinewsdataset
- iNLTK Nepali News Dataset - https://www.kaggle.com/disisbig/nepali-news-dataset
- 16NepaliNews Corpus - https://github.com/sndsabin/Nepali-News-Classifier (14,364 documents)
- Nepali News Dataset (Large) - https://www.kaggle.com/ashokpant/nepali-news-dataset-large
- Nepali News Dataset (Small) - https://www.kaggle.com/tejshahi/20nepalinews
- Nepali News Classification Dataset - https://drive.google.com/drive/folders/1Vm0UJ3FfWP-3guSan3FZsOV4q7rYuJIG
- Nagarik News Corpus - https://github.com/ashmitbhattarai/Nepali-Language-Modeling-Using-LSTM/tree/master/Nepali_Corpus/Nagarik
- Setopati News Corpus - https://github.com/ashmitbhattarai/Nepali-Language-Modeling-Using-LSTM/tree/master/Nepali_Corpus/SetoPati
- Nepali Wikipedia Articles (39K) - https://www.kaggle.com/disisbig/nepali-wikipedia-articles
- OSCAR Corpus Nepali - https://www.kaggle.com/hsebarp/oscar-corpus-nepali
- Nepali Brihat Sabdakosh JSON - https://github.com/bikashpadhikari/nepali-brihat-sabdakosh-json (122,000 words)
- Large Scale Nepali Text Corpus - https://ieee-dataport.org/open-access/large-scale-nepali-text-corpus
- 65K Nepali Sentences - https://github.com/sanjaalcorps/NepaliDataSets
- 350K Nepali Sentences - https://github.com/Team-Naya/nlp-doko
- Nepali-English Language Pair - https://github.com/sharad461/nepali-translator
- FLORES 101 Dataset - https://github.com/facebookresearch/flores/tree/main/floresv1/data
- WMT19 Parallel Corpus - https://www.statmt.org/wmt19/parallel-corpus-filtering.html
- English-Nepali Parallel Corpus (ELRA) - https://catalog.elra.info/en-us/repository/browse/ELRA-W0077/
- English-Nepali Translated Strings (TDIL) - https://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1069&lang=en
- English to Nepali Translation - https://github.com/arunism/English-to-Nepali-Language-Translation/tree/master/data
- Nepali-English Translation Dataset - https://github.com/BISHALTWR/Nepali-English-Translation-Dataset
- Nepali English Machine Translation Corpus - https://github.com/facebookresearch/flores
- Nepali Translation Parallel Corpus - https://drive.google.com/file/d/1UThfJKJFvDgTu263DNbz-WPNLqoARZ_0/view
- Nepali Data Set for Sentiment Analysis - https://mahesha.com.np/nepali-data-set-for-sentiment-analysis/
- NepaliSentiment - https://github.com/rockerritesh/NepaliSentiment
- Sentiment Analysis in Nepali - https://github.com/sarozz/Sentiment_analysis_in_Nepali/blob/master/data.csv
- SentimentAnalysis - https://github.com/sagarl123/NepaliNLP-SentimentAnalysis/blob/main/collected_labeled_data.csv
- Nepali Sentiment Analysis - https://www.kaggle.com/smaheshacharya/nepali-sentiment-analysis
- Nepali Movie Reviews Sentiment Analysis - https://www.kaggle.com/shikharghimire/nepali-language-sentiment-analysis-movie-reviews
- Multi-channel CNN COVID-19 Tweets - Nepali COVID-19 related tweets for classification
- Nepali NER Dataset - https://github.com/oya163/nepali-ner/tree/master/data/ebiquity_v2
- Nepali Text Summarization - https://www.kaggle.com/imageinfo/nepali-text-summarization
- Nepali Abstractive Summarization Corpus - https://drive.google.com/file/d/1L56k0zonMk6XpelKAXPm45wCmt-9pS3x/view (286k article-title pairs)
- Laxmi Prasad Devkota Poems - https://github.com/devkotasawal1/Poem-Generator/blob/master/lspd.txt (119,161 characters)
- Nepali Ukhaan Tukka (Proverbs) - https://github.com/theseekersway/Nepali-Ukhaan-Tukka
- Nepali Names - https://github.com/datafiction/oya-nepali-nlp/blob/master/data/names/Nepali.txt
- Dummy Nepali People Information - https://github.com/bibhuticoder/dummydata/blob/master/data.csv
- Nepali Stopwords - https://github.com/sanjaalcorps/NepaliStopWords
- Nepali Ngram - https://github.com/virtualanup/nepalingram
- Nepali Chat Corpus - https://github.com/itsmeashutosh43/create-a-Open-Source-Nepali-Chat-corpus
- English News Corpus (Nepal) - https://github.com/sharad461/english-corpus-nepal
- Nepal Earthquake Tweets - https://crisisnlp.qcri.org/lrec2016/content/2015_nepal_eq.html
- High Quality TTS Data for Nepali - https://www.openslr.org/43/ (2,000 sentences, 48kHz)
- Nepali Text to Speech Dataset 1 - https://github.com/meamit/nepali-text-to-speech/tree/master/speechdb
- Nepali Text to Speech Dataset 2 - https://github.com/anuragregmi/speak_nepali/tree/master/sounds
- Nepali Text to Speech Dataset 3 - https://github.com/hcoebct069/nepali-asr/tree/master/recordings
- Large Nepali ASR Training Dataset - http://www.openslr.org/54 (157K utterances, 16kHz, FLAC)
- Devanagiri Numbers Spoken Audio - https://drive.google.com/drive/folders/15g57Qa1TQa4Ix6-MiC6v1wieouqp0XAl
- Devanagari Characters Speech - https://github.com/tsumansapkota/Devanagari_Characters_Speech
- 300-D Word Embeddings (Word2Vec) - https://github.com/rabindralamsal/Word2Vec-Embeddings-for-Nepali-Language
- DHCD Dataset - https://github.com/Prasanna1991/DHCD_Dataset (Devnagari handwritten characters)
- Nepali Characters Dataset - https://github.com/InspiringLab/NCD
- Nepali Handwritten Digits - https://github.com/kcnishan/Nepali_handwritten_digits_recognition/tree/master/dataset
- Nepali Fonts OCR Dataset - https://github.com/BasantaChaulagain/Nepscan/tree/master/resources
- License Plate Recognition (LPR) Dataset - https://github.com/Prasanna1991/LPR (Nepali motorbike plates)
- Nepali Portraits Dataset - https://www.kaggle.com/sumansid/nepali-portraits-dataset
- Vehicles Dataset - https://github.com/sdevkota007/vehicles-nepal-dataset (4,800 images)
- Corn Leaf Infection Dataset - https://www.kaggle.com/qramkrishna/corn-leaf-infection-dataset
- Voting Ballot Paper Dataset - https://github.com/rajshreeee/image_classification_for_voting_system_using_cnn
- Nepali Currency Notes - https://drive.google.com/file/d/1pDF0hx6pvgx4DJTCHL4EeDdCT4wlfnGW/view
- Nepali Cash Dataset - https://drive.google.com/drive/folders/1GxITXrk13ehKMEMEbpi8mRsFSr4LUR55
- 10, 50 & 100 Rupee Notes - https://github.com/mmanishh/nrscurrencyrecognizer/tree/master/data/train
- Faces of Famous People from Nepal - https://www.thefamouspeople.com/nepal.php
- Open Street Maps Metadata - https://github.com/sharad461/nepal-openstreetmap-extract
- Nepal Travel Distance (km) - https://data.world/hdx/d1d0c217-8c6b-4747-ab1e-1069e2ff3e6b
- Local Government of Nepal - https://anmol2059.github.io/federal-nepal/
- EPA Air Pollution Data - https://github.com/hbvj99/EPAAirPollution
- Nepal Government Air Pollution Data - https://github.com/hbvj99/NPGovAirPollution
- Dristhi Air Pollution Data - https://github.com/hbvj99/DristhiAirPollution
- Pokhara Weather Data (2009-2023) - https://www.kaggle.com/datasets/gauravneupane/pokhara-weather-data-from-2009-to-2023
- Nepal Multi-District Weather Dataset (2020-2025) - https://www.kaggle.com/datasets/dipeshthapa1/nepal-multi-district-weather-dataset-2020-2025
- River Level Data - http://www.hydrology.gov.np
- Daily Vegetable/Fruit Price Information - http://kalimatimarket.gov.np/daily-price-information
- Mahanagar Yatayat Real-time Location - https://github.com/theonlyNischal/Track-Mahanagar-Yatayat
- Tribhuwan International Airport (Arrivals) - http://tiairport.com.np/flight_details
- Tribhuwan International Airport (Departures) - http://tiairport.com.np/flight_details_2
- Nepali Stock Market Dataset (2012-2020) - https://www.kaggle.com/sagyamthapa/nepali-stock-market-form-2012-to-2020-till-march
- Nepal Stock Exchange Data till 2019 - https://www.kaggle.com/qramkrishna/nepal-stock-exchange-data
- Nepal Rastra Bank Forex Rate API - https://www.nrb.org.np/exportForexJSON.php?YY=2019&MM=08&DD=01&YY1=2019&MM1=08&DD1=02
- Earthquake Building Damage Levels - https://www.drivendata.org/competitions/57/nepal-earthquake/page/136/
- Health Diseases in Nepali - https://github.com/sanjaalcorps/NepaliDataClassifiers/blob/master/HealthClassifiers.txt
- Nepali Word2Vec from Scratch - https://github.com/R4j4n/Nepali-Word2Vec-from-scratch
- 300-D Word2Vec Embeddings - https://github.com/rabindralamsal/Word2Vec-Embeddings-for-Nepali-Language
- Open Data Nepal - https://opendatanepal.com/
- Census Nepal - https://censusnepal.cbs.gov.np/results
- LDC-IL (Indian Language Resources) - Language resource repository
- Nepali Lemmatizer - https://github.com/dpakpdl/NepaliLemmatizer
- Nepali NLP Progress - https://github.com/divyamani1/Nepali-NLP-Progress (tracking SOTA)
- NLP Progress Nepali - https://github.com/sebastianruder/NLP-progress/blob/master/nepali/nepali.md
- Pre-trained Models - https://huggingface.co/Suyogyart/nepali-16-newsgroups-classification (DistilBERT)
- Devkota Poem Collection - https://www.kaggle.com/datasets/abhyudayapokhrel/nepali-devkota-poem-dataset
- ** Nepali Handwritten Digits** - https://www.kaggle.com/datasets/ujjwalpaudel/nepali-handwritten-digits
Found a Nepali dataset? Help us grow this collection!
How to Add:
- Star the repository
- Fork the repository
- Add dataset to
README.mdin the right category - Format:
- **Name** - [Link](url) - Submit a PR
Requirements:
- Publicly available dataset
- Working direct link
- Brief description (1-2 sentences)
- Proper attribution
We review PRs within 48 hours. Thanks! 🎉