This project aims to build an Automatic Speech Recognition (ASR) system for the Nepali language. Using OpenAI's Whisper Small model as the base, we fine-tuned it on a custom dataset to accurately transcribe Nepali speech into text.
- Data Preparation: Scripts for cleaning, preprocessing, and augmenting Nepali speech data.
- Model Training: Configuration and scripts for fine-tuning the Whisper model.
- Inference and Evaluation: Tools and demo interfaces to run the model on new audio samples.
- Frontend and Deployment: A Streamlit application for interactive user testing.
- Some audio files contain background noise that affects transcription quality.
- Limited data . More data can be used to generalize in different scenarios
- Multiple channels or multiple people talking not transcribed well
- Collect more diverse and high-quality Nepali speech data.
- Train larger models if gpu resources available
Use git clone.
Install the requirements with
pip install -r requirements.in
- clone the repository
git clone https://github.com/fuseai-fellowship/Nepali-Speech-to-Text-Translation.git
- change to inference directory
cd src/inference
- run
streamlit run app.py
python src/inference.py test.mp3
Refere to the dataset readme for details on the dataset, sources, usablility and the link to data.
## Updated Code Structure
├── assets
├── dataset
│ ├── male-female-data (SLR143)
│ ├── ne_np_female (SLR43)
│ ├── preperation_scripts
│ ├── scraping
│ ├── synthetic_data_using_TTS
│ └── README.md
├── docs
├── notebook
│ ├── finetuning-whispher-on-Nepali-base_old_data.ipynb
│ ├── finetuning-whispher-on-Nepali-small_old_data.ipynb
│ ├── notebook_inference_and_push_hub.ipynb
│ ├── whisper_fine_tune_5_epoch.ipynb
│ └── whispher-finetune-on-small_NP_ASR_data.ipynb
├── src
│ ├── inference
│ ├── inference.py
│ ├── test.mp3
│ ├── train.py
│ └── utils.py
├── tests
│ └── test_template.py
├── Dockerfile
├── Makefile
├── pyproject.toml
├── README.md
├── requirements.in
├── requirements.txtdatasets: Data preparation scripts. src: Model training and architecture src/utils: utility functions for processing audio and model output src/inference: Inference scripts and the Streamlit demo. requirements.in: List of Python dependencies.
Makefile: Commands to set up and manage the project.
- HuggingFace Demo: https://huggingface.co/spaces/kshitizzzzzzz/NEPALI_ASR_Whisper_Small
- Model source code: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py
We used the Word Error Rate (WER) to evaluate the accuracy of the ASR system. WER is calculated as follows:
A lower WER indicates a better-performing model.
The current model has WER of 32 on common voice and other collected validation set.
