Mon Language Corpus Collection

Mon is spoken by roughly one million people across Myanmar and Thailand. UNESCO classifies it as vulnerable — and before this project, no NLP-grade corpus existed for it.

This is a production-grade corpus of the Mon language, curated for NLP research, LLM pre-training, and OCR model development. It is the training data source for MonOCR.

Dataset

Source	Shards	Lines	Characters	Mon/Myanmar	Other
Mon Wikipedia	4	891,665	24,676,307	21,037,957	3,638,350
Mon News Agency	2	107,882	11,310,762	10,260,088	1,050,674
Custom Collections	1	119,739	6,831,401	3,681,874	3,149,527
Telegram / Facebook	2	4,479	95,098	81,479	13,619
OCR Extracted	1	733	37,624	36,824	800
Total	10	1,124,998	42,951,192	35,098,222 (81.7%)	7,852,970 (18.3%)

Raw file size: ~113 MB (uncompressed UTF-8)

Data Quality

Unicode NFC normalization — All text is strictly normalized to NFC, ensuring consistent grapheme cluster representation regardless of input method or source platform.

Preservation pipeline — The pipeline preserves all Myanmar script blocks (U+1000–U+109F, Extended-A/B) and intentional spacing essential to Mon script readability. Only non-linguistic noise is stripped (BOM, ZWJ, ZWNJ, control codes).

Global deduplication — Content across all shards is globally deduplicated. A document in one shard will not appear in another, preventing data leakage between training and evaluation splits.

Structure

MonCorpusCollection/
├── shards/                   # Training shards (~20MB each)
│   ├── monnews_shard_*.txt   # Mon News Agency articles
│   ├── wikipedia_shard_*.txt # Mon Wikipedia articles
│   ├── telegram_shard_*.txt  # Curated Telegram messages
│   └── custom_shard_*.txt    # Specialized and legacy collections
├── results/latest/           # Character frequency and bigram/trigram stats
├── scripts/                  # Corpus analysis utilities
└── README.md

Usage

Iterate through shards/ for model training. Each file is standard UTF-8 text.

# Generate character frequency report
python scripts/mon_cluster_counter.py

License

MIT. If you use this data, please attribute Mon Corpus Collection and the original sources: Mon News Agency (IMNA) and Mon Wikipedia.

Contributing

Normalize all text to NFC before submission.
Provide clear source attribution for new data.

Janakh Pon · Htaw Mon

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
results		results
scripts		scripts
shards		shards
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mon Language Corpus Collection

Dataset

Data Quality

Structure

Usage

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mon Language Corpus Collection

Dataset

Data Quality

Structure

Usage

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages