Skip to content

Commit 3ea17b5

Browse files
authored
Merge pull request #326 from ARBML/add-qari_markdown_mixed_dataset
Adding QARI Markdown Mixed Dataset to the catalogue
2 parents ce88cf1 + 8beeb24 commit 3ea17b5

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{
2+
"Name": "QARI Markdown Mixed Dataset",
3+
"Subsets": [],
4+
"HF Link": "https://huggingface.co/datasets/NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset",
5+
"Link": "https://huggingface.co/datasets/NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset",
6+
"License": "Apache-2.0",
7+
"Year": 2025,
8+
"Language": "ar",
9+
"Dialect": "Modern Standard Arabic",
10+
"Domain": [
11+
"news articles"
12+
],
13+
"Form": "images",
14+
"Collection Style": [
15+
"machine annotation"
16+
],
17+
"Description": "A vision-language OCR dataset for Arabic text recognition, generated synthetically and used to fine-tune the Qari-OCR model.",
18+
"Volume": 37000.0,
19+
"Unit": "images",
20+
"Ethical Risks": "Low",
21+
"Provider": [
22+
"NAMAA"
23+
],
24+
"Derived From": [],
25+
"Paper Title": "QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation",
26+
"Paper Link": "https://arxiv.org/pdf/2506.02295",
27+
"Script": "Arab",
28+
"Tokenized": false,
29+
"Host": "HuggingFace",
30+
"Access": "Free",
31+
"Cost": "",
32+
"Test Split": false,
33+
"Tasks": [
34+
"optical character recognition"
35+
],
36+
"Venue Title": "arXiv",
37+
"Venue Type": "preprint",
38+
"Venue Name": "arXiv",
39+
"Authors": [
40+
"Ahmed Wasfy",
41+
"Omer Nacar",
42+
"Abdelakreem Elkhateb",
43+
"Mahmoud Reda",
44+
"Omar Elshehy",
45+
"Adel Ammar",
46+
"Wadii Boulila"
47+
],
48+
"Affiliations": [
49+
"NAMAA",
50+
"KANDCA Corp.",
51+
"Prince Sultan University"
52+
],
53+
"Abstract": "The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.\n",
54+
"Added By": "Zaid Alyafeai"
55+
}

0 commit comments

Comments
 (0)