+ "Abstract": "Arabic, with its rich diversity of dialects, re-\nmains significantly underrepresented in Large\nLanguage Models, particularly in dialectal vari-\nations. We address this gap by introducing\nseven synthetic datasets in dialects alongside\nModern Standard Arabic (MSA), created us-\ning Machine Translation (MT) combined with\nhuman post-editing. We present AraDiCE, a\nbenchmark for Arabic Dialect and Cultural\nEvaluation. We evaluate LLMs on dialect com-\nprehension and generation, focusing specifi-\ncally on low-resource Arabic dialects. Addi-\ntionally, we introduce the first-ever fine-grained\nbenchmark designed to evaluate cultural aware-\nness across the Gulf, Egypt, and Levant re-\ngions, providing a novel dimension to LLM\nevaluation. Our findings demonstrate that while\nArabic-specific models like Jais and AceGPT\noutperform multilingual models on dialectal\ntasks, significant challenges persist in dialect\nidentification, generation, and translation. This\nwork contributes \u224845K post-edited samples, a\ncultural benchmark, and highlights the impor-\ntance of tailored training to improve LLM per-\nformance in capturing the nuances of diverse\nArabic dialects and cultural contexts. We have\nreleased the dialectal translation models and\nbenchmarks developed in this study.",
0 commit comments