You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Description": "A culturally inclusive and linguistically diverse Arabic instruction dataset covering all 22 Arab countries in both MSA and dialectal Arabic.",
158
+
"Volume": 17411.0,
159
+
"Unit": "sentences",
160
+
"Ethical Risks": "Low",
161
+
"Provider": [
162
+
"UBC-NLP"
163
+
],
164
+
"Derived From": [],
165
+
"Paper Title": "Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs",
166
+
"Paper Link": "https://arxiv.org/pdf/2503.00151",
167
+
"Script": "Arab",
168
+
"Tokenized": false,
169
+
"Host": "GitHub",
170
+
"Access": "Free",
171
+
"Cost": "",
172
+
"Test Split": true,
173
+
"Tasks": [
174
+
"instruction tuning"
175
+
],
176
+
"Venue Title": "arXiv",
177
+
"Venue Type": "preprint",
178
+
"Venue Name": "",
179
+
"Authors": [
180
+
"Fakhraddin Alwajih",
181
+
"Abdellah El Mekki",
182
+
"Samar Mohamed Magdy",
183
+
"Abdelrahim A. Elmadany",
184
+
"Omer Nacar",
185
+
"ElMoatez Billah Nagoudi",
186
+
"Reem Abdel-Salam",
187
+
"Hanin Atwany",
188
+
"Youssef Nafea",
189
+
"Abdulfattah Mohammed Yahya",
190
+
"Rahaf Alhamouri",
191
+
"Hamzah A. Alsayadi",
192
+
"Hiba Zayed",
193
+
"Sara Shatnawi",
194
+
"Serry Sibaee",
195
+
"Yasir Ech-Chammakhy",
196
+
"Walid Al-Dhabyani",
197
+
"Marwa Mohamed Ali",
198
+
"Imen Jarraya",
199
+
"Ahmed Oumar El-Shangiti",
200
+
"Aisha Alraeesi",
201
+
"Mohammed Anwar Al-Ghrawi",
202
+
"Abdulrahman S. Al-Batati",
203
+
"Elgizouli Mohamed",
204
+
"Noha Taha Elgindi",
205
+
"Muhammed Saeed",
206
+
"Houdaifa Atou",
207
+
"Issam Ait Yahia",
208
+
"Abdelhak Bouayad",
209
+
"Mohammed Machrouh",
210
+
"Amal Makouar",
211
+
"Dania Alkawi",
212
+
"Mukhtar Mohamed",
213
+
"Safaa Taher Abdelfadil",
214
+
"Amine Ziad Ounnoughene",
215
+
"Rouabhia Anfel",
216
+
"Rwaa Assi",
217
+
"Ahmed Sorkatti",
218
+
"Mohamedou Cheikh Tourad",
219
+
"Anis Koubaa",
220
+
"Ismail Berrada",
221
+
"Mustafa Jarrar",
222
+
"Shady Shehata",
223
+
"Muhammad Abdul-Mageed"
224
+
],
225
+
"Affiliations": [
226
+
"The University of British Columbia",
227
+
"MBZUAI",
228
+
"InvertibleAI",
229
+
"Birzeit University",
230
+
"Prince Sultan University",
231
+
"UM6P",
232
+
"Cairo University",
233
+
"Ain Shams University",
234
+
"Damascus University",
235
+
"University of Khartoum",
236
+
"University of Nouakchott",
237
+
"National Polytechnic School of Algiers",
238
+
"Full Sail University",
239
+
"Alfaisal University",
240
+
"Hamad Bin Khalifa University"
241
+
],
242
+
"Abstract": "As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.",
0 commit comments