-
Notifications
You must be signed in to change notification settings - Fork 8
Description
I've a dataset which has some problematic names. Specifically:
- Palestine, State of
- Côte d’Ivoire
The en.yml data file contains these relevant entries:
PS:
aliases:
- Palestinian Territories
- Palestinian Territory
alpha2: PS
alpha3: PSE
fifa: PLE
ioc: PLE
iso_name: Palestinian Territory, Occupied
numeric: "275"
official: State of Palestine
short: Palestine
emoji: "\U0001F1F5\U0001F1F8"
shortcode: ":flag-ps:"
alpha2: CI
alpha3: CIV
fifa: CIV
ioc: CIV
iso_name: Côte D'Ivoire
numeric: "384"
official: Republic of Côte D'Ivoire
short: Ivory Coast
emoji: "\U0001F1E8\U0001F1EE"
shortcode: ":flag-ci:"
So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.
I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).
There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.
Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.
I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?