Skip to content

Problems with punctuation and order #8

@wu-lee

Description

@wu-lee

I've a dataset which has some problematic names. Specifically:

  • Palestine, State of
  • Côte d’Ivoire

The en.yml data file contains these relevant entries:

PS:
  aliases:
  - Palestinian Territories
  - Palestinian Territory
  alpha2: PS
  alpha3: PSE
  fifa: PLE
  ioc: PLE
  iso_name: Palestinian Territory, Occupied
  numeric: "275"
  official: State of Palestine
  short: Palestine
  emoji: "\U0001F1F5\U0001F1F8"
  shortcode: ":flag-ps:"
  alpha2: CI
  alpha3: CIV
  fifa: CIV
  ioc: CIV
  iso_name: Côte D'Ivoire
  numeric: "384"
  official: Republic of Côte D'Ivoire
  short: Ivory Coast
  emoji: "\U0001F1E8\U0001F1EE"
  shortcode: ":flag-ci:"

So it's a "close but no cigar" situation in both cases. I'm not sure how to solve this.

I'm wondering if the library should erase punctuation and flatten to ASCII when comparing? This would handle the different choice of apostrophe and any missing/altered accents in Côte D'Ivoire, but perhaps that goes too far. I can't currently think of country names it would break, but that's not saying they wouldn't be. And come to think of it, the official name is also a bit weird, mixing "Republic of" (English) with D'Ivoire (French).

There are other names with an apostrophe. These are going to be problematic, considering the general populace's facility with using punctuation. Likewise punctuation as in Bosnia-Herzegovina, Guinea-Bissau or accents as in Åland Islands, and just alternative spellings like Faeroes.

Palestine, State of does what some of the other names do, putting the main name first and any qualifiers like "State of" after a comma. But it doesn't match in this case. I think this is harder; removing punctuation is one thing, re-arranging word order is another.

I see elsewhere in en.yml there are aliases. Perhaps that's a better solution, adding a lot of aliases?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions