Skip to content

Fix UTF-8 encoding bug for 4-byte Unicode characters#34

Open
lewisbobrow wants to merge 1 commit intolevyfan:masterfrom
lewisbobrow:fix-utf8-encoding
Open

Fix UTF-8 encoding bug for 4-byte Unicode characters#34
lewisbobrow wants to merge 1 commit intolevyfan:masterfrom
lewisbobrow:fix-utf8-encoding

Conversation

@lewisbobrow
Copy link

@lewisbobrow lewisbobrow commented Dec 5, 2025

#33

The code used GetStringUTFChars/NewStringUTF which converts Java Strings to "Modified UTF-8". In Modified UTF-8, 4-byte Unicode characters (Mathematical Bold, emoji, etc.) are incorrectly encoded, causing SentencePiece to return UNK tokens.

  • Replaced GetStringUTFChars with Java's String.getBytes("UTF-8") for proper UTF-8 encoding

@lewisbobrow lewisbobrow reopened this Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant