Summary
In o200k_harmony, two special token names share the same token id 200018:
<|endofprompt|> → 200018
<|reserved_200018|> → 200018 (conflict)
Token ids must be unique within an encoding.
Reproduction
import tiktoken
from collections import defaultdict
print(tiktoken.__version__) # expect 0.12.0
enc = tiktoken.get_encoding("o200k_harmony")
sp = enc._special_tokens
print(sp) # shows '<|endofprompt|>': 200018 and '<|reserved_200018|>': 200018
# Optional: explicit duplicate-id check
id2names = defaultdict(list)
for name, tid in sp.items():
id2names[tid].append(name)
dups = {tid: names for tid, names in id2names.items() if len(names) > 1}
print(dups)
# -> {200018: ['<|endofprompt|>', '<|reserved_200018|>']}
Actual
<|reserved_200018|> duplicates <|endofprompt|> (both id=200018).
Expected
No two special token names share the same token id.
Root Cause
tiktoken_ext/openai_public.py bulk-generates reserved_* specials for [200013, 201088) without excluding ids already used by explicit specials, introducing <|reserved_200018|> as a duplicate of <|endofprompt|>.
-
This is the full code segment where the bug is introduced:
|
def o200k_base(): |
|
mergeable_ranks = load_tiktoken_bpe( |
|
"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken", |
|
expected_hash="446a9538cb6c348e3516120d7c08b09f57c36495e2acfffe59a5bf8b0cfb1a2d", |
|
) |
|
special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018} |
|
# This regex could be made more efficient. If I was the one working on this encoding, I would |
|
# have done a few other things differently too, e.g. I think you can allocate tokens more |
|
# efficiently across languages. |
|
pat_str = "|".join( |
|
[ |
|
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", |
|
r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""", |
|
r"""\p{N}{1,3}""", |
|
r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""", |
|
r"""\s*[\r\n]+""", |
|
r"""\s+(?!\S)""", |
|
r"""\s+""", |
|
] |
|
) |
|
return { |
|
"name": "o200k_base", |
|
"pat_str": pat_str, |
|
"mergeable_ranks": mergeable_ranks, |
|
"special_tokens": special_tokens, |
|
} |
|
|
|
|
|
def o200k_harmony(): |
|
base_enc = o200k_base() |
|
name = "o200k_harmony" |
|
pat_str = base_enc["pat_str"] |
|
mergeable_ranks = base_enc["mergeable_ranks"] |
|
special_tokens = { |
|
**base_enc["special_tokens"], |
|
"<|startoftext|>": 199998, |
|
"<|endoftext|>": 199999, |
|
"<|reserved_200000|>": 200000, |
|
"<|reserved_200001|>": 200001, |
|
"<|return|>": 200002, |
|
"<|constrain|>": 200003, |
|
"<|reserved_200004|>": 200004, |
|
"<|channel|>": 200005, |
|
"<|start|>": 200006, |
|
"<|end|>": 200007, |
|
"<|message|>": 200008, |
|
"<|reserved_200009|>": 200009, |
|
"<|reserved_200010|>": 200010, |
|
"<|reserved_200011|>": 200011, |
|
"<|call|>": 200012, |
|
} | {f"<|reserved_{i}|>": i for i in range(200013, 201088)} |
-
At line 100, <|endofprompt|> is already registered with the id 200018:
|
special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018} |
-
However, at line 145, the code adds all ids from 200013 to 201088 as reserved tokens, which mistakenly includes the already-used id 200018:
|
} | {f"<|reserved_{i}|>": i for i in range(200013, 201088)} |
Fix
- Remove
<|reserved_200018|> which duplicates <|endofprompt|> (both id=200018).
- When generating
reserved_*, skip ids already defined in special_tokens to prevent future collisions.
- Implemented in PR #458.
Tests
Add tests/test_token_ids_unique.py to enforce token-id uniqueness across all encodings:
- special token ids are unique (no two names share the same id);
- mergeable vocab ids are unique when
_mergeable_ranks is exposed.
The test fails before this change and passes after.
Compatibility
No behavior change to encoding/decoding. Only removes the duplicate special token entry.
Environment
tiktoken version: 0.12.0
- Python:
3.13.7
- OS: macOS/Linux/Windows (reproducible)
Summary
In
o200k_harmony, two special token names share the same token id 200018:<|endofprompt|>→ 200018<|reserved_200018|>→ 200018 (conflict)Token ids must be unique within an encoding.
Reproduction
Actual
<|reserved_200018|>duplicates<|endofprompt|>(both id=200018).Expected
No two special token names share the same token id.
Root Cause
tiktoken_ext/openai_public.pybulk-generatesreserved_*specials for[200013, 201088)without excluding ids already used by explicit specials, introducing<|reserved_200018|>as a duplicate of<|endofprompt|>.This is the full code segment where the bug is introduced:
tiktoken/tiktoken_ext/openai_public.py
Lines 95 to 145 in 97e49cb
At line 100,
<|endofprompt|>is already registered with the id200018:tiktoken/tiktoken_ext/openai_public.py
Line 100 in 97e49cb
However, at line 145, the code adds all ids from
200013to201088as reserved tokens, which mistakenly includes the already-used id200018:tiktoken/tiktoken_ext/openai_public.py
Line 145 in 97e49cb
Fix
<|reserved_200018|>which duplicates<|endofprompt|>(both id=200018).reserved_*, skip ids already defined inspecial_tokensto prevent future collisions.Tests
Add
tests/test_token_ids_unique.pyto enforce token-id uniqueness across all encodings:_mergeable_ranksis exposed.The test fails before this change and passes after.
Compatibility
No behavior change to encoding/decoding. Only removes the duplicate special token entry.
Environment
tiktokenversion:0.12.03.13.7