Skip to content

bug: o200k_harmony duplicates special token id 200018 #457

@jinzhuer

Description

@jinzhuer

Summary

In o200k_harmony, two special token names share the same token id 200018:

  • <|endofprompt|> → 200018
  • <|reserved_200018|> → 200018 (conflict)

Token ids must be unique within an encoding.


Reproduction

import tiktoken
from collections import defaultdict

print(tiktoken.__version__)  # expect 0.12.0

enc = tiktoken.get_encoding("o200k_harmony")
sp = enc._special_tokens
print(sp)  # shows '<|endofprompt|>': 200018 and '<|reserved_200018|>': 200018

# Optional: explicit duplicate-id check
id2names = defaultdict(list)
for name, tid in sp.items():
    id2names[tid].append(name)
dups = {tid: names for tid, names in id2names.items() if len(names) > 1}
print(dups)  

# -> {200018: ['<|endofprompt|>', '<|reserved_200018|>']}

Actual

<|reserved_200018|> duplicates <|endofprompt|> (both id=200018).

Expected

No two special token names share the same token id.

Root Cause

tiktoken_ext/openai_public.py bulk-generates reserved_* specials for [200013, 201088) without excluding ids already used by explicit specials, introducing <|reserved_200018|> as a duplicate of <|endofprompt|>.

  • This is the full code segment where the bug is introduced:

    def o200k_base():
    mergeable_ranks = load_tiktoken_bpe(
    "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken",
    expected_hash="446a9538cb6c348e3516120d7c08b09f57c36495e2acfffe59a5bf8b0cfb1a2d",
    )
    special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018}
    # This regex could be made more efficient. If I was the one working on this encoding, I would
    # have done a few other things differently too, e.g. I think you can allocate tokens more
    # efficiently across languages.
    pat_str = "|".join(
    [
    r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
    r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
    r"""\p{N}{1,3}""",
    r""" ?[^\s\p{L}\p{N}]+[\r\n/]*""",
    r"""\s*[\r\n]+""",
    r"""\s+(?!\S)""",
    r"""\s+""",
    ]
    )
    return {
    "name": "o200k_base",
    "pat_str": pat_str,
    "mergeable_ranks": mergeable_ranks,
    "special_tokens": special_tokens,
    }
    def o200k_harmony():
    base_enc = o200k_base()
    name = "o200k_harmony"
    pat_str = base_enc["pat_str"]
    mergeable_ranks = base_enc["mergeable_ranks"]
    special_tokens = {
    **base_enc["special_tokens"],
    "<|startoftext|>": 199998,
    "<|endoftext|>": 199999,
    "<|reserved_200000|>": 200000,
    "<|reserved_200001|>": 200001,
    "<|return|>": 200002,
    "<|constrain|>": 200003,
    "<|reserved_200004|>": 200004,
    "<|channel|>": 200005,
    "<|start|>": 200006,
    "<|end|>": 200007,
    "<|message|>": 200008,
    "<|reserved_200009|>": 200009,
    "<|reserved_200010|>": 200010,
    "<|reserved_200011|>": 200011,
    "<|call|>": 200012,
    } | {f"<|reserved_{i}|>": i for i in range(200013, 201088)}

  • At line 100, <|endofprompt|> is already registered with the id 200018:

    special_tokens = {ENDOFTEXT: 199999, ENDOFPROMPT: 200018}

  • However, at line 145, the code adds all ids from 200013 to 201088 as reserved tokens, which mistakenly includes the already-used id 200018:

    } | {f"<|reserved_{i}|>": i for i in range(200013, 201088)}

Fix

  • Remove <|reserved_200018|> which duplicates <|endofprompt|> (both id=200018).
  • When generating reserved_*, skip ids already defined in special_tokens to prevent future collisions.
  • Implemented in PR #458.

Tests

Add tests/test_token_ids_unique.py to enforce token-id uniqueness across all encodings:

  • special token ids are unique (no two names share the same id);
  • mergeable vocab ids are unique when _mergeable_ranks is exposed.

The test fails before this change and passes after.

Compatibility

No behavior change to encoding/decoding. Only removes the duplicate special token entry.

Environment

  • tiktoken version: 0.12.0
  • Python: 3.13.7
  • OS: macOS/Linux/Windows (reproducible)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions