Skip to content

mlx_vlm.server returns 500 on GLM-OCR region crops due to UTF-8 decode in BPEStreamingDetokenizer.add_token #837

@anthonyarguedas

Description

@anthonyarguedas

Summary

mlx_vlm.server still crashes with HTTP 500 on some GLM-OCR region crops due to strict UTF-8 decoding in BPEStreamingDetokenizer.add_token().

Error:

{"detail":"Generation failed: 'utf-8' codec can't decode byte 0xd9 in position 1: invalid continuation byte"}

Environment

  • mlx-vlm: 0.4.0
  • Python: 3.14.2
  • Platform: macOS arm64
  • Server command:
mlx_vlm.server --trust-remote-code --port 8099 --model mlx-community/GLM-OCR-bf16

Minimal repro

This reproduces a failing crop from page 1 of the GLM-OCR technical report:

import base64, requests
from io import BytesIO
import pypdfium2 as pdfium

pdf = "./2603.10910.pdf"  # https://arxiv.org/pdf/2603.10910
page_idx = 0
bbox = [371, 194, 626, 227]  # normalized 0-1000

# render + crop
p = pdfium.PdfDocument(pdf)
img = p[page_idx].render(scale=220/72).to_pil()
p.close()
w, h = img.size
x1, y1, x2, y2 = [int(bbox[0]*w/1000), int(bbox[1]*h/1000), int(bbox[2]*w/1000), int(bbox[3]*h/1000)]
crop = img.crop((x1, y1, x2, y2))

buf = BytesIO()
crop.save(buf, format="PNG")
data_url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()

payload = {
  "model": "mlx-community/GLM-OCR-bf16",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": data_url}},
      {"type": "text", "text": "Recognize the text in the image and output in Markdown format."}
    ]
  }],
  "max_tokens": 4096,
  "temperature": 0.01,
  "top_p": 0.00001,
  "top_k": 1,
  "repetition_penalty": 1.1,
}

r = requests.post("http://127.0.0.1:8099/v1/chat/completions", json=payload, timeout=180)
print(r.status_code)
print(r.text)

Observed:

  • status: 500
  • body includes UTF-8 decode exception

Suspected source

mlx_vlm/tokenizer_utils.py in BPEStreamingDetokenizer.add_token() decodes without an error handler:

.decode("utf-8")

while finalize() already uses tolerant decoding (errors="ignore").

Request

Could this be fixed by making add_token() tolerant as well (e.g. errors="replace" or errors="ignore"), so generation does not crash on these byte sequences?

This is currently a blocker for local GLM-OCR layout-region workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions