mlx_vlm.server returns 500 on GLM-OCR region crops due to UTF-8 decode in BPEStreamingDetokenizer.add_token

## Summary

`mlx_vlm.server` still crashes with HTTP 500 on some GLM-OCR region crops due to strict UTF-8 decoding in `BPEStreamingDetokenizer.add_token()`.

Error:

```
{"detail":"Generation failed: 'utf-8' codec can't decode byte 0xd9 in position 1: invalid continuation byte"}
```

## Environment

- mlx-vlm: **0.4.0**
- Python: **3.14.2**
- Platform: **macOS arm64**
- Server command:

```bash
mlx_vlm.server --trust-remote-code --port 8099 --model mlx-community/GLM-OCR-bf16
```

## Minimal repro

This reproduces a failing crop from page 1 of the GLM-OCR technical report:

```python
import base64, requests
from io import BytesIO
import pypdfium2 as pdfium

pdf = "./2603.10910.pdf"  # https://arxiv.org/pdf/2603.10910
page_idx = 0
bbox = [371, 194, 626, 227]  # normalized 0-1000

# render + crop
p = pdfium.PdfDocument(pdf)
img = p[page_idx].render(scale=220/72).to_pil()
p.close()
w, h = img.size
x1, y1, x2, y2 = [int(bbox[0]*w/1000), int(bbox[1]*h/1000), int(bbox[2]*w/1000), int(bbox[3]*h/1000)]
crop = img.crop((x1, y1, x2, y2))

buf = BytesIO()
crop.save(buf, format="PNG")
data_url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()

payload = {
  "model": "mlx-community/GLM-OCR-bf16",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": data_url}},
      {"type": "text", "text": "Recognize the text in the image and output in Markdown format."}
    ]
  }],
  "max_tokens": 4096,
  "temperature": 0.01,
  "top_p": 0.00001,
  "top_k": 1,
  "repetition_penalty": 1.1,
}

r = requests.post("http://127.0.0.1:8099/v1/chat/completions", json=payload, timeout=180)
print(r.status_code)
print(r.text)
```

Observed:

- status: `500`
- body includes UTF-8 decode exception

## Suspected source

`mlx_vlm/tokenizer_utils.py` in `BPEStreamingDetokenizer.add_token()` decodes without an error handler:

```python
.decode("utf-8")
```

while `finalize()` already uses tolerant decoding (`errors="ignore"`).

## Request

Could this be fixed by making `add_token()` tolerant as well (e.g. `errors="replace"` or `errors="ignore"`), so generation does not crash on these byte sequences?

This is currently a blocker for local GLM-OCR layout-region workflows.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mlx_vlm.server returns 500 on GLM-OCR region crops due to UTF-8 decode in BPEStreamingDetokenizer.add_token #837

Summary

Environment

Minimal repro

Suspected source

Request

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

mlx_vlm.server returns 500 on GLM-OCR region crops due to UTF-8 decode in BPEStreamingDetokenizer.add_token #837

Description

Summary

Environment

Minimal repro

Suspected source

Request

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions