-
-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Diagnostics Report — 8 failure(s), 8 harness issue(s) (mlx-vlm 0.4.1)
(This issue is truncated, for the full report, see https://github.com/jrp2014/check_models/blob/main/src/output/diagnostics.md)
Summary
Automated benchmarking of 51 locally-cached VLM models found 8 hard failure(s) and 8 harness/integration issue(s) plus 1 preflight compatibility warning(s) in successful models. 43 of 51 models succeeded.
Test image: 20260321-182222_DSC09486_DxO.jpg (33.8 MB).
Action Summary
Quick triage list with likely owner and next action for each issue class.
- [Medium] [transformers] Failed to process inputs with error: can only concatenate str (not "NoneType") to str (1 model(s)). Next: verify API compatibility and pinned version floor.
- [Medium] [mlx-vlm] 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte (1 model(s)). Next: check processor/chat-template wiring and generation kwargs.
- [Medium] [mlx-vlm] 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte (1 model(s)). Next: check processor/chat-template wiring and generation kwargs.
- [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
- [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
- [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
- [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
- [Medium] [model configuration/repository] Loaded processor has no image_processor; expected multimodal processor. (1 model(s)). Next: verify model config, tokenizer files, and revision alignment.
- [Medium] [mlx-vlm] Harness/integration warnings on 4 model(s). Next: check processor/chat-template wiring and generation kwargs.
- [Medium] [mlx-vlm / mlx] Harness/integration warnings on 2 model(s). Next: validate long-context handling and stop-token behavior across mlx-vlm + mlx runtime.
- [Medium] [model-config / mlx-vlm] Harness/integration warnings on 2 model(s). Next: validate chat-template/config expectations and mlx-vlm prompt formatting for this model.
- [Medium] [transformers / mlx-vlm] Stack-signal anomalies on 1 successful model(s). Next: verify API compatibility and pinned version floor.
- [Medium] [transformers] Preflight compatibility warnings (1 issue(s)). Next: verify API compatibility and pinned version floor.
Priority Summary
| Priority | Issue | Models Affected | Owner | Next Action |
|---|---|---|---|---|
| Medium | Failed to process inputs with error: can only concatenate str (not "N... | 1 (Florence-2-large-ft) | transformers |
verify API compatibility and pinned version floor. |
| Medium | 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte | 1 (InternVL3-8B-bf16) | mlx-vlm |
check processor/chat-template wiring and generation kwargs. |
| Medium | 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte | 1 (Molmo-7B-D-0924-bf16) | mlx-vlm |
check processor/chat-template wiring and generation kwargs. |
| Medium | Failed to process inputs with error: Only returning PyTorch tensors i... | 1 (Qwen3.5-27B-4bit) | transformers |
verify API compatibility and pinned version floor. |
| Medium | Failed to process inputs with error: Only returning PyTorch tensors i... | 1 (Qwen3.5-27B-mxfp8) | transformers |
verify API compatibility and pinned version floor. |
| Medium | Failed to process inputs with error: Only returning PyTorch tensors i... | 1 (Qwen3.5-35B-A3B-6bit) | transformers |
verify API compatibility and pinned version floor. |
| Medium | Failed to process inputs with error: Only returning PyTorch tensors i... | 1 (Qwen3.5-35B-A3B-bf16) | transformers |
verify API compatibility and pinned version floor. |
| Medium | Loaded processor has no image_processor; expected multimodal processor. | 1 (deepseek-vl2-8bit) | model configuration/repository |
verify model config, tokenizer files, and revision alignment. |
| Medium | Harness/integration | 4 (Phi-3.5-vision-instruct, Devstral-Small-2-24B-Instruct-2512-5bit, ERNIE-4.5-VL-28B-A3B-Thinking-bf16, Florence-2-large-ft) | mlx-vlm |
check processor/chat-template wiring and generation kwargs. |
| Medium | Harness/integration | 2 (Qwen3-VL-2B-Thinking-bf16, X-Reasoner-7B-8bit) | mlx-vlm / mlx |
validate long-context handling and stop-token behavior across mlx-vlm + mlx runtime. |
| Medium | Harness/integration | 2 (Qwen2-VL-2B-Instruct-4bit, paligemma2-10b-ft-docci-448-bf16) | model-config / mlx-vlm |
validate chat-template/config expectations and mlx-vlm prompt formatting for this model. |
| Medium | Stack-signal anomaly | 1 (Qwen3-VL-2B-Instruct) | transformers / mlx-vlm |
verify API compatibility and pinned version floor. |
| Medium | Preflight compatibility warning | 1 issue(s) | transformers |
verify API compatibility and pinned version floor. |
1. Failure affecting 1 model (Priority: Medium)
Observed behavior: Failed to process inputs with error: can only concatenate str (not "NoneType") to str
Owner (likely component): transformers
Suggested next action: verify API compatibility and pinned version floor.
Affected model: microsoft/Florence-2-large-ft
| Model | Observed Behavior | First Seen Failing | Recent Repro |
|---|---|---|---|
microsoft/Florence-2-large-ft |
Failed to process inputs with error: can only concatenate str (not "NoneType") to str | 2026-02-07 20:59:01 GMT | 3/3 recent runs failed |
To reproduce
- Repro command (exact run):
python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models microsoft/Florence-2-large-ft
Detailed trace logs (affected model)
microsoft/Florence-2-large-ft
Traceback:
Traceback (most recent call last):
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1019, in process_inputs_with_fallback
return process_inputs(
processor,
...<5 lines>...
**kwargs,
)
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1005, in process_inputs
return process_method(**args)
File "/Users/jrp/miniconda3/envs/mlx-vlm/lib/python3.13/site-packages/transformers/models/florence2/processing_florence2.py", line 185, in __call__
self.image_token * self.num_image_tokens
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ self.tokenizer.bos_token
^~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: can only concatenate str (not "NoneType") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 537, in stream_generate
inputs = prepare_inputs(
processor,
...<6 lines>...
**kwargs,
)
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1237, in prepare_inputs
inputs = process_inputs_with_fallback(
processor,
...<4 lines>...
**kwargs,
)
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1029, in process_inputs_with_fallback
raise ValueError(f"Failed to process inputs with error: {e}")
ValueError: Failed to process inputs with error: can only concatenate str (not "NoneType") to str
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
ValueError: Model generation failed for microsoft/Florence-2-large-ft: Failed to process inputs with error: can only concatenate str (not "NoneType") to str
Captured stdout/stderr:
=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g']
Prompt: Analyze this image for cataloguing metadata, using British English.
Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.
Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.
Return exactly these three sections, and nothing else:
Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.
Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.
Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.
Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.
Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.
=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 10 files: 0%| | 0/10 [00:00<?, ?it/s]
Fetching 10 files: 100%|##########| 10/10 [00:00<00:00, 38444.58it/s]
Download complete: : 0.00B [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s]
2. Failure affecting 1 model (Priority: Medium)
Observed behavior: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte
Owner (likely component): mlx-vlm
Suggested next action: check processor/chat-template wiring and generation kwargs.
Affected model: mlx-community/InternVL3-8B-bf16
| Model | Observed Behavior | First Seen Failing | Recent Repro |
|---|---|---|---|
mlx-community/InternVL3-8B-bf16 |
'utf-8' codec can't decode byte 0xab in position 10: invalid start byte | 2026-02-23 12:54:48 GMT | 2/3 recent runs failed |
To reproduce
- Repro command (exact run):
python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/InternVL3-8B-bf16
Detailed trace logs (affected model)
mlx-community/InternVL3-8B-bf16
Traceback:
Traceback (most recent call last):
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 596, in stream_generate
detokenizer.add_token(token, skip_special_token_ids=skip_special_token_ids)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/tokenizer_utils.py", line 232, in add_token
).decode("utf-8")
~~~~~~^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
ValueError: Model generation failed for mlx-community/InternVL3-8B-bf16: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte
Captured stdout/stderr:
=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g']
Prompt: User: <image>
Analyze this image for cataloguing metadata, using British English.
Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.
Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.
Return exactly these three sections, and nothing else:
Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.
Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.
Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.
Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.
Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.
Assistant:
感 Rencontre pestic Rencontre.ERR Rencontre enthus.ERR Rencontre醍racial pestic
=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 17 files: 0%| | 0/17 [00:00<?, ?it/s]
Fetching 17 files: 100%|##########| 17/17 [00:00<00:00, 9459.16it/s]
Download complete: : 0.00B [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s]
3. Failure affecting 1 model (Priority: Medium)
Observed behavior: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte
Owner (likely component): mlx-vlm
Suggested next action: check processor/chat-template wiring and generation kwargs.
Affected model: mlx-community/Molmo-7B-D-0924-bf16
| Model | Observed Behavior | First Seen Failing | Recent Repro |
|---|---|---|---|
mlx-community/Molmo-7B-D-0924-bf16 |
'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte | 2026-03-22 01:27:09 GMT | 2/3 recent runs failed |
To reproduce
- Repro command (exact run):
python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/Molmo-7B-D-0924-bf16
Detailed trace logs (affected model)
mlx-community/Molmo-7B-D-0924-bf16
Traceback:
Traceback (most recent call last):
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 596, in stream_generate
detokenizer.add_token(token, skip_special_token_ids=skip_special_token_ids)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/tokenizer_utils.py", line 232, in add_token
).decode("utf-8")
~~~~~~^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
ValueError: Model generation failed for mlx-community/Molmo-7B-D-0924-bf16: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte
Captured stdout/stderr:
=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g']
Prompt: Analyze this image for cataloguing metadata, using British English.
Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.
Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.
Return exactly these three sections, and nothing else:
Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.
Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.
Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.
Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.
Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.
=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]
Fetching 18 files: 0%| | 0/18 [00:00<?, ?it/s]
Fetching 18 files: 100%|##########| 18/18 [00:00<00:00, 11023.14it/s]
Download complete: : 0.00B [00:00, ?B/s]
Download complete: : 0.00B [00:00, ?B/s]
4. Failure affecting 1 model (Priority: Medium)
Observed behavior: Failed to process inputs with error: Only returning PyTorch tensors is currently supported.
Owner (likely component): transformers
Suggested next action: verify API compatibility and pinned version floor.
Affected model: mlx-community/Qwen3.5-27B-4bit
| Model | Observed Behavior | First Seen Failing | Recent Repro |
|---|---|---|---|
mlx-community/Qwen3.5-27B-4bit |
Failed to process inputs with error: Only returning PyTorch tensors is currently supported. | 2026-03-22 01:27:09 GMT | 3/3 recent runs failed |
To reproduce
- Repro command (exact run):
python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/Qwen3.5-27B-4bit
Detailed trace logs (affected model)
mlx-community/Qwen3.5-27B-4bit
Traceback:
Traceback (most recent call last):
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1019, in process_inputs_with_fallback
return process_inputs(
processor,
...<5 lines>...
**kwargs,
)
File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1005, in process_inputs
... Etc. Further details at https://github.com/jrp2014/check_models/blob/main/src/output/diagnostics.md