Skip to content

20260322 -- Diagnostic report #852

@jrp2014

Description

@jrp2014

Diagnostics Report — 8 failure(s), 8 harness issue(s) (mlx-vlm 0.4.1)

(This issue is truncated, for the full report, see https://github.com/jrp2014/check_models/blob/main/src/output/diagnostics.md)

Summary

Automated benchmarking of 51 locally-cached VLM models found 8 hard failure(s) and 8 harness/integration issue(s) plus 1 preflight compatibility warning(s) in successful models. 43 of 51 models succeeded.

Test image: 20260321-182222_DSC09486_DxO.jpg (33.8 MB).


Action Summary

Quick triage list with likely owner and next action for each issue class.

  • [Medium] [transformers] Failed to process inputs with error: can only concatenate str (not "NoneType") to str (1 model(s)). Next: verify API compatibility and pinned version floor.
  • [Medium] [mlx-vlm] 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte (1 model(s)). Next: check processor/chat-template wiring and generation kwargs.
  • [Medium] [mlx-vlm] 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte (1 model(s)). Next: check processor/chat-template wiring and generation kwargs.
  • [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
  • [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
  • [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
  • [Medium] [transformers] Failed to process inputs with error: Only returning PyTorch tensors is currently supp... (1 model(s)). Next: verify API compatibility and pinned version floor.
  • [Medium] [model configuration/repository] Loaded processor has no image_processor; expected multimodal processor. (1 model(s)). Next: verify model config, tokenizer files, and revision alignment.
  • [Medium] [mlx-vlm] Harness/integration warnings on 4 model(s). Next: check processor/chat-template wiring and generation kwargs.
  • [Medium] [mlx-vlm / mlx] Harness/integration warnings on 2 model(s). Next: validate long-context handling and stop-token behavior across mlx-vlm + mlx runtime.
  • [Medium] [model-config / mlx-vlm] Harness/integration warnings on 2 model(s). Next: validate chat-template/config expectations and mlx-vlm prompt formatting for this model.
  • [Medium] [transformers / mlx-vlm] Stack-signal anomalies on 1 successful model(s). Next: verify API compatibility and pinned version floor.
  • [Medium] [transformers] Preflight compatibility warnings (1 issue(s)). Next: verify API compatibility and pinned version floor.

Priority Summary

Priority Issue Models Affected Owner Next Action
Medium Failed to process inputs with error: can only concatenate str (not "N... 1 (Florence-2-large-ft) transformers verify API compatibility and pinned version floor.
Medium 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte 1 (InternVL3-8B-bf16) mlx-vlm check processor/chat-template wiring and generation kwargs.
Medium 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte 1 (Molmo-7B-D-0924-bf16) mlx-vlm check processor/chat-template wiring and generation kwargs.
Medium Failed to process inputs with error: Only returning PyTorch tensors i... 1 (Qwen3.5-27B-4bit) transformers verify API compatibility and pinned version floor.
Medium Failed to process inputs with error: Only returning PyTorch tensors i... 1 (Qwen3.5-27B-mxfp8) transformers verify API compatibility and pinned version floor.
Medium Failed to process inputs with error: Only returning PyTorch tensors i... 1 (Qwen3.5-35B-A3B-6bit) transformers verify API compatibility and pinned version floor.
Medium Failed to process inputs with error: Only returning PyTorch tensors i... 1 (Qwen3.5-35B-A3B-bf16) transformers verify API compatibility and pinned version floor.
Medium Loaded processor has no image_processor; expected multimodal processor. 1 (deepseek-vl2-8bit) model configuration/repository verify model config, tokenizer files, and revision alignment.
Medium Harness/integration 4 (Phi-3.5-vision-instruct, Devstral-Small-2-24B-Instruct-2512-5bit, ERNIE-4.5-VL-28B-A3B-Thinking-bf16, Florence-2-large-ft) mlx-vlm check processor/chat-template wiring and generation kwargs.
Medium Harness/integration 2 (Qwen3-VL-2B-Thinking-bf16, X-Reasoner-7B-8bit) mlx-vlm / mlx validate long-context handling and stop-token behavior across mlx-vlm + mlx runtime.
Medium Harness/integration 2 (Qwen2-VL-2B-Instruct-4bit, paligemma2-10b-ft-docci-448-bf16) model-config / mlx-vlm validate chat-template/config expectations and mlx-vlm prompt formatting for this model.
Medium Stack-signal anomaly 1 (Qwen3-VL-2B-Instruct) transformers / mlx-vlm verify API compatibility and pinned version floor.
Medium Preflight compatibility warning 1 issue(s) transformers verify API compatibility and pinned version floor.

1. Failure affecting 1 model (Priority: Medium)

Observed behavior: Failed to process inputs with error: can only concatenate str (not "NoneType") to str
Owner (likely component): transformers
Suggested next action: verify API compatibility and pinned version floor.
Affected model: microsoft/Florence-2-large-ft

Model Observed Behavior First Seen Failing Recent Repro
microsoft/Florence-2-large-ft Failed to process inputs with error: can only concatenate str (not "NoneType") to str 2026-02-07 20:59:01 GMT 3/3 recent runs failed

To reproduce

  • Repro command (exact run): python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models microsoft/Florence-2-large-ft
Detailed trace logs (affected model)

microsoft/Florence-2-large-ft

Traceback:

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1019, in process_inputs_with_fallback
    return process_inputs(
        processor,
    ...<5 lines>...
        **kwargs,
    )
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1005, in process_inputs
    return process_method(**args)
  File "/Users/jrp/miniconda3/envs/mlx-vlm/lib/python3.13/site-packages/transformers/models/florence2/processing_florence2.py", line 185, in __call__
    self.image_token * self.num_image_tokens
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + self.tokenizer.bos_token
    ^~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: can only concatenate str (not "NoneType") to str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
    for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
                    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 537, in stream_generate
    inputs = prepare_inputs(
        processor,
    ...<6 lines>...
        **kwargs,
    )
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1237, in prepare_inputs
    inputs = process_inputs_with_fallback(
        processor,
    ...<4 lines>...
        **kwargs,
    )
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1029, in process_inputs_with_fallback
    raise ValueError(f"Failed to process inputs with error: {e}")
ValueError: Failed to process inputs with error: can only concatenate str (not "NoneType") to str

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
ValueError: Model generation failed for microsoft/Florence-2-large-ft: Failed to process inputs with error: can only concatenate str (not "NoneType") to str

Captured stdout/stderr:

=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g'] 

Prompt: Analyze this image for cataloguing metadata, using British English.

Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.

Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.

Return exactly these three sections, and nothing else:

Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.

Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.

Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.

Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.

Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.

=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]
Fetching 10 files: 100%|##########| 10/10 [00:00<00:00, 38444.58it/s]

Download complete: : 0.00B [00:00, ?B/s]              
Download complete: : 0.00B [00:00, ?B/s]

2. Failure affecting 1 model (Priority: Medium)

Observed behavior: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte
Owner (likely component): mlx-vlm
Suggested next action: check processor/chat-template wiring and generation kwargs.
Affected model: mlx-community/InternVL3-8B-bf16

Model Observed Behavior First Seen Failing Recent Repro
mlx-community/InternVL3-8B-bf16 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte 2026-02-23 12:54:48 GMT 2/3 recent runs failed

To reproduce

  • Repro command (exact run): python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/InternVL3-8B-bf16
Detailed trace logs (affected model)

mlx-community/InternVL3-8B-bf16

Traceback:

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
    for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
                    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 596, in stream_generate
    detokenizer.add_token(token, skip_special_token_ids=skip_special_token_ids)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/tokenizer_utils.py", line 232, in add_token
    ).decode("utf-8")
      ~~~~~~^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
ValueError: Model generation failed for mlx-community/InternVL3-8B-bf16: 'utf-8' codec can't decode byte 0xab in position 10: invalid start byte

Captured stdout/stderr:

=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g'] 

Prompt: User: <image>
Analyze this image for cataloguing metadata, using British English.

Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.

Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.

Return exactly these three sections, and nothing else:

Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.

Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.

Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.

Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.

Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.
Assistant:

感 Rencontre pestic Rencontre.ERR Rencontre enthus.ERR Rencontre醍racial pestic

=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]
Fetching 17 files: 100%|##########| 17/17 [00:00<00:00, 9459.16it/s]

Download complete: : 0.00B [00:00, ?B/s]              
Download complete: : 0.00B [00:00, ?B/s]

3. Failure affecting 1 model (Priority: Medium)

Observed behavior: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte
Owner (likely component): mlx-vlm
Suggested next action: check processor/chat-template wiring and generation kwargs.
Affected model: mlx-community/Molmo-7B-D-0924-bf16

Model Observed Behavior First Seen Failing Recent Repro
mlx-community/Molmo-7B-D-0924-bf16 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte 2026-03-22 01:27:09 GMT 2/3 recent runs failed

To reproduce

  • Repro command (exact run): python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/Molmo-7B-D-0924-bf16
Detailed trace logs (affected model)

mlx-community/Molmo-7B-D-0924-bf16

Traceback:

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 694, in generate
    for response in stream_generate(model, processor, prompt, image, audio, **kwargs):
                    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/generate.py", line 596, in stream_generate
    detokenizer.add_token(token, skip_special_token_ids=skip_special_token_ids)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/tokenizer_utils.py", line 232, in add_token
    ).decode("utf-8")
      ~~~~~~^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
ValueError: Model generation failed for mlx-community/Molmo-7B-D-0924-bf16: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte

Captured stdout/stderr:

=== STDOUT ===
==========
Files: ['/', 'U', 's', 'e', 'r', 's', '/', 'j', 'r', 'p', '/', 'P', 'i', 'c', 't', 'u', 'r', 'e', 's', '/', 'P', 'r', 'o', 'c', 'e', 's', 's', 'e', 'd', '/', '2', '0', '2', '6', '0', '3', '2', '1', '-', '1', '8', '2', '2', '2', '2', '_', 'D', 'S', 'C', '0', '9', '4', '8', '6', '_', 'D', 'x', 'O', '.', 'j', 'p', 'g'] 

Prompt: Analyze this image for cataloguing metadata, using British English.

Use only details that are clearly and definitely visible in the image. If a detail is uncertain, ambiguous, partially obscured, too small to verify, or not directly visible, leave it out. Do not guess.

Treat the metadata hints below as a draft catalog record. Keep only details that are clearly confirmed by the image, correct anything contradicted by the image, and add important visible details that are definitely present.

Return exactly these three sections, and nothing else:

Title:
- 5-10 words, concrete and factual, limited to clearly visible content.
- Output only the title text after the label.
- Do not repeat or paraphrase these instructions in the title.

Description:
- 1-2 factual sentences describing the main visible subject, setting, lighting, action, and other distinctive visible details. Omit anything uncertain or inferred.
- Output only the description text after the label.

Keywords:
- 10-18 unique comma-separated terms based only on clearly visible subjects, setting, colors, composition, and style. Omit uncertain tags rather than guessing.
- Output only the keyword list after the label.

Rules:
- Include only details that are definitely visible in the image.
- Reuse metadata terms only when they are clearly supported by the image.
- If metadata and image disagree, follow the image.
- Prefer omission to speculation.
- Do not copy prompt instructions into the Title, Description, or Keywords fields.
- Do not infer identity, location, event, brand, species, time period, or intent unless visually obvious.
- Do not output reasoning, notes, hedging, or extra sections.

Context: Existing metadata hints (high confidence; use only when visually confirmed):
- Description hint: Pedestrians cross a footbridge over a canal at dusk in a vibrant urban waterside area. A modern glass building reflects the golden light of the setting sun against a purple twilight sky, while people walk along the towpath, relax on the bank, and socialize at a nearby restaurant. Moored boats line the canal, completing the lively evening scene as people go about their daily lives, commuting or enjoying leisure time.
- Capture metadata: Taken on 2026-03-21 18:22:22 GMT (at 18:22:22 local time). GPS: 51.536500°N, 0.126500°W.

=== STDERR ===
Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]
Fetching 18 files: 100%|##########| 18/18 [00:00<00:00, 11023.14it/s]

Download complete: : 0.00B [00:00, ?B/s]              
Download complete: : 0.00B [00:00, ?B/s]

4. Failure affecting 1 model (Priority: Medium)

Observed behavior: Failed to process inputs with error: Only returning PyTorch tensors is currently supported.
Owner (likely component): transformers
Suggested next action: verify API compatibility and pinned version floor.
Affected model: mlx-community/Qwen3.5-27B-4bit

Model Observed Behavior First Seen Failing Recent Repro
mlx-community/Qwen3.5-27B-4bit Failed to process inputs with error: Only returning PyTorch tensors is currently supported. 2026-03-22 01:27:09 GMT 3/3 recent runs failed

To reproduce

  • Repro command (exact run): python -m check_models --image /Users/jrp/Pictures/Processed/20260321-182222_DSC09486_DxO.jpg --trust-remote-code --max-tokens 500 --temperature 0.0 --top-p 1.0 --repetition-context-size 20 --prefill-step-size 4096 --timeout 300.0 --verbose --models mlx-community/Qwen3.5-27B-4bit
Detailed trace logs (affected model)

mlx-community/Qwen3.5-27B-4bit

Traceback:

Traceback (most recent call last):
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1019, in process_inputs_with_fallback
    return process_inputs(
        processor,
    ...<5 lines>...
        **kwargs,
    )
  File "/Users/jrp/Documents/AI/mlx/mlx-vlm/mlx_vlm/utils.py", line 1005, in process_inputs

... Etc. Further details at https://github.com/jrp2014/check_models/blob/main/src/output/diagnostics.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions