Skip to content

Qwen3-VL-8B: Significantly degraded visual understanding compared to LM Studio with same weights #856

@Hassan-A-K

Description

@Hassan-A-K

Description

Using the same model weights (lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit), same image, same prompt, and same hardware (M5 Max 40C, 64GB), mlx-vlm produces dramatically worse visual understanding than LM Studio's mlx-engine.

Reproduction

Same engineering drawing with a visible As-Built stamp at 4 MP:

LM Studio (mlx-engine v1.4.0):

  • Detects stamp, reads OCR text ("REDLINE AS-BUILT ☐ CHANGE ☑ NO CHANGE"), finds redlines
  • features_detected: ["As-Built Stamp", "Change/No Change Checkbox", "Redline Annotations", "Material Substitution Notes"]
  • as_built_stamp_detected: true

mlx-vlm (v0.4.0) via vllm-mlx and Bodega:

  • Returns features_detected: []
  • as_built_stamp_detected: false
  • "No As-Built stamp or revision table found"

Accuracy on full dataset (53 files at 4 MP)

Server Underlying Engine Accuracy
LM Studio mlx-lm + VisionAddOn 72.0%
vllm-mlx mlx-vlm generate() 52-55%
Bodega mlx-vlm (assumed) 41.5%

Investigation

We confirmed the image preprocessing is identicalpixel_values shape (15600, 1536) and image_grid_thw match exactly between HuggingFace processor and mlx-vlm's prepare_inputs(). The image reaches the model at full resolution in all cases.

The difference appears to be in how the text generation pipeline handles visual tokens after the vision encoder. LM Studio uses a "VisionAddOn" architecture where mlx-vlm is only used for vision embedding extraction, while text generation always goes through mlx-lm. This separation apparently produces significantly better results.

Specific areas that may differ:

  1. Chat template application — how vision tokens (<|vision_start|>, <|image_pad|>, <|vision_end|>) are placed in the prompt
  2. Deepstack visual embedding handling — Qwen3-VL injects intermediate vision features into early language model layers
  3. Sampling pipeline — mlx-lm has more mature sampling (logits processors, etc.)

Environment

  • mlx-vlm: 0.4.0
  • mlx: 0.31.1
  • transformers: 5.0.0rc3
  • Hardware: Apple M5 Max 40C, 64GB
  • Model: lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit
  • macOS 26.3

Expected Behavior

Same model weights should produce comparable visual understanding regardless of inference server.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions