-
-
Notifications
You must be signed in to change notification settings - Fork 306
Description
Description
Using the same model weights (lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit), same image, same prompt, and same hardware (M5 Max 40C, 64GB), mlx-vlm produces dramatically worse visual understanding than LM Studio's mlx-engine.
Reproduction
Same engineering drawing with a visible As-Built stamp at 4 MP:
LM Studio (mlx-engine v1.4.0):
- Detects stamp, reads OCR text ("REDLINE AS-BUILT ☐ CHANGE ☑ NO CHANGE"), finds redlines
features_detected: ["As-Built Stamp", "Change/No Change Checkbox", "Redline Annotations", "Material Substitution Notes"]as_built_stamp_detected: true
mlx-vlm (v0.4.0) via vllm-mlx and Bodega:
- Returns
features_detected: [] as_built_stamp_detected: false- "No As-Built stamp or revision table found"
Accuracy on full dataset (53 files at 4 MP)
| Server | Underlying Engine | Accuracy |
|---|---|---|
| LM Studio | mlx-lm + VisionAddOn | 72.0% |
| vllm-mlx | mlx-vlm generate() | 52-55% |
| Bodega | mlx-vlm (assumed) | 41.5% |
Investigation
We confirmed the image preprocessing is identical — pixel_values shape (15600, 1536) and image_grid_thw match exactly between HuggingFace processor and mlx-vlm's prepare_inputs(). The image reaches the model at full resolution in all cases.
The difference appears to be in how the text generation pipeline handles visual tokens after the vision encoder. LM Studio uses a "VisionAddOn" architecture where mlx-vlm is only used for vision embedding extraction, while text generation always goes through mlx-lm. This separation apparently produces significantly better results.
Specific areas that may differ:
- Chat template application — how vision tokens (
<|vision_start|>,<|image_pad|>,<|vision_end|>) are placed in the prompt - Deepstack visual embedding handling — Qwen3-VL injects intermediate vision features into early language model layers
- Sampling pipeline — mlx-lm has more mature sampling (logits processors, etc.)
Environment
- mlx-vlm: 0.4.0
- mlx: 0.31.1
- transformers: 5.0.0rc3
- Hardware: Apple M5 Max 40C, 64GB
- Model: lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit
- macOS 26.3
Expected Behavior
Same model weights should produce comparable visual understanding regardless of inference server.