Qwen3-VL-8B: Significantly degraded visual understanding compared to LM Studio with same weights

## Description

Using the same model weights (`lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit`), same image, same prompt, and same hardware (M5 Max 40C, 64GB), mlx-vlm produces dramatically worse visual understanding than LM Studio's mlx-engine.

## Reproduction

Same engineering drawing with a visible As-Built stamp at 4 MP:

**LM Studio (mlx-engine v1.4.0):**
- Detects stamp, reads OCR text ("REDLINE AS-BUILT ☐ CHANGE ☑ NO CHANGE"), finds redlines
- `features_detected: ["As-Built Stamp", "Change/No Change Checkbox", "Redline Annotations", "Material Substitution Notes"]`
- `as_built_stamp_detected: true`

**mlx-vlm (v0.4.0) via vllm-mlx and Bodega:**
- Returns `features_detected: []`
- `as_built_stamp_detected: false`
- "No As-Built stamp or revision table found"

## Accuracy on full dataset (53 files at 4 MP)

| Server | Underlying Engine | Accuracy |
|--------|------------------|----------|
| LM Studio | mlx-lm + VisionAddOn | **72.0%** |
| vllm-mlx | mlx-vlm generate() | 52-55% |
| Bodega | mlx-vlm (assumed) | 41.5% |

## Investigation

We confirmed the image preprocessing is **identical** — `pixel_values` shape `(15600, 1536)` and `image_grid_thw` match exactly between HuggingFace processor and mlx-vlm's `prepare_inputs()`. The image reaches the model at full resolution in all cases.

The difference appears to be in how the text generation pipeline handles visual tokens after the vision encoder. LM Studio uses a "VisionAddOn" architecture where mlx-vlm is only used for vision embedding extraction, while text generation always goes through mlx-lm. This separation apparently produces significantly better results.

Specific areas that may differ:
1. **Chat template application** — how vision tokens (`<|vision_start|>`, `<|image_pad|>`, `<|vision_end|>`) are placed in the prompt
2. **Deepstack visual embedding handling** — Qwen3-VL injects intermediate vision features into early language model layers
3. **Sampling pipeline** — mlx-lm has more mature sampling (logits processors, etc.)

## Environment

- mlx-vlm: 0.4.0
- mlx: 0.31.1
- transformers: 5.0.0rc3
- Hardware: Apple M5 Max 40C, 64GB
- Model: lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit
- macOS 26.3

## Expected Behavior

Same model weights should produce comparable visual understanding regardless of inference server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-VL-8B: Significantly degraded visual understanding compared to LM Studio with same weights #856

Description

Reproduction

Accuracy on full dataset (53 files at 4 MP)

Investigation

Environment

Expected Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Server	Underlying Engine	Accuracy
LM Studio	mlx-lm + VisionAddOn	72.0%
vllm-mlx	mlx-vlm generate()	52-55%
Bodega	mlx-vlm (assumed)	41.5%

Uh oh!

Qwen3-VL-8B: Significantly degraded visual understanding compared to LM Studio with same weights #856

Description

Description

Reproduction

Accuracy on full dataset (53 files at 4 MP)

Investigation

Environment

Expected Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions