Aggregated totals by architecture and backend:
architecture backend frame_count duration_s avg_fps
0 ViT-B-16-SigLIP2-256 MPS 1637 28.21 58.029068
1 ViT-L-16-SigLIP-256 MPS 1637 91.39 17.912244
2 ViT-L-16-SigLIP-384 MPS 1637 226.56 7.225459
3 ViT-SO400M-14-SigLIP-384 MPS 1637 483.48 3.385869
4 siglip-large-patch16-384 MLX 1637 180.88 9.050199
5 siglip-large-patch16-384-4bit MLX 1637 208.80 7.840038
6 siglip-so400m-patch14-224 MLX 1637 106.37 15.389678
7 siglip-so400m-patch14-384 MLX 1637 341.62 4.791874

Was expecting a more significant speedup, perhaps I'm missing something? I'm taking extracted frames from videos and encoding them with batch size 32, using an MBP M1 Pro 16GB. Here's some snippets of my code:
def process_batch(batch: torch.Tensor) -> torch.Tensor:
# batch has shape [B, 3, H, W] on CPU (by default).
if self.is_mlx_model:
mx_in = self.mx.array(batch)
dtype = (
self.mlx_model.vision_model.vision_model.embeddings.patch_embedding.weight.dtype
)
mx_in = mx_in.transpose(0, 2, 3, 1).astype(dtype)
return torch.from_numpy(np.array(self.mlx_model.get_image_features(pixel_values=mx_in, return_dict=False, output_attentions=False), copy=False))
I also avoid passing the pre-allocated tensors between CPU and MPS when running the MLX pipeline, where otherwise I would allocate them on CPU for the frame extraction and move the whole thing over to MPS when it's done in order to start computing embeddings. Maybe (hopefully) I'm missing something? Will be interesting to see how SigLIP2 performs on MLX regardless, as you can see I included ViT-B-16-SigLIP2-256 and it's by far the fastest.
Was expecting a more significant speedup, perhaps I'm missing something? I'm taking extracted frames from videos and encoding them with batch size 32, using an MBP M1 Pro 16GB. Here's some snippets of my code:
I also avoid passing the pre-allocated tensors between CPU and MPS when running the MLX pipeline, where otherwise I would allocate them on CPU for the frame extraction and move the whole thing over to MPS when it's done in order to start computing embeddings. Maybe (hopefully) I'm missing something? Will be interesting to see how SigLIP2 performs on MLX regardless, as you can see I included
ViT-B-16-SigLIP2-256and it's by far the fastest.