A Python tool for calculating the number of tokens generated when processing images with Vision Language Models (VLMs).
pip install vt-calc
vt-calc --size 1920 1080 # Calculate tokens for 1920x1080 image
vt-calc --image photo.jpg -m qwen2.5-vl # Calculate tokens for an image
vt-calc --compare all --size 1920 1080 # Compare all models- Calculate image/video tokens for VLMs
- Multi-model comparison - Compare token counts across multiple models
- Support both existing images and dummy images
- Support remote images via URL (http/https)
- Simple command line interface (CLI)
pip install vt-calcpip install -e .# Single image
vt-calc --image path/to/your/image.jpg
# Image from URL
vt-calc --image https://example.com/image.jpg
# Directory (batch processing)
vt-calc --image path/to/your/images_dir
# Dummy image with specific dimensions (Height x Width)
vt-calc --size 1920 1080
# Choose a model (default: qwen2.5-vl)
vt-calc --image photo.jpg -m internvl3# Calculate tokens for a video file
vt-calc --video path/to/video.mp4 -m qwen2.5-vl
# Specify frame sampling rate (FPS)
vt-calc --video video.mp4 --fps 2.0
# Limit maximum number of frames
vt-calc --video video.mp4 --max-frames 100# Compare specific models (comma-separated)
vt-calc --image photo.jpg --compare qwen2.5-vl,internvl3,llava
# Compare all supported models
vt-calc --size 1920 1080 --compare all
# Compare models for video
vt-calc --video video.mp4 --compare qwen2.5-vl,llava-next --fps 2.0| Option | Short | Description | Default |
|---|---|---|---|
--image |
-i |
Path to image file, directory, or URL | - |
--video |
-v |
Path to video file | - |
--size |
-s |
Create dummy image (HEIGHT WIDTH) | - |
--model-name |
-m |
Model name to use | qwen2.5-vl |
--compare |
-c |
Compare models (comma-separated or all) |
- |
--fps |
- | Frames per second for video sampling | - |
--max-frames |
- | Maximum frames to extract from video | - |
--duration |
- | Duration in seconds (dummy video) | - |
Supported input formats: .jpg, .jpeg, .png, .webp (case-insensitive)
Single Image Analysis
Using dummy image: 1920 x 1080
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VISION TOKEN ANALYSIS REPORT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ MODEL INFO โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Model Name deepseek-ocr-tiny โ
โ Processing Method Native Resolution โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ IMAGE INFO โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Source Dummy image (HรW): 1920ร1080 โ
โ Original Size (HรW) 1920ร1080 โ
โ Resized Size (HรW) 512ร512 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ PATCH INFO โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Patch Size (ViT) 16 โ
โ Patch Grid (HรW) 32ร32 โ
โ Total Patches 1024 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ TOKEN INFO โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Image Token (<image>) 64 โ
โ Image Newline Token 8 โ
โ (<image_newline>) โ
โ Image Separator Token 1 โ
โ (<image_separator>) โ
โ Total Vision Tokens 73 โ
โ Pixels per Token 3591.0 px/token โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ TOKEN FORMAT โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ (<image>*8 + <image_newline>) * 8 + <image_seperator> = 73 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Model Comparison
Comparing models for dummy image (HรW): 1920ร1080
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ IMAGE MODEL COMPARISON โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
Dummy image (HรW): 1920ร1080
Resolution (HรW): 1920ร1080
Token Comparison
โญโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโฎ
โ Rank โ Model โ Tokens โ px/token โ Efficiency โ Status โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโค
โ ๐ฅ 1 โ deepseek-ocr-tiny โ 73 โ 3591.0 โ โโโโโโโโโโ Best โ โ โ
โ ๐ฅ 2 โ deepseek-ocr-small โ 111 โ 3690.1 โ โโโโโโโโโโ โ โ โ
โ ๐ฅ 3 โ deepseek-ocr-base โ 273 โ 3840.9 โ โโโโโโโโโโ โ โ โ
โ 4 โ deepseek-ocr-large โ 421 โ 3891.7 โ โโโโโโโโโโ โ โ โ
โ 5 โ llava โ 576 โ 196.0 โ โโโโโโโโโโ โ โ โ
โ 6 โ deepseek-ocr-gundam โ 1,113 โ 942.1 โ โโโโโโโโโโ โ โ โ
โ 7 โ llava-next โ 1,968 โ 129.1 โ โโโโโโโโโโ โ โ โ
โ 8 โ internvl3 โ 2,306 โ 696.3 โ โโโโโโโโโโ โ โ โ
โ 9 โ qwen2-vl โ 2,693 โ 783.4 โ โโโโโโโโโโ โ โ โ
โ 10 โ qwen2.5-vl โ 2,693 โ 783.4 โ โโโโโโโโโโ โ โ โ
โ 11 โ llava-onevision โ 7,317 โ 283.4 โ โโโโโโโโโโ โ โ โ
โ 12 โ phi4-multimodal โ 7,553 โ 744.0 โ โโโโโโโโโโ โ โ โ
โฐโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Summary โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Best: deepseek-ocr-tiny (73 tokens) โ
โ Worst: phi4-multimodal (7,553 tokens) โ
โ Potential Savings: 7,480 tokens (99.0%) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
| Model | Option | Image | Video |
|---|---|---|---|
| Qwen2-VL | qwen2-vl |
โ | โ |
| Qwen2.5-VL | qwen2.5-vl |
โ | โ |
| Qwen3-VL | qwen3-vl |
โ | โ |
| LLaVA | llava |
โ | โ |
| LLaVA-NeXT | llava-next |
โ | |
| LLaVA-OneVision | llava-onevision |
โ | โ |
| InternVL3 | internvl3 |
โ | โ |
| DeepSeek-OCR (tiny) | deepseek-ocr-tiny |
โ | |
| DeepSeek-OCR (small) | deepseek-ocr-small |
โ | |
| DeepSeek-OCR (base) | deepseek-ocr-base |
โ | |
| DeepSeek-OCR (large) | deepseek-ocr-large |
โ | |
| DeepSeek-OCR (gundam) | deepseek-ocr-gundam |
โ | |
| Phi-4-Multimodal | phi4-multimodal |
โ |
This project is licensed under the MIT License โ see the LICENSE file for details.