Skip to content

thisisiron/vision-token-calculator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

153 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Vision Token Calculator

PyPI version License: MIT Python 3.8+

A Python tool for calculating the number of tokens generated when processing images with Vision Language Models (VLMs).

Quick Start

pip install vt-calc
vt-calc --size 1920 1080                    # Calculate tokens for 1920x1080 image
vt-calc --image photo.jpg -m qwen2.5-vl     # Calculate tokens for an image
vt-calc --compare all --size 1920 1080      # Compare all models

Features

  • Calculate image/video tokens for VLMs
  • Multi-model comparison - Compare token counts across multiple models
  • Support both existing images and dummy images
  • Support remote images via URL (http/https)
  • Simple command line interface (CLI)

Installation

Option 1: PyPI (recommended)

pip install vt-calc

Option 2: From source (editable for development)

pip install -e .

Usage

Basic Commands

# Single image
vt-calc --image path/to/your/image.jpg

# Image from URL
vt-calc --image https://example.com/image.jpg

# Directory (batch processing)
vt-calc --image path/to/your/images_dir

# Dummy image with specific dimensions (Height x Width)
vt-calc --size 1920 1080

# Choose a model (default: qwen2.5-vl)
vt-calc --image photo.jpg -m internvl3

Video Processing

# Calculate tokens for a video file
vt-calc --video path/to/video.mp4 -m qwen2.5-vl

# Specify frame sampling rate (FPS)
vt-calc --video video.mp4 --fps 2.0

# Limit maximum number of frames
vt-calc --video video.mp4 --max-frames 100

Model Comparison

# Compare specific models (comma-separated)
vt-calc --image photo.jpg --compare qwen2.5-vl,internvl3,llava

# Compare all supported models
vt-calc --size 1920 1080 --compare all

# Compare models for video
vt-calc --video video.mp4 --compare qwen2.5-vl,llava-next --fps 2.0

CLI Options

Option Short Description Default
--image -i Path to image file, directory, or URL -
--video -v Path to video file -
--size -s Create dummy image (HEIGHT WIDTH) -
--model-name -m Model name to use qwen2.5-vl
--compare -c Compare models (comma-separated or all) -
--fps - Frames per second for video sampling -
--max-frames - Maximum frames to extract from video -
--duration - Duration in seconds (dummy video) -

Supported input formats: .jpg, .jpeg, .png, .webp (case-insensitive)

Example Output

Single Image Analysis
Using dummy image: 1920 x 1080
                        โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
                        โ•‘ VISION TOKEN ANALYSIS REPORT โ•‘
                        โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ MODEL INFO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Model Name                deepseek-ocr-tiny                                โ”‚
โ”‚   Processing Method         Native Resolution                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ IMAGE INFO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Source                    Dummy image (Hร—W): 1920ร—1080                     โ”‚
โ”‚   Original Size (Hร—W)       1920ร—1080                                        โ”‚
โ”‚   Resized Size (Hร—W)        512ร—512                                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ PATCH INFO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Patch Size (ViT)          16                                               โ”‚
โ”‚   Patch Grid (Hร—W)          32ร—32                                            โ”‚
โ”‚   Total Patches             1024                                             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TOKEN INFO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   Image Token (<image>)     64                                               โ”‚
โ”‚   Image Newline Token       8                                                โ”‚
โ”‚   (<image_newline>)                                                          โ”‚
โ”‚   Image Separator Token     1                                                โ”‚
โ”‚   (<image_separator>)                                                        โ”‚
โ”‚   Total Vision Tokens        73                                               โ”‚
โ”‚   Pixels per Token          3591.0 px/token                                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TOKEN FORMAT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚          (<image>*8 + <image_newline>) * 8 + <image_seperator> = 73          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Model Comparison
Comparing models for dummy image (Hร—W): 1920ร—1080

                           โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
                           โ•‘ IMAGE MODEL COMPARISON โ•‘
                           โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                          Dummy image (Hร—W): 1920ร—1080
                          Resolution (Hร—W): 1920ร—1080

                                  Token Comparison
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Rank  โ”‚ Model               โ”‚     Tokens โ”‚   px/token โ”‚ Efficiency       โ”‚ Status โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ๐Ÿฅ‡ 1  โ”‚ deepseek-ocr-tiny   โ”‚         73 โ”‚     3591.0 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘ Best  โ”‚   โœ“    โ”‚
โ”‚  ๐Ÿฅˆ 2  โ”‚ deepseek-ocr-small  โ”‚        111 โ”‚     3690.1 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘       โ”‚   โœ“    โ”‚
โ”‚  ๐Ÿฅ‰ 3  โ”‚ deepseek-ocr-base   โ”‚        273 โ”‚     3840.9 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘       โ”‚   โœ“    โ”‚
โ”‚   4    โ”‚ deepseek-ocr-large  โ”‚        421 โ”‚     3891.7 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘       โ”‚   โœ“    โ”‚
โ”‚   5    โ”‚ llava               โ”‚        576 โ”‚      196.0 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘       โ”‚   โœ“    โ”‚
โ”‚   6    โ”‚ deepseek-ocr-gundam โ”‚      1,113 โ”‚      942.1 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   7    โ”‚ llava-next          โ”‚      1,968 โ”‚      129.1 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   8    โ”‚ internvl3           โ”‚      2,306 โ”‚      696.3 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   9    โ”‚ qwen2-vl            โ”‚      2,693 โ”‚      783.4 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   10   โ”‚ qwen2.5-vl          โ”‚      2,693 โ”‚      783.4 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   11   โ”‚ llava-onevision     โ”‚      7,317 โ”‚      283.4 โ”‚ โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ”‚   12   โ”‚ phi4-multimodal     โ”‚      7,553 โ”‚      744.0 โ”‚ โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘       โ”‚   โœ“    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Best: deepseek-ocr-tiny (73 tokens)                                          โ”‚
โ”‚ Worst: phi4-multimodal (7,553 tokens)                                        โ”‚
โ”‚ Potential Savings: 7,480 tokens (99.0%)                                      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Supported Models

Model Option Image Video
Qwen2-VL qwen2-vl โœ“ โœ“
Qwen2.5-VL qwen2.5-vl โœ“ โœ“
Qwen3-VL qwen3-vl โœ“ โœ“
LLaVA llava โœ“ โœ“
LLaVA-NeXT llava-next โœ“
LLaVA-OneVision llava-onevision โœ“ โœ“
InternVL3 internvl3 โœ“ โœ“
DeepSeek-OCR (tiny) deepseek-ocr-tiny โœ“
DeepSeek-OCR (small) deepseek-ocr-small โœ“
DeepSeek-OCR (base) deepseek-ocr-base โœ“
DeepSeek-OCR (large) deepseek-ocr-large โœ“
DeepSeek-OCR (gundam) deepseek-ocr-gundam โœ“
Phi-4-Multimodal phi4-multimodal โœ“

License

This project is licensed under the MIT License โ€” see the LICENSE file for details.

Packages

 
 
 

Contributors