iree-tokenizer

Rust bindings for the IREE tokenizer — a high-performance C tokenizer with full HuggingFace tokenizer.json and OpenAI tiktoken compatibility.

Fast. 3-15x faster encode, 25-49x faster decode vs HuggingFace tokenizers. Pure C hot path with zero allocations per token.
Safe. Idiomatic Rust API with RAII, Result error handling, and Send + Sync thread safety. No unsafe in user-facing code.
Streaming encode/decode. First-class support for incremental tokenization — feed chunks in, get tokens out. Ideal for LLM inference.
Drop-in compatible. Loads any HuggingFace tokenizer.json or OpenAI .tiktoken vocabulary. Supports BPE, WordPiece, and Unigram models.

Based on the IREE high-speed tokenizer library:

Optimized for cache utilization. Efficiently utilizes cache on both large and small CPUs. No dependencies and small footprint make it ideal for embedded/client and inclusion into other projects.
Unique algorithmic optimizations. Pull-based streaming processor with bounded/small, deterministic memory usage. Various novel optimizations not seen elsewhere.
GPU-ready. Designed to be compatible with executing tiled on the GPU, not just the host.

Performance

Criterion benchmarks comparing IREE against HuggingFace tokenizers on the same 112-token BPE vocabulary (Apple M-series, single-threaded):

Encode

Corpus	IREE	HuggingFace	Speedup
short (44 chars)	3.1 us	10.4 us	3.3x
medium (880 chars)	13.5 us	155.9 us	11.6x
long (22K chars)	253.9 us	3807.8 us	15.0x

Decode

Corpus	IREE	HuggingFace	Speedup
short (38 tokens)	106 ns	2.6 us	24.5x
medium (456 tokens)	987 ns	46.3 us	46.9x
long (11K tokens)	23.7 us	1152.2 us	48.6x

Run benchmarks yourself:

cargo bench

Prerequisites

CMake and Ninja
A C/C++ compiler (clang or gcc)
libclang (for bindgen)

Building

The IREE C library source is included as a git submodule and compiled automatically via CMake during cargo build.

git clone https://github.com/lightseekorg/iree-tokenizer-rust.git
cd iree-tokenizer-rust

# Initialize submodules (shallow clone to minimize download)
git submodule update --init --depth 1
cd third_party/iree && git submodule update --init --depth 1 third_party/flatcc && cd ../..

cargo build

To build against a local IREE source tree instead of the submodule:

IREE_SOURCE_DIR=/path/to/iree cargo build

Quick Start

use iree_tokenizer::Tokenizer;

// Load from a HuggingFace tokenizer.json
let tok = Tokenizer::from_file("tokenizer.json")?;

// Encode / decode
let ids = tok.encode("Hello world", false)?;     // vec![15496, 995]
let text = tok.decode(&ids, false)?;              // "Hello world"

// Batch encode
let batch = tok.encode_batch(&["Hello", "World"], false)?;

// Rich encoding with byte offsets
let enc = tok.encode_rich("Hello world", false, true)?;
// enc.ids, enc.offsets, enc.type_ids

// Streaming encode (feed chunks, get tokens incrementally)
let mut stream = tok.encode_stream(false)?;
stream.feed("Hello ")?;
stream.feed("world")?;
let ids = stream.finalize()?;

// Vocabulary introspection
tok.vocab_size();                // 50257
tok.model_type();                // "BPE"
tok.token_to_id("hello");       // Some(31373)
tok.id_to_token(31373);         // Some("hello")
tok.bos_token_id();             // Some(1)

Tiktoken

use iree_tokenizer::Tokenizer;

let tok = Tokenizer::from_tiktoken_file("cl100k_base.tiktoken", "cl100k_base")?;
let ids = tok.encode("Hello world", false)?;

API

Method	Returns	Description
`Tokenizer::from_file(path)`	`Result<Tokenizer>`	Load from `tokenizer.json`
`Tokenizer::from_str(json)`	`Result<Tokenizer>`	Load from JSON string
`Tokenizer::from_tiktoken_file(path, enc)`	`Result<Tokenizer>`	Load from `.tiktoken` file
`tok.encode(text, add_special)`	`Result<Vec<i32>>`	Encode text to token IDs
`tok.encode_rich(text, special, offsets)`	`Result<Encoding>`	IDs + byte offsets + type IDs
`tok.decode(ids, skip_special)`	`Result<String>`	Decode token IDs to text
`tok.encode_batch(texts, add_special)`	`Result<Vec<Vec<i32>>>`	Batch encode
`tok.decode_batch(id_lists, skip_special)`	`Result<Vec<String>>`	Batch decode
`tok.encode_stream(add_special)`	`Result<EncodeStream>`	Streaming encoder
`tok.decode_stream(skip_special)`	`Result<DecodeStream>`	Streaming decoder
`tok.vocab_size()`	`usize`	Vocabulary size
`tok.model_type()`	`String`	`"BPE"`, `"WordPiece"`, or `"Unigram"`
`tok.token_to_id(token)`	`Option<i32>`	Look up token ID
`tok.id_to_token(id)`	`Option<String>`	Look up token text

CLI

A streaming iree-tokenizer command is included. It reads from stdin, writes JSONL to stdout.

# Encode text to token IDs
echo "Hello world" | iree-tokenizer encode -t tokenizer.json
# {"seq":0,"text":"Hello world","ids":[15496,995],"n_tokens":2,...}

# Encode with a tiktoken vocabulary
echo "Hello world" | iree-tokenizer encode -t cl100k_base.tiktoken --encoding cl100k_base

# Decode token IDs back to text
echo '[15496, 995]' | iree-tokenizer decode -t tokenizer.json
# {"seq":0,"ids":[15496,995],"text":"Hello world","n_tokens":2,...}

# Chain encode -> decode (round-trip)
cat corpus.txt | iree-tokenizer encode -t tokenizer.json | iree-tokenizer decode -t tokenizer.json

# Tokenizer info
iree-tokenizer info -t tokenizer.json

Use --compact to omit timing fields, --rich for byte offsets.

Testing

cargo test

44 integration tests across 7 test files covering loading, encoding, decoding, batching, streaming, vocabulary introspection, and tiktoken support.

Project Structure

src/
  lib.rs            Public API re-exports
  tokenizer.rs      Safe Tokenizer wrapper over FFI
  stream.rs         EncodeStream / DecodeStream (RAII)
  encoding.rs       Encoding result type (ids, offsets, type_ids)
  error.rs          Error types mapping IREE status codes
  ffi.rs            Bindgen output + manual inline reimplementations
  main.rs           CLI binary
build.rs            CMake + bindgen build script
third_party/iree/   IREE C library (git submodule)
benches/
  bench_comparison.rs   Criterion benchmarks (IREE vs HuggingFace)
tests/
  test_load.rs      Tokenizer construction
  test_encode.rs    Encoding correctness
  test_decode.rs    Decoding correctness
  test_batch.rs     Batch encode/decode
  test_streaming.rs Streaming encode/decode
  test_vocab.rs     Vocabulary introspection
  test_tiktoken.rs  Tiktoken format support

Acknowledgments

This project wraps the IREE tokenizer C library, developed as part of the IREE project. The API design follows the iree-tokenizer-py Python bindings, aiming for feature parity with the Rust ecosystem.

License

Apache-2.0 WITH LLVM-exception -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benches		benches
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.rs		build.rs
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iree-tokenizer

Performance

Encode

Decode

Prerequisites

Building

Quick Start

Tiktoken

API

CLI

Testing

Project Structure

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iree-tokenizer

Performance

Encode

Decode

Prerequisites

Building

Quick Start

Tiktoken

API

CLI

Testing

Project Structure

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages