readable-hash-rs

Hashes like a7b9c3d4e5f6... are hard to read, compare, and remember. This crate transforms hash bytes into pronounceable text, making them easier on human eyes.

You might use this when verifying file integrity visually, or when you need consistent pseudonyms for names and addresses without caring much about cryptographic strength. It's also handy during debugging when you want to quickly tell hashes apart.

This crate is not trying to be the most secure, fastest, or most entropy-efficient solution. The goal is simply readability.

Usage

Add the crate to your Cargo.toml:

readable-hash = "0.1"

Generate a readable hash:

use readable_hash::{english_word_hash, StdHasher};

fn main() {
    let word = english_word_hash::<StdHasher, _>("hello");
    println!("{word}");
    // thatised
}

More examples:

"I" -> "waged"
"different" -> "imaumates"
"pneumonoultramicroscopicsilicovolcanoconiosis" -> "dummaricardemastria"

Tokenizer

The models/ directory bundles a small sample corpus and Python utilities for training a BPE tokenizer with explicit word boundary markers using the tokenizers library.

Install dependencies and train the tokenizer:

pip install -r models/requirements.txt
python models/train_tokenizer.py models/sample_corpus.txt

This writes the tokenizer files into models/tokenizer/. You can verify that the tokenizer works by encoding text with the helper script. The training script normalizes input using Unicode NFKC and lowercases it prior to tokenization, so differently cased forms produce the same tokens. Tokens that continue a word are prefixed with ##, while tokens finishing a word carry a </w> suffix.

python models/tokenizer_check.py models/tokenizer/tokenizer.json "Hello WORLD"

N-gram model

With a tokenizer trained, an n-gram language model can be built from the corpus:

python models/train_ngram.py models/sample_corpus.txt

Pass -n to control the order (default 2 for bigrams).

The tokenizer must define start (<s>) and end (</s>) tokens, which the script uses to mark sentence boundaries. It also verifies that tokens respect word-boundary markers (## prefixes and </w> suffixes) before accumulating statistics. The result is written to ngram.json and stores transition probabilities between token ids.

Generating the 8-bit model

Use the built-in tokenizer/trainer to produce a JSON model with 8-bit cumulative thresholds (0..=255):

python models/tokenize.py models/sample_corpus.txt -o training-data

This writes training-data/<input>-model.json with 8-bit cumulative transition tables and includes probability_resolution_bits: 8 in the metadata.

Entropy consumption and weighted transitions (8-bit)

The Rust generator turns a fixed byte slice of entropy into a sequence of tokens by repeatedly sampling transitions from a weighted distribution encoded as cumulative probabilities with 8-bit resolution.

Entropy consumption

A BitReader wraps the entropy bytes and exposes a bit cursor.
Each token choice reads 8 bits, yielding a value in 0..=255. If fewer than 8 bits remain, generation stops (or falls back to a default for the end token).
Using 8-bit chunks reduces entropy usage per step and matches the transition tables’ resolution.

Weighted transition selection

Transitions are stored as (token_id, cumulative_probability) pairs with cumulative values in 0..=255 (u8).
Given an 8-bit value, the code selects the first entry whose cumulative probability is greater than or equal to the value. If no entry matches (a safety fallback), it returns the last token.
For example, with two options (A, 127) and (B, 255), values 0..=127 yield A, and 128..=255 yield B.

In summary: entropy is consumed in fixed 8-bit chunks, and those chunks are interpreted through cumulative probability tables to pick the next token.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.codex/skills/readable-hash-models		.codex/skills/readable-hash-models
.github		.github
examples		examples
fuzz		fuzz
models		models
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readable-hash-rs

Usage

Tokenizer

N-gram model

Generating the 8-bit model

Entropy consumption and weighted transitions (8-bit)

About

Uh oh!

Releases 2

Packages

Contributors 5

Uh oh!

Languages

License

renatgalimov/readable-hash-rs

Folders and files

Latest commit

History

Repository files navigation

readable-hash-rs

Usage

Tokenizer

N-gram model

Generating the 8-bit model

Entropy consumption and weighted transitions (8-bit)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Uh oh!

Languages

Packages