Hashes like a7b9c3d4e5f6... are hard to read, compare, and remember. This crate transforms hash bytes into pronounceable text, making them easier on human eyes.
You might use this when verifying file integrity visually, or when you need consistent pseudonyms for names and addresses without caring much about cryptographic strength. It's also handy during debugging when you want to quickly tell hashes apart.
This crate is not trying to be the most secure, fastest, or most entropy-efficient solution. The goal is simply readability.
Add the crate to your Cargo.toml:
readable-hash = "0.1"Generate a readable hash:
use readable_hash::{english_word_hash, StdHasher};
fn main() {
let word = english_word_hash::<StdHasher, _>("hello");
println!("{word}");
// thatised
}More examples:
"I" -> "waged"
"different" -> "imaumates"
"pneumonoultramicroscopicsilicovolcanoconiosis" -> "dummaricardemastria"
The models/ directory bundles a small sample corpus and Python utilities for
training a BPE tokenizer with explicit word boundary markers using the
tokenizers library.
Install dependencies and train the tokenizer:
pip install -r models/requirements.txt
python models/train_tokenizer.py models/sample_corpus.txtThis writes the tokenizer files into models/tokenizer/. You can verify that
the tokenizer works by encoding text with the helper script. The training
script normalizes input using Unicode NFKC and lowercases it prior to
tokenization, so differently cased forms produce the same tokens. Tokens that
continue a word are prefixed with ##, while tokens finishing a word carry a
</w> suffix.
python models/tokenizer_check.py models/tokenizer/tokenizer.json "Hello WORLD"With a tokenizer trained, an n-gram language model can be built from the corpus:
python models/train_ngram.py models/sample_corpus.txtPass -n to control the order (default 2 for bigrams).
The tokenizer must define start (<s>) and end (</s>) tokens, which the
script uses to mark sentence boundaries. It also verifies that tokens respect
word-boundary markers (## prefixes and </w> suffixes) before accumulating
statistics. The result is written to ngram.json and stores transition
probabilities between token ids.
Use the built-in tokenizer/trainer to produce a JSON model with 8-bit cumulative thresholds (0..=255):
python models/tokenize.py models/sample_corpus.txt -o training-dataThis writes training-data/<input>-model.json with 8-bit cumulative transition
tables and includes probability_resolution_bits: 8 in the metadata.
The Rust generator turns a fixed byte slice of entropy into a sequence of tokens by repeatedly sampling transitions from a weighted distribution encoded as cumulative probabilities with 8-bit resolution.
Entropy consumption
- A
BitReaderwraps the entropy bytes and exposes a bit cursor. - Each token choice reads 8 bits, yielding a value in
0..=255. If fewer than 8 bits remain, generation stops (or falls back to a default for the end token). - Using 8-bit chunks reduces entropy usage per step and matches the transition tables’ resolution.
Weighted transition selection
- Transitions are stored as
(token_id, cumulative_probability)pairs with cumulative values in0..=255(u8). - Given an 8-bit value, the code selects the first entry whose cumulative probability is greater than or equal to the value. If no entry matches (a safety fallback), it returns the last token.
- For example, with two options
(A, 127)and(B, 255), values0..=127yieldA, and128..=255yieldB.
In summary: entropy is consumed in fixed 8-bit chunks, and those chunks are interpreted through cumulative probability tables to pick the next token.