A fast command-line tool written in Rust for detecting the language of subtitle files.
It reads subtitle text, strips timing and formatting markup, tokenizes the spoken dialogue, compares it against built-in language word lists, and reports the most likely match.
The project supports 80 languages and can output results as a human-readable table, JSON, or CSV.
- Detects the most likely language from subtitle files
- Supports 80 built-in languages, including regional variants
- Works with common subtitle and plain-text formats:
.srt.ass.ssa.sub.txt
- Cleans subtitle markup before detection
- Handles:
- SRT sequence numbers and timestamps
- SSA/ASS dialogue lines and metadata
- SSA/ASS override tags like
{"β¦"} - HTML-like tags such as
<i>and<font>
- Twoβpass detection:
- Pass 1 β language identification using shared common words
- Pass 2 β variant disambiguation using weighted dialectβspecific markers
- Multithreaded scanning across all supported languages
- Multiple output formats:
tablejsoncsv
- Debug mode to inspect exactly which words contributed to a languageβs score
# Download the binary
wget https://github.com/Rabbit-Company/language-detector/releases/latest/download/language-detector-$(uname -m)-gnu
# Set file permissions
sudo chmod 777 language-detector-$(uname -m)-gnu
# Place the binary to `/usr/local/bin`
sudo mv language-detector-$(uname -m)-gnu /usr/local/bin/language-detector
# Start language detector
language-detector# Download Language Detector
wget https://github.com/Rabbit-Company/language-detector/releases/latest/download/language-detector-$(uname -m)-gnu
sudo chmod 777 language-detector-$(uname -m)-gnu
sudo mv language-detector-$(uname -m)-gnu /usr/local/bin/language-detectorThe detector follows a simple pipeline:
- Read the subtitle file
- Strip non-dialogue content such as timestamps, metadata, and markup
- Tokenize the remaining text
- whitespace-based tokenization for space-delimited languages
- character and bigram tokenization for scripts that usually do not separate words with spaces
- Twoβpass scoring
- Pass 1 compares tokens against shared common_words for every language. This identifies the broad language family (e.g., Spanish, Portuguese, Chinese).
- Pass 2 reβranks language variants (e.g., es-419 vs es-ES) using weighted_wordsβdialectβspecific spelling, grammar, and vocabulary that carry a higher score.
- Return ranked results with the top match shown as the detected language
main.rs CLI entry point and orchestration
cleaner.rs Subtitle cleanup and tokenization
scanner.rs Language scanning and scoring
output.rs Table, JSON, and CSV renderers
languages/ Built-in language catalogue and word lists
language-detector [OPTIONS] <FILE><FILE>β path to a file
-f, --format <FORMAT>β output format:table,json, orcsv-d, --debug <LANG>β debug mode: show detailed match info for a language (accepts name, ISO code, or BCP 47 tag)--dump-textβ print the cleaned text used for word matching and exit (debug SRT/SSA/ASS dialogue extraction)-V, --versionβ print version information-h, --helpβ show help
Detect the language of a file:
language-detector movie.srtOutput JSON:
language-detector -f json movie.srtOutput CSV:
language-detector --format csv movie.srtSave JSON to a file:
language-detector -f json movie.srt > result.jsonββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββ
β File β Total words parsed β
ββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββ€
β sub_spa.ass β 2124 β
ββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββ
βββββββββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββ
β Detected language β ISO 639-1 β ISO 639-2 β BCP 47 β Confidence β Weighted score β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββΌββββββββββββββββββββββΌβββββββββββββββββ€
β Spanish (Latin America) β es β spa β es-419 β 43.69% (928 / 2124) β 31.00 β
βββββββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββ
ββββββ¬ββββββββββββββββββββββββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββββ¬βββββββββββββ¬βββββββββββ
β # β Language β 639-1 β 639-2 β BCP 47 β Matches β Confidence β Weighted β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 1 β Spanish (Latin America) β es β spa β es-419 β 928 β 43.69% β 31.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 2 β Spanish (Spain) β es β spa β es-ES β 928 β 43.69% β 5.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 3 β Catalan β ca β cat β - β 507 β 23.87% β 0.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 4 β Galician β gl β glg β - β 469 β 22.08% β 0.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 5 β Portuguese (Portugal) β pt β por β pt-PT β 414 β 19.49% β 140.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 6 β Portuguese (Brazil) β pt β por β pt-BR β 414 β 19.49% β 45.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 7 β French β fr β fra β - β 383 β 18.03% β 0.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 8 β Italian β it β ita β - β 323 β 15.21% β 0.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 9 β Romanian β ro β ron β - β 259 β 12.19% β 0.00 β
ββββββΌββββββββββββββββββββββββββΌββββββββΌββββββββΌβββββββββΌββββββββββΌβββββββββββββΌβββββββββββ€
β 10 β Hungarian β hu β hun β - β 176 β 8.29% β 0.00 β
ββββββ΄ββββββββββββββββββββββββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββββ΄βββββββββββββ΄βββββββββββ
{
"file": "sub_spa.ass",
"total_words": 2124,
"detected": {
"language": "Spanish (Latin America)",
"iso_639_1": "es",
"iso_639_2": "spa",
"bcp47": "es-419",
"matched_words": 928,
"confidence": 0.4369,
"weighted_score": 31.0
},
"scores": [
{
"rank": 1,
"language": "Spanish (Latin America)",
"iso_639_1": "es",
"iso_639_2": "spa",
"bcp47": "es-419",
"matched_words": 928,
"total_words": 2124,
"confidence": 0.4369,
"weighted_score": 31.0
},
{
"rank": 2,
"language": "Spanish (Spain)",
"iso_639_1": "es",
"iso_639_2": "spa",
"bcp47": "es-ES",
"matched_words": 928,
"total_words": 2124,
"confidence": 0.4369,
"weighted_score": 5.0
},
{
"rank": 3,
"language": "Catalan",
"iso_639_1": "ca",
"iso_639_2": "cat",
"bcp47": null,
"matched_words": 507,
"total_words": 2124,
"confidence": 0.2387,
"weighted_score": 0.0
}
]
}rank,language,iso_639_1,iso_639_2,bcp47,matched_words,total_words,confidence,weighted_score
1,Spanish (Latin America),es,spa,es-419,928,2124,0.4369,31.0000
2,Spanish (Spain),es,spa,es-ES,928,2124,0.4369,5.0000
3,Catalan,ca,cat,-,507,2124,0.2387,0.0000This project uses lightweight lexicon-based language detection rather than a large statistical or neural model.
That gives it a few advantages:
- fast
- no external dependencies required at runtime
- easy to inspect and extend
- predictable output
The detector performs two passes over the tokenized text:
- Language identification β every language is scored using its
common_words(function words and highβfrequency neutral vocabulary). This groups related languages together. - Variant disambiguation β for languages that share the same ISO 639β2 code (e.g.,
spa,por,zho), a second pass usesweighted_words. These are dialectβspecific spelling patterns, conjugations, and vocabulary that are strong signals for one variant over another.
The Weighted score column in the output shows the total from Pass 2. When two variants have identical Pass 1 match counts, the one with the higher weighted score is ranked higher.
Use --debug <LANG> to see exactly which tokens matched common_words and weighted_words, along with their hit counts and contributions. This is invaluable for tuning word lists and understanding why a language scored the way it did.
This is a practical detector, not a full linguistic analyzer.
You may see weaker results when:
- the subtitle file is very short
- the text is mostly names, numbers, or sound effects
- two languages are very closely related
- subtitles are heavily mixed between multiple languages
- the lexicon for a language is too small or not representative
To add or improve a language:
- Add a language module in
languages/ - Provide:
- English name
- ISO 639-1 code
- ISO 639-2 code
- a
common_wordslist (shared, neutral vocabulary) - a
weighted_wordslist (dialectβspecific markers, if a variant)
- Register the language in
languages/mod.rs
The better the word lists, the better the detector performs.
The program exits with an error when:
- no file path is provided
- an unknown option is used
- an unknown output format is used
- the file cannot be read
- no usable words are found after cleaning