Skip to content

Rabbit-Company/language-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Language Detector

A fast command-line tool written in Rust for detecting the language of subtitle files.

It reads subtitle text, strips timing and formatting markup, tokenizes the spoken dialogue, compares it against built-in language word lists, and reports the most likely match.

The project supports 80 languages and can output results as a human-readable table, JSON, or CSV.

Features

  • Detects the most likely language from subtitle files
  • Supports 80 built-in languages, including regional variants
  • Works with common subtitle and plain-text formats:
    • .srt
    • .ass
    • .ssa
    • .sub
    • .txt
  • Cleans subtitle markup before detection
  • Handles:
    • SRT sequence numbers and timestamps
    • SSA/ASS dialogue lines and metadata
    • SSA/ASS override tags like {"…"}
    • HTML-like tags such as <i> and <font>
  • Two‑pass detection:
    • Pass 1 – language identification using shared common words
    • Pass 2 – variant disambiguation using weighted dialect‑specific markers
  • Multithreaded scanning across all supported languages
  • Multiple output formats:
    • table
    • json
    • csv
  • Debug mode to inspect exactly which words contributed to a language’s score

Installation

# Download the binary
wget https://github.com/Rabbit-Company/language-detector/releases/latest/download/language-detector-$(uname -m)-gnu
# Set file permissions
sudo chmod 777 language-detector-$(uname -m)-gnu
# Place the binary to `/usr/local/bin`
sudo mv language-detector-$(uname -m)-gnu /usr/local/bin/language-detector
# Start language detector
language-detector

Upgrade

# Download Language Detector
wget https://github.com/Rabbit-Company/language-detector/releases/latest/download/language-detector-$(uname -m)-gnu
sudo chmod 777 language-detector-$(uname -m)-gnu
sudo mv language-detector-$(uname -m)-gnu /usr/local/bin/language-detector

How it works

The detector follows a simple pipeline:

  1. Read the subtitle file
  2. Strip non-dialogue content such as timestamps, metadata, and markup
  3. Tokenize the remaining text
    • whitespace-based tokenization for space-delimited languages
    • character and bigram tokenization for scripts that usually do not separate words with spaces
  4. Two‑pass scoring
    • Pass 1 compares tokens against shared common_words for every language. This identifies the broad language family (e.g., Spanish, Portuguese, Chinese).
    • Pass 2 re‑ranks language variants (e.g., es-419 vs es-ES) using weighted_wordsβ€”dialect‑specific spelling, grammar, and vocabulary that carry a higher score.
  5. Return ranked results with the top match shown as the detected language

Project structure

main.rs      CLI entry point and orchestration
cleaner.rs   Subtitle cleanup and tokenization
scanner.rs   Language scanning and scoring
output.rs    Table, JSON, and CSV renderers
languages/   Built-in language catalogue and word lists

Usage

language-detector [OPTIONS] <FILE>

Arguments

  • <FILE> β€” path to a file

Options

  • -f, --format <FORMAT> β€” output format: table, json, or csv
  • -d, --debug <LANG> β€” debug mode: show detailed match info for a language (accepts name, ISO code, or BCP 47 tag)
  • --dump-text β€” print the cleaned text used for word matching and exit (debug SRT/SSA/ASS dialogue extraction)
  • -V, --version β€” print version information
  • -h, --help β€” show help

Examples

Detect the language of a file:

language-detector movie.srt

Output JSON:

language-detector -f json movie.srt

Output CSV:

language-detector --format csv movie.srt

Save JSON to a file:

language-detector -f json movie.srt > result.json

Example output

Table

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      File                      β”‚ Total words parsed β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ sub_spa.ass                                    β”‚ 2124               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Detected language    β”‚ ISO 639-1 β”‚ ISO 639-2 β”‚ BCP 47 β”‚     Confidence      β”‚ Weighted score β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Spanish (Latin America) β”‚ es        β”‚ spa       β”‚ es-419 β”‚ 43.69% (928 / 2124) β”‚ 31.00          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ #  β”‚        Language         β”‚ 639-1 β”‚ 639-2 β”‚ BCP 47 β”‚ Matches β”‚ Confidence β”‚ Weighted β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1  β”‚ Spanish (Latin America) β”‚ es    β”‚ spa   β”‚ es-419 β”‚ 928     β”‚ 43.69%     β”‚ 31.00    β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2  β”‚ Spanish (Spain)         β”‚ es    β”‚ spa   β”‚ es-ES  β”‚ 928     β”‚ 43.69%     β”‚ 5.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3  β”‚ Catalan                 β”‚ ca    β”‚ cat   β”‚ -      β”‚ 507     β”‚ 23.87%     β”‚ 0.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4  β”‚ Galician                β”‚ gl    β”‚ glg   β”‚ -      β”‚ 469     β”‚ 22.08%     β”‚ 0.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 5  β”‚ Portuguese (Portugal)   β”‚ pt    β”‚ por   β”‚ pt-PT  β”‚ 414     β”‚ 19.49%     β”‚ 140.00   β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 6  β”‚ Portuguese (Brazil)     β”‚ pt    β”‚ por   β”‚ pt-BR  β”‚ 414     β”‚ 19.49%     β”‚ 45.00    β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7  β”‚ French                  β”‚ fr    β”‚ fra   β”‚ -      β”‚ 383     β”‚ 18.03%     β”‚ 0.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 8  β”‚ Italian                 β”‚ it    β”‚ ita   β”‚ -      β”‚ 323     β”‚ 15.21%     β”‚ 0.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 9  β”‚ Romanian                β”‚ ro    β”‚ ron   β”‚ -      β”‚ 259     β”‚ 12.19%     β”‚ 0.00     β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 10 β”‚ Hungarian               β”‚ hu    β”‚ hun   β”‚ -      β”‚ 176     β”‚ 8.29%      β”‚ 0.00     β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

JSON

{
	"file": "sub_spa.ass",
	"total_words": 2124,
	"detected": {
		"language": "Spanish (Latin America)",
		"iso_639_1": "es",
		"iso_639_2": "spa",
		"bcp47": "es-419",
		"matched_words": 928,
		"confidence": 0.4369,
		"weighted_score": 31.0
	},
	"scores": [
		{
			"rank": 1,
			"language": "Spanish (Latin America)",
			"iso_639_1": "es",
			"iso_639_2": "spa",
			"bcp47": "es-419",
			"matched_words": 928,
			"total_words": 2124,
			"confidence": 0.4369,
			"weighted_score": 31.0
		},
		{
			"rank": 2,
			"language": "Spanish (Spain)",
			"iso_639_1": "es",
			"iso_639_2": "spa",
			"bcp47": "es-ES",
			"matched_words": 928,
			"total_words": 2124,
			"confidence": 0.4369,
			"weighted_score": 5.0
		},
		{
			"rank": 3,
			"language": "Catalan",
			"iso_639_1": "ca",
			"iso_639_2": "cat",
			"bcp47": null,
			"matched_words": 507,
			"total_words": 2124,
			"confidence": 0.2387,
			"weighted_score": 0.0
		}
	]
}

CSV

rank,language,iso_639_1,iso_639_2,bcp47,matched_words,total_words,confidence,weighted_score
1,Spanish (Latin America),es,spa,es-419,928,2124,0.4369,31.0000
2,Spanish (Spain),es,spa,es-ES,928,2124,0.4369,5.0000
3,Catalan,ca,cat,-,507,2124,0.2387,0.0000

Detection strategy

This project uses lightweight lexicon-based language detection rather than a large statistical or neural model.

That gives it a few advantages:

  • fast
  • no external dependencies required at runtime
  • easy to inspect and extend
  • predictable output

Two‑pass scoring

The detector performs two passes over the tokenized text:

  1. Language identification – every language is scored using its common_words (function words and high‑frequency neutral vocabulary). This groups related languages together.
  2. Variant disambiguation – for languages that share the same ISO 639‑2 code (e.g., spa, por, zho), a second pass uses weighted_words. These are dialect‑specific spelling patterns, conjugations, and vocabulary that are strong signals for one variant over another.

The Weighted score column in the output shows the total from Pass 2. When two variants have identical Pass 1 match counts, the one with the higher weighted score is ranked higher.

Debug mode

Use --debug <LANG> to see exactly which tokens matched common_words and weighted_words, along with their hit counts and contributions. This is invaluable for tuning word lists and understanding why a language scored the way it did.

Limitations

This is a practical detector, not a full linguistic analyzer.

You may see weaker results when:

  • the subtitle file is very short
  • the text is mostly names, numbers, or sound effects
  • two languages are very closely related
  • subtitles are heavily mixed between multiple languages
  • the lexicon for a language is too small or not representative

Extending the project

To add or improve a language:

  1. Add a language module in languages/
  2. Provide:
    • English name
    • ISO 639-1 code
    • ISO 639-2 code
    • a common_words list (shared, neutral vocabulary)
    • a weighted_words list (dialect‑specific markers, if a variant)
  3. Register the language in languages/mod.rs

The better the word lists, the better the detector performs.

Exit behavior

The program exits with an error when:

  • no file path is provided
  • an unknown option is used
  • an unknown output format is used
  • the file cannot be read
  • no usable words are found after cleaning

About

Detect file language using common word frequency analysis with multithreading

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages