This file provides guidance for AI agents working with the Linguist repository.
Linguist is a Ruby library used on GitHub.com to:
- Detect blob languages
- Ignore binary and vendored files
- Suppress generated files in diffs
- Generate language breakdown graphs
It's packaged as the github-linguist gem and includes CLI tools: github-linguist and git-linguist.
| Path | Purpose |
|---|---|
lib/linguist/languages.yml |
Master list of all languages, extensions, filenames, and their properties |
lib/linguist/heuristics.yml |
Rules for disambiguating files with shared extensions |
lib/linguist/heuristics.rb |
Ruby code implementing heuristic logic |
lib/linguist/classifier.rb |
Centroid-based classifier for language detection |
samples/ |
Sample code files for each language (used by classifier) |
grammars.yml |
Lists all TextMate grammars used for syntax highlighting |
vendor/ |
Contains grammar submodules and cached license files for the grammars |
script/ |
Utility scripts for maintenance tasks |
Use GitHub Codespaces or the dev container for the easiest setup. The environment includes Ruby, Node.js, Docker, and all required dependencies.
To bootstrap locally:
script/bootstrapTo run Linguist from the repo:
bundle exec bin/github-linguist --breakdown- Add extension to the language entry in
lib/linguist/languages.yml(alphabetical order, case-sensitive; primary extension first) - Add sample files to
samples/<Language>/ - If extension is shared with another language, ensure there are at least two samples and add a heuristic in
lib/linguist/heuristics.yml
- Add entry to
lib/linguist/languages.yml(omitlanguage_idinitially) - Add grammar:
script/add-grammar <grammar-url> - Add samples to
samples/<Language>/ - Generate ID:
script/update-ids - If an extension is shared with another language, ensure there are at least two samples and add a heuristic in
lib/linguist/heuristics.yml
script/add-grammar --replace <GrammarName> <new-grammar-url>script/bootstrap- Install dependenciesscript/add-grammar- Add or replace TextMate grammarsscript/update-ids- Generate unique language IDsscript/cross-validation- Test the classifierscript/cibuild- CI build script
Run tests with:
bundle exec rake testTest the classifier:
bundle exec script/cross-validation --testAlways ensure tests pass before submitting changes. GitHub Actions runs tests on all PRs.
- Ruby style: Follow existing code patterns in the repository
- languages.yml: Keep alphabetically sorted (case-sensitive, uppercase before lowercase) with the comment at the top. Use the comment at the top to determine the fields to add for a language.
- Extensions in languages.yml: Keep alphabetically sorted (case-sensitive, uppercase before lowercase); primary extension first
- Samples: Use real-world code examples, not "Hello World" and other common examples used in tutorials
- Heuristics: Write patterns to minimize false positives. Patterns must be linear, safe from ReDoS attacks, and RE2 compatible. All heuristics must have tests.
- Don't refactor unrelated code. Keep changes scoped to the task at hand.
.gitmodulesgrammars.yml- all files in
vendor/
These are all maintained by the script/add-grammars script.
All PRs must:
- Use the provided PR template. Irrelevant sections may be removed.
- Link to GitHub search results showing in-the-wild usage, excluding forks, for each extension being added (minimum 2000 files for common extensions, 200 for once-per-repo files or extensions).
- Link to the original source of any samples added. This must not be a link to the fork of Linguist used for the contribution.
- State the license of any sample code added individually.
- Pass all CI tests.
Do not open a PR if any of the above conditions are not met and notify the user why the PR has not been created.
- At least 2000 files indexed on GitHub (or 200 for once-per-repo files like Makefile), excluding forks
- Reasonable distribution across unique
user/repocombinations - High-proportion users may be filtered out during assessment
Only add grammars with approved licenses (see vendor/licenses/config.yml for the list). TextMate-compatible grammars are used for syntax highlighting.
charlock_holmes- Character encoding detection (requires ICU)rugged- libgit2 bindings for Ruby- Docker - Required for grammar operations