fix(i18n): clean _protected wrong-forms that corrupt prose across all 12 dicts#218
Merged
Conversation
… 12 dicts PR 2 of the multi-agent re-review fix campaign (PR 1 was #217, the engine). The re-review (225 findings) plus a 12-language harmonization pass under a single "can this wrong-form appear in correct prose?" rule cleaned every _protected block. restoreProtectedTerms rewrites each wrong-form -> correct term. PR 1 anchored Latin/Cyrillic matching so a wrong-form no longer corrupts a LONGER word, but a wrong-form that is itself a real word / given name / the dict's own native translation is still destructive (and CJK forms still use literal replaceAll). This PR removes those data-level offenders, keeping only safe forms: - Real-word / given-name collisions neutralized to self-referential keep-English brand keys: es/it "Claudio", pt "Cláudio" (man's name); fr "Anthropique", it/es/pt "Antropico"/"Antrópico", de "Anthropisch" (real adjective); fr "cotravail", es "cotrabajo", pt "cotrabalho" (real/coined words); vi "Mã Claude". - Generic concepts the dictionaries render natively are dropped entirely so the native rendering stands and CJK prose stops being rewritten: slash command, subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch, Enterprise — i.e. the wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル / プラグイン / フック / 후크 / субагент / etc., every one a common standalone word. - "Skills" (the product) becomes a self-ref keep-English key in every CJK locale, dropping the real-word wrong-forms 技能 / スキル / 스킬. - SAFE phonetic brand transliterations are KEPT as restores (they never occur as real prose and fix GT's transliteration back to the English brand): ja クロード, zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de "Koarbeit", es "Código Claude", etc. Net: 50 generic entries dropped, 23 brand wrong-form arrays neutralized, across the 12 source dictionaries (+ regenerated plugin data). All brand/product KEYS are preserved for the Gemini keep-English path. 9 gates green (incl. the PR 1 glossary build-guard) · 553 unit · e2e 20/20. Defers to PR 3: Managed Agents canonicalization ("Claude Managed Agents") and residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs others).
heznpc
added a commit
that referenced
this pull request
Jun 16, 2026
…s (12 dicts) (#219) PR 3 of the multi-agent re-review fix campaign (PR 1 #217 engine, PR 2 #218 data). Scope per owner decision: objective content DEFECTS + brand/product English- retention only. Stylistic word-choice (how to render "AI Fluency", subagent synonyms, generic rendering-consistency) is deferred to native review (#202). 108 value-only edits across the 12 source dictionaries, produced and then adversarially verified by an independent per-language pass (46 out-of-scope or incorrect proposals were rejected; every applied edit's prior value matched the file exactly; no keys added/removed, so key-parity holds): - Mistranslations (meaning was wrong): zh-CN Delegation 授权 "authorize" -> 委托 "delegate" and Diligence 审核 "review" -> 勤奋 "diligence" (the 4D competency names); zh-TW 審核 -> 勤勉; ja "steerable" 操舵可能 (nautical) -> 制御しやすい; ru "Headless mode" 自律 -> без интерфейса; it "Prompts" (verb-read) -> Prompt, trigger "grilletto" (gun trigger) -> attivazione, Sign In/Up swap. - Garbled strings (residue of the old unanchored protected-terms replaceAll): it "affidskill" -> affidabilità, "scalskill" -> scalabilità, "Aghook eventi" -> "Eventi degli hook". - Untranslated fragments: Bedrock/Vertex catalog connectives ("Claude with ..." -> con/avec/com), "Powered by", English lead-ins left mid-sentence (ko/ja). - Brand-policy: product names restored to English in prose — Skills / Agent Skills (the largest group: 技能/skill/Fähigkeiten/Habilidades/agent skills -> Skills/Agent Skills across agentSkills + claude101 + catalog), Claude Code (it "Codice Claude" -> Claude Code), Model Context Protocol (it). This pairs with PR 2, which removed the now-unsafe auto-restore of 技能/スキル/Skills — the curated values now carry the English brand directly. - German grammar: separable verb "Diese Lektion hervorhebt" -> "hebt ... hervor", genitive "Claude's" -> "Claudes", malformed compounds "KI-Fluencysplan" -> "KI-Fluency-Plan", "Lektion Rückblick" -> "Lektionsrückblick". 9 gates green · 553 unit · e2e 20/20 (one tight-timeout PDF-popup test flaked twice under local load; passed in isolation and on a determinism re-run — it is fixture-based and cannot be affected by dictionary content).
heznpc
added a commit
that referenced
this pull request
Jun 17, 2026
…llback (#224) DEFAULT_PROTECTED_TERMS is the Gemini "keep-English" fallback used by getKeepEnglishTerms() only when a locale has no _protected keys. It still listed skill, Subagent, Enterprise, Personal, Plugin, and Dispatch — generic concept words that PR #218 deliberately REMOVED from the per-locale _protected blocks because they are translated natively per locale (concept-vs-product-name policy, docs/TRANSLATION_RULES.md §1). Keeping them in the fallback told Gemini to keep ordinary words in English — the opposite of the shipped policy. Reduced to brand/product/file-format proper nouns only: API, SDK, Claude, Anthropic, Claude Code, Cowork, Computer Use, SKILL.md, frontmatter. Low-impact (the fallback only fires for a locale with an empty _protected), but it removes a policy inconsistency a future contributor/LLM pass could be misled by. Surfaced by the doc fact-check in #223. 555 unit (constants assertions unchanged) · gates green · e2e 20/20.
heznpc
added a commit
that referenced
this pull request
Jun 17, 2026
… write paths (#231) The GT queue applies restoreProtectedTerms() deterministically, but two other translation write paths bypassed it and relied only on the prompt's "keep English" instruction (probabilistic): the Gemini inline-HTML block translator and the code-comment translator. When the model ignored the instruction, a brand/API term (e.g. "Claude" → "클로드") was written into the lesson DOM untouched. Apply the same deterministic safety net to both paths: - gemini-block.js: restoreProtectedTerms() on the model reply before the DOM write. - code-comments.js: restore the translated comment before escaping/splicing, and build the protected-terms map up front so the standalone code-comment path has it. This is now safe to apply broadly because the engine was hardened in #217/#218 (substring guard + Latin/Cyrillic boundary + CJK interpunct guard + real-word data cleanup) — restoreProtectedTerms no longer corrupts prose, so extending it to more write paths carries no regression risk. restoreProtectedTerms is also a no-op when the map is unbuilt, and protected-terms.js loads before both modules. Tests genuinely catch the regression (verified by stash-rebuild-rerun on both): - gemini-block.test.js: a unit test feeding "클로드" asserts "<strong>Claude</strong>" is written and the restored text is cached. Fails without the fix. - code-comments e2e: the GT stub now returns "클로드 프롬프트 예시"; the existing assertion expects "# Claude 프롬프트 예시", so it only passes if restoration fires. Fails ("# 클로드 …") without the fix. lint · format · 556 unit · gemini-block 26 · e2e 21/21.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 2 of the multi-agent re-review fix campaign (#217 was PR 1, the restore engine).
A 12-language re-review (225 findings) + a harmonization pass under one rule — "can this wrong-form ever appear in CORRECT target-language prose?" — cleaned every
_protectedblock.Why
restoreProtectedTermsrewrites each wrong-form → correct term. PR 1 anchored Latin/Cyrillic matching so a wrong-form no longer corrupts a longer word — but a wrong-form that is itself a real word, a given name, or the dict's own native translation is still destructive, and CJK forms still use literalreplaceAll. This removes those data-level offenders.What changed (50 entries dropped · 23 arrays neutralized · 12 dicts)
Claudio, ptCláudio(man's name); frAnthropique, it/es/ptAntropico/Antrópico, deAnthropisch(real adjective); frcotravail, escotrabajo, ptcotrabalho; viMã Claude.Skills(product) → self-ref keep-English key in every CJK locale, dropping real-word wrong-forms 技能 / スキル / 스킬.Koarbeit, esCódigo Claude.All brand/product KEYS preserved for the Gemini keep-English path.
Verify
9 gates green (incl. PR 1's glossary substring build-guard) · 553 unit · e2e 20/20.
Defers to PR 3: Managed Agents canonicalization (
Claude Managed Agents) + residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs ja/zh dropped).🤖 Generated with Claude Code