fix(i18n): anchor protected-term restoration to stop substring corruption#217
Merged
Conversation
…tion A multi-agent re-review of all 12 premium dictionaries found that restoreProtectedTerms did an UNANCHORED replaceAll, so any _protected wrong-form that is a substring/prefix of a real word silently corrupts correct on-page prose (the #172/#197 class — confirmed live across every dictionary). Worst case was a regression I had just shipped: id's "subagent"->["subagen"] turned "subagent" into "subagentst". Engine fix (src/lib/protected-terms.js): - buildProtectedTermsMap drops any wrong-form that is a SUBSTRING of its own correct term (e.g. "subagen" inside "subagent") — the longest-first sort can never save a true prefix. - restoreProtectedTerms now matches Latin/Cyrillic/etc. wrong-forms with a Unicode letter boundary `(?<!\p{L})form(?!\p{L})`, so a form never corrupts a longer word that merely contains it. CJK/Kana/Hangul keep literal replaceAll (those scripts have no word separators, so a term is routinely adjacent to a particle even when it legitimately should be restored — anchoring would BREAK restoration there); their compound corruption stays a per-dictionary data concern. Guardrail (scripts/check-glossary.js): hard-error on any wrong-form that is a substring of its correct term, so the regression class can't come back. Data: removed id's "subagent"/"subagents" -> "subagen" entries (the regression). 553 unit (incl. 4 new engine tests: substring guard, Latin boundary, CJK-restoration-preserved) · e2e 20/20 · all gates green. This is PR 1 of the re-review fix campaign — the engine + guard. PR 2 cleans the per-dictionary _protected DATA (standalone real-word/name collisions + brand transliterations that anchoring cannot resolve).
heznpc
added a commit
that referenced
this pull request
Jun 16, 2026
… 12 dicts (#218) PR 2 of the multi-agent re-review fix campaign (PR 1 was #217, the engine). The re-review (225 findings) plus a 12-language harmonization pass under a single "can this wrong-form appear in correct prose?" rule cleaned every _protected block. restoreProtectedTerms rewrites each wrong-form -> correct term. PR 1 anchored Latin/Cyrillic matching so a wrong-form no longer corrupts a LONGER word, but a wrong-form that is itself a real word / given name / the dict's own native translation is still destructive (and CJK forms still use literal replaceAll). This PR removes those data-level offenders, keeping only safe forms: - Real-word / given-name collisions neutralized to self-referential keep-English brand keys: es/it "Claudio", pt "Cláudio" (man's name); fr "Anthropique", it/es/pt "Antropico"/"Antrópico", de "Anthropisch" (real adjective); fr "cotravail", es "cotrabajo", pt "cotrabalho" (real/coined words); vi "Mã Claude". - Generic concepts the dictionaries render natively are dropped entirely so the native rendering stands and CJK prose stops being rewritten: slash command, subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch, Enterprise — i.e. the wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル / プラグイン / フック / 후크 / субагент / etc., every one a common standalone word. - "Skills" (the product) becomes a self-ref keep-English key in every CJK locale, dropping the real-word wrong-forms 技能 / スキル / 스킬. - SAFE phonetic brand transliterations are KEPT as restores (they never occur as real prose and fix GT's transliteration back to the English brand): ja クロード, zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de "Koarbeit", es "Código Claude", etc. Net: 50 generic entries dropped, 23 brand wrong-form arrays neutralized, across the 12 source dictionaries (+ regenerated plugin data). All brand/product KEYS are preserved for the Gemini keep-English path. 9 gates green (incl. the PR 1 glossary build-guard) · 553 unit · e2e 20/20. Defers to PR 3: Managed Agents canonicalization ("Claude Managed Agents") and residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs others).
heznpc
added a commit
that referenced
this pull request
Jun 16, 2026
…s (12 dicts) (#219) PR 3 of the multi-agent re-review fix campaign (PR 1 #217 engine, PR 2 #218 data). Scope per owner decision: objective content DEFECTS + brand/product English- retention only. Stylistic word-choice (how to render "AI Fluency", subagent synonyms, generic rendering-consistency) is deferred to native review (#202). 108 value-only edits across the 12 source dictionaries, produced and then adversarially verified by an independent per-language pass (46 out-of-scope or incorrect proposals were rejected; every applied edit's prior value matched the file exactly; no keys added/removed, so key-parity holds): - Mistranslations (meaning was wrong): zh-CN Delegation 授权 "authorize" -> 委托 "delegate" and Diligence 审核 "review" -> 勤奋 "diligence" (the 4D competency names); zh-TW 審核 -> 勤勉; ja "steerable" 操舵可能 (nautical) -> 制御しやすい; ru "Headless mode" 自律 -> без интерфейса; it "Prompts" (verb-read) -> Prompt, trigger "grilletto" (gun trigger) -> attivazione, Sign In/Up swap. - Garbled strings (residue of the old unanchored protected-terms replaceAll): it "affidskill" -> affidabilità, "scalskill" -> scalabilità, "Aghook eventi" -> "Eventi degli hook". - Untranslated fragments: Bedrock/Vertex catalog connectives ("Claude with ..." -> con/avec/com), "Powered by", English lead-ins left mid-sentence (ko/ja). - Brand-policy: product names restored to English in prose — Skills / Agent Skills (the largest group: 技能/skill/Fähigkeiten/Habilidades/agent skills -> Skills/Agent Skills across agentSkills + claude101 + catalog), Claude Code (it "Codice Claude" -> Claude Code), Model Context Protocol (it). This pairs with PR 2, which removed the now-unsafe auto-restore of 技能/スキル/Skills — the curated values now carry the English brand directly. - German grammar: separable verb "Diese Lektion hervorhebt" -> "hebt ... hervor", genitive "Claude's" -> "Claudes", malformed compounds "KI-Fluencysplan" -> "KI-Fluency-Plan", "Lektion Rückblick" -> "Lektionsrückblick". 9 gates green · 553 unit · e2e 20/20 (one tight-timeout PDF-popup test flaked twice under local load; passed in isolation and on a determinism re-run — it is fixture-based and cannot be affected by dictionary content).
heznpc
added a commit
that referenced
this pull request
Jun 17, 2026
… write paths (#231) The GT queue applies restoreProtectedTerms() deterministically, but two other translation write paths bypassed it and relied only on the prompt's "keep English" instruction (probabilistic): the Gemini inline-HTML block translator and the code-comment translator. When the model ignored the instruction, a brand/API term (e.g. "Claude" → "클로드") was written into the lesson DOM untouched. Apply the same deterministic safety net to both paths: - gemini-block.js: restoreProtectedTerms() on the model reply before the DOM write. - code-comments.js: restore the translated comment before escaping/splicing, and build the protected-terms map up front so the standalone code-comment path has it. This is now safe to apply broadly because the engine was hardened in #217/#218 (substring guard + Latin/Cyrillic boundary + CJK interpunct guard + real-word data cleanup) — restoreProtectedTerms no longer corrupts prose, so extending it to more write paths carries no regression risk. restoreProtectedTerms is also a no-op when the map is unbuilt, and protected-terms.js loads before both modules. Tests genuinely catch the regression (verified by stash-rebuild-rerun on both): - gemini-block.test.js: a unit test feeding "클로드" asserts "<strong>Claude</strong>" is written and the restored text is cached. Fails without the fix. - code-comments e2e: the GT stub now returns "클로드 프롬프트 예시"; the existing assertion expects "# Claude 프롬프트 예시", so it only passes if restoration fires. Fails ("# 클로드 …") without the fix. lint · format · 556 unit · gemini-block 26 · e2e 21/21.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A multi-agent re-review of all 12 premium dictionaries found that
restoreProtectedTermsdid an unanchoredreplaceAll, so any_protectedwrong-form that is a substring/prefix of a real word silently corrupts correct on-page prose (the #172/#197 class — confirmed live across every dictionary). Worst case was a regression I had just shipped: id'ssubagent→["subagen"]turned "subagent" into "subagentst".Engine fix (
src/lib/protected-terms.js)buildProtectedTermsMapdrops any wrong-form that is a SUBSTRING of its own correct term (e.g. "subagen" ⊂ "subagent") — longest-first sort can never save a true prefix.restoreProtectedTermsnow matches Latin/Cyrillic wrong-forms with a Unicode letter boundary(?<!\p{L})form(?!\p{L}), so a form never corrupts a longer word containing it. CJK/Kana/Hangul keep literalreplaceAll— those scripts have no word separators, so a term is routinely adjacent to a particle even when it legitimately should be restored; anchoring would BREAK restoration there. (CJK compound corruption stays a per-dict data concern → PR 2.)Guardrail (
scripts/check-glossary.js)Hard-errors on any wrong-form that is a substring of its correct term, so the regression class can't return.
Data
Removed id's
subagent/subagents→subagen(the regression).Verify
553 unit (4 new engine tests: substring guard, Latin boundary, CJK-restoration-preserved) · e2e 20/20 · all gates green.
PR 1 of the re-review fix campaign (engine + guard). PR 2 cleans per-dictionary
_protectedDATA (standalone real-word/name collisions + brand transliterations anchoring can't resolve); PR 3 the consistency/mistranslation findings.🤖 Generated with Claude Code