fix(i18n): anchor protected-term restoration to stop substring corruption by heznpc · Pull Request #217 · heznpc/skillBridge

heznpc · 2026-06-16T22:39:08Z

A multi-agent re-review of all 12 premium dictionaries found that restoreProtectedTerms did an unanchored replaceAll, so any _protected wrong-form that is a substring/prefix of a real word silently corrupts correct on-page prose (the #172/#197 class — confirmed live across every dictionary). Worst case was a regression I had just shipped: id's subagent→["subagen"] turned "subagent" into "subagentst".

Engine fix (`src/lib/protected-terms.js`)

buildProtectedTermsMap drops any wrong-form that is a SUBSTRING of its own correct term (e.g. "subagen" ⊂ "subagent") — longest-first sort can never save a true prefix.
restoreProtectedTerms now matches Latin/Cyrillic wrong-forms with a Unicode letter boundary (?<!\p{L})form(?!\p{L}), so a form never corrupts a longer word containing it. CJK/Kana/Hangul keep literal replaceAll — those scripts have no word separators, so a term is routinely adjacent to a particle even when it legitimately should be restored; anchoring would BREAK restoration there. (CJK compound corruption stays a per-dict data concern → PR 2.)

Guardrail (`scripts/check-glossary.js`)

Hard-errors on any wrong-form that is a substring of its correct term, so the regression class can't return.

Data

Removed id's subagent/subagents→subagen (the regression).

Verify

553 unit (4 new engine tests: substring guard, Latin boundary, CJK-restoration-preserved) · e2e 20/20 · all gates green.

PR 1 of the re-review fix campaign (engine + guard). PR 2 cleans per-dictionary _protected DATA (standalone real-word/name collisions + brand transliterations anchoring can't resolve); PR 3 the consistency/mistranslation findings.

🤖 Generated with Claude Code

…tion A multi-agent re-review of all 12 premium dictionaries found that restoreProtectedTerms did an UNANCHORED replaceAll, so any _protected wrong-form that is a substring/prefix of a real word silently corrupts correct on-page prose (the #172/#197 class — confirmed live across every dictionary). Worst case was a regression I had just shipped: id's "subagent"->["subagen"] turned "subagent" into "subagentst". Engine fix (src/lib/protected-terms.js): - buildProtectedTermsMap drops any wrong-form that is a SUBSTRING of its own correct term (e.g. "subagen" inside "subagent") — the longest-first sort can never save a true prefix. - restoreProtectedTerms now matches Latin/Cyrillic/etc. wrong-forms with a Unicode letter boundary `(?<!\p{L})form(?!\p{L})`, so a form never corrupts a longer word that merely contains it. CJK/Kana/Hangul keep literal replaceAll (those scripts have no word separators, so a term is routinely adjacent to a particle even when it legitimately should be restored — anchoring would BREAK restoration there); their compound corruption stays a per-dictionary data concern. Guardrail (scripts/check-glossary.js): hard-error on any wrong-form that is a substring of its correct term, so the regression class can't come back. Data: removed id's "subagent"/"subagents" -> "subagen" entries (the regression). 553 unit (incl. 4 new engine tests: substring guard, Latin boundary, CJK-restoration-preserved) · e2e 20/20 · all gates green. This is PR 1 of the re-review fix campaign — the engine + guard. PR 2 cleans the per-dictionary _protected DATA (standalone real-word/name collisions + brand transliterations that anchoring cannot resolve).

… 12 dicts (#218) PR 2 of the multi-agent re-review fix campaign (PR 1 was #217, the engine). The re-review (225 findings) plus a 12-language harmonization pass under a single "can this wrong-form appear in correct prose?" rule cleaned every _protected block. restoreProtectedTerms rewrites each wrong-form -> correct term. PR 1 anchored Latin/Cyrillic matching so a wrong-form no longer corrupts a LONGER word, but a wrong-form that is itself a real word / given name / the dict's own native translation is still destructive (and CJK forms still use literal replaceAll). This PR removes those data-level offenders, keeping only safe forms: - Real-word / given-name collisions neutralized to self-referential keep-English brand keys: es/it "Claudio", pt "Cláudio" (man's name); fr "Anthropique", it/es/pt "Antropico"/"Antrópico", de "Anthropisch" (real adjective); fr "cotravail", es "cotrabajo", pt "cotrabalho" (real/coined words); vi "Mã Claude". - Generic concepts the dictionaries render natively are dropped entirely so the native rendering stands and CJK prose stops being rewritten: slash command, subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch, Enterprise — i.e. the wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル / プラグイン / フック / 후크 / субагент / etc., every one a common standalone word. - "Skills" (the product) becomes a self-ref keep-English key in every CJK locale, dropping the real-word wrong-forms 技能 / スキル / 스킬. - SAFE phonetic brand transliterations are KEPT as restores (they never occur as real prose and fix GT's transliteration back to the English brand): ja クロード, zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de "Koarbeit", es "Código Claude", etc. Net: 50 generic entries dropped, 23 brand wrong-form arrays neutralized, across the 12 source dictionaries (+ regenerated plugin data). All brand/product KEYS are preserved for the Gemini keep-English path. 9 gates green (incl. the PR 1 glossary build-guard) · 553 unit · e2e 20/20. Defers to PR 3: Managed Agents canonicalization ("Claude Managed Agents") and residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs others).

…s (12 dicts) (#219) PR 3 of the multi-agent re-review fix campaign (PR 1 #217 engine, PR 2 #218 data). Scope per owner decision: objective content DEFECTS + brand/product English- retention only. Stylistic word-choice (how to render "AI Fluency", subagent synonyms, generic rendering-consistency) is deferred to native review (#202). 108 value-only edits across the 12 source dictionaries, produced and then adversarially verified by an independent per-language pass (46 out-of-scope or incorrect proposals were rejected; every applied edit's prior value matched the file exactly; no keys added/removed, so key-parity holds): - Mistranslations (meaning was wrong): zh-CN Delegation 授权 "authorize" -> 委托 "delegate" and Diligence 审核 "review" -> 勤奋 "diligence" (the 4D competency names); zh-TW 審核 -> 勤勉; ja "steerable" 操舵可能 (nautical) -> 制御しやすい; ru "Headless mode" 自律 -> без интерфейса; it "Prompts" (verb-read) -> Prompt, trigger "grilletto" (gun trigger) -> attivazione, Sign In/Up swap. - Garbled strings (residue of the old unanchored protected-terms replaceAll): it "affidskill" -> affidabilità, "scalskill" -> scalabilità, "Aghook eventi" -> "Eventi degli hook". - Untranslated fragments: Bedrock/Vertex catalog connectives ("Claude with ..." -> con/avec/com), "Powered by", English lead-ins left mid-sentence (ko/ja). - Brand-policy: product names restored to English in prose — Skills / Agent Skills (the largest group: 技能/skill/Fähigkeiten/Habilidades/agent skills -> Skills/Agent Skills across agentSkills + claude101 + catalog), Claude Code (it "Codice Claude" -> Claude Code), Model Context Protocol (it). This pairs with PR 2, which removed the now-unsafe auto-restore of 技能/スキル/Skills — the curated values now carry the English brand directly. - German grammar: separable verb "Diese Lektion hervorhebt" -> "hebt ... hervor", genitive "Claude's" -> "Claudes", malformed compounds "KI-Fluencysplan" -> "KI-Fluency-Plan", "Lektion Rückblick" -> "Lektionsrückblick". 9 gates green · 553 unit · e2e 20/20 (one tight-timeout PDF-popup test flaked twice under local load; passed in isolation and on a determinism re-run — it is fixture-based and cannot be affected by dictionary content).

… write paths (#231) The GT queue applies restoreProtectedTerms() deterministically, but two other translation write paths bypassed it and relied only on the prompt's "keep English" instruction (probabilistic): the Gemini inline-HTML block translator and the code-comment translator. When the model ignored the instruction, a brand/API term (e.g. "Claude" → "클로드") was written into the lesson DOM untouched. Apply the same deterministic safety net to both paths: - gemini-block.js: restoreProtectedTerms() on the model reply before the DOM write. - code-comments.js: restore the translated comment before escaping/splicing, and build the protected-terms map up front so the standalone code-comment path has it. This is now safe to apply broadly because the engine was hardened in #217/#218 (substring guard + Latin/Cyrillic boundary + CJK interpunct guard + real-word data cleanup) — restoreProtectedTerms no longer corrupts prose, so extending it to more write paths carries no regression risk. restoreProtectedTerms is also a no-op when the map is unbuilt, and protected-terms.js loads before both modules. Tests genuinely catch the regression (verified by stash-rebuild-rerun on both): - gemini-block.test.js: a unit test feeding "클로드" asserts "<strong>Claude</strong>" is written and the restored text is cached. Fails without the fix. - code-comments e2e: the GT stub now returns "클로드 프롬프트 예시"; the existing assertion expects "# Claude 프롬프트 예시", so it only passes if restoration fires. Fails ("# 클로드 …") without the fix. lint · format · 556 unit · gemini-block 26 · e2e 21/21.

heznpc enabled auto-merge (squash) June 16, 2026 22:39

heznpc merged commit b85712d into main Jun 16, 2026
9 checks passed

heznpc deleted the fix/protected-terms-engine branch June 16, 2026 22:40

heznpc mentioned this pull request Jun 16, 2026

fix(i18n): clean _protected wrong-forms that corrupt prose across all 12 dicts #218

Merged

heznpc mentioned this pull request Jun 16, 2026

fix(i18n): fix content mistranslations, garble, and brand-policy leaks (12 dicts) #219

Merged

heznpc mentioned this pull request Jun 17, 2026

fix(i18n): restore protected terms on the Gemini-block + code-comment write paths #231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(i18n): anchor protected-term restoration to stop substring corruption#217

fix(i18n): anchor protected-term restoration to stop substring corruption#217
heznpc merged 1 commit into
mainfrom
fix/protected-terms-engine

heznpc commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented Jun 16, 2026

Engine fix (src/lib/protected-terms.js)

Guardrail (scripts/check-glossary.js)

Data

Verify

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Engine fix (`src/lib/protected-terms.js`)

Guardrail (`scripts/check-glossary.js`)