Skip to content

fix(i18n): anchor protected-term restoration to stop substring corruption#217

Merged
heznpc merged 1 commit into
mainfrom
fix/protected-terms-engine
Jun 16, 2026
Merged

fix(i18n): anchor protected-term restoration to stop substring corruption#217
heznpc merged 1 commit into
mainfrom
fix/protected-terms-engine

Conversation

@heznpc

@heznpc heznpc commented Jun 16, 2026

Copy link
Copy Markdown
Owner

A multi-agent re-review of all 12 premium dictionaries found that restoreProtectedTerms did an unanchored replaceAll, so any _protected wrong-form that is a substring/prefix of a real word silently corrupts correct on-page prose (the #172/#197 class — confirmed live across every dictionary). Worst case was a regression I had just shipped: id's subagent["subagen"] turned "subagent" into "subagentst".

Engine fix (src/lib/protected-terms.js)

  • buildProtectedTermsMap drops any wrong-form that is a SUBSTRING of its own correct term (e.g. "subagen" ⊂ "subagent") — longest-first sort can never save a true prefix.
  • restoreProtectedTerms now matches Latin/Cyrillic wrong-forms with a Unicode letter boundary (?<!\p{L})form(?!\p{L}), so a form never corrupts a longer word containing it. CJK/Kana/Hangul keep literal replaceAll — those scripts have no word separators, so a term is routinely adjacent to a particle even when it legitimately should be restored; anchoring would BREAK restoration there. (CJK compound corruption stays a per-dict data concern → PR 2.)

Guardrail (scripts/check-glossary.js)

Hard-errors on any wrong-form that is a substring of its correct term, so the regression class can't return.

Data

Removed id's subagent/subagentssubagen (the regression).

Verify

553 unit (4 new engine tests: substring guard, Latin boundary, CJK-restoration-preserved) · e2e 20/20 · all gates green.

PR 1 of the re-review fix campaign (engine + guard). PR 2 cleans per-dictionary _protected DATA (standalone real-word/name collisions + brand transliterations anchoring can't resolve); PR 3 the consistency/mistranslation findings.

🤖 Generated with Claude Code

…tion

A multi-agent re-review of all 12 premium dictionaries found that
restoreProtectedTerms did an UNANCHORED replaceAll, so any _protected wrong-form
that is a substring/prefix of a real word silently corrupts correct on-page
prose (the #172/#197 class — confirmed live across every dictionary). Worst
case was a regression I had just shipped: id's "subagent"->["subagen"] turned
"subagent" into "subagentst".

Engine fix (src/lib/protected-terms.js):
- buildProtectedTermsMap drops any wrong-form that is a SUBSTRING of its own
  correct term (e.g. "subagen" inside "subagent") — the longest-first sort can
  never save a true prefix.
- restoreProtectedTerms now matches Latin/Cyrillic/etc. wrong-forms with a
  Unicode letter boundary `(?<!\p{L})form(?!\p{L})`, so a form never corrupts a
  longer word that merely contains it. CJK/Kana/Hangul keep literal replaceAll
  (those scripts have no word separators, so a term is routinely adjacent to a
  particle even when it legitimately should be restored — anchoring would BREAK
  restoration there); their compound corruption stays a per-dictionary data
  concern.

Guardrail (scripts/check-glossary.js): hard-error on any wrong-form that is a
substring of its correct term, so the regression class can't come back.

Data: removed id's "subagent"/"subagents" -> "subagen" entries (the regression).

553 unit (incl. 4 new engine tests: substring guard, Latin boundary,
CJK-restoration-preserved) · e2e 20/20 · all gates green. This is PR 1 of the
re-review fix campaign — the engine + guard. PR 2 cleans the per-dictionary
_protected DATA (standalone real-word/name collisions + brand transliterations
that anchoring cannot resolve).
@heznpc heznpc enabled auto-merge (squash) June 16, 2026 22:39
@heznpc heznpc merged commit b85712d into main Jun 16, 2026
9 checks passed
@heznpc heznpc deleted the fix/protected-terms-engine branch June 16, 2026 22:40
heznpc added a commit that referenced this pull request Jun 16, 2026
… 12 dicts (#218)

PR 2 of the multi-agent re-review fix campaign (PR 1 was #217, the engine).
The re-review (225 findings) plus a 12-language harmonization pass under a single
"can this wrong-form appear in correct prose?" rule cleaned every _protected block.

restoreProtectedTerms rewrites each wrong-form -> correct term. PR 1 anchored
Latin/Cyrillic matching so a wrong-form no longer corrupts a LONGER word, but a
wrong-form that is itself a real word / given name / the dict's own native
translation is still destructive (and CJK forms still use literal replaceAll).
This PR removes those data-level offenders, keeping only safe forms:

- Real-word / given-name collisions neutralized to self-referential keep-English
  brand keys: es/it "Claudio", pt "Cláudio" (man's name); fr "Anthropique",
  it/es/pt "Antropico"/"Antrópico", de "Anthropisch" (real adjective); fr
  "cotravail", es "cotrabajo", pt "cotrabalho" (real/coined words); vi "Mã Claude".
- Generic concepts the dictionaries render natively are dropped entirely so the
  native rendering stands and CJK prose stops being rewritten: slash command,
  subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch,
  Enterprise — i.e. the wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル /
  プラグイン / フック / 후크 / субагент / etc., every one a common standalone word.
- "Skills" (the product) becomes a self-ref keep-English key in every CJK locale,
  dropping the real-word wrong-forms 技能 / スキル / 스킬.
- SAFE phonetic brand transliterations are KEPT as restores (they never occur as
  real prose and fix GT's transliteration back to the English brand): ja クロード,
  zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de "Koarbeit", es "Código Claude", etc.

Net: 50 generic entries dropped, 23 brand wrong-form arrays neutralized, across
the 12 source dictionaries (+ regenerated plugin data). All brand/product KEYS
are preserved for the Gemini keep-English path.

9 gates green (incl. the PR 1 glossary build-guard) · 553 unit · e2e 20/20.
Defers to PR 3: Managed Agents canonicalization ("Claude Managed Agents") and
residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs others).
heznpc added a commit that referenced this pull request Jun 16, 2026
…s (12 dicts) (#219)

PR 3 of the multi-agent re-review fix campaign (PR 1 #217 engine, PR 2 #218 data).
Scope per owner decision: objective content DEFECTS + brand/product English-
retention only. Stylistic word-choice (how to render "AI Fluency", subagent
synonyms, generic rendering-consistency) is deferred to native review (#202).

108 value-only edits across the 12 source dictionaries, produced and then
adversarially verified by an independent per-language pass (46 out-of-scope or
incorrect proposals were rejected; every applied edit's prior value matched the
file exactly; no keys added/removed, so key-parity holds):

- Mistranslations (meaning was wrong): zh-CN Delegation 授权 "authorize" -> 委托
  "delegate" and Diligence 审核 "review" -> 勤奋 "diligence" (the 4D competency
  names); zh-TW 審核 -> 勤勉; ja "steerable" 操舵可能 (nautical) -> 制御しやすい;
  ru "Headless mode" 自律 -> без интерфейса; it "Prompts" (verb-read) -> Prompt,
  trigger "grilletto" (gun trigger) -> attivazione, Sign In/Up swap.
- Garbled strings (residue of the old unanchored protected-terms replaceAll):
  it "affidskill" -> affidabilità, "scalskill" -> scalabilità, "Aghook eventi"
  -> "Eventi degli hook".
- Untranslated fragments: Bedrock/Vertex catalog connectives ("Claude with ..."
  -> con/avec/com), "Powered by", English lead-ins left mid-sentence (ko/ja).
- Brand-policy: product names restored to English in prose — Skills / Agent
  Skills (the largest group: 技能/skill/Fähigkeiten/Habilidades/agent skills ->
  Skills/Agent Skills across agentSkills + claude101 + catalog), Claude Code
  (it "Codice Claude" -> Claude Code), Model Context Protocol (it). This pairs
  with PR 2, which removed the now-unsafe auto-restore of 技能/スキル/Skills — the
  curated values now carry the English brand directly.
- German grammar: separable verb "Diese Lektion hervorhebt" -> "hebt ... hervor",
  genitive "Claude's" -> "Claudes", malformed compounds "KI-Fluencysplan" ->
  "KI-Fluency-Plan", "Lektion Rückblick" -> "Lektionsrückblick".

9 gates green · 553 unit · e2e 20/20 (one tight-timeout PDF-popup test flaked
twice under local load; passed in isolation and on a determinism re-run — it is
fixture-based and cannot be affected by dictionary content).
heznpc added a commit that referenced this pull request Jun 17, 2026
… write paths (#231)

The GT queue applies restoreProtectedTerms() deterministically, but two other
translation write paths bypassed it and relied only on the prompt's "keep English"
instruction (probabilistic): the Gemini inline-HTML block translator and the
code-comment translator. When the model ignored the instruction, a brand/API term
(e.g. "Claude" → "클로드") was written into the lesson DOM untouched.

Apply the same deterministic safety net to both paths:
- gemini-block.js: restoreProtectedTerms() on the model reply before the DOM write.
- code-comments.js: restore the translated comment before escaping/splicing, and
  build the protected-terms map up front so the standalone code-comment path has it.

This is now safe to apply broadly because the engine was hardened in #217/#218
(substring guard + Latin/Cyrillic boundary + CJK interpunct guard + real-word data
cleanup) — restoreProtectedTerms no longer corrupts prose, so extending it to more
write paths carries no regression risk. restoreProtectedTerms is also a no-op when
the map is unbuilt, and protected-terms.js loads before both modules.

Tests genuinely catch the regression (verified by stash-rebuild-rerun on both):
- gemini-block.test.js: a unit test feeding "클로드" asserts "<strong>Claude</strong>"
  is written and the restored text is cached. Fails without the fix.
- code-comments e2e: the GT stub now returns "클로드 프롬프트 예시"; the existing
  assertion expects "# Claude 프롬프트 예시", so it only passes if restoration fires.
  Fails ("# 클로드 …") without the fix.

lint · format · 556 unit · gemini-block 26 · e2e 21/21.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant