Skip to content

fix(i18n): clean _protected wrong-forms that corrupt prose across all 12 dicts#218

Merged
heznpc merged 1 commit into
mainfrom
fix/protected-data-cleanup
Jun 16, 2026
Merged

fix(i18n): clean _protected wrong-forms that corrupt prose across all 12 dicts#218
heznpc merged 1 commit into
mainfrom
fix/protected-data-cleanup

Conversation

@heznpc

@heznpc heznpc commented Jun 16, 2026

Copy link
Copy Markdown
Owner

PR 2 of the multi-agent re-review fix campaign (#217 was PR 1, the restore engine).

A 12-language re-review (225 findings) + a harmonization pass under one rule — "can this wrong-form ever appear in CORRECT target-language prose?" — cleaned every _protected block.

Why

restoreProtectedTerms rewrites each wrong-form → correct term. PR 1 anchored Latin/Cyrillic matching so a wrong-form no longer corrupts a longer word — but a wrong-form that is itself a real word, a given name, or the dict's own native translation is still destructive, and CJK forms still use literal replaceAll. This removes those data-level offenders.

What changed (50 entries dropped · 23 arrays neutralized · 12 dicts)

  • Real-word / given-name collisions → self-ref keep-English brand key: es/it Claudio, pt Cláudio (man's name); fr Anthropique, it/es/pt Antropico/Antrópico, de Anthropisch (real adjective); fr cotravail, es cotrabajo, pt cotrabalho; vi Mã Claude.
  • Generic concepts the dicts render natively → entry dropped (native rendering stands, CJK prose no longer rewritten): slash command, subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch, Enterprise — i.e. wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル / プラグイン / フック / 후크 / субагент, each a common standalone word.
  • Skills (product) → self-ref keep-English key in every CJK locale, dropping real-word wrong-forms 技能 / スキル / 스킬.
  • Safe phonetic brand transliterations KEPT as restores (never real prose; fix GT→English brand): ja クロード, zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de Koarbeit, es Código Claude.

All brand/product KEYS preserved for the Gemini keep-English path.

Verify

9 gates green (incl. PR 1's glossary substring build-guard) · 553 unit · e2e 20/20.

Defers to PR 3: Managed Agents canonicalization (Claude Managed Agents) + residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs ja/zh dropped).

🤖 Generated with Claude Code

… 12 dicts

PR 2 of the multi-agent re-review fix campaign (PR 1 was #217, the engine).
The re-review (225 findings) plus a 12-language harmonization pass under a single
"can this wrong-form appear in correct prose?" rule cleaned every _protected block.

restoreProtectedTerms rewrites each wrong-form -> correct term. PR 1 anchored
Latin/Cyrillic matching so a wrong-form no longer corrupts a LONGER word, but a
wrong-form that is itself a real word / given name / the dict's own native
translation is still destructive (and CJK forms still use literal replaceAll).
This PR removes those data-level offenders, keeping only safe forms:

- Real-word / given-name collisions neutralized to self-referential keep-English
  brand keys: es/it "Claudio", pt "Cláudio" (man's name); fr "Anthropique",
  it/es/pt "Antropico"/"Antrópico", de "Anthropisch" (real adjective); fr
  "cotravail", es "cotrabajo", pt "cotrabalho" (real/coined words); vi "Mã Claude".
- Generic concepts the dictionaries render natively are dropped entirely so the
  native rendering stands and CJK prose stops being rewritten: slash command,
  subagent, hook/hooks, lowercase skill/skills, native plugin/plugins, Dispatch,
  Enterprise — i.e. the wrong-forms 技能 / 插件 / 外掛程式 / 钩子 / 挂钩 / スキル /
  プラグイン / フック / 후크 / субагент / etc., every one a common standalone word.
- "Skills" (the product) becomes a self-ref keep-English key in every CJK locale,
  dropping the real-word wrong-forms 技能 / スキル / 스킬.
- SAFE phonetic brand transliterations are KEPT as restores (they never occur as
  real prose and fix GT's transliteration back to the English brand): ja クロード,
  zh 克洛德/克劳德/克勞德, ru Клод, ko 클로드, de "Koarbeit", es "Código Claude", etc.

Net: 50 generic entries dropped, 23 brand wrong-form arrays neutralized, across
the 12 source dictionaries (+ regenerated plugin data). All brand/product KEYS
are preserved for the Gemini keep-English path.

9 gates green (incl. the PR 1 glossary build-guard) · 553 unit · e2e 20/20.
Defers to PR 3: Managed Agents canonicalization ("Claude Managed Agents") and
residual keep-English-coverage harmonization (e.g. ko Plugin self-ref vs others).
@heznpc heznpc enabled auto-merge (squash) June 16, 2026 23:05
@heznpc heznpc merged commit f37b326 into main Jun 16, 2026
9 checks passed
@heznpc heznpc deleted the fix/protected-data-cleanup branch June 16, 2026 23:06
heznpc added a commit that referenced this pull request Jun 16, 2026
…s (12 dicts) (#219)

PR 3 of the multi-agent re-review fix campaign (PR 1 #217 engine, PR 2 #218 data).
Scope per owner decision: objective content DEFECTS + brand/product English-
retention only. Stylistic word-choice (how to render "AI Fluency", subagent
synonyms, generic rendering-consistency) is deferred to native review (#202).

108 value-only edits across the 12 source dictionaries, produced and then
adversarially verified by an independent per-language pass (46 out-of-scope or
incorrect proposals were rejected; every applied edit's prior value matched the
file exactly; no keys added/removed, so key-parity holds):

- Mistranslations (meaning was wrong): zh-CN Delegation 授权 "authorize" -> 委托
  "delegate" and Diligence 审核 "review" -> 勤奋 "diligence" (the 4D competency
  names); zh-TW 審核 -> 勤勉; ja "steerable" 操舵可能 (nautical) -> 制御しやすい;
  ru "Headless mode" 自律 -> без интерфейса; it "Prompts" (verb-read) -> Prompt,
  trigger "grilletto" (gun trigger) -> attivazione, Sign In/Up swap.
- Garbled strings (residue of the old unanchored protected-terms replaceAll):
  it "affidskill" -> affidabilità, "scalskill" -> scalabilità, "Aghook eventi"
  -> "Eventi degli hook".
- Untranslated fragments: Bedrock/Vertex catalog connectives ("Claude with ..."
  -> con/avec/com), "Powered by", English lead-ins left mid-sentence (ko/ja).
- Brand-policy: product names restored to English in prose — Skills / Agent
  Skills (the largest group: 技能/skill/Fähigkeiten/Habilidades/agent skills ->
  Skills/Agent Skills across agentSkills + claude101 + catalog), Claude Code
  (it "Codice Claude" -> Claude Code), Model Context Protocol (it). This pairs
  with PR 2, which removed the now-unsafe auto-restore of 技能/スキル/Skills — the
  curated values now carry the English brand directly.
- German grammar: separable verb "Diese Lektion hervorhebt" -> "hebt ... hervor",
  genitive "Claude's" -> "Claudes", malformed compounds "KI-Fluencysplan" ->
  "KI-Fluency-Plan", "Lektion Rückblick" -> "Lektionsrückblick".

9 gates green · 553 unit · e2e 20/20 (one tight-timeout PDF-popup test flaked
twice under local load; passed in isolation and on a determinism re-run — it is
fixture-based and cannot be affected by dictionary content).
heznpc added a commit that referenced this pull request Jun 17, 2026
…llback (#224)

DEFAULT_PROTECTED_TERMS is the Gemini "keep-English" fallback used by
getKeepEnglishTerms() only when a locale has no _protected keys. It still listed
skill, Subagent, Enterprise, Personal, Plugin, and Dispatch — generic concept
words that PR #218 deliberately REMOVED from the per-locale _protected blocks
because they are translated natively per locale (concept-vs-product-name policy,
docs/TRANSLATION_RULES.md §1). Keeping them in the fallback told Gemini to keep
ordinary words in English — the opposite of the shipped policy.

Reduced to brand/product/file-format proper nouns only:
API, SDK, Claude, Anthropic, Claude Code, Cowork, Computer Use, SKILL.md, frontmatter.

Low-impact (the fallback only fires for a locale with an empty _protected), but it
removes a policy inconsistency a future contributor/LLM pass could be misled by.
Surfaced by the doc fact-check in #223. 555 unit (constants assertions unchanged) ·
gates green · e2e 20/20.
heznpc added a commit that referenced this pull request Jun 17, 2026
… write paths (#231)

The GT queue applies restoreProtectedTerms() deterministically, but two other
translation write paths bypassed it and relied only on the prompt's "keep English"
instruction (probabilistic): the Gemini inline-HTML block translator and the
code-comment translator. When the model ignored the instruction, a brand/API term
(e.g. "Claude" → "클로드") was written into the lesson DOM untouched.

Apply the same deterministic safety net to both paths:
- gemini-block.js: restoreProtectedTerms() on the model reply before the DOM write.
- code-comments.js: restore the translated comment before escaping/splicing, and
  build the protected-terms map up front so the standalone code-comment path has it.

This is now safe to apply broadly because the engine was hardened in #217/#218
(substring guard + Latin/Cyrillic boundary + CJK interpunct guard + real-word data
cleanup) — restoreProtectedTerms no longer corrupts prose, so extending it to more
write paths carries no regression risk. restoreProtectedTerms is also a no-op when
the map is unbuilt, and protected-terms.js loads before both modules.

Tests genuinely catch the regression (verified by stash-rebuild-rerun on both):
- gemini-block.test.js: a unit test feeding "클로드" asserts "<strong>Claude</strong>"
  is written and the restored text is cached. Fails without the fix.
- code-comments e2e: the GT stub now returns "클로드 프롬프트 예시"; the existing
  assertion expects "# Claude 프롬프트 예시", so it only passes if restoration fires.
  Fails ("# 클로드 …") without the fix.

lint · format · 556 unit · gemini-block 26 · e2e 21/21.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant