Skip to content

feat(knowledge): scope-bounded reconcile prune, atomic skill install, 16MiB envd unary cap#67

Merged
ysyneu merged 3 commits into
mainfrom
feat/knowledge-pr3
Jun 11, 2026
Merged

feat(knowledge): scope-bounded reconcile prune, atomic skill install, 16MiB envd unary cap#67
ysyneu merged 3 commits into
mainfrom
feat/knowledge-pr3

Conversation

@ysyneu

@ysyneu ysyneu commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

PR-3 of the knowledge/skills loading-stability series (safari side: flashcatcloud/fc-safari#170 merged, #172 open).

What

  1. Scope-bounded reconcile pruneReconcileKnowledgeManifestArgs gains staged_scopes + valid_scopes. Orphans are pruned only inside scopes this manifest staged; on-disk scopes outside valid_scopes (deleted packs) are pruned whole; scopes valid account-wide but unstaged by this session are left untouched. Fixes the shared-BYOC cross-session prune thrash observed live during #172's over-cap rigor pass (two sessions with different mounted teams evicted each other's files every reconcile). Empty valid_scopes (older Safari, or a zero-pack account) keeps the legacy global prune — both directions of mixed-version fleet are safe, no deploy ordering.

  2. Atomic skill installSyncSkill extracts archive + .checksum into a sibling .installing-* staging dir and swaps via RemoveAll+Rename, serialized by a per-environment mutex; orphaned staging dirs from hard crashes are swept before each install. A corrupt zip no longer destroys the previously installed version (old behavior: RemoveAll first, then extract into the live dir).

  3. envd unary cap 8 MiB → 16 MiB — sync_skill zip_data at Safari's MaxSkillZipBytes (10 MiB) is ~13.3 MiB after base64 and could not fit the old frame, so >6 MiB-raw skills were uninstallable on cloud sandboxes. Over-cap bodies now fail with an explicit cap error instead of silent LimitReader truncation surfacing as a cryptic JSON decode failure. Companion safari commit pins the relationship with TestSkillZipFitsEnvdUnaryFrame and adds a cloud-dispatch pre-check.

Verification

  • Unit: scoped prune matrix (staged orphan pruned / valid-unstaged kept / invalid scope pruned whole / legacy global on empty scopes / degenerate staged-not-valid), atomic install (corrupt zip → v1 intact + probe still hits + no staging litter), readUnary 14 MiB accept + over-cap explicit reject. Full go test ./... green, go vet clean, gofumpt clean.
  • Live (local BYOC runner + safari #172 branch binary, dev account 2451002751131):
    • Fully-staged regime: planted knowledge/team_999999/junk.md (invalid scope) + knowledge/account/orphan-live.md → reconcile kept=7 total_pruned=2 eager_staged=12, both planted files gone, all real files intact.
    • Discriminating over-cap test (cap=1 throwaway safari build, unbound session → staged=account only, valid=all 5 scopes): planted orphan in a valid-but-unstaged team scope survived alongside all 13 team files (total_pruned=1 = the account orphan only). Old global prune would have deleted all 14 — that was the live thrash.

Rollout

  • BYOC fleet: picks this up via self-update after the next release tag.
  • Cloud sandboxes: the 16 MiB cap lands only after the sandbox image is rebuilt with the new runner — until then >6 MiB-raw skills keep failing on cloud (now with the explicit error once rebuilt). Mention from review: pre-16 MiB images still truncate at 8 MiB.

ysyneu added 3 commits June 11, 2026 15:42
… 16MiB envd unary cap

Three stability fixes for the knowledge/skill sync layer (PR-3 of the
knowledge-loading-stability series):

- ReconcileKnowledgeManifest orphan prune is now scope-bounded. Args gain
  staged_scopes (manifest authoritative inside; orphans pruned) and
  valid_scopes (all scopes account-wide; on-disk scopes outside are
  deleted-pack residue, pruned whole; valid-but-unstaged scopes belong to
  other sessions sharing the runner and are kept). Fixes the over-cap
  shared-BYOC cross-session prune thrash where two sessions with different
  mounted teams evicted each other's files. Empty valid_scopes (older
  Safari, or zero-pack account) keeps the legacy global prune.

- SyncSkill install is atomic: archive + .checksum extract into a sibling
  .installing-* staging dir, then RemoveAll+Rename swap, serialized by a
  per-environment mutex; orphaned staging dirs from hard crashes are swept
  before each install. A corrupt zip no longer destroys the previously
  installed version mid-RemoveAll.

- envd unary body cap raised 8MiB to 16MiB (sync_skill zip_data at Safari's
  MaxSkillZipBytes 10MiB is ~13.3MiB after base64; it could not fit the old
  frame, so >6MiB-raw skills were uninstallable on cloud sandboxes). Over-cap
  requests now fail with an explicit cap error instead of a silent
  LimitReader truncation surfacing as a cryptic JSON decode failure.
…irTemp+Chmod

Installs are mutex-serialized and leftovers are swept pre-install, so a
random suffix bought nothing; CI's gosec also rejected the Chmod (G302)
that MkdirTemp's 0700 forced.
@ysyneu ysyneu merged commit e84efa0 into main Jun 11, 2026
8 checks passed
@ysyneu ysyneu deleted the feat/knowledge-pr3 branch June 11, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant