Skip to content

fix(node): close under-withholding via full ref scope and full-history classification (#42)#84

Open
beardthelion wants to merge 8 commits into
mainfrom
fix/under-withholding-ref-scope
Open

fix(node): close under-withholding via full ref scope and full-history classification (#42)#84
beardthelion wants to merge 8 commits into
mainfrom
fix/under-withholding-ref-scope

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Closes #42.

Re-homes the under-withholding fix onto current main as a focused same-repo PR. It previously lived in #46 (a fork PR from when this work predated a maintainer); #46's other commits are already covered elsewhere (blind recipients via #40, partial-recipient and mirror fail-closed via #67/#69), so #46 will be closed in favor of this.

Rebased onto main after #90 (per-push delta pin) landed. #90 reworked object enumeration for the pin path, so the "reachable-only pin set" half this work originally carried is no longer needed and has been dropped. The PR is now scoped to the withheld-blob classifier.

What it does:

  • Classifies the withheld-blob set across the full ref scope and full history, not just the refs/heads + refs/tags tips. A blob reachable only through a non-standard ref namespace, only in older history (a since-deleted file), or only via a detached HEAD is still withheld. upload-pack (serve) and feat(node): pin the per-push object delta instead of re-enumerating the whole repo #90's whole-repo pin fallback both expose the full reachable object graph, so classifying only ref-tip trees would under-withhold.
  • Fails closed on any ref that does not resolve to a commit. Each ref is peeled fully with <ref>^{} through git cat-file --batch-check, so a ref pointing at a blob or tree, or an annotated tag (including a nested one) of such an object, is refused rather than silently skipped. git rev-list --all skips those refs while the serve and pin paths still expose their target object.
  • Walks history once per call: withheld_blob_recipients now shares the blob_paths result with the withheld-set computation instead of walking it twice.

Implementation note: the ref peel writes its queries to cat-file on a separate thread while the main thread drains stdout, so a repo with many (or long-named) refs cannot deadlock the classifier.

Tests: withholding via a non-standard ref, a detached HEAD, and a blob deleted at the tip but reachable in history; fail-closed on an untraversable ref, an annotated tag of a blob, and a ref pointing at a missing object; an annotated tag and a nested tag of a commit that must not fail closed; a blob shared across an allowed and a denied path; and a many-long-named-ref regression guard for the peel.

Known adjacent gap (not this PR): #90's pin candidate set (git cat-file --batch-all-objects) can include unreachable/dangling objects the reachable-only classifier never sees, and replicable_objects replicates anything not explicitly withheld, so an unreachable private blob could be pinned in cleartext. Tracked separately in #99.

Summary by CodeRabbit

  • Bug Fixes
    • Improved “fail-closed” blob visibility checks: all refs must resolve to commits, and invalid/missing targets now block withholding decisions.
    • Reworked history and path handling to reliably walk reachable commits and compute withheld blobs using raw path bytes (safer for non-ASCII and control characters).
    • Ensured recipients and paths are derived consistently, and prevented incorrect withholding when a blob exists in both allowed and denied locations.
  • Tests
    • Added coverage for detached/unborn states, custom ref namespaces, non-UTF-8 paths, annotated-tag peel errors, unreachable refs, and large batches of dangling refs.

…nly pin set

blob_paths walked only refs/heads/* and refs/tags/* and skipped silently on a
failed git ls-tree, so a blob reachable only through another namespace, or a ref
that failed to traverse, could fall out of the withheld set and ship in cleartext.
Walk every ref and fail closed on traversal error.

The pin enumerators (ipfs_pin, pinata) used git cat-file --batch-all-objects,
which includes unreachable/dangling objects that have no path and cannot be
classified for withholding. Switch them to git rev-list --objects --all so the
pin set matches the reachable graph blob_paths evaluates.
…asymmetry

blob_paths classified objects from for-each-ref only, but the pin set
(git rev-list --objects --all) and git upload-pack both also expose
objects reachable from HEAD. A detached HEAD (or any HEAD pointing
outside the ref-covered set) therefore put a blob into the pin/serve
path that the withholding walk never classified — a latent cleartext
leak of a withheld blob. Walk HEAD too (guarded for unborn HEAD on an
empty repo) so the withheld set is a superset of every exposure path.

Adds a regression test: a blob reachable only via a detached HEAD must
still be withheld (fails without this change).
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b88ea02f-aab7-49b3-8a72-7f20464fc719

📥 Commits

Reviewing files that changed from the base of the PR and between 3266eaf and 836bf7b.

📒 Files selected for processing (1)
  • crates/gitlawb-node/src/git/visibility_pack.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/gitlawb-node/src/git/visibility_pack.rs

📝 Walkthrough

Walkthrough

visibility_pack.rs now validates every ref target, traverses reachable commits plus HEAD into unique blob/path pairs, and reuses that same pair list to compute withheld OIDs and recipients. Tests add coverage for detached HEAD, custom refs, missing targets, and overlapping path cases.

Changes

Visibility pack withholding traversal

Layer / File(s) Summary
Ref validation and blob walk
crates/gitlawb-node/src/git/visibility_pack.rs
assert_all_refs_are_commits peels every ref target before traversal, and blob_paths collects unique (blob_oid, path) pairs from reachable commits plus resolved HEAD.
Shared withheld pair reuse
crates/gitlawb-node/src/git/visibility_pack.rs
withheld_blob_oids derives withheld blobs from blob_paths via withheld_from_pairs, and withheld_blob_recipients reuses the same pairs for withheld and recipient mapping.
Withholding regression coverage
crates/gitlawb-node/src/git/visibility_pack.rs
Tests cover custom refs, detached HEAD, history-only blobs, missing or non-commit refs, long dangling refs, annotated tag peeling, and overlapping allowed/denied blob paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • Gitlawb/node#28: Both changes center on visibility_pack withholding behavior, so this PR’s stricter blob-path and withheld-OID computation directly affects that flow.
  • Gitlawb/node#34: That PR consumes visibility_pack::withheld_blob_oids, and this PR rewrites the same withheld-blob computation and recipient mapping paths it relies on.
  • Gitlawb/node#90: Both changes touch the central withheld-set flow used by replication and recipient handling.

Suggested labels

kind:bug

Suggested reviewers

  • kevincodex1

Poem

I hopped through refs by moonlit light,
and sniffed each commit tree just right.
No dangling tag could fool my nose,
The withheld set blooms like carrot rows,
🐇✨ hop hop—history stays bright.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix: closing under-withholding by expanding ref scope and history classification.
Description check ✅ Passed The description covers the issue, scope, implementation, tests, and reviewer context; only the template's formal section headings/checklist are missing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/under-withholding-ref-scope

Comment @coderabbitai help to get the list of available commands.

@beardthelion beardthelion added sev:high Major break or real security/trust risk, no easy workaround crate:node gitlawb-node — the serving node and REST API kind:security Vulnerability fix or hardening subsystem:visibility Path-scoped visibility and content withholding subsystem:storage Blob/object store, Arweave, IPFS, archives labels Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 40-43: The current implementation in the ref-walking loop only
uses git ls-tree to examine blobs at each ref tip, but this misses blobs in
historical commits that remain reachable via commit history. This creates a
security gap where the visibility pack fails to classify all reachable objects
that other flows like rev-list traversals would include. Replace the git ls-tree
command with git rev-list --objects (or an equivalent command that traverses
full commit history) to ensure all blobs reachable from each ref, including
those in older commits, are properly classified. This will align the visibility
classification with what the rev-list based flows use elsewhere in the codebase.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7f77d48-b23e-4b59-a8ac-d96a6cd83b78

📥 Commits

Reviewing files that changed from the base of the PR and between 2153b0b and 53cd319.

📒 Files selected for processing (3)
  • crates/gitlawb-node/src/git/visibility_pack.rs
  • crates/gitlawb-node/src/ipfs_pin.rs
  • crates/gitlawb-node/src/pinata.rs

Comment thread crates/gitlawb-node/src/git/visibility_pack.rs Outdated
blob_paths walked only ref-tip trees (git ls-tree -r <ref>), so a blob
that existed in an older commit and was deleted or rotated at the tip
was never classified, while the pin set (git rev-list --objects --all)
and upload-pack still expose it from history. That under-withheld the
blob into the IPFS/Pinata replication paths in cleartext.

Enumerate all reachable commits (git rev-list --all + HEAD for the
detached case) and walk each commit tree, deduping (oid, path) pairs.
ls-tree per commit is kept over rev-list --objects so every path a blob
lives at is still represented for the per-path visibility check. A ref
that resolves to a non-commit now fails closed explicitly, since
rev-list --all silently skips it where the old per-ref ls-tree errored.

Regression test: a secret blob deleted at the tip but reachable in
history is withheld.
@beardthelion

Copy link
Copy Markdown
Collaborator Author

@kevincodex1 flagging this one as higher priority. It closes the under-withholding hole (#42): full ref scope plus a reachable-only pin set so history blobs can't leak past a path-scoped withhold. It's also a gate for follow-on work, the reconciliation sweep can't land until this is in main. All green (tests, clippy, cargo audit, MSRV, CodeQL), mergeable, no open threads. Ready when you are.

kevincodex1 pushed a commit that referenced this pull request Jun 24, 2026
…he whole repo (#90)

* feat(node): add push-delta object enumeration module

Compute the per-push object delta (git rev-list --objects <new> --not <old>)
for the pin path instead of enumerating the whole repo, with a fully-peeled
ref-type guard (cat-file -t '<sha>^{}') that fails closed to a full scan on a
non-commit tip, any git error, or the GITLAWB_PIN_FORCE_FULLSCAN kill-switch.
The whole-repo lister moves here for the fallback and the reconciliation sweep.

Pin-enumeration scope only (N16); the withheld filter is unchanged.

* feat(node): pin the per-push object delta instead of the whole repo

Wire the push-delta enumeration into the IPFS and Pinata pin paths so pin work
scales with push size, not repo size (N16).

- U2: both pin_new_objects twins take a candidate OID set instead of scanning
  the whole repo internally; repo_path is retained for reading object bytes.
- U4: extract the anonymous replication withheld gate (caller=None, no caller
  param) into a shared helper returning (announce, withheld), so the push path
  and the future reconciliation sweep cannot drift. The per-caller read-serve
  withheld call in git_upload_pack is deliberately left separate.
- U3: the post-receive handler resolves the push delta once (off-thread),
  falls back to a full scan on a non-commit tip / git error / kill-switch, and
  passes the candidate set to both pin paths. Logs delta size and fallbacks.

The reconciliation sweep (U5) is deferred: it depends on the history-deep
blob_paths fix (#84) landing on main first.

* fix(review): log degraded pin-candidate paths, extract wiring to push_delta

Code review found the push candidate computation collapsed both a failed
full-scan fallback and a panicked spawn_blocking task to a silent empty set,
diverging from the sibling withheld path that logs and fails closed.

Extract resolve_candidates_for_push into git::push_delta (it owns the
PinCandidates dispatch and full-scan fallback), logging every degraded path
(fallback taken, fallback failed, task panicked) instead of failing silently.
Pinning nothing on a hard failure is still safe (the withheld filter always
runs; the reconciliation sweep backstops durability) but is now observable.
Moving the dispatch out of the HTTP handler also makes it unit-testable.

Add tests: tag-of-tag-of-commit peels to a Delta, and resolve_candidates_for_push
delta + non-commit-tip full-scan-fallback paths.

* fix(node): fail announce closed when withheld walk fails (#90)

A failed or panicked withheld_blob_oids walk left announce=true while
withheld=None, so a push we could not vet would still gossip ref updates,
notify peers, and write a permanent Arweave anchor. Force announce false
in that arm so the gate fails closed on both axes.

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

1. Merge conflicts with main — BLOCKING

  • Severity: High
  • Evidence: gh pr view 84 reports "mergeable":"CONFLICTING". A local git diff main against the checked-out branch shows large unrelated deletions (.env.example, auth/mod.rs, config.rs, git/push_delta.rs, docs/RUN-A-NODE.md, api/repos.rs, etc.) that are not part of the intended PR diff (gh pr diff 84 only touches visibility_pack.rs, ipfs_pin.rs, and pinata.rs).
  • Impact: Cannot be merged as-is; must be rebased/merged onto current main and conflicts resolved before review approval.

2. Detached HEAD not included in pin enumerators

  • Severity: Medium
  • Files: crates/gitlawb-node/src/ipfs_pin.rs (lines 154–180), crates/gitlawb-node/src/pinata.rs (lines 137–159)
  • Details:
    • blob_paths in visibility_pack.rs explicitly appends HEAD to the commit enumeration when store::head_commit resolves (visibility_pack.rs:81–85), ensuring a blob reachable only via a detached HEAD is still classified.
    • The two pin listers, ipfs_pin::list_all_objects and pinata::list_all_objects, run git rev-list --objects --all without HEAD. In a detached-HEAD repository, git rev-list --objects --all does not traverse the commit recorded in HEAD because --all only walks refs under refs/.
    • This creates a mismatch: the withholding set is a superset of what the pin set actually replicates. Public objects reachable only through a detached HEAD would be classified as replicable but then not pinned, producing an silent under-pinning/availability gap.
  • Recommendation: Make the pin enumerators mirror blob_paths: append HEAD to the rev-list arguments whenever store::head_commit(repo_path).is_some(), or, better, share a single "reachable objects" helper between visibility_pack, ipfs_pin, and pinata so the three sites cannot drift again.

3. blob_paths is now expensive for large/long histories

  • Severity: Medium
  • File: crates/gitlawb-node/src/git/visibility_pack.rs (lines 34–132)
  • Details:
    • The old implementation ran git ls-tree -r <ref> once per ref tip.
    • The new implementation runs git rev-list --all to enumerate every reachable commit, then runs git ls-tree -r <commit> for every commit in history.
    • For repositories with deep history, this is O(history × tree-size) and may materially slow down every push/serve/pin path that calls withheld_blob_oids.
  • Recommendation: This may be a necessary cost for per-path classification across history, but it should be benchmarked on a representative repo. If the cost is too high, consider caching the (oid, path) set per repo or documenting the expected latency. Avoid further duplicating this work (see next finding).

4. withheld_blob_recipients recomputes the full blob_paths set

  • Severity: Low
  • File: crates/gitlawb-node/src/git/visibility_pack.rs (lines 176–205)
  • Details:
    • withheld_blob_recipients calls withheld_blob_oids (which computes blob_paths) and then calls blob_paths again to build the recipient map.
    • With the new O(history) implementation, this doubles the expensive enumeration.
  • Recommendation: Refactor so the caller computes blob_paths once and reuses it, or have withheld_blob_oids return both the withheld set and the path map used to compute it.

5. Minor: for-each-ref parsing could be more robust

  • Severity: Low
  • File: crates/gitlawb-node/src/git/visibility_pack.rs (lines 41–75)
  • Details:
    • The code formats each ref as %(objecttype) %(*objecttype) %(refname) and splits on whitespace.
    • Refnames cannot contain whitespace, so this is safe today, but a NUL-delimited format (--format=...%00...) plus split('\0') would be more robust and avoid the ad-hoc token arithmetic for annotated tags.
  • Recommendation: Optional cleanup; not a blocker.

Resolves conflicts in ipfs_pin.rs and pinata.rs against #90 (per-push delta
pin). #90 centralized object enumeration in push_delta::list_all_objects
(cat-file --batch-all-objects) and reworked the pin path to take a candidate
OID set, so this branch's local reachable-only list_all_objects helpers and
their tests are now dead duplicates — dropped both. The withholding fix
(full ref scope + full-history classification + explicit detached-HEAD walk)
in visibility_pack.rs is unaffected; updated its stale comments that still
described the pin set as 'git rev-list --objects --all'.
…d classifier

Addresses two review findings on the under-withholding classifier:

- withheld_blob_recipients walked blob_paths twice (once via
  withheld_blob_oids, once directly). Extract withheld_from_pairs so a
  single history walk feeds both the withheld set and the recipient map.

- The ref-type fail-closed guard used for-each-ref %(*objecttype), which
  peels only one tag level and so misclassified a nested annotated tag of
  a commit as a non-commit, refusing the whole walk (over-withhold). List
  refnames and peel each fully with <ref>^{} through one cat-file
  --batch-check; reap the child even if the stdin write fails. Fail closed
  on a short read or any non-commit/missing target.

Adds tests: annotated + nested-tag of a commit does not fail closed,
annotated tag of a blob still fails closed, and a blob at both an allowed
and a denied path is not withheld.
The ref-type guard peeled every <ref>^{} through git cat-file --batch-check,
writing all queries to stdin before draining stdout. On a repo with enough
output the child's stdout pipe fills, it stops reading stdin, and write_all
blocks forever — hanging every serve/pin that classifies the repo. Batching
the queries does NOT bound this: cat-file echoes the full query on a
'<query> missing' line, so output scales with refname length, and a few
hundred long-named dangling refs already exceed the pipe buffer.

Run the stdin write on a separate thread so the main thread drains stdout
concurrently via wait_with_output, which reaps the child unconditionally
before any error is surfaced; the writer is joined after the drain so it
cannot deadlock, and a vanished stdin pipe becomes a broken-pipe write error.
One pass, no batch bound, deadlock-free regardless of ref count or refname
length.

Extracted as assert_all_refs_are_commits with corrected doc attribution.
Detect the missing-object line by its trailing status word. Adds a
timeout-bounded regression test with 500 long-named dangling refs (>64 KiB of
cat-file output) that fails rather than hangs if the deadlock returns.
@beardthelion

Copy link
Copy Markdown
Collaborator Author

Pushed 3266eaf. Rebased onto main and worked your findings.

  1. Resolved. Merged main; the branch is mergeable and 0 behind. The collision was with feat(node): pin the per-push object delta instead of re-enumerating the whole repo #90, which is what moved the pin-set work below.

  2. Mooted by feat(node): pin the per-push object delta instead of re-enumerating the whole repo #90. The pin enumerators you flagged (ipfs_pin/pinata list_all_objects over rev-list --all) no longer exist: feat(node): pin the per-push object delta instead of re-enumerating the whole repo #90 centralized enumeration in push_delta and the pin path now takes a candidate OID set. The classifier still appends HEAD, so the serve path's detached-HEAD case stays covered.

  3. The per-commit ls-tree is load-bearing for correctness, not just thoroughness. rev-list --objects reports one path per blob, but a blob that appears at both an allowed and a denied path has to be evaluated at each, so the walk can't collapse to a single --objects pass. Left as-is and documented; caching is the right lever if the cost shows up, not a narrower walk.

  4. Fixed. Extracted withheld_from_pairs so withheld_blob_recipients shares one blob_paths walk with the withheld-set computation instead of walking history twice.

  5. Done. Replaced the %(*objecttype) token parsing with a full <ref>^{} peel through git cat-file --batch-check. That also fixes a real bug in the old one-level peel: a nested annotated tag of a commit was misread as a non-commit and failed the whole walk closed. The peel feeds cat-file stdin on a separate thread while draining stdout, so it can't deadlock on a repo with many or long-named refs. Added tests for the nested-tag, annotated-tag-of-blob, missing-object, shared-allowed/denied-path, and many-ref cases.

One adjacent thing I hit while tracing this: #90's pin candidate set uses cat-file --batch-all-objects (which includes unreachable objects) while this classifier is reachable-only, and replicable_objects is fail-open, so an unreachable private blob could be pinned in cleartext. It is independent of this PR and pre-dates it on main; filed as #99.

@beardthelion beardthelion requested a review from jatmn June 24, 2026 19:55
@beardthelion beardthelion changed the title fix(node): close under-withholding via full ref scope + reachable-only pin set (#42) fix(node): close under-withholding via full ref scope and full-history classification (#42) Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/git/visibility_pack.rs (1)

180-203: 🔒 Security & Privacy | 🔴 Critical

Use git ls-tree -rz and parse raw entries here.
git ls-tree -r C-quotes paths by default, so a path like secret/café.txt is stored as /"secret/caf\303\251.txt" and visibility rules such as /secret/** can miss it. Parse NUL-delimited output and reject malformed entries instead of lossy-decoding them.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/git/visibility_pack.rs` around lines 180 - 203, The
path parsing in the git tree listing logic is using the quoted text output from
Command::new("git") with ls-tree -r, which can mangle non-ASCII or
special-character paths and break visibility matching. Update the listing flow
in the visibility_pack.rs code that builds out from git ls-tree to use git
ls-tree -rz, parse NUL-delimited raw entries, and reject malformed entries
explicitly instead of relying on lossy UTF-8 decoding or split_once on quoted
lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 180-203: The path parsing in the git tree listing logic is using
the quoted text output from Command::new("git") with ls-tree -r, which can
mangle non-ASCII or special-character paths and break visibility matching.
Update the listing flow in the visibility_pack.rs code that builds out from git
ls-tree to use git ls-tree -rz, parse NUL-delimited raw entries, and reject
malformed entries explicitly instead of relying on lossy UTF-8 decoding or
split_once on quoted lines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e0b64e30-fd64-49dd-a5ca-efef347fdba6

📥 Commits

Reviewing files that changed from the base of the PR and between 687b068 and 3266eaf.

📒 Files selected for processing (1)
  • crates/gitlawb-node/src/git/visibility_pack.rs

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. git ls-tree -r C-quotes non-ASCII paths, causing under-withholding (CodeRabbit finding, unaddressed)

Severity: High / blocking — privacy/security.

Location: crates/gitlawb-node/src/git/visibility_pack.rs, blob_paths (lines 180–205).

Details:
The code parses tree entries from plain git ls-tree -r <commit> output and extracts the path by splitting on \t:

let listing = std::process::Command::new("git")
    .args(["ls-tree", "-r", commit])
    ...
let Some((meta, path)) = line.split_once('\t') else {
    continue;
};

Git's default text output C-quotes path names that contain non-ASCII bytes. I verified this directly: a blob at secret/café.txt is emitted by git ls-tree -r HEAD as:

100644 blob 33ea03ba1db1ac1fdfdc0c0a7cc57d3673ae3c1b	"secret/caf\303\251.txt"

The path string the code stores is therefore "/\"secret/caf\\303\\251.txt\"" (quotes and escapes included). That literal string does not match a visibility rule such as /secret/**, so the blob is not withheld. A secret blob stored under a non-ASCII path inside a denied subtree would be served/pinned in cleartext.

The fix is to use git ls-tree -rz and parse NUL-delimited raw records. With -z, the same entry is emitted as:

100644 blob 33ea03ba1db1ac1fdfdc0c0a7cc57d3673ae3c1b secret/café.txt<NUL>

The parser would need to split records on \0 and split each record on the first ASCII space sequence, rather than on \t and newlines.

Recommendation:

  • Switch blob_paths to git ls-tree -rz <commit> and parse NUL-delimited raw output.
  • Add a regression test with a non-ASCII (or otherwise C-quoted) path under a denied subtree to ensure it is withheld.

… stop under-withholding

Plain `git ls-tree -r` C-quotes any path with non-ASCII or special bytes
(café.txt becomes "secret/caf\303\251.txt"), so the stored path no longer
matched a visibility rule like /secret/** and the blob was served/pinned in
cleartext. Switch blob_paths to `git ls-tree -rz` and split the NUL-delimited
records, which emit paths raw; the TAB field separator survives -z so the
per-record parse is otherwise unchanged.

Parse strictly with str::from_utf8 and fail closed on invalid UTF-8 rather than
from_utf8_lossy: a lossy decode would rewrite an invalid byte in a denied
directory name to U+FFFD, and the mangled path would again miss its deny rule
(the same under-withholding class one layer down).

Tests: a non-ASCII path stays withheld (and the public blob still replicates),
and a non-UTF-8 path fails closed.
@beardthelion

Copy link
Copy Markdown
Collaborator Author

Fixed in 836bf7b.

blob_paths now runs git ls-tree -rz and parses the NUL-delimited records, so paths come back raw instead of C-quoted. secret/café.txt keeps its raw bytes and matches /secret/**, so the blob is withheld. Confirmed on git 2.43: -r emits "secret/caf\303\251.txt" (quoted, no match, leaked); -rz emits the raw path (withheld). -rz also ignores core.quotePath, so a repo can't re-enable quoting to dodge the filter.

While there I closed the adjacent byte-vs-string hole: the parse used from_utf8_lossy, so a genuinely non-UTF-8 byte in a denied directory name would be rewritten to U+FFFD and stop matching its rule. It now parses strictly with from_utf8 and fails closed (the whole walk aborts) on invalid UTF-8 rather than serving a partial set.

Regression tests cover all three: a non-ASCII path stays withheld (and the public blob still replicates), a TAB+newline path stays withheld (pins the -rz raw-path and NUL-record parse against a revert), and a non-UTF-8 path fails closed.

Separate from this PR: glob_matches in visibility.rs does a byte-level prefix compare, so an NFC-vs-NFD skew between a deny rule and the stored path won't match. Pre-existing, tracking it on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

crate:node gitlawb-node — the serving node and REST API kind:security Vulnerability fix or hardening sev:high Major break or real security/trust risk, no easy workaround subsystem:storage Blob/object store, Arweave, IPFS, archives subsystem:visibility Path-scoped visibility and content withholding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 3 withholding: incomplete withheld set can leak blobs in cleartext (ref namespace scope + fail-open ref walk)

2 participants