fix(node): dedupe mirror and canonical repo rows on list surfaces (#6)#73
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRepo listing and counting now deduplicate mirror and canonical rows into one logical repository. The database layer adds shared deduped list/count queries, and the API, GraphQL, stats, and tests use the deduped results. ChangesCanonical repo deduplication
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai full review |
✅ Action performedFull review finished. |
jatmn
left a comment
There was a problem hiding this comment.
Findings
1. GraphQL repos query still returns duplicate mirror/canonical rows
The PR dedups GET /api/v1/repos and /api/v1/repos/federated, but the GraphQL repos query at crates/gitlawb-node/src/graphql/query.rs:12-28 still calls db.list_all_repos() without any dedup. Since that method returns every raw row, a client using /graphql will still see the same logical repo twice when both a mirror row and a canonical row exist.
Evidence from the checkout:
graphql/query.rs:15-17callsdb.list_all_repos()and maps the result directly toRepoType.list_all_repos()indb/mod.rs:826-836selects all rows fromreposwith no dedup.- The REST list path now calls
dedupe_canonical_reposon the output oflist_all_repos_with_stars()(repos.rs:173and:1005).
Because the linked issue (#6) asks for "profile and repo-list surfaces" to show one logical repo, and the GraphQL endpoint is a repo-list surface, the PR does not fully close the issue as claimed.
Recommended action: Apply the same dedup to the GraphQL repos resolver, or add a dedicated Db::list_all_repos_deduped() / list_all_repos_with_stars_deduped() method and use it consistently across REST, GraphQL, and stats.
2. stats endpoint inflates repo count when mirror rows exist
/api/v1/stats (server.rs:435-450) uses db.list_all_repos().await and returns r.len() as i64. Because this path is not deduped, a node with both a canonical and a mirror row for the same repo counts them as two repos. This value is displayed by gl sync (crates/gl/src/sync.rs:49-55), so it is user-visible.
This is the same underlying issue as the list-surface bug, but applied to a count. The PR changed the legacy X-Total-Count to count logical repos, which makes the lack of consistency with /api/v1/stats more noticeable.
Recommended action: Either reuse the dedup helper for the stats count, or move the canonical-dedup logic into the DB layer so all callers get consistent counts automatically.
3. Mirror detection relies on a user-settable description string
dedupe_canonical_repos treats any row whose description == "mirrored from peer" as a mirror. The same string is used in the SQL list_all_repos_paged query. Because description is user-provided at repo creation, a canonical repo created with that exact description would be deprioritized against another canonical row, and a mirror row with a different description would be treated as canonical. This is a pre-existing fragility, but the PR duplicates it in Rust code rather than using a dedicated marker (e.g., a boolean column, the id format, or the machine_id/disk_path pattern used by upsert_mirror_repo).
Recommended action: Consider adding an explicit is_mirror column or other non-user-visible marker and update both the SQL and Rust dedup paths to use it.
The non-paged GET /api/v1/repos legacy path and list_federated_repos returned both the short-owner peer mirror row and the canonical did:key row for the same logical repo, so profiles rendered the repo twice. Only the paged path collapsed them, in SQL. Add a dedupe_canonical_repos helper that groups by (normalized owner, name), keeps the canonical non-mirror row (tie broken by earliest created_at), and carries the group's latest updated_at onto the survivor, matching the paged SQL dedup. Apply it at both non-paged surfaces and cover it with unit tests.
…ion on the repo id Addresses jatmn's review on #73 (dedup not applied on every reader path; mirror detection keyed on a user-settable string). - GraphQL repos and /api/v1/stats now collapse mirror+canonical rows, via new Db::list_all_repos_deduped and count_repos_deduped that share the DISTINCT ON CTE with list_all_repos_paged so the dedup rule cannot drift. - Mirror detection keys on the structural slash-form id (written only by upsert_mirror_repo) instead of the description == 'mirrored from peer' string, in both the SQL paths and dedupe_canonical_repos. - Deterministic survivor on a full created_at tie (id ASC) in both implementations. - Legacy REST list and federated keep their method-scoped did_matches owner filter in Rust; it does not compose with the method-blind SQL group key, so those paths intentionally stay on the Rust helper. - Adds sqlx and unit tests for the new surfaces, the structural marker, and the tiebreak.
…p cases Follow-ups from code review on the dedup change: - list_all_repos_deduped/count_repos_deduped: mirror-only group survives, empty table returns empty/0, and count_repos_deduped equals the deduped list length (guards the two independent SQL queries against grouping-key drift). - Document list_all_repos as the raw, non-deduped enumeration path (object lookup only), so it is not mistaken for a listing-surface method.
dcbad62 to
8e8b74b
Compare
|
Thanks, all three are addressed. Rebased onto main first (the conflict is gone; 1 & 2 — GraphQL 3 — mirror detection no longer keys on the description. Both the SQL paths and While there I made the dedup survivor deterministic on a full One thing I left out of scope: GraphQL The legacy non-paged list and federated paths keep their existing |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/db/mod.rs (1)
906-917: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winMake the SQL owner key
did:key-aware instead of suffix-based.
split_part(owner_did, ':', -1)collapses any DID method with the same last segment, and the owner filter is asymmetric:owner=did:key:z6Xdoes not include the bare mirror rowz6X, so paged results can disagree with the legacydid_matchespath and lose the mirror’s maxupdated_at. Normalize onlydid:key:<id>to<id>; keep other DID methods exact.Suggested shape
- SELECT DISTINCT ON (split_part(owner_did, ':', -1), name) + SELECT DISTINCT ON ( + CASE WHEN owner_did LIKE 'did:key:%' THEN substring(owner_did from 9) ELSE owner_did END, + name + ) ... - PARTITION BY split_part(owner_did, ':', -1), name + PARTITION BY + CASE WHEN owner_did LIKE 'did:key:%' THEN substring(owner_did from 9) ELSE owner_did END, + name ... - WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1) + WHERE ( + $1::text IS NULL + OR owner_did = $1 + OR ($1 LIKE 'did:key:%' AND owner_did = substring($1 from 9)) + OR (owner_did LIKE 'did:key:%' AND $1 = substring(owner_did from 9)) + )Apply the same normalized key to
count_repos_dedupedand the empty-page count fallback.Also applies to: 1011-1014
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/db/mod.rs` around lines 906 - 917, The deduplication key in the repos query is still suffix-based, so it can merge unrelated DID methods and miss the bare mirror row for did:key owners. Update the normalization in the repos selection logic to treat only did:key:<id> as <id> while leaving other owner_did values exact, and apply the same normalized key consistently in count_repos_deduped and the empty-page count fallback so paging and counts stay aligned with did_matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 1424-1431: The dedupe key normalization in the rows loop is too
broad because it strips everything after the last colon, which can collide
different DID methods; update the key-building logic around rec.owner_did to
match did_matches behavior by only normalizing did:key:<id> to <id> and leaving
other DID methods unchanged. Use the existing owner comparison semantics as the
reference point, and add a regression test covering did:key:z6Same versus
did:gitlawb:z6Same to verify they remain distinct.
---
Outside diff comments:
In `@crates/gitlawb-node/src/db/mod.rs`:
- Around line 906-917: The deduplication key in the repos query is still
suffix-based, so it can merge unrelated DID methods and miss the bare mirror row
for did:key owners. Update the normalization in the repos selection logic to
treat only did:key:<id> as <id> while leaving other owner_did values exact, and
apply the same normalized key consistently in count_repos_deduped and the
empty-page count fallback so paging and counts stay aligned with did_matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e0df3ff1-9b66-4958-8e78-0a70f4c09c04
📒 Files selected for processing (5)
crates/gitlawb-node/src/api/repos.rscrates/gitlawb-node/src/db/mod.rscrates/gitlawb-node/src/graphql/query.rscrates/gitlawb-node/src/server.rscrates/gitlawb-node/src/test_support.rs
jatmn
left a comment
There was a problem hiding this comment.
Findings
1. Required: cargo fmt is not clean
Severity: Required
Location: crates/gitlawb-node/src/api/repos.rs, crates/gitlawb-node/src/db/mod.rs, crates/gitlawb-node/src/test_support.rs
cargo fmt --all -- --check reports diffs in the PR’s changed code (long single-line record(...) calls, over-length rec(...) signature, assert_eq! macro invocations that should be multi-line, and a few comment alignments). The PR checklist claims cargo fmt --all is clean, but it is not. This must be fixed before merge.
2. Positive: Logic is consistent across SQL and Rust dedup paths
Severity: No issue
I verified the deduplication rules line up between the new Rust helper and the shared SQL CTE:
- Grouping key:
split_part(owner_did, ':', -1)/owner_did.rsplit(':').next(). - Mirror marker:
position('/' in id) > 0/r.id.contains('/'). - Survivor preference: canonical (no slash) over mirror.
- Tie-break:
created_at ASC, id ASCin both SQL and Rust. - Activity timestamp: the survivor inherits the group’s max
updated_at(SQL window function inDEDUP_CTE,latestmap in Rust).
The GraphQL and stats surfaces now call the deduped methods, and the only remaining raw consumer is the IPFS object scan, which is intentionally documented as the non-listing path.
3. Positive: Tests cover the important edge cases
Severity: No issue
The new tests exercise:
- Canonical wins over mirror.
- Distinct repos are preserved.
- Same owner, different repo does not collapse.
- Tie-breaking by earliest
created_atand thenid ASC. - Mirror description on a canonical row does not misclassify it.
- Mirror-only group survives.
- Empty table returns empty / zero.
count_repos_dedupedmatcheslist_all_repos_dedupedlength.- Real
upsert_mirror_reporow shape is classified correctly. - GraphQL and REST stats integration.
Unit tests in api::repos::tests pass.
4. Optional / pre-existing: paged owner filter still only handles short-form owner
Severity: Optional observation (not introduced by this PR)
In list_all_repos_paged, the owner filter is:
WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1)If a caller passes the full did:key:z6Mk… form, the LIKE pattern becomes %:did:key:z6Mk…, which will not match the bare z6Mk… mirror row. The non-paged legacy path filters in Rust using did_matches, which handles both forms correctly. Fixing the paged SQL filter to handle full DID is out of scope here, but worth a follow-up to keep the two list paths consistent.
|
Fixed in 5721791. Ran |
jatmn
left a comment
There was a problem hiding this comment.
Findings
1. Deduplication key is method-blind and can collapse distinct DID methods (Major)
Severity: Major / correctness
Location:
crates/gitlawb-node/src/db/mod.rs—DEDUP_CTE(lines 905–925) andcount_repos_deduped(lines 1011–1018)crates/gitlawb-node/src/api/repos.rs—dedupe_canonical_repos(lines 1384–1466)
Issue: The dedup grouping key is the last : segment of owner_did:
split_part(owner_did, ':', -1)rec.owner_did.rsplit(':').next().unwrap_or(&rec.owner_did)This is method-blind. A repo owned by did:key:z6MkExample and a repo owned by did:gitlawb:z6MkExample (or any other DID method with the same trailing segment) and having the same name will be collapsed into one logical repo. The codebase already recognizes this risk and explicitly guards against it in crates/gitlawb-node/src/api/mod.rs:
/// Match a presented DID against a stored DID ... never let a bare id match across methods —
/// `did:web` / `did:gitlawb` share the base58 space with `did:key`, so a
/// trailing-segment compare would treat `did:key:X` and `did:gitlawb:X` as equal.
pub(crate) fn did_matches(a: &str, b: &str) -> bool { ... }The new dedup logic does exactly the trailing-segment comparison that did_matches warns against. It is true that repo creation currently only accepts did:key owners, but the database schema does not enforce that, and the project already treats cross-method collision as a real concern. The dedup key should match the project's own DID-matching semantics.
Recommended action: Make the dedup key did:key-aware (and bare-id-as-did:key). For example:
- Treat
did:key:<id>as<id>. - Treat a bare
<id>(no colon) as<id>. - Leave any other
did:<method>:<id>as the full string.
Apply the same normalization in DEDUP_CTE, count_repos_deduped, and dedupe_canonical_repos so the list, count, and legacy paths stay consistent.
2. Paged owner filter is inconsistent with the legacy path for full-DID owner queries (Optional / pre-existing)
Severity: Optional / observation
Location:
crates/gitlawb-node/src/db/mod.rs—list_all_repos_pagedWHERE clause (line 917 and empty-page fallback line 972)crates/gitlawb-node/src/api/repos.rs— legacy filter (lines 233–239)crates/gitlawb-node/src/api/mod.rs—did_matches(lines 63–74)
Issue: The paged path filters in SQL:
WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1)If a caller passes the full owner did:key:z6MkExample, the LIKE pattern becomes %:did:key:z6MkExample, which will not match the bare mirror row z6MkExample. The legacy path uses did_matches in Rust, which correctly matches both forms. This means a full-DID owner filter returns different results on the paged and legacy surfaces.
This is not introduced by the PR, but the PR touches the paged query and leaves the inconsistency in place. It is worth a follow-up to keep the two list paths aligned.
Recommended action: In the paged SQL, normalize the owner filter the same way the grouping key is normalized (ideally after fixing finding #1), or add an explicit OR branch for the bare mirror form when the filter is a did:key: value.
…ods don't collapse The dedup grouping key took the last ':' segment of owner_did (split_part / rsplit), so two repos owned by did:key:X and did:gitlawb:X with the same name collapsed into one logical repo on the list, paged, count, stats, and GraphQL surfaces. That is the exact cross-method collision the codebase already guards against in did_matches. Replace it with a did:key-aware key that strips a did:key: prefix only when the remainder is a bare id (no ':'), otherwise keeps the full DID, reproducing did_matches/key_id as an equivalence relation: did:key:X and a bare mirror X still collapse, while distinct methods never merge. Applied byte-identically across DEDUP_CTE (DISTINCT ON / PARTITION BY / ORDER BY), count_repos_deduped, the empty-page count fallback, and the in-memory dedupe_canonical_repos, so the SQL and Rust paths agree. The backing index lived in the already-released migration v1, so v1 is left untouched and a new migration v7 drops idx_repos_owner_short_name and builds idx_repos_owner_key_name on the matching expression; the CASE must stay byte-identical to the queries or Postgres stops using it. Tests cover both the in-memory and SQL paths: distinct methods stay separate, bare-id and did:key forms collapse, and the residual-colon guard keeps a malformed did:key:did:gitlawb:X distinct from the bare method DID. 216 pass.
|
Both addressed in 1 (Major) - method-blind dedup key. Fixed. Replaced the
That reproduces Applied in every spot the key is read: Tests cover cross-method distinctness and bare/ 2 (Optional / pre-existing) - paged owner filter. Left as-is for now. The |
jatmn
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I do not see any actionable issues from my review.
@kevincodex1 LGTM
Summary
Profile and repo-list surfaces rendered the same logical repo twice when a short-owner peer mirror row and the canonical
did:key:row both existed. This collapses them to one card on the surfaces that were missing the dedup.Motivation & context
Closes #6
node.gitlawb.comshowed twonipmodcards on a profile: one from the peer mirror row (owner_did = "z6Mk…", description "mirrored from peer") and one from the canonical row (owner_did = "did:key:z6Mk…"). The paged repo list already deduped these in SQL, but the non-pagedGET /api/v1/reposlegacy path andlist_federated_reposreturned every matching row, so both showed up.Kind of change
What changed
Crate touched:
gitlawb-node.dedupe_canonical_reposinapi/repos.rs: groups rows by(normalized owner, name)(the key segment after the last:, sodid:key:z6Mk…and the barez6Mk…mirror row collapse together), keeps the canonical row (non-mirror beats "mirrored from peer", ties broken by earliestcreated_at), and carries the group's most recentupdated_atonto the survivor so a gossip push that only touched the mirror row still floats the repo to the top. This matches the existing SQL dedup inDb::list_all_repos_paged.list_reposfallback andlist_federated_repos. As a side effect the legacy path'sX-Total-Countnow counts logical repos rather than raw rows, consistent with the paged path.repos::testsmodule covering canonical-wins, distinct-repos-preserved, same-owner-different-repo, and the mirror tie-break.How a reviewer can verify
Before you request review
cargo test --workspacepasses locallycargo fmt --allandcargo clippy --workspace --all-targets -- -D warningsare cleanfeat(...),fix(...),docs(...)).env.exampleupdated if behavior or config changed (or N/A)Notes for reviewers
The dedup logic now lives in two places: the SQL
DISTINCT ONinDb::list_all_repos_pagedand this Rust helper for the non-paged surfaces. They use the same preference rules and the helper's doc comment flags that they must stay in sync. Consolidating both behind one path is possible later but would change the legacy "return all rows" contract that peer/CLI callers rely on, so I kept it out of scope.Summary by CodeRabbit
/api/v1/statsreposmetric now counts logical repositories consistently with deduplication.