Skip to content

fix(node): dedupe mirror and canonical repo rows on list surfaces (#6)#73

Merged
kevincodex1 merged 5 commits into
mainfrom
fix/dedup-mirror-rows-canonical-owner
Jun 25, 2026
Merged

fix(node): dedupe mirror and canonical repo rows on list surfaces (#6)#73
kevincodex1 merged 5 commits into
mainfrom
fix/dedup-mirror-rows-canonical-owner

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Profile and repo-list surfaces rendered the same logical repo twice when a short-owner peer mirror row and the canonical did:key: row both existed. This collapses them to one card on the surfaces that were missing the dedup.

Motivation & context

Closes #6

node.gitlawb.com showed two nipmod cards on a profile: one from the peer mirror row (owner_did = "z6Mk…", description "mirrored from peer") and one from the canonical row (owner_did = "did:key:z6Mk…"). The paged repo list already deduped these in SQL, but the non-paged GET /api/v1/repos legacy path and list_federated_repos returned every matching row, so both showed up.

Kind of change

  • Bug fix
  • Feature
  • Security fix
  • Docs
  • Tests / CI
  • Refactor (no behavior change)
  • Breaking or protocol change (issue required first)

What changed

Crate touched: gitlawb-node.

  • Added dedupe_canonical_repos in api/repos.rs: groups rows by (normalized owner, name) (the key segment after the last :, so did:key:z6Mk… and the bare z6Mk… mirror row collapse together), keeps the canonical row (non-mirror beats "mirrored from peer", ties broken by earliest created_at), and carries the group's most recent updated_at onto the survivor so a gossip push that only touched the mirror row still floats the repo to the top. This matches the existing SQL dedup in Db::list_all_repos_paged.
  • Applied it at the two non-paged surfaces: the legacy list_repos fallback and list_federated_repos. As a side effect the legacy path's X-Total-Count now counts logical repos rather than raw rows, consistent with the paged path.
  • Added a repos::tests module covering canonical-wins, distinct-repos-preserved, same-owner-different-repo, and the mirror tie-break.

How a reviewer can verify

cargo test --bin gitlawb-node repos::tests
# Against a node that has both a mirror and a canonical row for one repo:
curl -fsSL 'http://<node>/api/v1/repos?owner=z6Mkwbud...'  # one record, owner_did = did:key:..., real description

Before you request review

  • Scope is one logical change; no unrelated churn
  • cargo test --workspace passes locally
  • New behavior is covered by tests (required for fixes)
  • cargo fmt --all and cargo clippy --workspace --all-targets -- -D warnings are clean
  • Commit titles use Conventional Commits (feat(...), fix(...), docs(...))
  • Docs / .env.example updated if behavior or config changed (or N/A)
  • Checked existing PRs so this isn't a duplicate

Notes for reviewers

The dedup logic now lives in two places: the SQL DISTINCT ON in Db::list_all_repos_paged and this Rust helper for the non-paged surfaces. They use the same preference rules and the helper's doc comment flags that they must stay in sync. Consolidating both behind one path is possible later but would change the legacy "return all rows" contract that peer/CLI callers rely on, so I kept it out of scope.

Summary by CodeRabbit

  • Bug Fixes
    • Repository listings now properly deduplicate canonical and mirrored entries, so each logical repository shows once.
    • Total counts and “most recent activity” ordering now reflect the deduplicated view (with deterministic tie-breaking).
  • API Updates
    • GraphQL repository results use the deduplicated dataset.
    • The /api/v1/stats repos metric now counts logical repositories consistently with deduplication.
  • Tests
    • Added coverage for canonical-vs-mirror selection, deterministic ordering, did-key-aware grouping, and empty-table behavior.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a466227d-c4bd-406d-b497-e9146fc1a63e

📥 Commits

Reviewing files that changed from the base of the PR and between 5721791 and 9b9b120.

📒 Files selected for processing (2)
  • crates/gitlawb-node/src/api/repos.rs
  • crates/gitlawb-node/src/db/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/gitlawb-node/src/api/repos.rs

📝 Walkthrough

Walkthrough

Repo listing and counting now deduplicate mirror and canonical rows into one logical repository. The database layer adds shared deduped list/count queries, and the API, GraphQL, stats, and tests use the deduped results.

Changes

Canonical repo deduplication

Layer / File(s) Summary
DB dedup query and methods
crates/gitlawb-node/src/db/mod.rs
Defines shared SQL for collapsing canonical and mirror rows, switches paged listing to that SQL, adds unpaged deduped list/count methods, and adds database tests for deduplication and count behavior.
API repo list deduplication
crates/gitlawb-node/src/api/repos.rs
Updates list_repos to deduplicate owned repo rows before building the response and total count, deduplicates the local repo set in list_federated_repos, and adds unit tests for selection and tie-breaking rules.
GraphQL, stats, and integration tests
crates/gitlawb-node/src/graphql/query.rs, crates/gitlawb-node/src/server.rs, crates/gitlawb-node/src/test_support.rs
Routes GraphQL repos and stats repo counts through the deduped DB methods and adds integration tests that assert the deduped logical repo count on both surfaces.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • kevincodex1

Poem

🐇 I hopped through rows both twin and true,
Found one clear repo shining through.
Canonical first, the mirror tucked away,
Now one neat list greets the day.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: deduplicating mirror and canonical repo rows on list surfaces.
Description check ✅ Passed The description follows the template and covers summary, motivation, change details, verification, and reviewer notes.
Linked Issues check ✅ Passed The PR addresses #6 by deduplicating mirrored and canonical repo rows, preferring canonical metadata, and covering the affected list/profile surfaces.
Out of Scope Changes check ✅ Passed The DB, GraphQL, stats, and migration changes are supporting pieces of the same deduplication fix, not unrelated churn.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dedup-mirror-rows-canonical-owner

Comment @coderabbitai help to get the list of available commands.

@beardthelion

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@beardthelion

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@beardthelion beardthelion added kind:bug Defect fix — wrong or unsafe behavior crate:node gitlawb-node — the serving node and REST API subsystem:replication Mirror, replica, and cross-node sync subsystem:api Node REST API request/response surface sev:low Cosmetic, cleanup, or nice-to-have labels Jun 22, 2026

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

1. GraphQL repos query still returns duplicate mirror/canonical rows

The PR dedups GET /api/v1/repos and /api/v1/repos/federated, but the GraphQL repos query at crates/gitlawb-node/src/graphql/query.rs:12-28 still calls db.list_all_repos() without any dedup. Since that method returns every raw row, a client using /graphql will still see the same logical repo twice when both a mirror row and a canonical row exist.

Evidence from the checkout:

  • graphql/query.rs:15-17 calls db.list_all_repos() and maps the result directly to RepoType.
  • list_all_repos() in db/mod.rs:826-836 selects all rows from repos with no dedup.
  • The REST list path now calls dedupe_canonical_repos on the output of list_all_repos_with_stars() (repos.rs:173 and :1005).

Because the linked issue (#6) asks for "profile and repo-list surfaces" to show one logical repo, and the GraphQL endpoint is a repo-list surface, the PR does not fully close the issue as claimed.

Recommended action: Apply the same dedup to the GraphQL repos resolver, or add a dedicated Db::list_all_repos_deduped() / list_all_repos_with_stars_deduped() method and use it consistently across REST, GraphQL, and stats.

2. stats endpoint inflates repo count when mirror rows exist

/api/v1/stats (server.rs:435-450) uses db.list_all_repos().await and returns r.len() as i64. Because this path is not deduped, a node with both a canonical and a mirror row for the same repo counts them as two repos. This value is displayed by gl sync (crates/gl/src/sync.rs:49-55), so it is user-visible.

This is the same underlying issue as the list-surface bug, but applied to a count. The PR changed the legacy X-Total-Count to count logical repos, which makes the lack of consistency with /api/v1/stats more noticeable.

Recommended action: Either reuse the dedup helper for the stats count, or move the canonical-dedup logic into the DB layer so all callers get consistent counts automatically.

3. Mirror detection relies on a user-settable description string

dedupe_canonical_repos treats any row whose description == "mirrored from peer" as a mirror. The same string is used in the SQL list_all_repos_paged query. Because description is user-provided at repo creation, a canonical repo created with that exact description would be deprioritized against another canonical row, and a mirror row with a different description would be treated as canonical. This is a pre-existing fragility, but the PR duplicates it in Rust code rather than using a dedicated marker (e.g., a boolean column, the id format, or the machine_id/disk_path pattern used by upsert_mirror_repo).

Recommended action: Consider adding an explicit is_mirror column or other non-user-visible marker and update both the SQL and Rust dedup paths to use it.

The non-paged GET /api/v1/repos legacy path and list_federated_repos
returned both the short-owner peer mirror row and the canonical did:key
row for the same logical repo, so profiles rendered the repo twice. Only
the paged path collapsed them, in SQL.

Add a dedupe_canonical_repos helper that groups by (normalized owner,
name), keeps the canonical non-mirror row (tie broken by earliest
created_at), and carries the group's latest updated_at onto the
survivor, matching the paged SQL dedup. Apply it at both non-paged
surfaces and cover it with unit tests.
…ion on the repo id

Addresses jatmn's review on #73 (dedup not applied on every reader path; mirror
detection keyed on a user-settable string).

- GraphQL repos and /api/v1/stats now collapse mirror+canonical rows, via new
  Db::list_all_repos_deduped and count_repos_deduped that share the DISTINCT ON
  CTE with list_all_repos_paged so the dedup rule cannot drift.
- Mirror detection keys on the structural slash-form id (written only by
  upsert_mirror_repo) instead of the description == 'mirrored from peer' string,
  in both the SQL paths and dedupe_canonical_repos.
- Deterministic survivor on a full created_at tie (id ASC) in both
  implementations.
- Legacy REST list and federated keep their method-scoped did_matches owner
  filter in Rust; it does not compose with the method-blind SQL group key, so
  those paths intentionally stay on the Rust helper.
- Adds sqlx and unit tests for the new surfaces, the structural marker, and the
  tiebreak.
…p cases

Follow-ups from code review on the dedup change:
- list_all_repos_deduped/count_repos_deduped: mirror-only group survives, empty
  table returns empty/0, and count_repos_deduped equals the deduped list length
  (guards the two independent SQL queries against grouping-key drift).
- Document list_all_repos as the raw, non-deduped enumeration path (object
  lookup only), so it is not mistaken for a listing-surface method.
@beardthelion beardthelion force-pushed the fix/dedup-mirror-rows-canonical-owner branch from dcbad62 to 8e8b74b Compare June 24, 2026 14:37
@beardthelion beardthelion requested a review from jatmn June 24, 2026 14:38
@beardthelion

Copy link
Copy Markdown
Collaborator Author

Thanks, all three are addressed. Rebased onto main first (the conflict is gone; mergeable now).

1 & 2 — GraphQL repos and /api/v1/stats no longer show raw rows. Both now go through new DB methods, list_all_repos_deduped and count_repos_deduped, that reuse the same DISTINCT ON (split_part(owner_did,':',-1), name) selection as list_all_repos_paged (factored into a shared DEDUP_CTE const so the three can't drift). count_repos_deduped uses the COUNT(DISTINCT (split_part(...), name)) idiom already in the paged empty-page path. #[sqlx::test]s cover both surfaces, plus mirror-only, empty-table, and a count-equals-list-length guard.

3 — mirror detection no longer keys on the description. Both the SQL paths and dedupe_canonical_repos now classify a mirror by its slash-form id ({owner_short}/{name}), which upsert_mirror_repo is the only writer of; canonical rows use UUID ids and repo names are sanitized, so no other row can carry a slash. A test seeds a canonical row whose description is literally "mirrored from peer" and confirms it still wins. I went with the structural id marker rather than an is_mirror column to keep this migration-free; happy to add the column as a follow-up if you'd prefer the explicit schema signal.

While there I made the dedup survivor deterministic on a full (mirror-status, created_at) tie via an id ASC backstop, in both the SQL and Rust paths.

One thing I left out of scope: GraphQL repos and /api/v1/stats don't filter on is_public (the previous list_all_repos they called didn't either, so this PR doesn't change that). If those should be public-only surfaces it's a separate visibility fix; say the word and I'll open an issue.

The legacy non-paged list and federated paths keep their existing did_matches owner filter in Rust rather than moving to SQL: did_matches is DID-method-scoped (it won't match did:key:z6X against did:gitlawb:z6X), and the SQL group key is method-blind, so deduping in SQL before that filter could drop a repo from its own owner's listing. Leaving those paths as-is avoids that.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/db/mod.rs (1)

906-917: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Make the SQL owner key did:key-aware instead of suffix-based.

split_part(owner_did, ':', -1) collapses any DID method with the same last segment, and the owner filter is asymmetric: owner=did:key:z6X does not include the bare mirror row z6X, so paged results can disagree with the legacy did_matches path and lose the mirror’s max updated_at. Normalize only did:key:<id> to <id>; keep other DID methods exact.

Suggested shape
- SELECT DISTINCT ON (split_part(owner_did, ':', -1), name)
+ SELECT DISTINCT ON (
+     CASE WHEN owner_did LIKE 'did:key:%' THEN substring(owner_did from 9) ELSE owner_did END,
+     name
+ )
...
- PARTITION BY split_part(owner_did, ':', -1), name
+ PARTITION BY
+     CASE WHEN owner_did LIKE 'did:key:%' THEN substring(owner_did from 9) ELSE owner_did END,
+     name
...
- WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1)
+ WHERE (
+     $1::text IS NULL
+     OR owner_did = $1
+     OR ($1 LIKE 'did:key:%' AND owner_did = substring($1 from 9))
+     OR (owner_did LIKE 'did:key:%' AND $1 = substring(owner_did from 9))
+ )

Apply the same normalized key to count_repos_deduped and the empty-page count fallback.

Also applies to: 1011-1014

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/db/mod.rs` around lines 906 - 917, The deduplication
key in the repos query is still suffix-based, so it can merge unrelated DID
methods and miss the bare mirror row for did:key owners. Update the
normalization in the repos selection logic to treat only did:key:<id> as <id>
while leaving other owner_did values exact, and apply the same normalized key
consistently in count_repos_deduped and the empty-page count fallback so paging
and counts stay aligned with did_matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 1424-1431: The dedupe key normalization in the rows loop is too
broad because it strips everything after the last colon, which can collide
different DID methods; update the key-building logic around rec.owner_did to
match did_matches behavior by only normalizing did:key:<id> to <id> and leaving
other DID methods unchanged. Use the existing owner comparison semantics as the
reference point, and add a regression test covering did:key:z6Same versus
did:gitlawb:z6Same to verify they remain distinct.

---

Outside diff comments:
In `@crates/gitlawb-node/src/db/mod.rs`:
- Around line 906-917: The deduplication key in the repos query is still
suffix-based, so it can merge unrelated DID methods and miss the bare mirror row
for did:key owners. Update the normalization in the repos selection logic to
treat only did:key:<id> as <id> while leaving other owner_did values exact, and
apply the same normalized key consistently in count_repos_deduped and the
empty-page count fallback so paging and counts stay aligned with did_matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e0df3ff1-9b66-4958-8e78-0a70f4c09c04

📥 Commits

Reviewing files that changed from the base of the PR and between dcbad62 and 8e8b74b.

📒 Files selected for processing (5)
  • crates/gitlawb-node/src/api/repos.rs
  • crates/gitlawb-node/src/db/mod.rs
  • crates/gitlawb-node/src/graphql/query.rs
  • crates/gitlawb-node/src/server.rs
  • crates/gitlawb-node/src/test_support.rs

Comment thread crates/gitlawb-node/src/api/repos.rs Outdated

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

1. Required: cargo fmt is not clean

Severity: Required
Location: crates/gitlawb-node/src/api/repos.rs, crates/gitlawb-node/src/db/mod.rs, crates/gitlawb-node/src/test_support.rs

cargo fmt --all -- --check reports diffs in the PR’s changed code (long single-line record(...) calls, over-length rec(...) signature, assert_eq! macro invocations that should be multi-line, and a few comment alignments). The PR checklist claims cargo fmt --all is clean, but it is not. This must be fixed before merge.


2. Positive: Logic is consistent across SQL and Rust dedup paths

Severity: No issue

I verified the deduplication rules line up between the new Rust helper and the shared SQL CTE:

  • Grouping key: split_part(owner_did, ':', -1) / owner_did.rsplit(':').next().
  • Mirror marker: position('/' in id) > 0 / r.id.contains('/').
  • Survivor preference: canonical (no slash) over mirror.
  • Tie-break: created_at ASC, id ASC in both SQL and Rust.
  • Activity timestamp: the survivor inherits the group’s max updated_at (SQL window function in DEDUP_CTE, latest map in Rust).

The GraphQL and stats surfaces now call the deduped methods, and the only remaining raw consumer is the IPFS object scan, which is intentionally documented as the non-listing path.


3. Positive: Tests cover the important edge cases

Severity: No issue

The new tests exercise:

  • Canonical wins over mirror.
  • Distinct repos are preserved.
  • Same owner, different repo does not collapse.
  • Tie-breaking by earliest created_at and then id ASC.
  • Mirror description on a canonical row does not misclassify it.
  • Mirror-only group survives.
  • Empty table returns empty / zero.
  • count_repos_deduped matches list_all_repos_deduped length.
  • Real upsert_mirror_repo row shape is classified correctly.
  • GraphQL and REST stats integration.

Unit tests in api::repos::tests pass.


4. Optional / pre-existing: paged owner filter still only handles short-form owner

Severity: Optional observation (not introduced by this PR)

In list_all_repos_paged, the owner filter is:

WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1)

If a caller passes the full did:key:z6Mk… form, the LIKE pattern becomes %:did:key:z6Mk…, which will not match the bare z6Mk… mirror row. The non-paged legacy path filters in Rust using did_matches, which handles both forms correctly. Fixing the paged SQL filter to handle full DID is out of scope here, but worth a follow-up to keep the two list paths consistent.


@beardthelion

Copy link
Copy Markdown
Collaborator Author

Fixed in 5721791. Ran cargo fmt --all over the three files (repos.rs, db/mod.rs, test_support.rs) — the multi-line record/rec calls, single-line assert_eq!s, and the create_repo(seed_repo(...)) reflows. cargo fmt --all -- --check is clean now and clippy is warning-free. Finding 4 (paged owner filter on full DID) is pre-existing and out of scope here; I'll leave it for a follow-up.

@beardthelion beardthelion requested a review from jatmn June 24, 2026 18:31

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

1. Deduplication key is method-blind and can collapse distinct DID methods (Major)

Severity: Major / correctness

Location:

  • crates/gitlawb-node/src/db/mod.rsDEDUP_CTE (lines 905–925) and count_repos_deduped (lines 1011–1018)
  • crates/gitlawb-node/src/api/repos.rsdedupe_canonical_repos (lines 1384–1466)

Issue: The dedup grouping key is the last : segment of owner_did:

split_part(owner_did, ':', -1)
rec.owner_did.rsplit(':').next().unwrap_or(&rec.owner_did)

This is method-blind. A repo owned by did:key:z6MkExample and a repo owned by did:gitlawb:z6MkExample (or any other DID method with the same trailing segment) and having the same name will be collapsed into one logical repo. The codebase already recognizes this risk and explicitly guards against it in crates/gitlawb-node/src/api/mod.rs:

/// Match a presented DID against a stored DID ... never let a bare id match across methods —
/// `did:web` / `did:gitlawb` share the base58 space with `did:key`, so a
/// trailing-segment compare would treat `did:key:X` and `did:gitlawb:X` as equal.
pub(crate) fn did_matches(a: &str, b: &str) -> bool { ... }

The new dedup logic does exactly the trailing-segment comparison that did_matches warns against. It is true that repo creation currently only accepts did:key owners, but the database schema does not enforce that, and the project already treats cross-method collision as a real concern. The dedup key should match the project's own DID-matching semantics.

Recommended action: Make the dedup key did:key-aware (and bare-id-as-did:key). For example:

  • Treat did:key:<id> as <id>.
  • Treat a bare <id> (no colon) as <id>.
  • Leave any other did:<method>:<id> as the full string.

Apply the same normalization in DEDUP_CTE, count_repos_deduped, and dedupe_canonical_repos so the list, count, and legacy paths stay consistent.


2. Paged owner filter is inconsistent with the legacy path for full-DID owner queries (Optional / pre-existing)

Severity: Optional / observation

Location:

  • crates/gitlawb-node/src/db/mod.rslist_all_repos_paged WHERE clause (line 917 and empty-page fallback line 972)
  • crates/gitlawb-node/src/api/repos.rs — legacy filter (lines 233–239)
  • crates/gitlawb-node/src/api/mod.rsdid_matches (lines 63–74)

Issue: The paged path filters in SQL:

WHERE ($1::text IS NULL OR owner_did = $1 OR owner_did LIKE '%:' || $1)

If a caller passes the full owner did:key:z6MkExample, the LIKE pattern becomes %:did:key:z6MkExample, which will not match the bare mirror row z6MkExample. The legacy path uses did_matches in Rust, which correctly matches both forms. This means a full-DID owner filter returns different results on the paged and legacy surfaces.

This is not introduced by the PR, but the PR touches the paged query and leaves the inconsistency in place. It is worth a follow-up to keep the two list paths aligned.

Recommended action: In the paged SQL, normalize the owner filter the same way the grouping key is normalized (ideally after fixing finding #1), or add an explicit OR branch for the bare mirror form when the filter is a did:key: value.

…ods don't collapse

The dedup grouping key took the last ':' segment of owner_did
(split_part / rsplit), so two repos owned by did:key:X and
did:gitlawb:X with the same name collapsed into one logical repo on
the list, paged, count, stats, and GraphQL surfaces. That is the exact
cross-method collision the codebase already guards against in
did_matches.

Replace it with a did:key-aware key that strips a did:key: prefix only
when the remainder is a bare id (no ':'), otherwise keeps the full DID,
reproducing did_matches/key_id as an equivalence relation: did:key:X
and a bare mirror X still collapse, while distinct methods never merge.
Applied byte-identically across DEDUP_CTE (DISTINCT ON / PARTITION BY /
ORDER BY), count_repos_deduped, the empty-page count fallback, and the
in-memory dedupe_canonical_repos, so the SQL and Rust paths agree.

The backing index lived in the already-released migration v1, so v1 is
left untouched and a new migration v7 drops idx_repos_owner_short_name
and builds idx_repos_owner_key_name on the matching expression; the
CASE must stay byte-identical to the queries or Postgres stops using it.

Tests cover both the in-memory and SQL paths: distinct methods stay
separate, bare-id and did:key forms collapse, and the residual-colon
guard keeps a malformed did:key:did:gitlawb:X distinct from the bare
method DID. 216 pass.
@beardthelion

Copy link
Copy Markdown
Collaborator Author

Both addressed in 9b9b120.

1 (Major) - method-blind dedup key. Fixed. Replaced the split_part(owner_did, ':', -1) / rsplit(':') last-segment key with a did:key-aware key that strips a did:key: prefix only when the remainder has no :, and keeps the full DID otherwise:

  • SQL: CASE WHEN owner_did LIKE 'did:key:%' AND position(':' in substr(owner_did, 9)) = 0 THEN substr(owner_did, 9) ELSE owner_did END
  • Rust: owner_did.strip_prefix("did:key:").filter(|rest| !rest.contains(':'))

That reproduces did_matches/key_id as an equivalence relation, so did:key:z6Mk… and the bare mirror z6Mk… still collapse while did:key:z6Mk… and did:gitlawb:z6Mk… stay distinct. The residual-colon guard matches key_id's !ka.contains(':') check, so even a malformed did:key:did:gitlawb:X keeps its full form rather than collapsing onto the bare method DID.

Applied in every spot the key is read: DEDUP_CTE (DISTINCT ON / PARTITION BY / ORDER BY), count_repos_deduped, the empty-page COUNT(DISTINCT …) fallback in list_all_repos_paged, and dedupe_canonical_repos. The backing index lived in the already-released migration v1, so I left v1 untouched and added migration v7 to drop idx_repos_owner_short_name and build idx_repos_owner_key_name on the matching expression; the CASE is byte-identical across all of them so the planner still uses the index.

Tests cover cross-method distinctness and bare/did:key collapse on both the in-memory and SQL paths, plus the did:key:-wrapped-full-DID and empty-residual boundaries.

2 (Optional / pre-existing) - paged owner filter. Left as-is for now. The owner_did LIKE '%:' || $1 branch still trailing-matches, so a full did:key: owner filter won't hit the bare mirror row, and it now diverges from the grouping key on cross-method ids the same way you noted. It predates this PR and is out of scope for the under-withholding intent, so I'd rather not widen the diff here. Happy to file a follow-up to align the filter with the new key (and with did_matches on the legacy path) if you'd prefer that tracked.

@beardthelion beardthelion requested a review from jatmn June 24, 2026 21:49

@jatmn jatmn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I do not see any actionable issues from my review.

@kevincodex1 LGTM

@kevincodex1 kevincodex1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevincodex1 kevincodex1 merged commit 3e8e333 into main Jun 25, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

crate:node gitlawb-node — the serving node and REST API kind:bug Defect fix — wrong or unsafe behavior sev:low Cosmetic, cleanup, or nice-to-have subsystem:api Node REST API request/response surface subsystem:replication Mirror, replica, and cross-node sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplicate mirrored repo rows with canonical did:key owner on profile/list surfaces

3 participants