Problem
Duplicate detection results are currently cached only in Redis with a 10-minute TTL. When the page is reloaded, results are gone and the user must click "Find Duplicates" again (~56 seconds for a 15K-class ontology).
Proposal
- New
duplicate_detection_results table — stores the latest detection results per project/branch, including clusters, threshold, and timestamp
- Auto-update on index rebuild — when
run_ontology_index_task completes (or the index is updated via entity edits), automatically re-run duplicate detection and persist the results
- Frontend loads persisted results on mount — the Duplicates tab should check for stored results first, showing them immediately without requiring a manual "Find Duplicates" click
- "Find Duplicates" button re-runs detection — still available for on-demand refresh, updating the persisted results
Context
PR #80 moved duplicate detection to the ARQ worker queue and rewrote it to use PostgreSQL's pg_trgm GIN index instead of in-memory rdflib parsing. The detection itself now completes in ~56 seconds for large ontologies. Persisting results would make the UX seamless.
Tasks
Problem
Duplicate detection results are currently cached only in Redis with a 10-minute TTL. When the page is reloaded, results are gone and the user must click "Find Duplicates" again (~56 seconds for a 15K-class ontology).
Proposal
duplicate_detection_resultstable — stores the latest detection results per project/branch, including clusters, threshold, and timestamprun_ontology_index_taskcompletes (or the index is updated via entity edits), automatically re-run duplicate detection and persist the resultsContext
PR #80 moved duplicate detection to the ARQ worker queue and rewrote it to use PostgreSQL's
pg_trgmGIN index instead of in-memory rdflib parsing. The detection itself now completes in ~56 seconds for large ontologies. Persisting results would make the UX seamless.Tasks
duplicate_detection_resultstable and Alembic migrationrun_duplicate_detection_taskcompletesrun_ontology_index_task