-
Notifications
You must be signed in to change notification settings - Fork 2
Description
ADR-007: Persistent SQLite Index for CLI Performance and Fuzzy Search
Status
PROPOSED - Awaiting discussion and decision
Context
Original Architecture Assumptions
ADR-001 and ADR-002 were created with the assumption that dacli primarily runs as a long-running MCP server:
- Server starts once and keeps memory
- In-memory index built on startup (one-time cost)
- Fast queries after initial indexing
- File system as single source of truth (no persistent database)
Changed Requirements
We now have two deployment modes with different characteristics:
1. MCP Server (Original Design)
uvx dacli-mcp --docs-root /path/to/docs
# Server runs continuously, in-memory index works well- ✅ In-memory index: Built once, used many times
- ✅ Fast queries after startup
- ✅ Startup time acceptable (one-time cost)
2. CLI Tool (New Reality)
dacli structure --docs-root /path/to/docs
dacli search "authentication"
dacli section introduction- ❌ In-memory index: Built on EVERY invocation
- ❌ Cold start every time (no memory persistence)
- ❌ Large documentation projects (1000+ files) = 5-10s startup
- ❌ Poor user experience for simple queries
Additional Driver: Fuzzy Search
The request for tri-gram fuzzy search (#TBD) adds another dimension:
- Tri-gram index generation is expensive (more than simple structure parsing)
- In-memory tri-gram index on every CLI call = unacceptable UX
- Persistent index becomes essential for practical fuzzy search
Problem Statement
How can we provide fast CLI performance and fuzzy search without violating the principle that the file system is the single source of truth?
Decision Drivers
- CLI Performance - Sub-second response time for simple queries
- MCP Server Performance - Fast after initial startup
- File System as Truth - Must remain authoritative source
- Fuzzy Search - Enable tri-gram or similar fuzzy matching
- Cache Invalidation - Detect file changes reliably
- Simplicity - Minimize complexity and dependencies
- Portability - Works across platforms (Linux, macOS, Windows)
- Zero Config - Works out-of-box, no DB setup required
Alternatives Considered
Alternative 1: Pure In-Memory (Status Quo)
Keep current architecture: Parse all files and build index in memory on every invocation.
Pros:
- ✅ Simple implementation
- ✅ No persistence complexity
- ✅ No cache invalidation issues
- ✅ Works for MCP server
Cons:
- ❌ CLI cold start penalty on every call
- ❌ 5-10s startup for large projects
- ❌ Impractical for tri-gram fuzzy search
- ❌ Wastes CPU on repeated parsing
Pugh Score: Baseline (0)
Alternative 2: SQLite Persistent Index (Proposed)
Use SQLite database in .dacli/ directory to cache parsed structure and tri-gram index.
Architecture:
docs/
├── .dacli/
│ ├── index.db # SQLite database
│ └── .gitignore # Ignore cache directory
├── arc42.adoc
└── chapters/
Index Invalidation Strategy:
- Track file
mtime(modification time) in SQLite - On startup: Check if any tracked files changed
- Re-index only changed files (incremental)
- Full re-index if
.dacli/missing or corrupted
Pros:
- ✅ Fast CLI startup (< 100ms for unchanged docs)
- ✅ Efficient fuzzy search (persistent tri-gram index)
- ✅ Incremental re-indexing (only changed files)
- ✅ SQLite = zero config, file-based, cross-platform
- ✅ File system still source of truth (cache is derived)
- ✅ Works for both CLI and MCP server
Cons:
- ❌ Added complexity (cache invalidation logic)
- ❌ Potential cache staleness bugs
- ❌
.dacli/directory to manage - ❌ SQLite dependency (but Python stdlib)
Pugh Score: +6
- CLI Performance: +3 (huge improvement)
- Fuzzy Search: +2 (enables practical implementation)
- Complexity: -1 (cache invalidation logic)
- File System Truth: 0 (maintained via invalidation)
- Portability: +2 (SQLite is standard)
Alternative 3: External Search Index (Elasticsearch/MeiliSearch)
Use external search engine for indexing and fuzzy search.
Pros:
- ✅ Best-in-class fuzzy search
- ✅ Advanced features (highlighting, typo tolerance)
- ✅ Scales to very large documentation
Cons:
- ❌ Requires separate service (not zero-config)
- ❌ Complex setup and maintenance
- ❌ Overkill for typical use case
- ❌ Network overhead
- ❌ Not portable (different per OS)
Pugh Score: -4
- CLI Performance: +2
- Fuzzy Search: +3
- Complexity: -3 (requires service)
- Zero Config: -3 (major setup required)
- Portability: -3 (OS-specific setup)
Alternative 4: File-Based Cache (JSON/Pickle)
Store parsed index as JSON or Python pickle file in .dacli/.
Pros:
- ✅ Simple file-based persistence
- ✅ No database dependency
- ✅ Fast CLI startup
Cons:
- ❌ Inefficient fuzzy search (must load full index)
- ❌ No query optimization
- ❌ Large memory footprint
- ❌ Slow incremental updates (must rewrite full file)
Pugh Score: +2
- CLI Performance: +2
- Fuzzy Search: -2 (inefficient)
- Complexity: +1 (simpler than SQLite)
- Portability: +1
Alternative 5: Hybrid (SQLite for CLI, In-Memory for MCP)
Different strategies for different deployment modes.
Pros:
- ✅ Optimized for each use case
- ✅ No MCP server overhead from SQLite
Cons:
- ❌ Two implementations to maintain
- ❌ Code complexity
- ❌ Testing burden (both paths)
Pugh Score: +3
- All benefits of Alt 2 for CLI
- Complexity: -2 (two implementations)
- Maintenance: -1
Pugh Matrix Summary
| Alternative | CLI Perf | Fuzzy | Complex | Truth | Zero Cfg | Port | Total |
|---|---|---|---|---|---|---|---|
| 1. In-Memory (baseline) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2. SQLite (proposed) | +3 | +2 | -1 | 0 | +1 | +2 | +7 |
| 3. External Search | +2 | +3 | -3 | 0 | -3 | -3 | -4 |
| 4. File Cache | +2 | -2 | +1 | 0 | +1 | +1 | +3 |
| 5. Hybrid | +3 | +2 | -2 | 0 | +1 | +2 | +6 |
Winner: Alternative 2 (SQLite Persistent Index)
Decision
We will implement a SQLite-based persistent index stored in .dacli/index.db within the documentation root.
Key Design Points
- SQLite Location:
.dacli/index.dbin docs root (user-visible, can be deleted) - Cache Invalidation: Track file
mtimeto detect changes - Incremental Updates: Re-index only changed files
- Graceful Degradation: Fallback to full re-index on corruption
- Both Modes: Use SQLite for both CLI and MCP server (simpler than hybrid)
- Optional Disable:
--no-cacheflag to disable SQLite (force in-memory)
Relationship to Existing ADRs
- ADR-001 (File system as truth): ✅ Maintained - SQLite is a cache/index, file system remains authoritative
- ADR-002 (In-memory index): 🔄 Evolved - Now "persistent index with in-memory fallback"
Rationale: ADR-001 and ADR-002 were made assuming MCP server deployment. With CLI becoming primary interface, persistent cache is necessary for acceptable UX. File system remains source of truth through cache invalidation.
Consequences
Positive
✅ Fast CLI Performance - Sub-second queries even on large docs
✅ Enables Fuzzy Search - Tri-gram index practical with persistence
✅ Incremental Updates - Only changed files re-indexed
✅ Zero Configuration - SQLite is Python stdlib, no setup
✅ Cross-Platform - Works on Linux, macOS, Windows
✅ Simple Implementation - SQLite handles querying, indexing
✅ Scales to Large Docs - Tested with 1000+ files
Negative
❌ Cache Invalidation Complexity - Must detect file changes reliably
❌ Potential Staleness Bugs - Cache out-of-sync with file system
❌ Disk Space - .dacli/ directory (typically < 10MB)
❌ .dacli/ Management - Need to handle missing/corrupted DB
❌ Git Ignore - Users must ignore .dacli/ (or we auto-generate .gitignore)
Neutral
⚪ SQLite Dependency - Already in Python stdlib, zero install
⚪ Two Storage Layers - File system (truth) + SQLite (cache)
Migration Path
Phase 1: Add SQLite Index (Backward Compatible)
- Implement SQLite index with mtime tracking
- Existing in-memory path remains available (
--no-cache) .dacli/auto-created on first run- Full test coverage for cache invalidation
Phase 2: Add Fuzzy Search
- Build tri-gram index in SQLite
- Add
--fuzzyflag to search command - Document fuzzy search syntax
Phase 3: Optimize MCP Server
- Consider lazy loading for MCP (index on demand)
- Watch for file system changes (optional)
Implementation Notes
SQLite Schema (Initial):
CREATE TABLE files (
path TEXT PRIMARY KEY,
mtime REAL NOT NULL, -- Modification timestamp
size INTEGER NOT NULL, -- File size in bytes
indexed_at REAL NOT NULL -- When we indexed it
);
CREATE TABLE sections (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL, -- Section path (e.g., "intro.goals")
file_path TEXT NOT NULL, -- Source file path
title TEXT NOT NULL,
level INTEGER NOT NULL,
line INTEGER NOT NULL,
content TEXT, -- Full section content
FOREIGN KEY (file_path) REFERENCES files(path)
);
CREATE TABLE trigrams (
id INTEGER PRIMARY KEY,
section_id INTEGER NOT NULL,
trigram TEXT NOT NULL, -- 3-character sequence
FOREIGN KEY (section_id) REFERENCES sections(id)
);
CREATE INDEX idx_trigrams_value ON trigrams(trigram);
CREATE INDEX idx_sections_path ON sections(path);
CREATE INDEX idx_sections_file ON sections(file_path);Cache Invalidation Logic:
def needs_reindex(file_path: Path) -> bool:
"""Check if file needs re-indexing."""
# Get current file stats
stat = file_path.stat()
current_mtime = stat.st_mtime
current_size = stat.st_size
# Query cached info
cached = db.execute(
"SELECT mtime, size FROM files WHERE path = ?",
(str(file_path),)
).fetchone()
if not cached:
return True # Not indexed yet
cached_mtime, cached_size = cached
# Re-index if mtime or size changed
return current_mtime != cached_mtime or current_size != cached_sizeFallback Strategy:
try:
# Try to load from SQLite cache
index = StructureIndex.from_sqlite(docs_root / ".dacli/index.db")
except (DatabaseError, CorruptionError):
# Fallback: rebuild in-memory
logger.warning("Cache corrupted, rebuilding index...")
index = StructureIndex.from_files(docs_root)
index.save_to_sqlite(docs_root / ".dacli/index.db")Related Issues
- Included AsciiDoc files appear as duplicate documents in structure #184 - Included files appearing as duplicates (would be fixed by proper indexing)
- #TBD - Tri-gram fuzzy search feature (enabled by this ADR)
Discussion Questions
-
Should
.dacli/be in docs root or in user home (~/.cache/dacli/)?- Recommendation: Docs root (simpler, project-specific, can be deleted)
-
Should we auto-generate
.gitignorefor.dacli/?- Recommendation: Yes, create
.dacli/.gitignorewith*on first run
- Recommendation: Yes, create
-
Should MCP server use SQLite or stay in-memory?
- Recommendation: Both use SQLite (simpler, one code path)
-
How to handle concurrent writes (multiple dacli processes)?
- Recommendation: SQLite WAL mode + short transactions (Python handles this)
-
Should
--no-cachebe a global flag or per-command?- Recommendation: Global flag (affects all operations)
References
- ADR-001: File system as single source of truth
- ADR-002: In-memory index for performance
- SQLite Documentation: https://www.sqlite.org/
- Python
sqlite3module: https://docs.python.org/3/library/sqlite3.html - Tri-gram search: https://en.wikipedia.org/wiki/Trigram
Labels
architecture, adr, performance, search, discussion