ADR-007: Persistent SQLite Index for CLI Performance and Fuzzy Search

# ADR-007: Persistent SQLite Index for CLI Performance and Fuzzy Search

## Status

**PROPOSED** - Awaiting discussion and decision

## Context

### Original Architecture Assumptions

**ADR-001** and **ADR-002** were created with the assumption that dacli primarily runs as a **long-running MCP server**:
- Server starts once and keeps memory
- In-memory index built on startup (one-time cost)
- Fast queries after initial indexing
- File system as single source of truth (no persistent database)

### Changed Requirements

We now have **two deployment modes** with different characteristics:

#### 1. MCP Server (Original Design)
```bash
uvx dacli-mcp --docs-root /path/to/docs
# Server runs continuously, in-memory index works well
```
- ✅ In-memory index: Built once, used many times
- ✅ Fast queries after startup
- ✅ Startup time acceptable (one-time cost)

#### 2. CLI Tool (New Reality)
```bash
dacli structure --docs-root /path/to/docs
dacli search "authentication"
dacli section introduction
```
- ❌ In-memory index: Built on EVERY invocation
- ❌ Cold start every time (no memory persistence)
- ❌ Large documentation projects (1000+ files) = 5-10s startup
- ❌ Poor user experience for simple queries

### Additional Driver: Fuzzy Search

The request for **tri-gram fuzzy search** (#TBD) adds another dimension:
- Tri-gram index generation is expensive (more than simple structure parsing)
- In-memory tri-gram index on every CLI call = unacceptable UX
- Persistent index becomes essential for practical fuzzy search

### Problem Statement

**How can we provide fast CLI performance and fuzzy search without violating the principle that the file system is the single source of truth?**

## Decision Drivers

1. **CLI Performance** - Sub-second response time for simple queries
2. **MCP Server Performance** - Fast after initial startup
3. **File System as Truth** - Must remain authoritative source
4. **Fuzzy Search** - Enable tri-gram or similar fuzzy matching
5. **Cache Invalidation** - Detect file changes reliably
6. **Simplicity** - Minimize complexity and dependencies
7. **Portability** - Works across platforms (Linux, macOS, Windows)
8. **Zero Config** - Works out-of-box, no DB setup required

## Alternatives Considered

### Alternative 1: Pure In-Memory (Status Quo)

Keep current architecture: Parse all files and build index in memory on every invocation.

**Pros:**
- ✅ Simple implementation
- ✅ No persistence complexity
- ✅ No cache invalidation issues
- ✅ Works for MCP server

**Cons:**
- ❌ CLI cold start penalty on every call
- ❌ 5-10s startup for large projects
- ❌ Impractical for tri-gram fuzzy search
- ❌ Wastes CPU on repeated parsing

**Pugh Score:** Baseline (0)

### Alternative 2: SQLite Persistent Index (Proposed)

Use SQLite database in `.dacli/` directory to cache parsed structure and tri-gram index.

**Architecture:**
```
docs/
├── .dacli/
│   ├── index.db          # SQLite database
│   └── .gitignore        # Ignore cache directory
├── arc42.adoc
└── chapters/
```

**Index Invalidation Strategy:**
- Track file `mtime` (modification time) in SQLite
- On startup: Check if any tracked files changed
- Re-index only changed files (incremental)
- Full re-index if `.dacli/` missing or corrupted

**Pros:**
- ✅ Fast CLI startup (< 100ms for unchanged docs)
- ✅ Efficient fuzzy search (persistent tri-gram index)
- ✅ Incremental re-indexing (only changed files)
- ✅ SQLite = zero config, file-based, cross-platform
- ✅ File system still source of truth (cache is derived)
- ✅ Works for both CLI and MCP server

**Cons:**
- ❌ Added complexity (cache invalidation logic)
- ❌ Potential cache staleness bugs
- ❌ `.dacli/` directory to manage
- ❌ SQLite dependency (but Python stdlib)

**Pugh Score:** +6
- CLI Performance: +3 (huge improvement)
- Fuzzy Search: +2 (enables practical implementation)
- Complexity: -1 (cache invalidation logic)
- File System Truth: 0 (maintained via invalidation)
- Portability: +2 (SQLite is standard)

### Alternative 3: External Search Index (Elasticsearch/MeiliSearch)

Use external search engine for indexing and fuzzy search.

**Pros:**
- ✅ Best-in-class fuzzy search
- ✅ Advanced features (highlighting, typo tolerance)
- ✅ Scales to very large documentation

**Cons:**
- ❌ Requires separate service (not zero-config)
- ❌ Complex setup and maintenance
- ❌ Overkill for typical use case
- ❌ Network overhead
- ❌ Not portable (different per OS)

**Pugh Score:** -4
- CLI Performance: +2
- Fuzzy Search: +3
- Complexity: -3 (requires service)
- Zero Config: -3 (major setup required)
- Portability: -3 (OS-specific setup)

### Alternative 4: File-Based Cache (JSON/Pickle)

Store parsed index as JSON or Python pickle file in `.dacli/`.

**Pros:**
- ✅ Simple file-based persistence
- ✅ No database dependency
- ✅ Fast CLI startup

**Cons:**
- ❌ Inefficient fuzzy search (must load full index)
- ❌ No query optimization
- ❌ Large memory footprint
- ❌ Slow incremental updates (must rewrite full file)

**Pugh Score:** +2
- CLI Performance: +2
- Fuzzy Search: -2 (inefficient)
- Complexity: +1 (simpler than SQLite)
- Portability: +1

### Alternative 5: Hybrid (SQLite for CLI, In-Memory for MCP)

Different strategies for different deployment modes.

**Pros:**
- ✅ Optimized for each use case
- ✅ No MCP server overhead from SQLite

**Cons:**
- ❌ Two implementations to maintain
- ❌ Code complexity
- ❌ Testing burden (both paths)

**Pugh Score:** +3
- All benefits of Alt 2 for CLI
- Complexity: -2 (two implementations)
- Maintenance: -1

## Pugh Matrix Summary

| Alternative | CLI Perf | Fuzzy | Complex | Truth | Zero Cfg | Port | **Total** |
|-------------|----------|-------|---------|-------|----------|------|-----------|
| 1. In-Memory (baseline) | 0 | 0 | 0 | 0 | 0 | 0 | **0** |
| 2. SQLite (proposed) | +3 | +2 | -1 | 0 | +1 | +2 | **+7** |
| 3. External Search | +2 | +3 | -3 | 0 | -3 | -3 | **-4** |
| 4. File Cache | +2 | -2 | +1 | 0 | +1 | +1 | **+3** |
| 5. Hybrid | +3 | +2 | -2 | 0 | +1 | +2 | **+6** |

**Winner:** Alternative 2 (SQLite Persistent Index)

## Decision

We will implement a **SQLite-based persistent index** stored in `.dacli/index.db` within the documentation root.

### Key Design Points

1. **SQLite Location:** `.dacli/index.db` in docs root (user-visible, can be deleted)
2. **Cache Invalidation:** Track file `mtime` to detect changes
3. **Incremental Updates:** Re-index only changed files
4. **Graceful Degradation:** Fallback to full re-index on corruption
5. **Both Modes:** Use SQLite for both CLI and MCP server (simpler than hybrid)
6. **Optional Disable:** `--no-cache` flag to disable SQLite (force in-memory)

### Relationship to Existing ADRs

- **ADR-001 (File system as truth):** ✅ **Maintained** - SQLite is a cache/index, file system remains authoritative
- **ADR-002 (In-memory index):** 🔄 **Evolved** - Now "persistent index with in-memory fallback"

**Rationale:** ADR-001 and ADR-002 were made assuming MCP server deployment. With CLI becoming primary interface, persistent cache is necessary for acceptable UX. File system remains source of truth through cache invalidation.

## Consequences

### Positive

✅ **Fast CLI Performance** - Sub-second queries even on large docs
✅ **Enables Fuzzy Search** - Tri-gram index practical with persistence
✅ **Incremental Updates** - Only changed files re-indexed
✅ **Zero Configuration** - SQLite is Python stdlib, no setup
✅ **Cross-Platform** - Works on Linux, macOS, Windows
✅ **Simple Implementation** - SQLite handles querying, indexing
✅ **Scales to Large Docs** - Tested with 1000+ files

### Negative

❌ **Cache Invalidation Complexity** - Must detect file changes reliably
❌ **Potential Staleness Bugs** - Cache out-of-sync with file system
❌ **Disk Space** - `.dacli/` directory (typically < 10MB)
❌ **`.dacli/` Management** - Need to handle missing/corrupted DB
❌ **Git Ignore** - Users must ignore `.dacli/` (or we auto-generate `.gitignore`)

### Neutral

⚪ **SQLite Dependency** - Already in Python stdlib, zero install
⚪ **Two Storage Layers** - File system (truth) + SQLite (cache)

### Migration Path

**Phase 1: Add SQLite Index (Backward Compatible)**
1. Implement SQLite index with mtime tracking
2. Existing in-memory path remains available (`--no-cache`)
3. `.dacli/` auto-created on first run
4. Full test coverage for cache invalidation

**Phase 2: Add Fuzzy Search**
1. Build tri-gram index in SQLite
2. Add `--fuzzy` flag to search command
3. Document fuzzy search syntax

**Phase 3: Optimize MCP Server**
1. Consider lazy loading for MCP (index on demand)
2. Watch for file system changes (optional)

### Implementation Notes

**SQLite Schema (Initial):**
```sql
CREATE TABLE files (
    path TEXT PRIMARY KEY,
    mtime REAL NOT NULL,          -- Modification timestamp
    size INTEGER NOT NULL,         -- File size in bytes
    indexed_at REAL NOT NULL       -- When we indexed it
);

CREATE TABLE sections (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,            -- Section path (e.g., "intro.goals")
    file_path TEXT NOT NULL,       -- Source file path
    title TEXT NOT NULL,
    level INTEGER NOT NULL,
    line INTEGER NOT NULL,
    content TEXT,                  -- Full section content
    FOREIGN KEY (file_path) REFERENCES files(path)
);

CREATE TABLE trigrams (
    id INTEGER PRIMARY KEY,
    section_id INTEGER NOT NULL,
    trigram TEXT NOT NULL,         -- 3-character sequence
    FOREIGN KEY (section_id) REFERENCES sections(id)
);

CREATE INDEX idx_trigrams_value ON trigrams(trigram);
CREATE INDEX idx_sections_path ON sections(path);
CREATE INDEX idx_sections_file ON sections(file_path);
```

**Cache Invalidation Logic:**
```python
def needs_reindex(file_path: Path) -> bool:
    """Check if file needs re-indexing."""
    # Get current file stats
    stat = file_path.stat()
    current_mtime = stat.st_mtime
    current_size = stat.st_size
    
    # Query cached info
    cached = db.execute(
        "SELECT mtime, size FROM files WHERE path = ?",
        (str(file_path),)
    ).fetchone()
    
    if not cached:
        return True  # Not indexed yet
    
    cached_mtime, cached_size = cached
    
    # Re-index if mtime or size changed
    return current_mtime != cached_mtime or current_size != cached_size
```

**Fallback Strategy:**
```python
try:
    # Try to load from SQLite cache
    index = StructureIndex.from_sqlite(docs_root / ".dacli/index.db")
except (DatabaseError, CorruptionError):
    # Fallback: rebuild in-memory
    logger.warning("Cache corrupted, rebuilding index...")
    index = StructureIndex.from_files(docs_root)
    index.save_to_sqlite(docs_root / ".dacli/index.db")
```

## Related Issues

- #184 - Included files appearing as duplicates (would be fixed by proper indexing)
- #TBD - Tri-gram fuzzy search feature (enabled by this ADR)

## Discussion Questions

1. Should `.dacli/` be in docs root or in user home (`~/.cache/dacli/`)?
   - **Recommendation:** Docs root (simpler, project-specific, can be deleted)

2. Should we auto-generate `.gitignore` for `.dacli/`?
   - **Recommendation:** Yes, create `.dacli/.gitignore` with `*` on first run

3. Should MCP server use SQLite or stay in-memory?
   - **Recommendation:** Both use SQLite (simpler, one code path)

4. How to handle concurrent writes (multiple dacli processes)?
   - **Recommendation:** SQLite WAL mode + short transactions (Python handles this)

5. Should `--no-cache` be a global flag or per-command?
   - **Recommendation:** Global flag (affects all operations)

## References

- ADR-001: File system as single source of truth
- ADR-002: In-memory index for performance
- SQLite Documentation: https://www.sqlite.org/
- Python `sqlite3` module: https://docs.python.org/3/library/sqlite3.html
- Tri-gram search: https://en.wikipedia.org/wiki/Trigram

## Labels

`architecture`, `adr`, `performance`, `search`, `discussion`

Alternative	CLI Perf	Fuzzy	Complex	Zero Cfg	Port	Total
1. In-Memory (baseline)	0	0	0	0	0	0
2. SQLite (proposed)	+3	+2	-1	+1	+2	+7
3. External Search	+2	+3	-3	-3	-3	-4
4. File Cache	+2	-2	+1	+1	+1	+3
5. Hybrid	+3	+2	-2	+1	+2	+6

ADR-007: Persistent SQLite Index for CLI Performance and Fuzzy Search #185

Description

ADR-007: Persistent SQLite Index for CLI Performance and Fuzzy Search

Status

Context

Original Architecture Assumptions

Changed Requirements

1. MCP Server (Original Design)

2. CLI Tool (New Reality)

Additional Driver: Fuzzy Search

Problem Statement

Decision Drivers

Alternatives Considered

Alternative 1: Pure In-Memory (Status Quo)

Alternative 2: SQLite Persistent Index (Proposed)

Alternative 3: External Search Index (Elasticsearch/MeiliSearch)

Alternative 4: File-Based Cache (JSON/Pickle)

Alternative 5: Hybrid (SQLite for CLI, In-Memory for MCP)

Pugh Matrix Summary

Decision

Key Design Points

Relationship to Existing ADRs

Consequences

Positive

Negative

Neutral

Migration Path

Implementation Notes

Related Issues

Discussion Questions

References

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions