Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
6522eca
add recon and decisions docs for local mode
ShreyPatel4 Jun 12, 2026
96cb090
add chat provider protocol and route responses through it
ShreyPatel4 Jun 12, 2026
4505b9b
add mlx local inference engine and on-device tts client
ShreyPatel4 Jun 12, 2026
f7d72f3
add local inference benchmark harness
ShreyPatel4 Jun 12, 2026
edb4e68
wire local mode end to end
ShreyPatel4 Jun 12, 2026
338db33
show provider and first-token latency badge during responses
ShreyPatel4 Jun 12, 2026
467b61e
add demo script, remix readme, and update agent docs
ShreyPatel4 Jun 12, 2026
b689861
pin resolved package versions for local mode deps
ShreyPatel4 Jun 12, 2026
ed2ddef
let the bench take a one-off question for demos
ShreyPatel4 Jun 12, 2026
cce853d
design offline autonomous takeover mode (spike)
ShreyPatel4 Jun 12, 2026
d19c16b
add offline takeover engine: vision agent + CGEvent executor + guarde…
ShreyPatel4 Jun 12, 2026
b796e21
add takeover controls to the panel
ShreyPatel4 Jun 12, 2026
0c013ae
harden action parser with regex salvage; add grounding smoke test
ShreyPatel4 Jun 12, 2026
1f24686
tolerate unquoted keys in action salvage; record measured grounding
ShreyPatel4 Jun 12, 2026
d562e7e
mark executor + loop + kill switch as unproven until run in the app
ShreyPatel4 Jun 13, 2026
54e9a69
add cloud guided-tour mode (e.g. Logic Pro beats)
ShreyPatel4 Jun 14, 2026
c15f580
add permission-free local-mode demo window + ignore wrangler cache
ShreyPatel4 Jun 15, 2026
5485844
polish: address pre-PR review
ShreyPatel4 Jun 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
worker/node_modules/
worker/.dev.vars
worker/.wrangler/
.DS_Store
*.xcuserstate
build/
releases/
.claude/
coding-plans/

# local model weights must never land in the repo
models/
*.safetensors
*.gguf
DerivedData/
29 changes: 24 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
- **App Type**: Menu bar-only (`LSUIElement=true`), no dock icon or main window
- **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
- **Pattern**: MVVM with `@StateObject` / `@Published` state management
- **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
- **AI Chat**: Pluggable chat-provider layer (`BuddyChatProvider`) — Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming, or **Local Mode**: Llama-3.2-3B-Instruct-4bit on-device via MLX (`mlx-swift-lm`)
- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
- **Text-to-Speech**: Pluggable TTS layer (`BuddyTextToSpeechClient`) — ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy, or `AVSpeechSynthesizer` on-device in Local Mode
- **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
- **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
- **Element Pointing**: Claude embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target.
Expand Down Expand Up @@ -48,15 +48,21 @@ Worker vars: `ELEVENLABS_VOICE_ID`

**Transient Cursor Mode**: When "Show Clicky" is off, pressing the hotkey fades in the cursor overlay for the duration of the interaction (recording → response → TTS → optional pointing), then fades it out automatically after 1 second of inactivity.

**Guided tour (experimental, cloud)**: A panel button ("Make a Logic beat") or the phrase "teach me a beat" starts an autonomous cloud loop — Clicky screenshots the screen, asks Sonnet for the next control + narration, flies the *overlay* cursor to it (no real clicks, no Accessibility), narrates via Apple Speech, and advances. `CompanionManager.startGuidedLogicTour` + `guidedTourSection` in the panel. Reliable because Sonnet grounding is accurate; pure pointing so it can't damage anything.

**Takeover (experimental, offline)**: The phrase "take over — <task>" (offline only) hands the cursor + keyboard to a guarded local-VLM loop. See `TAKEOVER.md` — the brain grounds in isolation but small VLMs misclick on dense UIs; dry-run is the default and the executor/loop are unproven in a live GUI.

**Local Mode**: A third model-picker option runs the whole answer loop on-device: Apple Speech for transcription, Llama-3.2-3B-Instruct-4bit via MLX for the response, `AVSpeechSynthesizer` for the voice. The behavioral contract is strict — the screenshot is *never captured* (not captured-and-dropped), `[POINT:...]` pointing is disabled via a trimmed local system prompt, and the transcript/response analytics events do not fire (a content-free `local_mode_selected` event is the only signal). The model downloads once (~1.8 GB) to `~/Library/Application Support/Clicky/models/huggingface` with progress in the panel, then loads from disk on later launches so Local Mode works fully offline. Cloud requests fail fast with a spoken nudge toward Local when the network is down (the URLSession otherwise waits 120s for connectivity).

## Key Files

| File | Lines | Purpose |
|------|-------|---------|
| `leanring_buddyApp.swift` | ~89 | Menu bar app entry point. Uses `@NSApplicationDelegateAdaptor` with `CompanionAppDelegate` which creates `MenuBarPanelManager` and starts `CompanionManager`. No main window — the app lives entirely in the status bar. |
| `CompanionManager.swift` | ~1026 | Central state machine. Owns dictation, shortcut monitoring, screen capture, Claude API, ElevenLabs TTS, and overlay management. Tracks voice state (idle/listening/processing/responding), conversation history, model selection, and cursor visibility. Coordinates the full push-to-talk → screenshot → Claude → TTS → pointing pipeline. |
| `CompanionManager.swift` | ~1233 | Central state machine. Owns dictation, shortcut monitoring, screen capture, the chat providers (cloud + local), both TTS clients, and overlay management. Tracks voice state (idle/listening/processing/responding), conversation history, model selection, network reachability, and cursor visibility. Coordinates the full push-to-talk → (screenshot)AI → TTS → pointing pipeline, with the Local Mode branches (no capture, trimmed prompt, gated analytics). |
| `MenuBarPanelManager.swift` | ~243 | NSStatusItem + custom NSPanel lifecycle. Creates the menu bar icon, manages the floating companion panel (show/hide/position), installs click-outside-to-dismiss monitor. |
| `CompanionPanelView.swift` | ~761 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk instructions, model picker (Sonnet/Opus), permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. |
| `OverlayWindow.swift` | ~881 | Full-screen transparent overlay hosting the blue cursor, response text, waveform, and spinner. Handles cursor animation, element pointing with bezier arcs, multi-monitor coordinate mapping, and fade-out transitions. |
| `CompanionPanelView.swift` | ~817 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk instructions, model picker (Sonnet/Opus/Local) with the local model download-progress row and privacy notice, permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. |
| `OverlayWindow.swift` | ~905 | Full-screen transparent overlay hosting the blue cursor, waveform, spinner, and the latency badge (provider + first-token time, shown while a response plays). Handles cursor animation, element pointing with bezier arcs, multi-monitor coordinate mapping, and fade-out transitions. |
| `CompanionResponseOverlay.swift` | ~217 | SwiftUI view for the response text bubble and waveform displayed next to the cursor in the overlay. |
| `CompanionScreenCaptureUtility.swift` | ~132 | Multi-monitor screenshot capture using ScreenCaptureKit. Returns labeled image data for each connected display. |
| `BuddyDictationManager.swift` | ~866 | Push-to-talk voice pipeline. Handles microphone capture via `AVAudioEngine`, provider-aware permission checks, keyboard/button dictation sessions, transcript finalization, shortcut parsing, contextual keyterms, and live audio-level reporting for waveform feedback. |
Expand All @@ -67,6 +73,15 @@ Worker vars: `ELEVENLABS_VOICE_ID`
| `BuddyAudioConversionSupport.swift` | ~108 | Audio conversion helpers. Converts live mic buffers to PCM16 mono audio and builds WAV payloads for upload-based providers. |
| `GlobalPushToTalkShortcutMonitor.swift` | ~132 | System-wide push-to-talk monitor. Owns the listen-only `CGEvent` tap and publishes press/release transitions. |
| `ClaudeAPI.swift` | ~291 | Claude vision API client with streaming (SSE) and non-streaming modes. TLS warmup optimization, image MIME detection, conversation history support. |
| `BuddyChatProvider.swift` | ~112 | Protocol surface for chat backends, mirroring the transcription-provider pattern. `CloudChatProvider` wraps `ClaudeAPI` unchanged and measures first-token latency for the badge. |
| `LocalChatProvider.swift` | ~253 | On-device chat provider for Local Mode. Runs Llama-3.2-3B-Instruct-4bit via MLX (`mlx-swift-lm`); downloads once to Application Support with progress, then loads from disk (offline) on later launches. Publishes `modelReadiness` for the panel UI. |
| `BuddyTextToSpeechClient.swift` | ~28 | Protocol surface for TTS backends (`speakText` returns at playback start, `isPlaying`, synchronous `stopPlayback`). ElevenLabs and Apple Speech both conform. |
| `AppleSpeechSynthesisClient.swift` | ~73 | On-device TTS for Local Mode via `AVSpeechSynthesizer`. Picks the best installed en-US voice (premium > enhanced > compact) and honors the ElevenLabs client's timing contract. |
| `ClickyModelCache.swift` | ~30 | Single source of truth for the on-device model directory (Application Support, not Caches). Shared by the local chat provider and takeover vision agent. |
| `TakeoverController.swift` | ~300 | **Experimental.** Offline autonomous "takeover" loop (e.g. drive an app). Screenshot → local VLM → one guarded action → execute → repeat. Rails: dry-run default, ESC kill switch, offline+Accessibility gate, danger-action opt-in, step/time caps. See `TAKEOVER.md` for the honest verification state. |
| `TakeoverVisionAgent.swift` | ~165 | **Experimental.** The takeover brain — Qwen2.5-VL-3B (MLXVLM) emitting one JSON action per screenshot. |
| `TakeoverAction.swift` | ~205 | **Experimental.** The 6-verb takeover action vocabulary + a tolerant JSON/regex parser for shaky small-VLM output. |
| `ComputerUseActionExecutor.swift` | ~245 | **Experimental.** CoreGraphics input synthesis (move/click/type/scroll/key) + the ESC kill-switch tap. Needs only the existing Accessibility grant. |
| `OpenAIAPI.swift` | ~142 | OpenAI GPT vision API client. |
| `ElevenLabsTTSClient.swift` | ~81 | ElevenLabs TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. |
| `ElementLocationDetector.swift` | ~335 | Detects UI element locations in screenshots for cursor pointing. |
Expand All @@ -88,6 +103,10 @@ open leanring-buddy.xcodeproj
# deprecated onChange warning in OverlayWindow.swift. Do NOT attempt to fix these.
```

Local Mode needs Xcode 16.3+ (mlx-swift-lm is a Swift 6.1-tools package and uses Swift
macros — Xcode will ask once to "Trust & Enable" the `MLXHuggingFaceMacros` plugin; say yes).
MLX's Metal shaders can't be compiled by the SwiftPM CLI, so builds must go through Xcode.

**Do NOT run `xcodebuild` from the terminal** — it invalidates TCC (Transparency, Consent, and Control) permissions and the app will need to re-request screen recording, accessibility, etc.

## Cloudflare Worker
Expand Down
133 changes: 133 additions & 0 deletions DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# DECISIONS — Local Mode

Each entry: the call, and why. Deviations from the original plan are marked **[deviation]**.

## Runtime & model

- **MLX via `ml-explore/mlx-swift-lm` pinned `.upToNextMajor(from: "3.31.3")`** plus
`huggingface/swift-huggingface` (≥0.9.0) and `huggingface/swift-transformers` (≥1.3.0).
The LLM libraries moved out of `mlx-swift-examples` in April 2026; this is the maintained
home. Let mlx-swift-lm own the mlx-swift version (it pins `.upToNextMinor(from: 0.31.4)`).
- **Model: `mlx-community/Llama-3.2-3B-Instruct-4bit` (~1.8 GB)** via
`LLMRegistry.llama3_2_3B_4bit`. It's a registry constant (path of least resistance), no
thinking-token surprises (Qwen3-family templates emit them), fits the ≤2.5 GB budget, and
the name reads well in a demo. One model, not a menu.
- **Load path: the `#huggingFaceLoadModelContainer` macro** — the README-canonical route. It
pulls swift-syntax (slower clean builds); the alternative is ~30 lines of hand-written
Downloader/TokenizerLoader adapters. Sugar wins: less custom plumbing in the diff, and clean
builds are a one-time tax.
- **Model storage: `~/Library/Application Support/Clicky/models/huggingface`**, pinned with
`HubCache(location: .fixed(directory:))`. Application Support, not Caches — macOS purges
Caches under disk pressure and silently re-downloading 1.8 GB is a bad surprise. Never in
the repo; `.gitignore` extended.

## Architecture

- **`BuddyChatProvider` protocol + `CloudChatProvider` + `LocalChatProvider`**, mirroring the
existing `BuddyTranscriptionProvider` pattern (Farza's own seam, completed for chat).
`CloudChatProvider` wraps the untouched `ClaudeAPI` — cloud behavior stays byte-for-byte.
Routing branches in `CompanionManager` at the request site, because `ClaudeAPI`'s proxy URL
is `private let` on a single lazy instance and the picker pipeline is plain Strings.
- **"Local" is a third `modelOptionButton` with sentinel modelID `local`**, stored in the same
`selectedClaudeModel` UserDefaults key. Hardening: the stored value is validated on load and
unknown values fall back to Sonnet, so a stale `local` can never reach the Anthropic API
(which would 404).
- **The local engine lives in `LocalChatProvider`** (owns ModelContainer + download state),
exposed to SwiftUI through `@Published` state on CompanionManager like everything else. MVVM
graph unchanged.

## Behavior contract (Tier 1 honesty)

- **No screenshot in Local Mode** — capture is skipped entirely, not captured-and-dropped.
`ClaudeAPI`-style empty-images requests already degrade cleanly; pointing needs a capture to
map pixels→points, so it no-ops by construction as well as by prompt.
- **`[POINT:...]` disabled in Local Mode.** A 3B model guessing screen coordinates it has never
seen is a clown show. The local system prompt bans the tag; the end-anchored parse regex
would choke on malformed emissions anyway and TTS would read the tag aloud.
- **Short local system prompt** (~10 lines): keeps the clicky voice rules (lowercase, 1-2
sentences, TTS-friendly), drops all screen/pointing/multi-monitor material — 4.3k chars of
vision instructions wasted on a text-only 3B model otherwise.
- **[deviation] Analytics content events are gated in Local Mode** — promoted from Tier 2 to
Tier 1 after recon found PostHog uploads the *full transcript and full response text* with no
opt-out. "Your screen stays on your Mac" while words go to a US analytics cloud is a
self-own. Local Mode sends a single content-free `local_mode_selected` event; transcript and
response events don't fire.
- **[deviation] Local Mode gets its own spoken error line** instead of the global "I'm all out
of credits, please DM Farza" fallback — that line is a lie when the actual failure is local
inference. Kept Farza-casual.
- **[deviation] The "screen stays on your Mac" notice lives in the menu-bar panel**, under the
picker, shown whenever Local is selected — not a once-per-session overlay toast. Recon: the
overlay fades out aggressively (transient mode) and has no stable text surface; the panel is
where the user makes the choice, so the disclosure sits at the decision point.

## Voice loop (offline)

- **TTS: `AVSpeechSynthesizer` client implementing the same surface as ElevenLabsTTSClient**
(`speakText` returns at playback start, `isPlaying` true until didFinish, synchronous
`stopPlayback`) — the transient-cursor hide loop polls `isPlaying` every 200ms and hangs
forever if it never flips. Robotic next to ElevenLabs; it's the *offline* voice and the
tradeoff is the point.
- **STT: Apple Speech provider, switched at runtime.** The factory resolves once at init today;
the refactor re-resolves the provider at session start (the one seam where it's used) based
on the selected mode. On-device recognition is verified via `supportsOnDeviceRecognition`,
not assumed — if unsupported, the UI says so rather than silently sending audio to Apple.
First switch to Local will trigger the speech-recognition permission prompt (usage string
already shipped in Info.plist).

## Streaming & UX

- **[deviation] No streamed-text rendering in Tier 1.** The brief assumed the overlay renders
tokens; recon shows the app discards them (no-op `onTextChunk`, dead
`CompanionResponseOverlay`). Clicky's medium is voice. Local Mode matches the existing
pipeline (spinner → spoken reply) and the latency badge (Tier 2) makes the speed difference
visible. Reviving 217 lines of untested dead overlay code that resizes an NSPanel per token
at 40 tok/s is a drive-by refactor with demo risk — declined.
- **First-token latency is measured in the provider layer** (wrap the first `onTextChunk`
invocation) — no changes inside `ClaudeAPI`.
- **max_tokens parity: local generation capped at 1024** to match the cloud path
(`ClaudeAPI.swift:144`); KV growth bounded via `GenerateParameters`.

## Out of scope (explicit)

- Local vision (Tier 3 gate: only after demo video is recordable; cut if TTFT > ~6s).
- Auto-routing local/cloud (Tier 3, manual picker must be solid first).
- Fixing the worker-URL placeholders, the unauthenticated worker, the dead
`ElementLocationDetector`, the camera entitlement, or any known warning — not our diff.
- Streaming TTS, sentence-chunked TTS — escape hatch only if full-response TTS feels bad.

## Decisions made during implementation

- **[deviation] Offline fast-fail for cloud requests.** ClaudeAPI's URLSession sets
`waitsForConnectivity`, so with wifi off a cloud request spins for up to 120 seconds —
unusable and undemoable. CompanionManager now checks an `NWPathMonitor` before cloud
requests and speaks the "flip me to local" nudge immediately. This changes cloud behavior
only in the no-network case (120s hang → instant honest answer); `ClaudeAPI` itself is
untouched.
- **[deviation] Benchmark harness committed at `bench/LocalModeBench`.** "Real measurements
only" needs a reproducible source: same model, cache directory, and generation parameters
as the app. The README table comes from its output on this machine.
- **Model-not-ready voice line.** Push-to-talk in Local Mode before the download/load
finishes speaks "still warming up my local brain…" instead of holding the spinner hostage
to a 1.8 GB download.
- **Latency badge lives in the cursor overlay**, above the triangle, only during
`.responding` — the menu-bar panel is auto-dismissed when push-to-talk starts, so it can't
host a badge the demo can see. Muted chip (surface fill, 9pt), not a second blue bubble.
- **Offline relaunch path is load-from-directory, not hub-cache-hit.** After the first
download the resolved model directory is persisted; later launches call the
`loadContainer(from: directory)` overload with zero network involvement, rather than
trusting the hub client's cache lookup to behave offline.
- **Stray `[POINT:...]` tags from the local model are stripped, not rendered.** The existing
end-anchored parse runs in both modes; in Local Mode the coordinate is unusable by
construction (no screen capture to map pixels with), so the tag is removed from the spoken
text and nothing flies.
- **CLI builds use `-skipMacroValidation`** (mlx-swift-lm ships Swift macros); in the Xcode
GUI this is the one-time "Trust & Enable" prompt.

## Process

- **Terminal `xcodebuild` is used for compile verification only while this machine has zero
TCC grants for clicky** (fresh clone, never run). Farza's rule exists because terminal
builds invalidate *granted* permissions; there are none yet. The moment the app runs from
Xcode and permissions are granted, terminal builds stop.
- Cloud side is blocked on a deployed worker URL (placeholders in repo). Local-first per the
escape hatch; side-by-side demo shots happen once the worker exists.
Loading