Add Local Mode — on-device inference next to Sonnet/Opus#115
Draft
ShreyPatel4 wants to merge 18 commits into
Draft
Add Local Mode — on-device inference next to Sonnet/Opus#115ShreyPatel4 wants to merge 18 commits into
ShreyPatel4 wants to merge 18 commits into
Conversation
Mirrors the BuddyTranscriptionProvider pattern so chat backends are interchangeable. CloudChatProvider wraps the untouched ClaudeAPI — no behavior change for Sonnet/Opus.
LocalChatProvider runs Llama-3.2-3B-Instruct-4bit via mlx-swift-lm, downloads once to Application Support, then loads from disk offline. AppleSpeechSynthesisClient mirrors the ElevenLabs client contract so the voice loop can run with networking off.
Same model, cache dir, and generation parameters as the app, so the README numbers are measured on the actual pipeline, not estimated.
Local in the picker downloads the model with visible progress, then routes voice responses through on-device inference. The contract: no screenshot captured, no pointing, no conversation content in analytics, Apple Speech for STT and TTS so the loop works offline. Cloud requests fail fast with a nudge toward Local when offline instead of hanging two minutes on waitsForConnectivity.
Turns the local-vs-cloud speed difference into something visible: a small chip near the cursor while the answer plays, with decode speed included when the local engine reports it.
Benchmark table holds measured numbers from bench/LocalModeBench on the dev machine (M1 Pro 16GB): 0.6-0.7s first token, 54-60 tok/s, 3.0s model load from disk. Cloud column deliberately absent until a worker exists to measure against.
A custom question prints its full answer; the default sweep keeps short previews.
Local VLM (Qwen2.5-VL-3B via MLXVLM) drives a guarded computer-use loop. Dry-run default, kill switch, offline-only enforcement, confirm gate before irreversible actions. Input monitoring stays on-device.
…d loop Qwen2.5-VL-3B (MLXVLM) emits one JSON action per screenshot; the controller guards and executes via CoreGraphics input synthesis. Rails: offline+Accessibility precondition, ESC kill switch, step + wall-clock caps, dry-run by default, danger gate behind a second opt-in. 'take over — <task>' in a transcript routes here.
Arm/dry-run toggle, the risky-actions opt-in (only shown when armed), live step status, and a stop button. Says offline-only and that esc stops it.
Smoke test verified Qwen2.5-VL-3B grounds a button at (768,309) vs
true center (800,308) on a synthetic UI — but emits malformed JSON
('x:768,309'). Strict JSON rejects it; the regex fallback recovers
the right coordinate so a correct grounding in shaky syntax still
executes.
2/3 synthetic-UI targets grounded correctly; the top-bar edge target missed, matching the ScreenSpot-Pro failure mode. Numbers + Holo1 upgrade path in TAKEOVER.md.
Clicky screenshots the screen, asks Sonnet for the next control + narration, flies the overlay cursor to it, speaks, and advances — hands-off, pure pointing (no clicks, no Accessibility). Triggered by the panel button or 'teach me a beat'. Worker base URL is now overridable via UserDefaults for local wrangler dev; ATS allows localhost.
ClickyLocalDemo runs the real LocalChatProvider path (same model, params, offline TTS) in a window with typed input — a way to show Local Mode without mic/screen permissions.
Blockers: restore the user's cloud model after a guided tour (was silently downgrading Opus to Sonnet for the session); refuse takeover if the ESC kill switch can't be installed instead of running blind. Fixes: route fallback voices through the retained Apple Speech client (local NSSpeechSynthesizer was deallocating mid-sentence); gate bare destructive keys (delete/backspace) behind the risky opt-in; tighten guided-tour/takeover triggers so they don't hijack ordinary cloud queries; guard empty TTS utterances; don't treat repeated scrolls as stuck. Cleanup: share the model-cache dir helper, use DS color tokens in the new panel UI, document the takeover + guided-tour files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Clicky Local Mode — on-device inference on Apple Silicon
Flip the model picker to Local and clicky answers on your Mac. First token in ~0.6s, $0
per question, works on a plane, and your screen never leaves the machine — in Local Mode the
screenshot isn't captured at all.
This is a weekend remix (you said remix the repo instead of sending a resume — so here's the
feature I couldn't not build). I work on local inference on Apple Silicon and fairness
scheduling for shared inference servers, so this lane is home turf.
Why
trust in the architecture. Local Mode: no screenshot, no upload, nothing to leak. While
mapping the code I also found the analytics layer ships the full transcript + full response
to PostHog with no opt-out — so Local Mode gates those events too. A privacy mode that
phones home is a self-own.
the 14-year-old in Lagos doesn't need an API budget.
Apple Intelligence.
What's in this PR
Local Mode (the shippable feature):
BuddyChatProviderprotocol — completes the seamBuddyTranscriptionProvideralreadyestablished, for chat.
CloudChatProviderwraps the existingClaudeAPIuntouched, soSonnet/Opus behave byte-for-byte as before.
LocalChatProvider— Llama-3.2-3B-Instruct-4bit via mlx-swift-lm. Downloads once(~1.8 GB) to Application Support with a progress bar, then loads from disk (offline) on
later launches.
AVSpeechSynthesizerfor TTS, so push-to-talk →answer → spoken reply works with wifi off.
[POINT]pointing disabled via atrimmed local prompt, transcript/response analytics gated.
local · 0.6s · 58 tok/s), and cloud requests now fail fastwith a nudge to Local when offline (instead of a 120s
waitsForConnectivityhang).Measured on an M1 Pro / 16 GB (via
bench/LocalModeBench— same model, params, cache diras the app; real runs, no invented numbers):
Experimental (clearly fenced — bold ideas, honestly labeled):
Make a Logic beat): clicky screenshots → Sonnet grounds the nextcontrol → the blue overlay cursor flies to it and narrates, hands-off. Pure pointing (no
clicks, no Accessibility). Sonnet grounding is pixel-accurate (verified: 799,307 vs true
800,308). Reliable + recordable.
take over — <task>): a local VLM (Qwen2.5-VL-3B) drives a guardedCGEvent loop — dry-run default, ESC kill switch, offline+Accessibility gate, step/time caps.
Honest status in
TAKEOVER.md: the brain grounds in isolation (2/3 on a toy UI) but smallVLMs misclick on dense pro UIs, and the executor/loop are unproven in a live GUI. Shipped as
a prototype to show where this goes, not as a finished feature.
What I'd build next
→ cloud. The picker disappears; clicky just feels instant and cheap.
on-device, the half of clicky Local Mode can't do yet.
inference, one chatty user starves the rest; scheduling/fairness on the server is the moat
nobody sees coming. (My research lane.)
Cloud path is unchanged;
CLAUDE.mdupdated in your format. Docs:RECON.md/DECISIONS.md/
README-REMIX.md/DEMO_SCRIPT.md/TAKEOVER.md. MIT, like the original.