Skip to content

Add Local Mode — on-device inference next to Sonnet/Opus#115

Draft
ShreyPatel4 wants to merge 18 commits into
farzaa:mainfrom
ShreyPatel4:local-mode
Draft

Add Local Mode — on-device inference next to Sonnet/Opus#115
ShreyPatel4 wants to merge 18 commits into
farzaa:mainfrom
ShreyPatel4:local-mode

Conversation

@ShreyPatel4

Copy link
Copy Markdown

Clicky Local Mode — on-device inference on Apple Silicon

Flip the model picker to Local and clicky answers on your Mac. First token in ~0.6s, $0
per question, works on a plane, and your screen never leaves the machine — in Local Mode the
screenshot isn't captured at all.

This is a weekend remix (you said remix the repo instead of sending a resume — so here's the
feature I couldn't not build). I work on local inference on Apple Silicon and fairness
scheduling for shared inference servers, so this lane is home turf.

Why

  • Privacy is structural, not a promise. An assistant that sees your screen has to earn
    trust in the architecture. Local Mode: no screenshot, no upload, nothing to leak. While
    mapping the code I also found the analytics layer ships the full transcript + full response
    to PostHog with no opt-out — so Local Mode gates those events too. A privacy mode that
    phones home is a self-own.
  • Cost. A 3B model on the GPU the user already owns makes the buddy free at the margin —
    the 14-year-old in Lagos doesn't need an API budget.
  • Latency / offline. No network hop. Sub-second first token is felt in a cursor buddy.
  • Platform defense. A clicky that's natively great on Apple Silicon is the counter to
    Apple Intelligence.

What's in this PR

Local Mode (the shippable feature):

  • BuddyChatProvider protocol — completes the seam BuddyTranscriptionProvider already
    established, for chat. CloudChatProvider wraps the existing ClaudeAPI untouched, so
    Sonnet/Opus behave byte-for-byte as before.
  • LocalChatProvider — Llama-3.2-3B-Instruct-4bit via mlx-swift-lm. Downloads once
    (~1.8 GB) to Application Support with a progress bar, then loads from disk (offline) on
    later launches.
  • Offline voice loop — Apple Speech for STT, AVSpeechSynthesizer for TTS, so push-to-talk →
    answer → spoken reply works with wifi off.
  • Strict behavioral contract — no screenshot captured, [POINT] pointing disabled via a
    trimmed local prompt, transcript/response analytics gated.
  • Latency badge near the cursor (local · 0.6s · 58 tok/s), and cloud requests now fail fast
    with a nudge to Local when offline (instead of a 120s waitsForConnectivity hang).

Measured on an M1 Pro / 16 GB (via bench/LocalModeBench — same model, params, cache dir
as the app; real runs, no invented numbers):

metric local
first token 0.6–0.7 s
decode 54–60 tok/s
model load from disk 3.0 s
cost / question $0
network none

Experimental (clearly fenced — bold ideas, honestly labeled):

  • Cloud guided tour (Make a Logic beat): clicky screenshots → Sonnet grounds the next
    control → the blue overlay cursor flies to it and narrates, hands-off. Pure pointing (no
    clicks, no Accessibility). Sonnet grounding is pixel-accurate (verified: 799,307 vs true
    800,308). Reliable + recordable.
  • Offline takeover (take over — <task>): a local VLM (Qwen2.5-VL-3B) drives a guarded
    CGEvent loop — dry-run default, ESC kill switch, offline+Accessibility gate, step/time caps.
    Honest status in TAKEOVER.md: the brain grounds in isolation (2/3 on a toy UI) but small
    VLMs misclick on dense pro UIs, and the executor/loop are unproven in a live GUI. Shipped as
    a prototype to show where this goes, not as a finished feature.

What I'd build next

  1. Auto-routing brain — short query, no screen → local; screen question or hard reasoning
    → cloud. The picker disappears; clicky just feels instant and cheap.
  2. Fully-local screen understanding — a small VLM answering screenshot questions
    on-device, the half of clicky Local Mode can't do yet.
  3. Fleet-scale inference fairness — when thousands of clicky buddies share hosted
    inference, one chatty user starves the rest; scheduling/fairness on the server is the moat
    nobody sees coming. (My research lane.)

Cloud path is unchanged; CLAUDE.md updated in your format. Docs: RECON.md / DECISIONS.md
/ README-REMIX.md / DEMO_SCRIPT.md / TAKEOVER.md. MIT, like the original.

Mirrors the BuddyTranscriptionProvider pattern so chat backends are
interchangeable. CloudChatProvider wraps the untouched ClaudeAPI —
no behavior change for Sonnet/Opus.
LocalChatProvider runs Llama-3.2-3B-Instruct-4bit via mlx-swift-lm,
downloads once to Application Support, then loads from disk offline.
AppleSpeechSynthesisClient mirrors the ElevenLabs client contract so
the voice loop can run with networking off.
Same model, cache dir, and generation parameters as the app, so the
README numbers are measured on the actual pipeline, not estimated.
Local in the picker downloads the model with visible progress, then
routes voice responses through on-device inference. The contract:
no screenshot captured, no pointing, no conversation content in
analytics, Apple Speech for STT and TTS so the loop works offline.
Cloud requests fail fast with a nudge toward Local when offline
instead of hanging two minutes on waitsForConnectivity.
Turns the local-vs-cloud speed difference into something visible:
a small chip near the cursor while the answer plays, with decode
speed included when the local engine reports it.
Benchmark table holds measured numbers from bench/LocalModeBench on
the dev machine (M1 Pro 16GB): 0.6-0.7s first token, 54-60 tok/s,
3.0s model load from disk. Cloud column deliberately absent until a
worker exists to measure against.
A custom question prints its full answer; the default sweep keeps
short previews.
Local VLM (Qwen2.5-VL-3B via MLXVLM) drives a guarded computer-use
loop. Dry-run default, kill switch, offline-only enforcement, confirm
gate before irreversible actions. Input monitoring stays on-device.
…d loop

Qwen2.5-VL-3B (MLXVLM) emits one JSON action per screenshot; the
controller guards and executes via CoreGraphics input synthesis.
Rails: offline+Accessibility precondition, ESC kill switch, step +
wall-clock caps, dry-run by default, danger gate behind a second
opt-in. 'take over — <task>' in a transcript routes here.
Arm/dry-run toggle, the risky-actions opt-in (only shown when armed),
live step status, and a stop button. Says offline-only and that esc
stops it.
Smoke test verified Qwen2.5-VL-3B grounds a button at (768,309) vs
true center (800,308) on a synthetic UI — but emits malformed JSON
('x:768,309'). Strict JSON rejects it; the regex fallback recovers
the right coordinate so a correct grounding in shaky syntax still
executes.
2/3 synthetic-UI targets grounded correctly; the top-bar edge target
missed, matching the ScreenSpot-Pro failure mode. Numbers + Holo1
upgrade path in TAKEOVER.md.
Clicky screenshots the screen, asks Sonnet for the next control +
narration, flies the overlay cursor to it, speaks, and advances —
hands-off, pure pointing (no clicks, no Accessibility). Triggered by
the panel button or 'teach me a beat'. Worker base URL is now
overridable via UserDefaults for local wrangler dev; ATS allows
localhost.
ClickyLocalDemo runs the real LocalChatProvider path (same model,
params, offline TTS) in a window with typed input — a way to show
Local Mode without mic/screen permissions.
Blockers: restore the user's cloud model after a guided tour (was
silently downgrading Opus to Sonnet for the session); refuse takeover
if the ESC kill switch can't be installed instead of running blind.
Fixes: route fallback voices through the retained Apple Speech client
(local NSSpeechSynthesizer was deallocating mid-sentence); gate bare
destructive keys (delete/backspace) behind the risky opt-in; tighten
guided-tour/takeover triggers so they don't hijack ordinary cloud
queries; guard empty TTS utterances; don't treat repeated scrolls as
stuck. Cleanup: share the model-cache dir helper, use DS color tokens
in the new panel UI, document the takeover + guided-tour files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant