Add Local Mode — on-device inference next to Sonnet/Opus by ShreyPatel4 · Pull Request #115 · farzaa/clicky

ShreyPatel4 · 2026-06-15T17:51:46Z

Clicky Local Mode — on-device inference on Apple Silicon

Flip the model picker to Local and clicky answers on your Mac. First token in ~0.6s, $0
per question, works on a plane, and your screen never leaves the machine — in Local Mode the
screenshot isn't captured at all.

This is a weekend remix (you said remix the repo instead of sending a resume — so here's the
feature I couldn't not build). I work on local inference on Apple Silicon and fairness
scheduling for shared inference servers, so this lane is home turf.

Why

Privacy is structural, not a promise. An assistant that sees your screen has to earn
trust in the architecture. Local Mode: no screenshot, no upload, nothing to leak. While
mapping the code I also found the analytics layer ships the full transcript + full response
to PostHog with no opt-out — so Local Mode gates those events too. A privacy mode that
phones home is a self-own.
Cost. A 3B model on the GPU the user already owns makes the buddy free at the margin —
the 14-year-old in Lagos doesn't need an API budget.
Latency / offline. No network hop. Sub-second first token is felt in a cursor buddy.
Platform defense. A clicky that's natively great on Apple Silicon is the counter to
Apple Intelligence.

What's in this PR

Local Mode (the shippable feature):

BuddyChatProvider protocol — completes the seam BuddyTranscriptionProvider already
established, for chat. CloudChatProvider wraps the existing ClaudeAPI untouched, so
Sonnet/Opus behave byte-for-byte as before.
LocalChatProvider — Llama-3.2-3B-Instruct-4bit via mlx-swift-lm. Downloads once
(~1.8 GB) to Application Support with a progress bar, then loads from disk (offline) on
later launches.
Offline voice loop — Apple Speech for STT, AVSpeechSynthesizer for TTS, so push-to-talk →
answer → spoken reply works with wifi off.
Strict behavioral contract — no screenshot captured, [POINT] pointing disabled via a
trimmed local prompt, transcript/response analytics gated.
Latency badge near the cursor (local · 0.6s · 58 tok/s), and cloud requests now fail fast
with a nudge to Local when offline (instead of a 120s waitsForConnectivity hang).

Measured on an M1 Pro / 16 GB (via bench/LocalModeBench — same model, params, cache dir
as the app; real runs, no invented numbers):

metric	local
first token	0.6–0.7 s
decode	54–60 tok/s
model load from disk	3.0 s
cost / question	$0
network	none

Experimental (clearly fenced — bold ideas, honestly labeled):

Cloud guided tour (Make a Logic beat): clicky screenshots → Sonnet grounds the next
control → the blue overlay cursor flies to it and narrates, hands-off. Pure pointing (no
clicks, no Accessibility). Sonnet grounding is pixel-accurate (verified: 799,307 vs true
800,308). Reliable + recordable.
Offline takeover (take over — <task>): a local VLM (Qwen2.5-VL-3B) drives a guarded
CGEvent loop — dry-run default, ESC kill switch, offline+Accessibility gate, step/time caps.
Honest status in TAKEOVER.md: the brain grounds in isolation (2/3 on a toy UI) but small
VLMs misclick on dense pro UIs, and the executor/loop are unproven in a live GUI. Shipped as
a prototype to show where this goes, not as a finished feature.

What I'd build next

Auto-routing brain — short query, no screen → local; screen question or hard reasoning
→ cloud. The picker disappears; clicky just feels instant and cheap.
Fully-local screen understanding — a small VLM answering screenshot questions
on-device, the half of clicky Local Mode can't do yet.
Fleet-scale inference fairness — when thousands of clicky buddies share hosted
inference, one chatty user starves the rest; scheduling/fairness on the server is the moat
nobody sees coming. (My research lane.)

Cloud path is unchanged; CLAUDE.md updated in your format. Docs: RECON.md / DECISIONS.md
/ README-REMIX.md / DEMO_SCRIPT.md / TAKEOVER.md. MIT, like the original.

Mirrors the BuddyTranscriptionProvider pattern so chat backends are interchangeable. CloudChatProvider wraps the untouched ClaudeAPI — no behavior change for Sonnet/Opus.

LocalChatProvider runs Llama-3.2-3B-Instruct-4bit via mlx-swift-lm, downloads once to Application Support, then loads from disk offline. AppleSpeechSynthesisClient mirrors the ElevenLabs client contract so the voice loop can run with networking off.

Same model, cache dir, and generation parameters as the app, so the README numbers are measured on the actual pipeline, not estimated.

Local in the picker downloads the model with visible progress, then routes voice responses through on-device inference. The contract: no screenshot captured, no pointing, no conversation content in analytics, Apple Speech for STT and TTS so the loop works offline. Cloud requests fail fast with a nudge toward Local when offline instead of hanging two minutes on waitsForConnectivity.

Turns the local-vs-cloud speed difference into something visible: a small chip near the cursor while the answer plays, with decode speed included when the local engine reports it.

Benchmark table holds measured numbers from bench/LocalModeBench on the dev machine (M1 Pro 16GB): 0.6-0.7s first token, 54-60 tok/s, 3.0s model load from disk. Cloud column deliberately absent until a worker exists to measure against.

A custom question prints its full answer; the default sweep keeps short previews.

Local VLM (Qwen2.5-VL-3B via MLXVLM) drives a guarded computer-use loop. Dry-run default, kill switch, offline-only enforcement, confirm gate before irreversible actions. Input monitoring stays on-device.

…d loop Qwen2.5-VL-3B (MLXVLM) emits one JSON action per screenshot; the controller guards and executes via CoreGraphics input synthesis. Rails: offline+Accessibility precondition, ESC kill switch, step + wall-clock caps, dry-run by default, danger gate behind a second opt-in. 'take over — <task>' in a transcript routes here.

Arm/dry-run toggle, the risky-actions opt-in (only shown when armed), live step status, and a stop button. Says offline-only and that esc stops it.

Smoke test verified Qwen2.5-VL-3B grounds a button at (768,309) vs true center (800,308) on a synthetic UI — but emits malformed JSON ('x:768,309'). Strict JSON rejects it; the regex fallback recovers the right coordinate so a correct grounding in shaky syntax still executes.

2/3 synthetic-UI targets grounded correctly; the top-bar edge target missed, matching the ScreenSpot-Pro failure mode. Numbers + Holo1 upgrade path in TAKEOVER.md.

Clicky screenshots the screen, asks Sonnet for the next control + narration, flies the overlay cursor to it, speaks, and advances — hands-off, pure pointing (no clicks, no Accessibility). Triggered by the panel button or 'teach me a beat'. Worker base URL is now overridable via UserDefaults for local wrangler dev; ATS allows localhost.

ClickyLocalDemo runs the real LocalChatProvider path (same model, params, offline TTS) in a window with typed input — a way to show Local Mode without mic/screen permissions.

Blockers: restore the user's cloud model after a guided tour (was silently downgrading Opus to Sonnet for the session); refuse takeover if the ESC kill switch can't be installed instead of running blind. Fixes: route fallback voices through the retained Apple Speech client (local NSSpeechSynthesizer was deallocating mid-sentence); gate bare destructive keys (delete/backspace) behind the risky opt-in; tighten guided-tour/takeover triggers so they don't hijack ordinary cloud queries; guard empty TTS utterances; don't treat repeated scrolls as stuck. Cleanup: share the model-cache dir helper, use DS color tokens in the new panel UI, document the takeover + guided-tour files.

ShreyPatel4 added 18 commits June 12, 2026 17:45

add recon and decisions docs for local mode

6522eca

add chat provider protocol and route responses through it

96cb090

Mirrors the BuddyTranscriptionProvider pattern so chat backends are interchangeable. CloudChatProvider wraps the untouched ClaudeAPI — no behavior change for Sonnet/Opus.

add local inference benchmark harness

f7d72f3

Same model, cache dir, and generation parameters as the app, so the README numbers are measured on the actual pipeline, not estimated.

show provider and first-token latency badge during responses

338db33

Turns the local-vs-cloud speed difference into something visible: a small chip near the cursor while the answer plays, with decode speed included when the local engine reports it.

pin resolved package versions for local mode deps

b689861

let the bench take a one-off question for demos

ed2ddef

A custom question prints its full answer; the default sweep keeps short previews.

design offline autonomous takeover mode (spike)

cce853d

Local VLM (Qwen2.5-VL-3B via MLXVLM) drives a guarded computer-use loop. Dry-run default, kill switch, offline-only enforcement, confirm gate before irreversible actions. Input monitoring stays on-device.

add takeover controls to the panel

b796e21

Arm/dry-run toggle, the risky-actions opt-in (only shown when armed), live step status, and a stop button. Says offline-only and that esc stops it.

tolerate unquoted keys in action salvage; record measured grounding

1f24686

2/3 synthetic-UI targets grounded correctly; the top-bar edge target missed, matching the ScreenSpot-Pro failure mode. Numbers + Holo1 upgrade path in TAKEOVER.md.

mark executor + loop + kill switch as unproven until run in the app

d562e7e

add permission-free local-mode demo window + ignore wrangler cache

c15f580

ClickyLocalDemo runs the real LocalChatProvider path (same model, params, offline TTS) in a window with typed input — a way to show Local Mode without mic/screen permissions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Local Mode — on-device inference next to Sonnet/Opus#115

Add Local Mode — on-device inference next to Sonnet/Opus#115
ShreyPatel4 wants to merge 18 commits into
farzaa:mainfrom
ShreyPatel4:local-mode

ShreyPatel4 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ShreyPatel4 commented Jun 15, 2026

Clicky Local Mode — on-device inference on Apple Silicon

Why

What's in this PR

What I'd build next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant