On-device vision¶
This page mirrors implementation notes maintained in CLAUDE.md.
Update both when changing this subsystem.
The local digest backend runs Gemma 4 multimodal locally via a
bundled llama-server subprocess. When enabled, screenshots stay
on the device — they go to localhost-only HTTP, not to Anthropic.
Components¶
| Module | Role |
|---|---|
cue.llm.LocalVisionBackend |
Posts to the local server with the same prompt shape the cloud backend uses. Raises typed errors on failure (LocalUnavailable, LocalTimeout). |
cue.llama_server |
Subprocess lifecycle manager. Lazy start on first call, atexit kill, healthcheck loop, port collision retry. |
cue.local_models |
Pinned model + mmproj manifest. Resumable huggingface_hub downloads with sha256 verification, atomic move, disk-space check. |
Bundled llama-server binary |
Frozen builds ship the binary under Cue.app/Contents/Resources/bin/llama-b<tag>/llama-server (mac) or Cue\bin\llama-b<tag>\llama-server.exe (win). |
Model manifest¶
Two slugs are supported:
gemma4_e2b_vision—ggml-org/gemma-4-E2B-it-GGUF,gemma-4-E2B-it-Q8_0.gguf(~5.0 GB) +mmproj-gemma-4-E2B-it-Q8_0.gguf(~557 MB).gemma4_e4b_vision—ggml-org/gemma-4-E4B-it-GGUF,gemma-4-E4B-it-Q4_K_M.gguf(~5.3 GB) +mmproj-gemma-4-E4B-it-Q8_0.gguf(~560 MB).
Both pin to a specific HuggingFace revision SHA and verify model +
mmproj sha256 after download. local_models.preflight(slug) pulls
both artifacts and only reports ready when both land.
Why E2B at Q8 instead of Q4: ggml-org doesn't ship a Q4_K_M for E2B today. Sticking with the canonical artifacts keeps provenance auditable; the cost is ~3 GB of extra disk.
Subprocess lockdown¶
llama_server.start() runs (E2B example):
llama-server \
--model <path>/gemma-4-E2B-it-Q8_0.gguf \
--mmproj <path>/mmproj-gemma-4-E2B-it-Q8_0.gguf \
--host 127.0.0.1 \ # explicit — never bind external
--port <random_high_port> \ # 49152-65535 range
--ctx-size 8192 \
--n-gpu-layers <auto: 0 on Apple Silicon ≤ M4> \
--threads <cpu_count // 2> \
--no-context-shift \
--log-disable # no request / prompt logging on stdout
Lockdown rules:
- Bind to 127.0.0.1 only. External binds are a non-feature.
- Random high port (49152-65535). Stored in process state, never logged outside diagnostic mode.
--log-disableso subprocess stdout/stderr don't carry raw prompt text or image paths. The b8987 binary doesn't accept the older--no-display-promptflag.- No HTTP request logging. llama-server's optional access log isn't enabled.
Health: GET /health polled until ready (max 30 s, then
LocalUnavailable). The subprocess is stopped on Cue shutdown via
atexit and SIGTERM-with-grace of 3 s before SIGKILL.
Apple Silicon Metal bf16 caveat¶
The b8987 prebuilt's Metal kernels reference bf16 mat-mul ops
that pre-M5 / pre-A19 Apple Silicon GPU drivers don't ship. With
--n-gpu-layers -1 the server aborts during warmup with:
ggml_metal_library_compile_pipeline: failed to compile pipeline:
base = 'kernel_mul_mv_ext_bf16_f32_r1_2'
ggml_metal_library_compile_pipeline: Function kernel_mul_mv_ext_bf16_f32_r1_2
was not found in the library
Cue defaults --n-gpu-layers 0 (CPU) on Apple Silicon ≤ M4. The
Settings UI surfaces this as a status banner. Once M5 / A19+
becomes the maintainer's baseline, the manifest can flip to
-1 (full Metal offload) and the default-flip eval reruns under
realistic latency.
See docs/spikes/llama-server.md
for the measured numbers behind these defaults.
Why an external binary, not llama-cpp-python¶
llama-cpp-pythonrequires a C++ build at install time; pre-built wheels exist but lag behind upstream and don't always match thellama-serverHTTP API surface.- The HTTP boundary keeps the GIL-bound Python process clean and
lets us use the same
httpxclient we use for Anthropic. - Smaller bundled artifact (~26 MB for the prebuilt vs ~80 MB+ for a wheels-bundled native build).
- Matches the official upstream multimodal path (
llama-server mmproj).
Privacy posture summary¶
- Screenshots stay on this device unless the user explicitly enables Cloud backend OR Allow cloud fallback. Both are off by default.
- Memory and hotkey suggestions still use cloud LLMs (Opus) with scrubbed digest text only.
- The settings copy is explicit about this in Preferences.
Lifecycle¶
sequenceDiagram
participant App as Cue main
participant Models as cue.local_models
participant Srv as llama-server subprocess
participant Backend as LocalVisionBackend
App->>Models: preflight(slug) on daemon thread
Models->>Models: hf_hub_download (resumable, sha256 verified)
Models-->>App: ready
App->>Backend: summarize_digest(...)
Backend->>Srv: start() if not running (~10-30 s cold)
Backend->>Srv: GET /health until 200
Backend->>Srv: POST /v1/chat/completions (10 image_url blocks)
Srv-->>Backend: {choices[0].message.content}
Backend-->>App: summary str
App->>Srv: atexit -> SIGTERM, then SIGKILL after 3 s
See also¶
cue.llm—LocalVisionBackendAPI.cue.llama_server— subprocess manager API.cue.local_models— manifest + downloader API.- llama-server multimodal spike — the design note that pinned the contract.
- Digest pipeline — where this backend slots in.