On-device vision¶

This page mirrors implementation notes maintained in CLAUDE.md. Update both when changing this subsystem.

The local digest backend runs Gemma 4 multimodal locally via a bundled llama-server subprocess. When enabled, screenshots stay on the device — they go to localhost-only HTTP, not to Anthropic.

Components¶

Module	Role
`cue.llm.LocalVisionBackend`	Posts to the local server with the same prompt shape the cloud backend uses. Raises typed errors on failure (`LocalUnavailable`, `LocalTimeout`).
`cue.llama_server`	Subprocess lifecycle manager. Lazy start on first call, atexit kill, healthcheck loop, port collision retry.
`cue.local_models`	Pinned model + mmproj manifest. Resumable `huggingface_hub` downloads with sha256 verification, atomic move, disk-space check.
Bundled `llama-server` binary	Frozen builds ship the binary under `Cue.app/Contents/Resources/bin/llama-b<tag>/llama-server` (mac) or `Cue\bin\llama-b<tag>\llama-server.exe` (win).

Model manifest¶

Two slugs are supported:

gemma4_e2b_vision — ggml-org/gemma-4-E2B-it-GGUF, gemma-4-E2B-it-Q8_0.gguf (~5.0 GB) + mmproj-gemma-4-E2B-it-Q8_0.gguf (~557 MB).
gemma4_e4b_vision — ggml-org/gemma-4-E4B-it-GGUF, gemma-4-E4B-it-Q4_K_M.gguf (~5.3 GB) + mmproj-gemma-4-E4B-it-Q8_0.gguf (~560 MB).

Both pin to a specific HuggingFace revision SHA and verify model + mmproj sha256 after download. local_models.preflight(slug) pulls both artifacts and only reports ready when both land.

Why E2B at Q8 instead of Q4: ggml-org doesn't ship a Q4_K_M for E2B today. Sticking with the canonical artifacts keeps provenance auditable; the cost is ~3 GB of extra disk.

Subprocess lockdown¶

llama_server.start() runs (E2B example):

llama-server \
    --model <path>/gemma-4-E2B-it-Q8_0.gguf \
    --mmproj <path>/mmproj-gemma-4-E2B-it-Q8_0.gguf \
    --host 127.0.0.1 \                  # explicit — never bind external
    --port <random_high_port> \         # 49152-65535 range
    --ctx-size 8192 \
    --n-gpu-layers <auto: 0 on Apple Silicon ≤ M4> \
    --threads <cpu_count // 2> \
    --no-context-shift \
    --log-disable                       # no request / prompt logging on stdout

Lockdown rules:

Bind to 127.0.0.1 only. External binds are a non-feature.
Random high port (49152-65535). Stored in process state, never logged outside diagnostic mode.
--log-disable so subprocess stdout/stderr don't carry raw prompt text or image paths. The b8987 binary doesn't accept the older --no-display-prompt flag.
No HTTP request logging. llama-server's optional access log isn't enabled.

Health: GET /health polled until ready (max 30 s, then LocalUnavailable). The subprocess is stopped on Cue shutdown via atexit and SIGTERM-with-grace of 3 s before SIGKILL.

Apple Silicon Metal bf16 caveat¶

The b8987 prebuilt's Metal kernels reference bf16 mat-mul ops that pre-M5 / pre-A19 Apple Silicon GPU drivers don't ship. With --n-gpu-layers -1 the server aborts during warmup with:

ggml_metal_library_compile_pipeline: failed to compile pipeline:
  base = 'kernel_mul_mv_ext_bf16_f32_r1_2'
ggml_metal_library_compile_pipeline: Function kernel_mul_mv_ext_bf16_f32_r1_2
  was not found in the library

Cue defaults --n-gpu-layers 0 (CPU) on Apple Silicon ≤ M4. The Settings UI surfaces this as a status banner. Once M5 / A19+ becomes the maintainer's baseline, the manifest can flip to -1 (full Metal offload) and the default-flip eval reruns under realistic latency.

See docs/spikes/llama-server.md for the measured numbers behind these defaults.

Why an external binary, not `llama-cpp-python`¶

llama-cpp-python requires a C++ build at install time; pre-built wheels exist but lag behind upstream and don't always match the llama-server HTTP API surface.
The HTTP boundary keeps the GIL-bound Python process clean and lets us use the same httpx client we use for Anthropic.
Smaller bundled artifact (~26 MB for the prebuilt vs ~80 MB+ for a wheels-bundled native build).
Matches the official upstream multimodal path (llama-server
mmproj).

Privacy posture summary¶

Screenshots stay on this device unless the user explicitly enables Cloud backend OR Allow cloud fallback. Both are off by default.
Memory and hotkey suggestions still use cloud LLMs (Opus) with scrubbed digest text only.
The settings copy is explicit about this in Preferences.

Lifecycle¶

sequenceDiagram
    participant App as Cue main
    participant Models as cue.local_models
    participant Srv as llama-server subprocess
    participant Backend as LocalVisionBackend

    App->>Models: preflight(slug) on daemon thread
    Models->>Models: hf_hub_download (resumable, sha256 verified)
    Models-->>App: ready

    App->>Backend: summarize_digest(...)
    Backend->>Srv: start() if not running (~10-30 s cold)
    Backend->>Srv: GET /health until 200
    Backend->>Srv: POST /v1/chat/completions (10 image_url blocks)
    Srv-->>Backend: {choices[0].message.content}
    Backend-->>App: summary str

    App->>Srv: atexit -> SIGTERM, then SIGKILL after 3 s