온디바이스 비전¶

이 페이지는 CLAUDE.md의 구현 노트를 미러합니다. 서브시스템 변경 시 양쪽 다 업데이트하세요.

로컬 디지스트 백엔드는 번들된 llama-server 서브프로세스를 통해 Gemma 4 멀티모달을 로컬 실행합니다. 활성화되면 스크린샷이 디바이스에 머무름 — Anthropic이 아닌 localhost-only HTTP로 감.

컴포넌트¶

모듈	역할
`cue.llm.LocalVisionBackend`	cloud 백엔드와 같은 프롬프트 모양으로 로컬 서버에 POST. 실패 시 typed error (`LocalUnavailable`, `LocalTimeout`) 발생.
`cue.llama_server`	서브프로세스 라이프사이클 매니저. 첫 호출 시 lazy 시작, atexit kill, 헬스체크 루프, 포트 충돌 retry.
`cue.local_models`	핀된 model + mmproj 매니페스트. sha256 검증 + atomic move + 디스크 공간 체크 포함 재개 가능 `huggingface_hub` 다운로드.
번들된 `llama-server` 바이너리	Frozen 빌드는 `Cue.app/Contents/Resources/bin/llama-b<tag>/llama-server` (mac) 또는 `Cue\bin\llama-b<tag>\llama-server.exe` (win) 아래에 binary ship.

모델 매니페스트¶

두 슬러그 지원:

gemma4_e2b_vision — ggml-org/gemma-4-E2B-it-GGUF, gemma-4-E2B-it-Q8_0.gguf (~5.0 GB) + mmproj-gemma-4-E2B-it-Q8_0.gguf (~557 MB).
gemma4_e4b_vision — ggml-org/gemma-4-E4B-it-GGUF, gemma-4-E4B-it-Q4_K_M.gguf (~5.3 GB) + mmproj-gemma-4-E4B-it-Q8_0.gguf (~560 MB).

둘 다 특정 HuggingFace revision SHA에 핀하고 다운로드 후 model + mmproj sha256 검증. local_models.preflight(slug)이 두 아티팩트 다 가져오고 둘 다 land해야 ready 보고.

E2B를 Q4 대신 Q8로 쓰는 이유: ggml-org가 오늘 E2B의 Q4_K_M을 ship 안 함. canonical 아티팩트를 유지하면 provenance가 감사 가능; ~3 GB 추가 디스크가 비용.

서브프로세스 lockdown¶

llama_server.start() 실행 (E2B 예시):

llama-server \
    --model <path>/gemma-4-E2B-it-Q8_0.gguf \
    --mmproj <path>/mmproj-gemma-4-E2B-it-Q8_0.gguf \
    --host 127.0.0.1 \                  # 명시 — 외부 bind 절대 금지
    --port <random_high_port> \         # 49152-65535 범위
    --ctx-size 8192 \
    --n-gpu-layers <auto: Apple Silicon ≤ M4에서 0> \
    --threads <cpu_count // 2> \
    --no-context-shift \
    --log-disable                       # stdout에 request / prompt 로깅 없음

Lockdown 규칙:

127.0.0.1만 bind. 외부 bind는 non-feature.
랜덤 high port (49152-65535). 프로세스 상태에 저장, diagnostic 모드 외 로깅 금지.
--log-disable으로 서브프로세스 stdout/stderr가 raw 프롬프트 텍스트나 이미지 경로를 운반하지 않음. b8987 바이너리는 옛 --no-display-prompt 플래그 미지원.
HTTP request 로깅 없음. llama-server의 옵션 access log 미활성.

Health: GET /health ready까지 폴링 (max 30초, 그 후 LocalUnavailable). 서브프로세스는 Cue 종료 시 atexit + 3초 grace SIGTERM 후 SIGKILL.

Apple Silicon Metal bf16 caveat¶

b8987 prebuilt의 Metal 커널은 pre-M5 / pre-A19 Apple Silicon GPU 드라이버가 ship 안 하는 bf16 mat-mul ops를 참조. --n-gpu-layers -1이면 서버가 warmup 중에 abort:

ggml_metal_library_compile_pipeline: failed to compile pipeline:
  base = 'kernel_mul_mv_ext_bf16_f32_r1_2'
ggml_metal_library_compile_pipeline: Function kernel_mul_mv_ext_bf16_f32_r1_2
  was not found in the library

Cue는 Apple Silicon ≤ M4에서 --n-gpu-layers 0 (CPU)을 기본값. 설정 UI가 status banner로 노출. M5 / A19+가 maintainer baseline이 되면 매니페스트가 -1 (full Metal offload)로 flip 가능, default-flip eval이 현실적 라텐시로 재실행.

이 기본값들 뒤의 측정 숫자는 docs/spikes/llama-server.md 참고.

`llama-cpp-python` 대신 외부 바이너리인 이유¶

llama-cpp-python은 install 시 C++ 빌드 필요; pre-built wheel은 upstream에 뒤처지고 항상 llama-server HTTP API 표면과 일치하는 것은 아님.
HTTP 경계가 GIL-bound Python 프로세스를 깔끔하게 유지, Anthropic용 같은 httpx 클라이언트 재사용 가능.
더 작은 번들 아티팩트 (~26 MB prebuilt vs wheels-bundled native 빌드 ~80 MB+).
공식 upstream 멀티모달 경로 (llama-server + mmproj)와 일치.

프라이버시 자세 요약¶

사용자가 명시적으로 Cloud backend OR Allow cloud fallback을 활성화하지 않는 한 스크린샷이 디바이스에 머무름. 둘 다 기본 off.
메모리와 핫키 제안은 여전히 cloud LLM (Opus)을 scrub된 디지스트 텍스트로만 사용.
설정의 설정 카피가 이를 명시.

라이프사이클¶

sequenceDiagram
    participant App as Cue main
    participant Models as cue.local_models
    participant Srv as llama-server 서브프로세스
    participant Backend as LocalVisionBackend

    App->>Models: 데몬 스레드에서 preflight(slug)
    Models->>Models: hf_hub_download (재개 가능, sha256 검증)
    Models-->>App: ready

    App->>Backend: summarize_digest(...)
    Backend->>Srv: 미실행 시 start() (~10-30초 cold)
    Backend->>Srv: 200까지 GET /health
    Backend->>Srv: POST /v1/chat/completions (10 image_url block)
    Srv-->>Backend: {choices[0].message.content}
    Backend-->>App: summary str

    App->>Srv: atexit -> SIGTERM, 3초 후 SIGKILL