Skip to content

cue.llama_server

Bundled llama-server subprocess manager.

LocalVisionBackend posts to a localhost-bound llama-server over the OpenAI-compatible /v1/chat/completions endpoint. The server stays running across digest cycles — model load (~10–30 s on CPU) is too expensive to repeat. The singleton is started lazily on the first digest call that picks the local backend, kept alive, and torn down in atexit.

Production lockdown: - --host 127.0.0.1 only — never bind external. - Random high port (49152–65535). - --log-disable so subprocess stdout / stderr don't carry request / response logs. The b8987 binary doesn't accept --no-display-prompt (only --log-disable); confirmed against llama-server --help and the spike. Any future server-log tail in the UI gets re-scrubbed before display. - --n-gpu-layers 0 (CPU) by default. Metal offload (-1) is opt-in via the digest_local_n_gpu_layers config key — the b8987 prebuilt's Metal kernels reference bf16 mat-mul ops the drivers on pre-M5 / pre-A19 Apple Silicon don't ship, so default-off is the safe path. See docs/llama-server-spike.md.

LocalServerError

Bases: Exception

Base class for llama-server lifecycle / call failures.

LocalUnavailable

Bases: LocalServerError

Server isn't running and couldn't be started, or the bundled binary is missing, or the model GGUFs aren't on disk yet.

LocalTimeout

Bases: LocalServerError

A request to the running server exceeded its timeout budget.

get_or_start

get_or_start(*, model_path: Path, mmproj_path: Path, ctx_size: int = 8192, n_gpu_layers: int = 0) -> _Server

Return a healthy _Server configured for the given GGUFs. If the existing singleton's config differs (different model swapped in via Settings), the old server is torn down and a new one is started.

stop

stop() -> None

Atexit-safe singleton teardown.