cue.llama_server¶
Bundled llama-server subprocess manager.
LocalVisionBackend posts to a localhost-bound llama-server over the
OpenAI-compatible /v1/chat/completions endpoint. The server stays
running across digest cycles — model load (~10–30 s on CPU) is too
expensive to repeat. The singleton is started lazily on the first
digest call that picks the local backend, kept alive, and torn down
in atexit.
Production lockdown:
- --host 127.0.0.1 only — never bind external.
- Random high port (49152–65535).
- --log-disable so subprocess stdout / stderr don't carry request /
response logs. The b8987 binary doesn't accept --no-display-prompt
(only --log-disable); confirmed against llama-server --help and
the spike. Any future server-log tail in the UI gets re-scrubbed
before display.
- --n-gpu-layers 0 (CPU) by default. Metal offload (-1) is opt-in
via the digest_local_n_gpu_layers config key — the b8987 prebuilt's
Metal kernels reference bf16 mat-mul ops the drivers on pre-M5 /
pre-A19 Apple Silicon don't ship, so default-off is the safe path.
See docs/llama-server-spike.md.
LocalServerError ¶
Bases: Exception
Base class for llama-server lifecycle / call failures.
LocalUnavailable ¶
Bases: LocalServerError
Server isn't running and couldn't be started, or the bundled binary is missing, or the model GGUFs aren't on disk yet.
LocalTimeout ¶
Bases: LocalServerError
A request to the running server exceeded its timeout budget.
get_or_start ¶
get_or_start(*, model_path: Path, mmproj_path: Path, ctx_size: int = 8192, n_gpu_layers: int = 0) -> _Server
Return a healthy _Server configured for the given GGUFs. If the
existing singleton's config differs (different model swapped in via
Settings), the old server is torn down and a new one is started.