llama-server multimodal spike — measured values¶
Generated by scripts/spike_llama_mtmd.py against the b8987 prebuilt
binary on a maintainer's M2 Pro / macOS 13.5 / 16 GB RAM. Commit 6
(LocalVisionBackend) cites this contract; commit 9 (eval) measures
real latency on the 30-fixture digest set.
Pinned contract¶
| Question | Answer |
|---|---|
| Endpoint path | /v1/chat/completions (OpenAI-compatible) |
| Image payload | {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}} |
| Image order | Preserved across frames as listed in messages[0].content |
| Healthcheck | GET /health returns 200 once model + mmproj loaded |
| Streaming | stream: false returns choices[0].message.content (text) |
| Max images per call | At least 10 (production budget); higher untested |
| Server boot time | ~25 s on CPU (M2 Pro). Faster on M5+ Metal. |
Measured latencies¶
The spike uses 256 × 256 synthetic JPEGs and max_tokens=5. Production
input is 896-px keyframes with a multi-hundred-token digest prompt, so
the contract is what's pinned here; the budget gets re-measured
once the eval harness lands.
{
"binary": "vendor/llama-bin/llama-b8987/llama-server",
"n_gpu_layers": 0,
"n_gpu_layers_note": "CPU on M2 Pro — pre-M5 Metal kernels miss bf16 ops in this prebuilt",
"health_ready": true,
"endpoint": "/v1/chat/completions",
"cold_no_images_s": 0.23,
"warm_1_image_s": 0.28,
"warm_5_images_s": 10.44,
"warm_10_images_s": 13.76
}
Hardware-compat finding (pre-M5 Metal)¶
The b8987 prebuilt's Metal kernels include bf16 mat-mul variants that
the GPU drivers on pre-M5 / pre-A19 Apple Silicon don't ship. With
--n-gpu-layers -1 the server aborts during warmup with:
ggml_metal_library_compile_pipeline: failed to compile pipeline:
base = 'kernel_mul_mv_ext_bf16_f32_r1_2',
name = 'kernel_mul_mv_ext_bf16_f32_r1_2_nsg=2_nxpsg=16'
ggml_metal_library_compile_pipeline: Function kernel_mul_mv_ext_bf16_f32_r1_2
was not found in the library
The fix is --n-gpu-layers 0 (CPU only) or build llama.cpp from source
without bf16 ops. Commit 6 picks the GPU layer count based on
platform.mac_ver() + platform.machine(): M5+ → -1, otherwise 0.
Settings UI in commit 7 surfaces this as a status banner so users
on older hardware don't blame Cue when local digests run slowly.
Run command (pinned)¶
llama-server \
--model "$MODEL" \
--mmproj "$MMPROJ" \
--host 127.0.0.1 \
--port "$PORT" \
--ctx-size 8192 \
--n-gpu-layers 0 \
--no-context-shift
Production additions tracked in commit 6:
- --log-disable — suppress request / response access logs. The b8987
prebuilt does NOT accept --no-display-prompt (verified against
llama-server --help and the production-path smoke); the only
log-suppression flag the binary surfaces is --log-disable.
- Random high port chosen at runtime, not 8080.