Skip to content

llama-server multimodal spike — measured values

Generated by scripts/spike_llama_mtmd.py against the b8987 prebuilt binary on a maintainer's M2 Pro / macOS 13.5 / 16 GB RAM. Commit 6 (LocalVisionBackend) cites this contract; commit 9 (eval) measures real latency on the 30-fixture digest set.

Pinned contract

Question Answer
Endpoint path /v1/chat/completions (OpenAI-compatible)
Image payload {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
Image order Preserved across frames as listed in messages[0].content
Healthcheck GET /health returns 200 once model + mmproj loaded
Streaming stream: false returns choices[0].message.content (text)
Max images per call At least 10 (production budget); higher untested
Server boot time ~25 s on CPU (M2 Pro). Faster on M5+ Metal.

Measured latencies

The spike uses 256 × 256 synthetic JPEGs and max_tokens=5. Production input is 896-px keyframes with a multi-hundred-token digest prompt, so the contract is what's pinned here; the budget gets re-measured once the eval harness lands.

{
  "binary": "vendor/llama-bin/llama-b8987/llama-server",
  "n_gpu_layers": 0,
  "n_gpu_layers_note": "CPU on M2 Pro — pre-M5 Metal kernels miss bf16 ops in this prebuilt",
  "health_ready": true,
  "endpoint": "/v1/chat/completions",
  "cold_no_images_s": 0.23,
  "warm_1_image_s": 0.28,
  "warm_5_images_s": 10.44,
  "warm_10_images_s": 13.76
}

Hardware-compat finding (pre-M5 Metal)

The b8987 prebuilt's Metal kernels include bf16 mat-mul variants that the GPU drivers on pre-M5 / pre-A19 Apple Silicon don't ship. With --n-gpu-layers -1 the server aborts during warmup with:

ggml_metal_library_compile_pipeline: failed to compile pipeline:
  base = 'kernel_mul_mv_ext_bf16_f32_r1_2',
  name = 'kernel_mul_mv_ext_bf16_f32_r1_2_nsg=2_nxpsg=16'
ggml_metal_library_compile_pipeline: Function kernel_mul_mv_ext_bf16_f32_r1_2
  was not found in the library

The fix is --n-gpu-layers 0 (CPU only) or build llama.cpp from source without bf16 ops. Commit 6 picks the GPU layer count based on platform.mac_ver() + platform.machine(): M5+ → -1, otherwise 0. Settings UI in commit 7 surfaces this as a status banner so users on older hardware don't blame Cue when local digests run slowly.

Run command (pinned)

llama-server \
    --model "$MODEL" \
    --mmproj "$MMPROJ" \
    --host 127.0.0.1 \
    --port "$PORT" \
    --ctx-size 8192 \
    --n-gpu-layers 0 \
    --no-context-shift

Production additions tracked in commit 6: - --log-disable — suppress request / response access logs. The b8987 prebuilt does NOT accept --no-display-prompt (verified against llama-server --help and the production-path smoke); the only log-suppression flag the binary surfaces is --log-disable. - Random high port chosen at runtime, not 8080.