Skip to content

Eval & smoke

cue.eval.digest_eval is a dev-only harness that compares the local Gemma 4 backend against the cloud Haiku backend over a set of fixture sessions. Its outputs gate the on-device default flip.

What it does

For each fixture (a short captured session with up to 10 keyframes + a timeline JSON):

  1. Runs LocalVisionBackend and CloudVisionBackend independently.
  2. Captures: latency, length, error / timeout flag, scrubbed summary, residual PII entities (a second-pass scrub_strict to detect leakage the input scrub didn't catch).
  3. Aggregates per-backend stats: p95 / median latency, success count, timeout count, PII-leakage count, average output chars.

The default-flip check verifies these thresholds:

  • PII leakage on local — must be zero.
  • Local p95 latency — ≤ 14 s.
  • Local timeout rate — ≤ 5 %.
  • Local output length — average ≥ 20 chars (the model produced something).

If all pass, the gate reports ready_for_default_flip: true and the next release can flip digest_backend from cloud to local.

How to run it

From the repo root:

uv run python -m cue.eval.digest_eval --fixtures tests/fixtures/digest_eval

Smoke fixtures (3 sessions) ship in-tree under tests/fixtures/digest_eval/. The full 30-fixture set is captured separately by the maintainer (real screen recordings are too sensitive to ship in the repo).

Output is written to eval/digest_eval_<date>.json for audit.

Fixture format

Each fixture is a JSON file:

{
  "id": "session_001",
  "label": "code review in PR diff",
  "keyframes": ["./keyframes/0.jpg", "./keyframes/1.jpg"],
  "timeline": [
    {"topic": "window", "ts_ns": 1000000000,
     "payload": {"application": "Chrome", "title": "..."}},
    {"topic": "keyboard", "ts_ns": 2000000000,
     "payload": {"event_type": "press", "vk": 0x41}}
  ],
  "expected_topics": ["code review", "PR"]
}

Keyframe paths are relative to the fixture file. expected_topics is optional and used by future missed-activity scoring.

Reading the JSON output

{
  "generated_at": "2026-04-30T18:19:19Z",
  "fixture_count": 3,
  "records": [
    {"fixture_id": "smoke_blue", "backend": "cloud", "latency_s": 1.76, ...},
    {"fixture_id": "smoke_blue", "backend": "local", "latency_s": 34.70, ...}
  ],
  "stats": {
    "cloud": {"p95_latency_s": 1.92, "n_succeeded": 3, ...},
    "local": {"p95_latency_s": 34.70, "n_succeeded": 3, ...}
  },
  "ready_for_default_flip": false,
  "checks": {
    "local_p95_latency_ok": false,
    ...
  }
}

Current state

The smoke set on Apple Silicon CPU shows local p95 in the 22-35 s range — well over the 14 s threshold. That's expected: the b8987 prebuilt's Metal kernels miss bf16 ops on pre-M5 / pre-A19 Apple Silicon, so on-device inference falls back to CPU. See On-device vision.

When the maintainer's hardware (or the model) clears that bar, the eval will flip ready_for_default_flip: true and the config default can be changed.

Smoke vs full run

  • Smoke — 3 fixtures, runs in CI on every push to main, regression sentinel.
  • Full — 30 fixtures, run by hand by the maintainer when considering the default flip. Outputs are committed under eval/ for audit.

See also