Eval & smoke¶
cue.eval.digest_eval is a dev-only harness that compares the
local Gemma 4 backend against the cloud Haiku backend over a set
of fixture sessions. Its outputs gate the on-device default flip.
What it does¶
For each fixture (a short captured session with up to 10 keyframes + a timeline JSON):
- Runs
LocalVisionBackendandCloudVisionBackendindependently. - Captures: latency, length, error / timeout flag, scrubbed
summary, residual PII entities (a second-pass
scrub_strictto detect leakage the input scrub didn't catch). - Aggregates per-backend stats: p95 / median latency, success count, timeout count, PII-leakage count, average output chars.
The default-flip check verifies these thresholds:
- PII leakage on local — must be zero.
- Local p95 latency — ≤ 14 s.
- Local timeout rate — ≤ 5 %.
- Local output length — average ≥ 20 chars (the model produced something).
If all pass, the gate reports ready_for_default_flip: true and
the next release can flip digest_backend from cloud to local.
How to run it¶
From the repo root:
uv run python -m cue.eval.digest_eval --fixtures tests/fixtures/digest_eval
Smoke fixtures (3 sessions) ship in-tree under
tests/fixtures/digest_eval/. The full 30-fixture set is captured
separately by the maintainer (real screen recordings are too
sensitive to ship in the repo).
Output is written to eval/digest_eval_<date>.json for audit.
Fixture format¶
Each fixture is a JSON file:
{
"id": "session_001",
"label": "code review in PR diff",
"keyframes": ["./keyframes/0.jpg", "./keyframes/1.jpg"],
"timeline": [
{"topic": "window", "ts_ns": 1000000000,
"payload": {"application": "Chrome", "title": "..."}},
{"topic": "keyboard", "ts_ns": 2000000000,
"payload": {"event_type": "press", "vk": 0x41}}
],
"expected_topics": ["code review", "PR"]
}
Keyframe paths are relative to the fixture file. expected_topics
is optional and used by future missed-activity scoring.
Reading the JSON output¶
{
"generated_at": "2026-04-30T18:19:19Z",
"fixture_count": 3,
"records": [
{"fixture_id": "smoke_blue", "backend": "cloud", "latency_s": 1.76, ...},
{"fixture_id": "smoke_blue", "backend": "local", "latency_s": 34.70, ...}
],
"stats": {
"cloud": {"p95_latency_s": 1.92, "n_succeeded": 3, ...},
"local": {"p95_latency_s": 34.70, "n_succeeded": 3, ...}
},
"ready_for_default_flip": false,
"checks": {
"local_p95_latency_ok": false,
...
}
}
Current state¶
The smoke set on Apple Silicon CPU shows local p95 in the 22-35 s range — well over the 14 s threshold. That's expected: the b8987 prebuilt's Metal kernels miss bf16 ops on pre-M5 / pre-A19 Apple Silicon, so on-device inference falls back to CPU. See On-device vision.
When the maintainer's hardware (or the model) clears that bar,
the eval will flip ready_for_default_flip: true and the
config default can be changed.
Smoke vs full run¶
- Smoke — 3 fixtures, runs in CI on every push to
main, regression sentinel. - Full — 30 fixtures, run by hand by the maintainer when
considering the default flip. Outputs are committed under
eval/for audit.
See also¶
cue.eval.digest_eval— module reference (deferred, the source is insrc/cue/eval/digest_eval.py).- Digest pipeline — what the harness compares.
- On-device vision — why the local side is currently held back.