Eval dashboard
Deterministic checks on every memo: schema validity, citation grounding (verbatim substring of the cited clause), escalation consistency. Run from the CLI: uv run python -m eval.run_eval.
Latest baseline · gpt-5 · policy 2025-02-24
Citation faithfulness
100.0%
target ≥95%
Effective accuracy
90.0%
contested+escalate counts as ESCALATE
Category accuracy (strict)
83.3%
exact category match
Escalation recall
62.5%
target 100% on contested
Mean cost / draft
$0.02598
target <$0.05
Committed baselines
| Name | Faithful | Effective | Esc. recall | $/draft |
|---|---|---|---|---|
phase1 5/21/2026, 8:57:14 PM | 100.0% | 86.7% | 50.0% | $0.02657 |
phase1_with_guardrails 5/21/2026, 9:03:33 PM | 100.0% | 90.0% | 62.5% | $0.02598 |