Eval dashboard

Deterministic checks on every memo: schema validity, citation grounding (verbatim substring of the cited clause), escalation consistency. Run from the CLI: uv run python -m eval.run_eval.

Latest baseline · gpt-5 · policy 2025-02-24

Citation faithfulness

100.0%

target ≥95%

Effective accuracy

90.0%

contested+escalate counts as ESCALATE

Category accuracy (strict)

83.3%

exact category match

Escalation recall

62.5%

target 100% on contested

Mean cost / draft

$0.02598

target <$0.05

Committed baselines

Name	Faithful	Effective	Esc. recall	$/draft
phase1 5/21/2026, 8:57:14 PM	100.0%	86.7%	50.0%	$0.02657
phase1_with_guardrails 5/21/2026, 9:03:33 PM	100.0%	90.0%	62.5%	$0.02598