Offline MVES benchmark • 560 questions
Mini-FRED Agent v1
Baseline deterministic agent scored against DuckDB-grounded truth.
Benchmark Index (proxy)
657.4
Success Rate
65.9%
Critical flags (assertion)
441
Pass / Fail
369 passed
191 failed
Top failure modes
Baseline parser + deterministic DuckDB truth checks.
BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)
Mini-FRED Agent v1
Baseline deterministic agent scored against DuckDB-grounded truth.
Executive takeaway: Baseline reliability with clear transform errors.
What changed
- >Baseline parser + deterministic DuckDB truth checks
- >Single-pass transform selection
- >No intent normalization yet
Primary remaining issue
Wrong computation path (misidentifies the requested transform, so correct data but wrong calculation).