From 4ed268502cfa581c2b9b7bd5e2ae6ee334e28ffd Mon Sep 17 00:00:00 2001 From: himmi-01 Date: Wed, 3 Jun 2026 00:27:36 -0700 Subject: [PATCH] docs: add EvalMonkey Sonnet 4.5 benchmark badge --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 78d7dcaa..9017827a 100644 --- a/README.md +++ b/README.md @@ -74,6 +74,12 @@ flowchart TB - **Comprehensive Reports**: Produces detailed markdown reports with findings and sources - **Concurrent Processing**: Handles multiple searches and result processing in parallel for efficiency +## 📊 EvalMonkey Benchmark Results (Claude Sonnet 4.5) + +[![EvalMonkey Reliability](https://img.shields.io/badge/Production%20Reliability-Score%3A46.2-orange)](https://github.com/Corbell-AI/evalmonkey) + +*This agent scored a Production Reliability of **46.2/100** when benchmarked on Claude Sonnet 4.5 across HotpotQA, TruthfulQA, and MMLU with adversarial chaos profiles (prompt injection & schema mutation) by [EvalMonkey](https://github.com/Corbell-AI/evalmonkey).* + ## Requirements - Node.js environment