-
Notifications
You must be signed in to change notification settings - Fork 136
feat: Add llm-eval-harness kit #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Tharun2511
wants to merge
7
commits into
Lamatic:main
Choose a base branch
from
Tharun2511:feat/llm-eval-harness
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+7,277
−0
Open
Changes from 4 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
5ccb848
feat: add llm-eval-harness kit (flows + Next.js app)
Tharun2511 6538d9f
feat: finalise llm-eval-harness kit (flows + metadata)
Tharun2511 a8e6d27
feat: premium UI redesign + live validation + robust judge parsing
Tharun2511 0648275
chore: enforce type-checking and harden golden-set validation
Tharun2511 22fac39
fix: address review feedback (prune unused UI, harden validation, con…
Tharun2511 b9a6355
refactor: use react-hook-form + zod for the form (repo standard)
Tharun2511 4177814
fix: forwardRef UI components and trim-validate golden-set strings
Tharun2511 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| .lamatic/ | ||
| node_modules/ | ||
| .next/ | ||
| .env | ||
| .env.local |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # LLM Eval Harness | ||
|
|
||
| A ready-to-deploy kit that scores an LLM prompt against a **golden set** using an **LLM-as-judge**, then applies a **CI-style pass/fail gate** — so you can catch quality regressions *before* they ship. | ||
|
|
||
| > Point it at any system prompt, give it a handful of test cases with expected criteria, and it tells you whether the prompt's outputs are faithful, relevant, and correct — with a single GATE PASSED / GATE FAILED verdict. | ||
|
|
||
| --- | ||
|
|
||
| ## The problem | ||
|
|
||
| When you ship an LLM feature and then tweak a prompt or swap a model, output quality can silently regress — a small wording change makes the model hallucinate, over-promise, or drift off-task, and you don't find out until a user does. Eyeballing a few outputs doesn't scale and isn't repeatable. | ||
|
|
||
| Teams solve this with an **evaluation harness**: a fixed set of representative inputs (a *golden set*), an automated grader, and a quality bar that must be met to ship. This kit packages that pattern as a hosted, reusable tool on Lamatic. | ||
|
|
||
| ## The approach | ||
|
|
||
| For each case in the golden set, the kit runs two flows: | ||
|
|
||
| 1. **`run-target`** — sends your system-prompt-under-test + the case input to an LLM and captures the output (the *system under test*). | ||
| 2. **`judge`** — an LLM-as-judge scores that output against the case's `criteria` (and optional `reference`) on three dimensions, **0–5** each: | ||
| - **Faithfulness** — is every claim grounded? (hallucination is penalised hard — it's a veto) | ||
| - **Relevancy** — does it actually address the input? | ||
| - **Correctness** — does it satisfy the case criteria? | ||
|
|
||
| The app aggregates the per-case verdicts into a **pass rate** and compares it to a threshold you set (default **90%**) to produce the gate. A case **passes** only if `overall ≥ 3.5` **and** `faithfulness ≥ 3`. | ||
|
|
||
| ``` | ||
| golden case ──▶ run-target (LLM) ──▶ output ──▶ judge (LLM-as-judge) ──▶ {scores, pass, reasoning} | ||
| │ | ||
| all cases ──▶ pass rate vs threshold ──▶ GATE PASS / FAIL | ||
| ``` | ||
|
|
||
| ## Results | ||
|
|
||
| - Runs entirely on **Lamatic flows** (Groq `llama-3.3-70b-versatile`, temperature 0 for deterministic scoring). | ||
| - The judge reliably **distinguishes good from bad output** — e.g. it fails a support reply that invents a refund against a "final-sale is non-refundable" policy (faithfulness 0), and passes a correct, grounded reply. | ||
| - Per-case results are expandable to show the generated output and the judge's reasoning, so a failure tells you *why*. | ||
|
|
||
| ## Tradeoffs & assumptions | ||
|
|
||
| - **Single provider (v1):** the flows use Groq. Lamatic stores model credentials at the project level, so multi-provider / bring-your-own-key was deliberately scoped out of v1 — runtime credential injection is a security tradeoff worth doing properly rather than quickly. | ||
| - **App-side loop:** the golden set is iterated in the Next.js server action (3 cases concurrently) rather than inside one flow, which keeps the flows simple and lets the UI surface per-case progress and errors. | ||
| - **Gate recomputed in code:** `overall` and `pass` are recomputed from the judge's dimension scores in the app, so the gate is deterministic and not dependent on the model's own arithmetic. | ||
| - **Defensive parsing:** judge output is tolerant of code fences and minor formatting; run-target output is HTML-entity-decoded before scoring. | ||
|
|
||
| --- | ||
|
|
||
| ## Flows | ||
|
|
||
| | Flow | Input | Output | | ||
| |------|-------|--------| | ||
| | `judge` | `{ input, output, criteria, reference? }` | `{ faithfulness, relevancy, correctness, overall, pass, reasoning }` | | ||
| | `run-target` | `{ systemPrompt, input }` | `{ answer }` (the generated output under test) | | ||
|
|
||
| ## Setup | ||
|
|
||
| ```bash | ||
| cd kits/llm-eval-harness/apps | ||
| cp .env.example .env.local # then fill in the values below | ||
| npm install | ||
| npm run dev # http://localhost:3000 | ||
| ``` | ||
|
|
||
| ### Environment variables | ||
|
|
||
| | Variable | Where to find it | | ||
| |----------|------------------| | ||
| | `JUDGE_FLOW` | Deploy the `judge` flow in Lamatic Studio → copy its Flow ID | | ||
| | `RUN_TARGET_FLOW` | Deploy the `run-target` flow → copy its Flow ID | | ||
| | `LAMATIC_API_URL` | Studio → Settings / API | | ||
| | `LAMATIC_PROJECT_ID` | Studio → Project settings | | ||
| | `LAMATIC_API_KEY` | Studio → API Keys | | ||
|
|
||
| ## Usage | ||
|
|
||
| 1. Paste the **system prompt** you want to evaluate. | ||
| 2. Provide a **golden set** as JSON — an array of `{ input, criteria, reference? }`. | ||
| 3. Set a **gate threshold** (default 90%). | ||
| 4. Click **Run evaluation** — or **Load example** to try a support-agent scenario. | ||
|
|
||
| Built on [Lamatic](https://lamatic.ai). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # LLM Eval Harness | ||
|
|
||
| ## Overview | ||
| The LLM Eval Harness is a quality-gate agent for other LLM features. Given a system prompt and a golden set of test cases, it runs each case through the prompt-under-test and then grades the output with an LLM-as-judge across faithfulness, relevancy, and correctness, returning per-case scores and a single pass/fail gate. It is invoked by a Next.js web UI that calls two Lamatic flows and aggregates the verdicts. It depends on Lamatic's hosted runtime, project credentials, and a connected text-generation provider (Groq). | ||
|
|
||
| ## Purpose | ||
| Prompt and model changes can silently regress output quality — a reworded instruction starts hallucinating, over-promising, or drifting off-task. This agent makes that measurable and repeatable: a fixed golden set plus an automated judge plus a quality threshold, so a regression is caught as a failed gate rather than by a user. It generalises the eval-harness pattern (golden sets + LLM-as-judge + CI gate) into a hosted, reusable tool. | ||
|
|
||
| ## Flows | ||
|
|
||
| ### `judge` | ||
| - **Trigger:** API request with `{ input, output, criteria, reference? }`. | ||
| - **Processing:** a single LLM node (Groq `llama-3.3-70b-versatile`, temperature 0) acts as a strict evaluation judge using the system prompt in `prompts/`. It scores the candidate `output` against the `criteria` and optional `reference`. | ||
| - **Response:** JSON `{ faithfulness, relevancy, correctness, overall, pass, reasoning }`, each dimension 0–5. | ||
| - **When to use:** to score one already-generated output against case criteria. | ||
| - **Dependencies:** Groq text model credential. | ||
|
|
||
| ### `run-target` | ||
| - **Trigger:** API request with `{ systemPrompt, input }`. | ||
| - **Processing:** a single LLM node runs `systemPrompt` (system) + `input` (user) — this is the *system under test*. | ||
| - **Response:** `{ answer }`, the generated output. | ||
| - **When to use:** to produce the output that `judge` then scores. | ||
| - **Dependencies:** Groq text model credential. | ||
|
|
||
| ## Guardrails | ||
| - The `judge` only scores; it never completes the user's task or rewrites the output. | ||
| - It does not reward length, confidence, formatting, or politeness — an eloquent but unsupported answer scores low on faithfulness. | ||
| - Faithfulness is a veto: a hallucinated or contradicting answer fails regardless of other scores. | ||
| - Scoring is deterministic (temperature 0); identical inputs yield identical scores. | ||
|
|
||
| ## Integration Reference | ||
| - **Lamatic API runtime** — hosts and executes both flows. Requires `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` in the calling app. | ||
| - **Groq (text generation)** — backs both LLM nodes; configured as a model credential in Lamatic Studio. | ||
|
|
||
| ## Environment Setup | ||
| - `JUDGE_FLOW` — deployed `judge` flow ID, called by the app. | ||
| - `RUN_TARGET_FLOW` — deployed `run-target` flow ID, called by the app. | ||
| - `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` — Lamatic project credentials used by the app to invoke the flows. | ||
|
|
||
| ## Quickstart | ||
| 1. Build and deploy the `judge` and `run-target` flows in Lamatic Studio; copy their Flow IDs. | ||
| 2. In `apps/`, copy `.env.example` to `.env.local` and fill in the flow IDs + Lamatic credentials. | ||
| 3. `npm install && npm run dev`, open `http://localhost:3000`. | ||
| 4. Paste a system prompt + a golden set (or click **Load example**) and run. | ||
|
|
||
| ## Common Failure Modes | ||
| | Symptom | Likely cause | Fix | | ||
| |---|---|---| | ||
| | Judge scores look random | Model too small or temperature not 0 | Use `llama-3.3-70b-versatile`, set temperature 0 | | ||
| | "No answer returned from flow" | Wrong flow ID or response mapping | Verify `JUDGE_FLOW`/`RUN_TARGET_FLOW` and that the response maps `answer` | | ||
| | Auth error on run | Missing/invalid Lamatic credentials | Check `LAMATIC_API_*` in `.env.local` | | ||
| | A case shows "error" | run-target or judge failed for that input | Expand the row; the run continues for other cases | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # Deployed Lamatic flow IDs (Studio → deploy the flow → copy its Flow ID) | ||
| JUDGE_FLOW="your-judge-flow-id" | ||
| RUN_TARGET_FLOW="your-run-target-flow-id" | ||
|
|
||
| # Lamatic project credentials (Studio → Settings / API) | ||
| LAMATIC_API_URL="https://your-project.lamatic.dev" | ||
| LAMATIC_PROJECT_ID="your-project-id" | ||
| LAMATIC_API_KEY="your-lamatic-api-key" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # See https://help.github.com/articles/ignoring-files/ for more about ignoring files. | ||
|
|
||
| # dependencies | ||
| /node_modules | ||
|
|
||
| # next.js | ||
| /.next/ | ||
| /out/ | ||
|
|
||
| # production | ||
| /build | ||
|
|
||
| # debug | ||
| npm-debug.log* | ||
| yarn-debug.log* | ||
| yarn-error.log* | ||
| .pnpm-debug.log* | ||
|
|
||
| # env files | ||
| .env | ||
| .env.local | ||
| .env*.local | ||
|
|
||
| # vercel | ||
| .vercel | ||
|
|
||
| # typescript | ||
| *.tsbuildinfo | ||
| next-env.d.ts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # LLM Eval Harness — App | ||
|
|
||
| Next.js front end for the **LLM Eval Harness** kit. It calls two Lamatic flows | ||
| (`run-target` and `judge`) to score a system prompt against a golden set and | ||
| render a CI-style pass/fail gate. | ||
|
|
||
| See the [kit README](../README.md) for the full overview. | ||
|
|
||
| ## Run locally | ||
|
|
||
| ```bash | ||
| cp .env.example .env.local # fill in flow IDs + Lamatic credentials | ||
| npm install | ||
| npm run dev # http://localhost:3000 | ||
| ``` | ||
|
|
||
| ## Environment variables | ||
|
|
||
| | Variable | Source | | ||
| |----------|--------| | ||
| | `JUDGE_FLOW` | Deployed `judge` flow ID (Lamatic Studio) | | ||
| | `RUN_TARGET_FLOW` | Deployed `run-target` flow ID | | ||
| | `LAMATIC_API_URL` | Studio → Settings / API | | ||
| | `LAMATIC_PROJECT_ID` | Studio → Project settings | | ||
| | `LAMATIC_API_KEY` | Studio → API Keys | | ||
|
|
||
| ## Structure | ||
|
|
||
| - `actions/orchestrate.ts` — server action: per-case `run-target` → `judge` loop, aggregation, gate | ||
| - `lib/lamatic-client.ts` — Lamatic SDK client + flow IDs from env | ||
| - `lib/eval.ts` — judge-output parsing, HTML decode, gate computation, bounded concurrency | ||
| - `lib/types.ts` — shared data contracts | ||
| - `components/gate-banner.tsx`, `components/results-table.tsx` — results UI | ||
| - `app/page.tsx` — the harness UI |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| "use server" | ||
|
|
||
| import { getFlowIds, getLamaticClient } from "@/lib/lamatic-client" | ||
| import { computeAggregate, decodeHtmlEntities, mapWithConcurrency, parseJudgeResult } from "@/lib/eval" | ||
| import type { CaseResult, GoldenCase, RunAggregate } from "@/lib/types" | ||
|
|
||
| // Bounded concurrency keeps large golden sets from tripping Groq rate limits. | ||
| const CONCURRENCY = 3 | ||
|
|
||
| /** Execute a flow and pull the `answer` field out of the Lamatic response. */ | ||
| async function getAnswer(flowId: string, inputs: Record<string, unknown>): Promise<unknown> { | ||
| const resData = await getLamaticClient().executeFlow(flowId, inputs) | ||
| const envelope = resData as { result?: { answer?: unknown }; answer?: unknown } | ||
| const answer = envelope?.result?.answer ?? envelope?.answer | ||
| if (answer === undefined || answer === null) { | ||
| throw new Error("No answer returned from flow") | ||
| } | ||
| return answer | ||
| } | ||
|
|
||
| /** Run one golden case through run-target, then score it with the judge. */ | ||
| async function evaluateCase(systemPrompt: string, testCase: GoldenCase): Promise<CaseResult> { | ||
| const { judge, runTarget } = getFlowIds() | ||
| try { | ||
| const rawOutput = await getAnswer(runTarget, { systemPrompt, input: testCase.input }) | ||
| const output = decodeHtmlEntities(typeof rawOutput === "string" ? rawOutput : JSON.stringify(rawOutput)) | ||
|
|
||
| const rawJudge = await getAnswer(judge, { | ||
| input: testCase.input, | ||
| output, | ||
| criteria: testCase.criteria, | ||
| reference: testCase.reference ?? "", | ||
| }) | ||
|
|
||
| return { case: testCase, output, judge: parseJudgeResult(rawJudge) } | ||
| } catch (error) { | ||
| return { | ||
| case: testCase, | ||
| output: "", | ||
| judge: null, | ||
| error: error instanceof Error ? error.message : "Evaluation failed", | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** Evaluate a system prompt against a golden set and return the gate verdict. */ | ||
| export async function runEvaluation( | ||
| systemPrompt: string, | ||
| cases: GoldenCase[], | ||
| threshold: number, | ||
| ): Promise<{ success: boolean; data?: RunAggregate; error?: string }> { | ||
| try { | ||
| if (!systemPrompt.trim()) throw new Error("A system prompt is required") | ||
| if (!Array.isArray(cases) || cases.length === 0) throw new Error("Provide at least one test case") | ||
|
|
||
| const results = await mapWithConcurrency(cases, CONCURRENCY, (testCase) => | ||
| evaluateCase(systemPrompt, testCase), | ||
| ) | ||
| return { success: true, data: computeAggregate(results, threshold) } | ||
| } catch (error) { | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| return { success: false, error: error instanceof Error ? error.message : "Evaluation failed" } | ||
| } | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.