Lamatic · Tharun2511 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/kits/llm-eval-harness/.gitignore b/kits/llm-eval-harness/.gitignore
@@ -0,0 +1,5 @@
+.lamatic/
+node_modules/
+.next/
+.env
+.env.local
diff --git a/kits/llm-eval-harness/README.md b/kits/llm-eval-harness/README.md
@@ -0,0 +1,81 @@
+# LLM Eval Harness
+
+A ready-to-deploy kit that scores an LLM prompt against a **golden set** using an **LLM-as-judge**, then applies a **CI-style pass/fail gate** — so you can catch quality regressions *before* they ship.
+
+> Point it at any system prompt, give it a handful of test cases with expected criteria, and it tells you whether the prompt's outputs are faithful, relevant, and correct — with a single GATE PASSED / GATE FAILED verdict.
+
+---
+
+## The problem
+
+When you ship an LLM feature and then tweak a prompt or swap a model, output quality can silently regress — a small wording change makes the model hallucinate, over-promise, or drift off-task, and you don't find out until a user does. Eyeballing a few outputs doesn't scale and isn't repeatable.
+
+Teams solve this with an **evaluation harness**: a fixed set of representative inputs (a *golden set*), an automated grader, and a quality bar that must be met to ship. This kit packages that pattern as a hosted, reusable tool on Lamatic.
+
+## The approach
+
+For each case in the golden set, the kit runs two flows:
+
+1. **`run-target`** — sends your system-prompt-under-test + the case input to an LLM and captures the output (the *system under test*).
+2. **`judge`** — an LLM-as-judge scores that output against the case's `criteria` (and optional `reference`) on three dimensions, **0–5** each:
+   - **Faithfulness** — is every claim grounded? (hallucination is penalised hard — it's a veto)
+   - **Relevancy** — does it actually address the input?
+   - **Correctness** — does it satisfy the case criteria?
+
+The app aggregates the per-case verdicts into a **pass rate** and compares it to a threshold you set (default **90%**) to produce the gate. A case **passes** only if `overall ≥ 3.5` **and** `faithfulness ≥ 3`.
+
+```
+golden case ──▶ run-target (LLM) ──▶ output ──▶ judge (LLM-as-judge) ──▶ {scores, pass, reasoning}
+                                                                              │
+                              all cases ──▶ pass rate vs threshold ──▶ GATE PASS / FAIL
+```
+
+## Results
+
+- Runs entirely on **Lamatic flows** (Groq `llama-3.3-70b-versatile`, temperature 0 for deterministic scoring).
+- The judge reliably **distinguishes good from bad output** — e.g. it fails a support reply that invents a refund against a "final-sale is non-refundable" policy (faithfulness 0), and passes a correct, grounded reply.
+- Per-case results are expandable to show the generated output and the judge's reasoning, so a failure tells you *why*.
+
+## Tradeoffs & assumptions
+
+- **Single provider (v1):** the flows use Groq. Lamatic stores model credentials at the project level, so multi-provider / bring-your-own-key was deliberately scoped out of v1 — runtime credential injection is a security tradeoff worth doing properly rather than quickly.
+- **App-side loop:** the golden set is iterated in the Next.js server action (3 cases concurrently) rather than inside one flow, which keeps the flows simple and lets the UI surface per-case progress and errors.
+- **Gate recomputed in code:** `overall` and `pass` are recomputed from the judge's dimension scores in the app, so the gate is deterministic and not dependent on the model's own arithmetic.
+- **Defensive parsing:** judge output is tolerant of code fences and minor formatting; run-target output is HTML-entity-decoded before scoring.
+
+---
+
+## Flows
+
+| Flow | Input | Output |
+|------|-------|--------|
+| `judge` | `{ input, output, criteria, reference? }` | `{ faithfulness, relevancy, correctness, overall, pass, reasoning }` |
+| `run-target` | `{ systemPrompt, input }` | `{ answer }` (the generated output under test) |
+
+## Setup
+
+```bash
+cd kits/llm-eval-harness/apps
+cp .env.example .env.local   # then fill in the values below
+npm install
+npm run dev                  # http://localhost:3000
+```
+
+### Environment variables
+
+| Variable | Where to find it |
+|----------|------------------|
+| `JUDGE_FLOW` | Deploy the `judge` flow in Lamatic Studio → copy its Flow ID |
+| `RUN_TARGET_FLOW` | Deploy the `run-target` flow → copy its Flow ID |
+| `LAMATIC_API_URL` | Studio → Settings / API |
+| `LAMATIC_PROJECT_ID` | Studio → Project settings |
+| `LAMATIC_API_KEY` | Studio → API Keys |
+
+## Usage
+
+1. Paste the **system prompt** you want to evaluate.
+2. Provide a **golden set** as JSON — an array of `{ input, criteria, reference? }`.
+3. Set a **gate threshold** (default 90%).
+4. Click **Run evaluation** — or **Load example** to try a support-agent scenario.
+
+Built on [Lamatic](https://lamatic.ai).
diff --git a/kits/llm-eval-harness/agent.md b/kits/llm-eval-harness/agent.md
@@ -0,0 +1,52 @@
+# LLM Eval Harness
+
+## Overview
+The LLM Eval Harness is a quality-gate agent for other LLM features. Given a system prompt and a golden set of test cases, it runs each case through the prompt-under-test and then grades the output with an LLM-as-judge across faithfulness, relevancy, and correctness, returning per-case scores and a single pass/fail gate. It is invoked by a Next.js web UI that calls two Lamatic flows and aggregates the verdicts. It depends on Lamatic's hosted runtime, project credentials, and a connected text-generation provider (Groq).
+
+## Purpose
+Prompt and model changes can silently regress output quality — a reworded instruction starts hallucinating, over-promising, or drifting off-task. This agent makes that measurable and repeatable: a fixed golden set plus an automated judge plus a quality threshold, so a regression is caught as a failed gate rather than by a user. It generalises the eval-harness pattern (golden sets + LLM-as-judge + CI gate) into a hosted, reusable tool.
+
+## Flows
+
+### `judge`
+- **Trigger:** API request with `{ input, output, criteria, reference? }`.
+- **Processing:** a single LLM node (Groq `llama-3.3-70b-versatile`, temperature 0) acts as a strict evaluation judge using the system prompt in `prompts/`. It scores the candidate `output` against the `criteria` and optional `reference`.
+- **Response:** JSON `{ faithfulness, relevancy, correctness, overall, pass, reasoning }`, each dimension 0–5.
+- **When to use:** to score one already-generated output against case criteria.
+- **Dependencies:** Groq text model credential.
+
+### `run-target`
+- **Trigger:** API request with `{ systemPrompt, input }`.
+- **Processing:** a single LLM node runs `systemPrompt` (system) + `input` (user) — this is the *system under test*.
+- **Response:** `{ answer }`, the generated output.
+- **When to use:** to produce the output that `judge` then scores.
+- **Dependencies:** Groq text model credential.
+
+## Guardrails
+- The `judge` only scores; it never completes the user's task or rewrites the output.
+- It does not reward length, confidence, formatting, or politeness — an eloquent but unsupported answer scores low on faithfulness.
+- Faithfulness is a veto: a hallucinated or contradicting answer fails regardless of other scores.
+- Scoring is deterministic (temperature 0); identical inputs yield identical scores.
+
+## Integration Reference
+- **Lamatic API runtime** — hosts and executes both flows. Requires `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` in the calling app.
+- **Groq (text generation)** — backs both LLM nodes; configured as a model credential in Lamatic Studio.
+
+## Environment Setup
+- `JUDGE_FLOW` — deployed `judge` flow ID, called by the app.
+- `RUN_TARGET_FLOW` — deployed `run-target` flow ID, called by the app.
+- `LAMATIC_API_URL`, `LAMATIC_PROJECT_ID`, `LAMATIC_API_KEY` — Lamatic project credentials used by the app to invoke the flows.
+
+## Quickstart
+1. Build and deploy the `judge` and `run-target` flows in Lamatic Studio; copy their Flow IDs.
+2. In `apps/`, copy `.env.example` to `.env.local` and fill in the flow IDs + Lamatic credentials.
+3. `npm install && npm run dev`, open `http://localhost:3000`.
+4. Paste a system prompt + a golden set (or click **Load example**) and run.
+
+## Common Failure Modes
+| Symptom | Likely cause | Fix |
+|---|---|---|
+| Judge scores look random | Model too small or temperature not 0 | Use `llama-3.3-70b-versatile`, set temperature 0 |
+| "No answer returned from flow" | Wrong flow ID or response mapping | Verify `JUDGE_FLOW`/`RUN_TARGET_FLOW` and that the response maps `answer` |
+| Auth error on run | Missing/invalid Lamatic credentials | Check `LAMATIC_API_*` in `.env.local` |
+| A case shows "error" | run-target or judge failed for that input | Expand the row; the run continues for other cases |
diff --git a/kits/llm-eval-harness/apps/.env.example b/kits/llm-eval-harness/apps/.env.example
@@ -0,0 +1,8 @@
+# Deployed Lamatic flow IDs (Studio → deploy the flow → copy its Flow ID)
+JUDGE_FLOW="your-judge-flow-id"
+RUN_TARGET_FLOW="your-run-target-flow-id"
+
+# Lamatic project credentials (Studio → Settings / API)
+LAMATIC_API_URL="https://your-project.lamatic.dev"
+LAMATIC_PROJECT_ID="your-project-id"
+LAMATIC_API_KEY="your-lamatic-api-key"
diff --git a/kits/llm-eval-harness/apps/.gitignore b/kits/llm-eval-harness/apps/.gitignore
@@ -0,0 +1,29 @@
+# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
+
+# dependencies
+/node_modules
+
+# next.js
+/.next/
+/out/
+
+# production
+/build
+
+# debug
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+.pnpm-debug.log*
+
+# env files
+.env
+.env.local
+.env*.local
+
+# vercel
+.vercel
+
+# typescript
+*.tsbuildinfo
+next-env.d.ts
diff --git a/kits/llm-eval-harness/apps/README.md b/kits/llm-eval-harness/apps/README.md
@@ -0,0 +1,34 @@
+# LLM Eval Harness — App
+
+Next.js front end for the **LLM Eval Harness** kit. It calls two Lamatic flows
+(`run-target` and `judge`) to score a system prompt against a golden set and
+render a CI-style pass/fail gate.
+
+See the [kit README](../README.md) for the full overview.
+
+## Run locally
+
+```bash
+cp .env.example .env.local   # fill in flow IDs + Lamatic credentials
+npm install
+npm run dev                  # http://localhost:3000
+```
+
+## Environment variables
+
+| Variable | Source |
+|----------|--------|
+| `JUDGE_FLOW` | Deployed `judge` flow ID (Lamatic Studio) |
+| `RUN_TARGET_FLOW` | Deployed `run-target` flow ID |
+| `LAMATIC_API_URL` | Studio → Settings / API |
+| `LAMATIC_PROJECT_ID` | Studio → Project settings |
+| `LAMATIC_API_KEY` | Studio → API Keys |
+
+## Structure
+
+- `actions/orchestrate.ts` — server action: per-case `run-target` → `judge` loop, aggregation, gate
+- `lib/lamatic-client.ts` — Lamatic SDK client + flow IDs from env
+- `lib/eval.ts` — judge-output parsing, HTML decode, gate computation, bounded concurrency
+- `lib/types.ts` — shared data contracts
+- `components/gate-banner.tsx`, `components/results-table.tsx` — results UI
+- `app/page.tsx` — the harness UI
diff --git a/kits/llm-eval-harness/apps/actions/orchestrate.ts b/kits/llm-eval-harness/apps/actions/orchestrate.ts
@@ -0,0 +1,63 @@
+"use server"
+
+import { getFlowIds, getLamaticClient } from "@/lib/lamatic-client"
+import { computeAggregate, decodeHtmlEntities, mapWithConcurrency, parseJudgeResult } from "@/lib/eval"
+import type { CaseResult, GoldenCase, RunAggregate } from "@/lib/types"
+
+// Bounded concurrency keeps large golden sets from tripping Groq rate limits.
+const CONCURRENCY = 3
+
+/** Execute a flow and pull the `answer` field out of the Lamatic response. */
+async function getAnswer(flowId: string, inputs: Record<string, unknown>): Promise<unknown> {
+  const resData = await getLamaticClient().executeFlow(flowId, inputs)
+  const envelope = resData as { result?: { answer?: unknown }; answer?: unknown }
+  const answer = envelope?.result?.answer ?? envelope?.answer
+  if (answer === undefined || answer === null) {
+    throw new Error("No answer returned from flow")
+  }
+  return answer
+}
+
+/** Run one golden case through run-target, then score it with the judge. */
+async function evaluateCase(systemPrompt: string, testCase: GoldenCase): Promise<CaseResult> {
+  const { judge, runTarget } = getFlowIds()
+  try {
+    const rawOutput = await getAnswer(runTarget, { systemPrompt, input: testCase.input })
+    const output = decodeHtmlEntities(typeof rawOutput === "string" ? rawOutput : JSON.stringify(rawOutput))
+
+    const rawJudge = await getAnswer(judge, {
+      input: testCase.input,
+      output,
+      criteria: testCase.criteria,
+      reference: testCase.reference ?? "",
+    })
+
+    return { case: testCase, output, judge: parseJudgeResult(rawJudge) }
+  } catch (error) {
+    return {
+      case: testCase,
+      output: "",
+      judge: null,
+      error: error instanceof Error ? error.message : "Evaluation failed",
+    }
+  }
+}
+
+/** Evaluate a system prompt against a golden set and return the gate verdict. */
+export async function runEvaluation(
+  systemPrompt: string,
+  cases: GoldenCase[],
+  threshold: number,
+): Promise<{ success: boolean; data?: RunAggregate; error?: string }> {
+  try {
+    if (!systemPrompt.trim()) throw new Error("A system prompt is required")
+    if (!Array.isArray(cases) || cases.length === 0) throw new Error("Provide at least one test case")
+
+    const results = await mapWithConcurrency(cases, CONCURRENCY, (testCase) =>
+      evaluateCase(systemPrompt, testCase),
+    )
+    return { success: true, data: computeAggregate(results, threshold) }
+  } catch (error) {
+    return { success: false, error: error instanceof Error ? error.message : "Evaluation failed" }
+  }
+}