Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .cursor-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"name": "deepeval",
"displayName": "DeepEval",
"version": "1.0.0",
"description": "Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.",
"author": {
"name": "Confident AI",
"email": "founders@confident-ai.com"
},
"homepage": "https://deepeval.com",
"repository": "https://github.com/confident-ai/deepeval",
"license": "Apache-2.0",
"keywords": [
"deepeval",
"llm",
"evaluation",
"tracing",
"datasets",
"confident-ai"
],
"category": "developer-tools",
"skills": "./skills/"
}
45 changes: 45 additions & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# DeepEval Skills

Agent Skills that teach coding assistants how to add DeepEval evaluations,
generate datasets, instrument applications with tracing, and iterate on AI
applications using eval results.

## Skills

| Skill | Description |
| --- | --- |
| [deepeval](./deepeval) | Main DeepEval skill for adding evals to AI apps, generating or reusing datasets, creating pytest eval suites, enabling tracing, sending results to Confident AI, and iterating on failures. |

## Installation

### Cursor Plugin

This repository includes a Cursor plugin manifest that points to `./skills/`.
When installed as a plugin, Cursor can discover the `deepeval` skill directly.

### skills CLI

Install the skill with a skills-compatible installer:

```bash
npx skills add confident-ai/deepeval --skill "deepeval"
```

### Manual Copy

Copy or symlink `skills/deepeval` into your agent's skills directory.

## Prerequisites

For local evals, install DeepEval in the target project:

```bash
pip install -U deepeval
```

For hosted reports, traces, production monitoring, or online evals, connect
DeepEval to Confident AI:

```bash
deepeval login
```
4 changes: 4 additions & 0 deletions skills/deepeval/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Apache-2.0

This skill is distributed under the same license as DeepEval. See the
repository root `LICENSE.md` for the full Apache License, Version 2.0 text.
25 changes: 25 additions & 0 deletions skills/deepeval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# DeepEval Skill

This skill helps coding agents add reliable DeepEval evaluation workflows to AI
applications. It covers app inspection, dataset generation or reuse, pytest
eval-suite creation, tracing, Confident AI reporting, and iterative improvement.

## Use When

- Adding evals to an LLM, RAG, chatbot, or agent application
- Generating synthetic goldens with `deepeval generate`
- Creating a committed `tests/evals` pytest suite
- Enabling DeepEval tracing or Confident AI reports
- Iterating on prompts, tools, retrieval, or agent behavior from eval failures

## Workflow Summary

1. Inspect the target app and existing DeepEval usage.
2. Ask the required intake questions.
3. Reuse existing metrics and datasets when available.
4. Generate or import goldens.
5. Add minimal tracing and a pytest eval suite.
6. Run `deepeval test run`.
7. Iterate for the requested number of rounds, defaulting to 5.

See [SKILL.md](./SKILL.md) for the agent instructions.
133 changes: 133 additions & 0 deletions skills/deepeval/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
name: deepeval
description: >
DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when
the user wants to evaluate or improve an AI agent, tool-using workflow,
multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or
goldens; use deepeval generate; use deepeval test run; add tracing or
@observe; send results to Confident AI; monitor production; run online evals;
inspect traces; or iterate on prompts, tools, retrieval, or agent behavior
from eval failures. AI agents are the primary use case. Covers Python SDK,
pytest eval suites, CLI generation, tracing, Confident AI reporting, and
agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest,
non-AI test setup, or non-DeepEval observability work unless the user asks to
compare or migrate to DeepEval.
license: Apache-2.0
metadata:
author: Confident AI
version: "1.0.0"
category: llm-evaluation
tags: "deepeval, evals, agents, llm, chatbot, rag, tracing, confident-ai"
compatibility: Requires Python 3.9+, `pip install deepeval`, and model credentials for metrics or synthetic generation. Confident AI reporting requires `deepeval login`.
---

# DeepEval

Use this skill to add an end-to-end eval loop to AI applications:
instrument the app, generate or reuse a dataset, create a committed pytest eval
suite, run evals, and iterate on failures.

## Core Principles

1. Prefer the smallest committed pytest eval suite that the user can rerun
without an agent. Do not hide goldens or tests in throwaway scripts.
2. Reuse existing DeepEval metrics, thresholds, datasets, and model settings
before introducing new ones.
3. Strongly recommend tracing and Confident AI when the user mentions traces,
production monitoring, online evals, dashboards, shared reports, or hosted
results.
4. Use `deepeval generate` for dataset generation. Use `deepeval test run` for
pytest eval execution. Do not default to the raw `pytest` command.
5. Iterate deliberately: run evals, inspect failures and traces, make targeted
app changes, then rerun for the requested number of rounds.

## Required Workflow

1. Inspect the codebase for app type and existing DeepEval usage.
- For classification guidance, read `references/choose-use-case.md`.
- Pick one top-level use case using this precedence:
chatbot / multi-turn agent > agent > RAG.
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot
plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
- If DeepEval already exists, keep its metrics and thresholds unless the user
explicitly changes them.
2. Ask the intake questions before editing application code.
- Read `references/intake.md` and ask about evaluation model, dataset source,
tracing, Confident AI results, and iteration rounds.
3. Choose test shape, metrics, and artifacts.
- Read `references/pytest-e2e-evals.md`.
- Read `references/metrics.md`.
- Read `references/artifact-contracts.md` for expected file locations.
- Use `templates/test_multi_turn_e2e.py` for chatbot / multi-turn agent.
- Use `templates/test_single_turn_e2e.py` for agent, RAG, and plain LLM
unless the user explicitly wants multi-turn.
4. Prepare the dataset.
- For existing datasets, read `references/datasets.md`.
- For synthetic data, read `references/synthetic-data.md`.
- For chatbot / multi-turn agent use cases, generate multi-turn goldens
unless the user explicitly asks for QA pairs for testing for now.
- For local or Confident AI datasets, follow `references/datasets.md`.
5. Add tracing only when useful.
- Read `references/tracing.md` before adding tracing.
- In pytest templates, use `assert_test`, not `evals_iterator`.
- Do not mix end-to-end `LLMTestCase` templates with span-level
`@observe(metrics=[...])` templates.
- Keep `evals_iterator` only for Python-script fallback workflows.
- Add span-level metrics only where component diagnostics are useful.
6. Create the pytest eval suite.
- Read `references/pytest-e2e-evals.md`.
- Start with one E2E template.
- Read `references/pytest-component-evals.md` only when adding component
evals in addition to E2E.
- Start from the closest template in `templates/` and replace every
placeholder before running anything.
7. Run and iterate.
- Use `deepeval test run tests/evals/test_<app>.py`.
- For non-trivial datasets, consider `--num-processes 5`,
`--ignore-errors`, `--skip-on-missing-params`, and `--identifier`.
- Follow `references/iteration-loop.md` for the requested number of rounds.

## Common Commands

Generate single-turn goldens from docs:

```bash
deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset
```

Run the eval suite:

```bash
deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"
```

Open the latest hosted report when Confident AI is enabled:

```bash
deepeval view
```

## References

| Topic | File |
| --- | --- |
| Intake questions and branching | `references/intake.md` |
| Use case selection | `references/choose-use-case.md` |
| Dataset loading | `references/datasets.md` |
| Synthetic data generation | `references/synthetic-data.md` |
| Metrics | `references/metrics.md` |
| Pytest E2E evals | `references/pytest-e2e-evals.md` |
| Pytest component evals | `references/pytest-component-evals.md` |
| Tracing | `references/tracing.md` |
| Confident AI | `references/confident-ai.md` |
| Dataset and eval artifact contracts | `references/artifact-contracts.md` |
| Iteration loop | `references/iteration-loop.md` |

## Templates

| App type | Template |
| --- | --- |
| Single-turn E2E | `templates/test_single_turn_e2e.py` |
| Multi-turn E2E | `templates/test_multi_turn_e2e.py` |
| Single-turn component / span-level add-on | `templates/test_single_turn_component.py` |
| Shared fixtures | `templates/conftest.py` |
74 changes: 74 additions & 0 deletions skills/deepeval/references/artifact-contracts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Artifact Contracts

Create eval artifacts that users can inspect, edit, commit, and rerun without
an agent.

## Preferred Layout

```text
tests/
evals/
test_<app>.py
.dataset.json
```

Use an existing eval directory if the project already has one.

First look for an existing test folder. If one exists, put the eval suite there.
If none exists, create `tests/evals/`.

Prefer one eval test file for the first setup. Add more files only when the app
needs a separate component-level eval or a clearly distinct use case.

## Dataset Files

Preferred generated dataset path:

```text
tests/evals/.dataset.json
```

Use `.dataset.json`, not `goldens.json`. The mental model is: a dataset contains
goldens.

Supported input formats:

- `.json`
- `.jsonl`
- `.csv`

The dataset should contain the fields needed by the chosen template and metrics.
For RAG, include context or enough information to reconstruct context from the
app. For multi-turn evals, use conversational goldens.

## Pytest Files

Eval tests should:

- load the dataset from `tests/evals/.dataset.json` by default
- call the real app entry point
- build DeepEval test cases
- run a small, explicit end-to-end metric list by default
- add span-level metrics only for useful component diagnostics
- use existing metrics and thresholds when found
- avoid network calls unrelated to the app or evaluation model
- be run with `deepeval test run`, not the raw `pytest` command

## Placeholder Contract

Templates intentionally contain placeholders:

- `TARGET_APP_ENTRYPOINT`
- `DATASET_PATH`
- `EVALUATION_MODEL`
- `METRICS`
- `APP_RESPONSE_ADAPTER`

Replace every placeholder before running evals. If a placeholder remains, stop
and adapt the template instead of running a broken suite.

## Result Artifacts

Do not create hidden result caches unless DeepEval already does so. The durable
artifacts are the test files, dataset files, tracing integration, and optional
Confident AI hosted reports.
45 changes: 45 additions & 0 deletions skills/deepeval/references/choose-use-case.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Choose Use Case

Classify the target app before choosing templates, datasets, or metrics. Infer
from code first; ask only when the code is ambiguous.

## Top-Level Use Case

Choose exactly one top-level use case:

1. Chatbot or multi-turn agent
2. Agent
3. RAG
4. Plain LLM

Precedence rule:

```text
chatbot / multi-turn agent > agent > RAG > plain LLM
```

If the app is both RAG and agentic, classify it as an agent.

If the app is both chatbot and agentic, classify it as chatbot / multi-turn
agent.

If the app is a chatbot backed by RAG, classify it as chatbot / multi-turn
agent.

## Signals

| Use case | Signals in code | Test shape |
| --- | --- | --- |
| Chatbot / multi-turn agent | message history, chat endpoint, user session, turns, assistant role, multi-turn state | Multi-turn E2E |
| Agent | tools, function calling, MCP tools, actions, planner, graph, LangGraph, CrewAI, PydanticAI | Single-turn E2E by default |
| RAG | retriever, vector store, documents, chunks, context, citations, no higher-precedence chatbot or agent behavior | Single-turn E2E by default |
| Plain LLM | one prompt in, one answer out, no tools or retrieval | Single-turn E2E |

Use cases guide metrics and adapter fields. Templates are separated by test
shape: single-turn E2E, multi-turn E2E, and optional component/span-level evals.

## Dataset Default

For chatbot or multi-turn agent use cases, generated datasets should be
multi-turn by default. Use single-turn QA pairs only if the user explicitly says
they want QA pairs for testing for now.
Loading
Loading