Skip to content

Commit 0d0e532

Browse files
authored
Merge pull request #2634 from confident-ai/features/agent-skills
vibe koding
2 parents d593974 + e358626 commit 0d0e532

20 files changed

Lines changed: 1734 additions & 0 deletions

.cursor-plugin/plugin.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{
2+
"name": "deepeval",
3+
"displayName": "DeepEval",
4+
"version": "1.0.0",
5+
"description": "Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.",
6+
"author": {
7+
"name": "Confident AI",
8+
"email": "founders@confident-ai.com"
9+
},
10+
"homepage": "https://deepeval.com",
11+
"repository": "https://github.com/confident-ai/deepeval",
12+
"license": "Apache-2.0",
13+
"keywords": [
14+
"deepeval",
15+
"llm",
16+
"evaluation",
17+
"tracing",
18+
"datasets",
19+
"confident-ai"
20+
],
21+
"category": "developer-tools",
22+
"skills": "./skills/"
23+
}

skills/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# DeepEval Skills
2+
3+
Agent Skills that teach coding assistants how to add DeepEval evaluations,
4+
generate datasets, instrument applications with tracing, and iterate on AI
5+
applications using eval results.
6+
7+
## Skills
8+
9+
| Skill | Description |
10+
| --- | --- |
11+
| [deepeval](./deepeval) | Main DeepEval skill for adding evals to AI apps, generating or reusing datasets, creating pytest eval suites, enabling tracing, sending results to Confident AI, and iterating on failures. |
12+
13+
## Installation
14+
15+
### Cursor Plugin
16+
17+
This repository includes a Cursor plugin manifest that points to `./skills/`.
18+
When installed as a plugin, Cursor can discover the `deepeval` skill directly.
19+
20+
### skills CLI
21+
22+
Install the skill with a skills-compatible installer:
23+
24+
```bash
25+
npx skills add confident-ai/deepeval --skill "deepeval"
26+
```
27+
28+
### Manual Copy
29+
30+
Copy or symlink `skills/deepeval` into your agent's skills directory.
31+
32+
## Prerequisites
33+
34+
For local evals, install DeepEval in the target project:
35+
36+
```bash
37+
pip install -U deepeval
38+
```
39+
40+
For hosted reports, traces, production monitoring, or online evals, connect
41+
DeepEval to Confident AI:
42+
43+
```bash
44+
deepeval login
45+
```

skills/deepeval/LICENSE

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Apache-2.0
2+
3+
This skill is distributed under the same license as DeepEval. See the
4+
repository root `LICENSE.md` for the full Apache License, Version 2.0 text.

skills/deepeval/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# DeepEval Skill
2+
3+
This skill helps coding agents add reliable DeepEval evaluation workflows to AI
4+
applications. It covers app inspection, dataset generation or reuse, pytest
5+
eval-suite creation, tracing, Confident AI reporting, and iterative improvement.
6+
7+
## Use When
8+
9+
- Adding evals to an LLM, RAG, chatbot, or agent application
10+
- Generating synthetic goldens with `deepeval generate`
11+
- Creating a committed `tests/evals` pytest suite
12+
- Enabling DeepEval tracing or Confident AI reports
13+
- Iterating on prompts, tools, retrieval, or agent behavior from eval failures
14+
15+
## Workflow Summary
16+
17+
1. Inspect the target app and existing DeepEval usage.
18+
2. Ask the required intake questions.
19+
3. Reuse existing metrics and datasets when available.
20+
4. Generate or import goldens.
21+
5. Add minimal tracing and a pytest eval suite.
22+
6. Run `deepeval test run`.
23+
7. Iterate for the requested number of rounds, defaulting to 5.
24+
25+
See [SKILL.md](./SKILL.md) for the agent instructions.

skills/deepeval/SKILL.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
name: deepeval
3+
description: >
4+
DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when
5+
the user wants to evaluate or improve an AI agent, tool-using workflow,
6+
multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or
7+
goldens; use deepeval generate; use deepeval test run; add tracing or
8+
@observe; send results to Confident AI; monitor production; run online evals;
9+
inspect traces; or iterate on prompts, tools, retrieval, or agent behavior
10+
from eval failures. AI agents are the primary use case. Covers Python SDK,
11+
pytest eval suites, CLI generation, tracing, Confident AI reporting, and
12+
agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest,
13+
non-AI test setup, or non-DeepEval observability work unless the user asks to
14+
compare or migrate to DeepEval.
15+
license: Apache-2.0
16+
metadata:
17+
author: Confident AI
18+
version: "1.0.0"
19+
category: llm-evaluation
20+
tags: "deepeval, evals, agents, llm, chatbot, rag, tracing, confident-ai"
21+
compatibility: Requires Python 3.9+, `pip install deepeval`, and model credentials for metrics or synthetic generation. Confident AI reporting requires `deepeval login`.
22+
---
23+
24+
# DeepEval
25+
26+
Use this skill to add an end-to-end eval loop to AI applications:
27+
instrument the app, generate or reuse a dataset, create a committed pytest eval
28+
suite, run evals, and iterate on failures.
29+
30+
## Core Principles
31+
32+
1. Prefer the smallest committed pytest eval suite that the user can rerun
33+
without an agent. Do not hide goldens or tests in throwaway scripts.
34+
2. Reuse existing DeepEval metrics, thresholds, datasets, and model settings
35+
before introducing new ones.
36+
3. Strongly recommend tracing and Confident AI when the user mentions traces,
37+
production monitoring, online evals, dashboards, shared reports, or hosted
38+
results.
39+
4. Use `deepeval generate` for dataset generation. Use `deepeval test run` for
40+
pytest eval execution. Do not default to the raw `pytest` command.
41+
5. Iterate deliberately: run evals, inspect failures and traces, make targeted
42+
app changes, then rerun for the requested number of rounds.
43+
44+
## Required Workflow
45+
46+
1. Inspect the codebase for app type and existing DeepEval usage.
47+
- For classification guidance, read `references/choose-use-case.md`.
48+
- Pick one top-level use case using this precedence:
49+
chatbot / multi-turn agent > agent > RAG.
50+
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot
51+
plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
52+
- If DeepEval already exists, keep its metrics and thresholds unless the user
53+
explicitly changes them.
54+
2. Ask the intake questions before editing application code.
55+
- Read `references/intake.md` and ask about evaluation model, dataset source,
56+
tracing, Confident AI results, and iteration rounds.
57+
3. Choose test shape, metrics, and artifacts.
58+
- Read `references/pytest-e2e-evals.md`.
59+
- Read `references/metrics.md`.
60+
- Read `references/artifact-contracts.md` for expected file locations.
61+
- Use `templates/test_multi_turn_e2e.py` for chatbot / multi-turn agent.
62+
- Use `templates/test_single_turn_e2e.py` for agent, RAG, and plain LLM
63+
unless the user explicitly wants multi-turn.
64+
4. Prepare the dataset.
65+
- For existing datasets, read `references/datasets.md`.
66+
- For synthetic data, read `references/synthetic-data.md`.
67+
- For chatbot / multi-turn agent use cases, generate multi-turn goldens
68+
unless the user explicitly asks for QA pairs for testing for now.
69+
- For local or Confident AI datasets, follow `references/datasets.md`.
70+
5. Add tracing only when useful.
71+
- Read `references/tracing.md` before adding tracing.
72+
- In pytest templates, use `assert_test`, not `evals_iterator`.
73+
- Do not mix end-to-end `LLMTestCase` templates with span-level
74+
`@observe(metrics=[...])` templates.
75+
- Keep `evals_iterator` only for Python-script fallback workflows.
76+
- Add span-level metrics only where component diagnostics are useful.
77+
6. Create the pytest eval suite.
78+
- Read `references/pytest-e2e-evals.md`.
79+
- Start with one E2E template.
80+
- Read `references/pytest-component-evals.md` only when adding component
81+
evals in addition to E2E.
82+
- Start from the closest template in `templates/` and replace every
83+
placeholder before running anything.
84+
7. Run and iterate.
85+
- Use `deepeval test run tests/evals/test_<app>.py`.
86+
- For non-trivial datasets, consider `--num-processes 5`,
87+
`--ignore-errors`, `--skip-on-missing-params`, and `--identifier`.
88+
- Follow `references/iteration-loop.md` for the requested number of rounds.
89+
90+
## Common Commands
91+
92+
Generate single-turn goldens from docs:
93+
94+
```bash
95+
deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset
96+
```
97+
98+
Run the eval suite:
99+
100+
```bash
101+
deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"
102+
```
103+
104+
Open the latest hosted report when Confident AI is enabled:
105+
106+
```bash
107+
deepeval view
108+
```
109+
110+
## References
111+
112+
| Topic | File |
113+
| --- | --- |
114+
| Intake questions and branching | `references/intake.md` |
115+
| Use case selection | `references/choose-use-case.md` |
116+
| Dataset loading | `references/datasets.md` |
117+
| Synthetic data generation | `references/synthetic-data.md` |
118+
| Metrics | `references/metrics.md` |
119+
| Pytest E2E evals | `references/pytest-e2e-evals.md` |
120+
| Pytest component evals | `references/pytest-component-evals.md` |
121+
| Tracing | `references/tracing.md` |
122+
| Confident AI | `references/confident-ai.md` |
123+
| Dataset and eval artifact contracts | `references/artifact-contracts.md` |
124+
| Iteration loop | `references/iteration-loop.md` |
125+
126+
## Templates
127+
128+
| App type | Template |
129+
| --- | --- |
130+
| Single-turn E2E | `templates/test_single_turn_e2e.py` |
131+
| Multi-turn E2E | `templates/test_multi_turn_e2e.py` |
132+
| Single-turn component / span-level add-on | `templates/test_single_turn_component.py` |
133+
| Shared fixtures | `templates/conftest.py` |
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Artifact Contracts
2+
3+
Create eval artifacts that users can inspect, edit, commit, and rerun without
4+
an agent.
5+
6+
## Preferred Layout
7+
8+
```text
9+
tests/
10+
evals/
11+
test_<app>.py
12+
.dataset.json
13+
```
14+
15+
Use an existing eval directory if the project already has one.
16+
17+
First look for an existing test folder. If one exists, put the eval suite there.
18+
If none exists, create `tests/evals/`.
19+
20+
Prefer one eval test file for the first setup. Add more files only when the app
21+
needs a separate component-level eval or a clearly distinct use case.
22+
23+
## Dataset Files
24+
25+
Preferred generated dataset path:
26+
27+
```text
28+
tests/evals/.dataset.json
29+
```
30+
31+
Use `.dataset.json`, not `goldens.json`. The mental model is: a dataset contains
32+
goldens.
33+
34+
Supported input formats:
35+
36+
- `.json`
37+
- `.jsonl`
38+
- `.csv`
39+
40+
The dataset should contain the fields needed by the chosen template and metrics.
41+
For RAG, include context or enough information to reconstruct context from the
42+
app. For multi-turn evals, use conversational goldens.
43+
44+
## Pytest Files
45+
46+
Eval tests should:
47+
48+
- load the dataset from `tests/evals/.dataset.json` by default
49+
- call the real app entry point
50+
- build DeepEval test cases
51+
- run a small, explicit end-to-end metric list by default
52+
- add span-level metrics only for useful component diagnostics
53+
- use existing metrics and thresholds when found
54+
- avoid network calls unrelated to the app or evaluation model
55+
- be run with `deepeval test run`, not the raw `pytest` command
56+
57+
## Placeholder Contract
58+
59+
Templates intentionally contain placeholders:
60+
61+
- `TARGET_APP_ENTRYPOINT`
62+
- `DATASET_PATH`
63+
- `EVALUATION_MODEL`
64+
- `METRICS`
65+
- `APP_RESPONSE_ADAPTER`
66+
67+
Replace every placeholder before running evals. If a placeholder remains, stop
68+
and adapt the template instead of running a broken suite.
69+
70+
## Result Artifacts
71+
72+
Do not create hidden result caches unless DeepEval already does so. The durable
73+
artifacts are the test files, dataset files, tracing integration, and optional
74+
Confident AI hosted reports.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Choose Use Case
2+
3+
Classify the target app before choosing templates, datasets, or metrics. Infer
4+
from code first; ask only when the code is ambiguous.
5+
6+
## Top-Level Use Case
7+
8+
Choose exactly one top-level use case:
9+
10+
1. Chatbot or multi-turn agent
11+
2. Agent
12+
3. RAG
13+
4. Plain LLM
14+
15+
Precedence rule:
16+
17+
```text
18+
chatbot / multi-turn agent > agent > RAG > plain LLM
19+
```
20+
21+
If the app is both RAG and agentic, classify it as an agent.
22+
23+
If the app is both chatbot and agentic, classify it as chatbot / multi-turn
24+
agent.
25+
26+
If the app is a chatbot backed by RAG, classify it as chatbot / multi-turn
27+
agent.
28+
29+
## Signals
30+
31+
| Use case | Signals in code | Test shape |
32+
| --- | --- | --- |
33+
| Chatbot / multi-turn agent | message history, chat endpoint, user session, turns, assistant role, multi-turn state | Multi-turn E2E |
34+
| Agent | tools, function calling, MCP tools, actions, planner, graph, LangGraph, CrewAI, PydanticAI | Single-turn E2E by default |
35+
| RAG | retriever, vector store, documents, chunks, context, citations, no higher-precedence chatbot or agent behavior | Single-turn E2E by default |
36+
| Plain LLM | one prompt in, one answer out, no tools or retrieval | Single-turn E2E |
37+
38+
Use cases guide metrics and adapter fields. Templates are separated by test
39+
shape: single-turn E2E, multi-turn E2E, and optional component/span-level evals.
40+
41+
## Dataset Default
42+
43+
For chatbot or multi-turn agent use cases, generated datasets should be
44+
multi-turn by default. Use single-turn QA pairs only if the user explicitly says
45+
they want QA pairs for testing for now.

0 commit comments

Comments
 (0)