Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@ config/local
# Build system
rundir/

# Python
__pycache__/
*.py[cod]

# F# / dotnet
backend/src/LibExecution/package-ref-hashes.txt
backend/packages/
Expand Down Expand Up @@ -54,5 +58,10 @@ clis/
# Scratch planning/execution notes, not part of any PR.
scratch/

# Benchmark run output, regenerated by bench/run.py.
bench/results.csv
bench/runs/
bench/report.md

# Claude Code session lock (per-process, regenerated each session).
.claude/scheduled_tasks.lock
.claude/scheduled_tasks.lock
34 changes: 34 additions & 0 deletions bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
From a terminal run bench/run.py

# bench

Run `claude -p` against tasks in multiple languages and record
cost/turns/wall/pass-fail metrics.

## Invocations

| invocation | runs |
|---|---|
| `bench/run.py` | every task in `bench/tasks/*.md` × python+dark × 1 trial |
| `bench/run.py --task url-shortener-cli` | one task × python+dark × 1 trial |
| `bench/run.py --langs python --trials 3` | every task × python × 3 trials |
| `bench/run.py --task X --langs dark --trials 5` | one task × dark × 5 trials |

## Adding a task

Drop a markdown file at `bench/tasks/<name>.md` describing what to build.
The runner picks it up automatically on the next run.

The model gets your task md + a short language note (from `LANG_RULES` in
`run.py`) + "print PASS on the last line if it works." Pass/fail is graded
by reading the model's final reply.

## Output

- `bench/runs/<id>/` — one dir per trial: `prompt.md`, `claude_result.json`,
`metrics.json`, and a `workspace/` with the seed `./run` shim plus whatever
the model wrote. Gitignored.
- `bench/results.csv` — one row per trial with the curated metrics. Gitignored.
- `bench/report.md` — overwritten each run when any dark trial surfaces
friction notes. The summary table itself is printed to stdout, not
written here. Gitignored.
42 changes: 42 additions & 0 deletions bench/lang/dark/run
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
# Seed shim: dispatches CLI args into the Darklang package tree on a
# bench-specific branch.
#
# Required env (the harness sets these):
# BENCH_DARK_BRANCH branch holding the implementation
# DARK_REPO absolute path to the Darklang monorepo
#
# Auto-derived:
# BENCH_STATE_DIR <workspace>/dark-state/ (one file per entry)
#
# {{namespace}} is substituted by the harness from the task name. Functions
# live at Bench.{{namespace}}.<subcommand> — one per command in the spec.
#
# Do not modify this file.
set -e

: "${BENCH_DARK_BRANCH:?BENCH_DARK_BRANCH must be set}"
: "${DARK_REPO:?DARK_REPO must be set}"

CMD="$1"
shift || true
FN="Bench.{{namespace}}.${CMD}"

WORKSPACE="$(pwd)"
export BENCH_STATE_DIR="${WORKSPACE}/dark-state"
mkdir -p "$BENCH_STATE_DIR"

# Build Dark-literal args. Strings here don't contain " or \, so simple
# double-quoting suffices. Zero-arg commands need explicit unit.
DARK_ARGS=()
if [ $# -eq 0 ]; then
DARK_ARGS+=("()")
else
for a in "$@"; do
DARK_ARGS+=("\"$a\"")
done
fi

# run-cli requires cwd = repo root.
cd "$DARK_REPO"
exec ./scripts/run-cli --branch "$BENCH_DARK_BRANCH" run "@$FN" "${DARK_ARGS[@]}"
6 changes: 6 additions & 0 deletions bench/lang/python/run
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash
# Seed shim: dispatches CLI args to the Python implementation.
# The implementation should live in main.py (or a module imported from it).
# Do not modify this file.
set -e
exec python3 "$(dirname "$0")/main.py" "$@"
6 changes: 6 additions & 0 deletions bench/lang/ts/run
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash
# Seed shim: dispatches CLI args to the TypeScript implementation.
# The implementation should live in main.ts.
# Do not modify this file.
set -e
exec bun "$(dirname "$0")/main.ts" "$@"
Loading