Skip to content

Commit e79df6e

Browse files
Axectclaude
andcommitted
Add preflight check, HPO analysis, training diagnostics, and AI agent skills
Major features: - `preflight` CLI command: 1-batch forward+backward check catches config errors (shape mismatch, NaN gradients, GPU OOM) before wasting GPU time - `hpo-report` CLI command: parameter importance (fANOVA), boundary warnings, top-K trial comparison after HPO - GradientMonitorCallback: exploding gradient detection, logged to W&B - OverfitDetectionCallback: train/val divergence warning, logged to W&B - Pluggable data loading via `data` field in RunConfig (importlib-based) - 3-tier config validation: structural → runtime → semantic - `--json` output for preflight and hpo-report (AI agent friendly) - AI agent skills: pytorch-train (experiment pipeline), pytorch-migrate (version migration) - Documentation rewritten as "Human Skill Guide" — 5 chapters covering the full pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f3a7eac commit e79df6e

28 files changed

Lines changed: 3887 additions & 2603 deletions
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
name: pytorch-migrate
3+
description: >
4+
Migrate an existing project that uses pytorch_template to the latest version.
5+
Use this skill when the user says: "update template", "migrate", "upgrade",
6+
"apply latest changes", "sync with template", "update pytorch_template",
7+
or when the user's project is missing new features (preflight, hpo-report,
8+
data field, diagnostic callbacks).
9+
allowed-tools: Bash, Write, Read, Glob, Grep, Edit
10+
---
11+
12+
# pytorch-migrate
13+
14+
Detect the current version of a pytorch_template-based project and apply all necessary migrations to bring it up to date.
15+
16+
## Usage
17+
18+
```
19+
/pytorch-migrate [project_path]
20+
```
21+
22+
If `project_path` is omitted, uses the current working directory.
23+
24+
---
25+
26+
## Step 1: Detect Current Version
27+
28+
Read the project's files and determine which version it's based on by checking for feature markers.
29+
30+
Run these checks **in order** — the first missing feature determines the starting migration point:
31+
32+
| Check | How to detect | Version if MISSING |
33+
|-------|---------------|-------------------|
34+
| `config.py` has `RunConfig` dataclass | `class RunConfig` exists | Pre-template (not migratable) |
35+
| `callbacks.py` exists | File exists | v0 (monolithic, pre-callback refactor) |
36+
| `pruner.py` has `PFLPruner` | `class PFLPruner` in pruner.py | v1 (pre-PFL pruner, before 2024-12) |
37+
| `callbacks.py` has `NaNDetectionCallback` | Class exists | v2 (pre-NaN detection, before 2024-09) |
38+
| `callbacks.py` has `CheckpointCallback` | Class exists | v3 (pre-checkpoint, before 2025-04) |
39+
| `config.py` has `data` field in RunConfig | `data: str` in RunConfig | v4 (pre-data-decoupling) |
40+
| `callbacks.py` has `GradientMonitorCallback` | Class exists | v5 (pre-diagnostics) |
41+
| `cli.py` has `preflight` command | `def preflight` exists | v5 (pre-preflight) |
42+
| `cli.py` has `hpo_report` command | `def hpo_report` exists | v5 (pre-hpo-report) |
43+
| All checks pass || Current (up to date) |
44+
45+
```bash
46+
# Quick detection script
47+
grep -c "class PFLPruner" pruner.py 2>/dev/null || echo "0"
48+
grep -c "class NaNDetectionCallback" callbacks.py 2>/dev/null || echo "0"
49+
grep -c "class CheckpointCallback" callbacks.py 2>/dev/null || echo "0"
50+
grep -c "data: str" config.py 2>/dev/null || echo "0"
51+
grep -c "class GradientMonitorCallback" callbacks.py 2>/dev/null || echo "0"
52+
grep -c "def preflight" cli.py 2>/dev/null || echo "0"
53+
grep -c "def hpo_report" cli.py 2>/dev/null || echo "0"
54+
```
55+
56+
---
57+
58+
## Step 2: Apply Migrations
59+
60+
Apply only the migrations that are needed, in order. Each migration is independent and idempotent.
61+
62+
Read `references/migrations.md` for the detailed migration steps.
63+
64+
### Migration Summary
65+
66+
| Migration | From | Changes |
67+
|-----------|------|---------|
68+
| M1: PFL Pruner | v1→v2 | Add `pruner.py`, update `optimize_template.yaml` |
69+
| M2: NaN Detection + Checkpoint | v2→v3 | Add NaN/Checkpoint callbacks, add `CheckpointConfig` to `config.py` |
70+
| M3: Modular CLI | v3→v4 | Add `cli.py` with typer, refactor `main.py` |
71+
| M4: Data Decoupling | v4→v5 | Add `data` field to `RunConfig`, add `load_data()` method, update CLI |
72+
| M5: Diagnostics + Preflight + HPO Report | v5→current | Add callbacks, CLI commands, `validate_semantics()` |
73+
74+
---
75+
76+
## Step 3: Migrate User's Config Files
77+
78+
After updating the template code, scan for existing YAML configs and update them:
79+
80+
```bash
81+
# Find all run configs
82+
find configs/ -name "*.yaml" -type f 2>/dev/null
83+
```
84+
85+
For each YAML config file:
86+
87+
### Add `data` field (M4+)
88+
89+
If the config lacks a `data:` field, add it after `criterion_config:`:
90+
91+
```yaml
92+
data: util.load_data # Default — change to your project's data loader
93+
```
94+
95+
**Important:** If the project has a custom `load_data()` in `util.py`, the user should either:
96+
1. Keep `data: util.load_data` (works as-is)
97+
2. Move their data loader to a dedicated module and use that path
98+
99+
### Verify config compatibility
100+
101+
After migration, run:
102+
```bash
103+
python -m cli preflight <config_path> --device cpu
104+
```
105+
106+
---
107+
108+
## Step 4: Update Dependencies
109+
110+
Check if new dependencies are needed:
111+
112+
```bash
113+
# Required since v3 (CLI refactor)
114+
pip show typer rich beaupy 2>/dev/null || echo "Need: uv pip install typer rich beaupy"
115+
116+
# Required since v1 (template inception)
117+
pip show pytorch-optimizer pytorch-scheduler 2>/dev/null || echo "Need: uv pip install pytorch-optimizer pytorch-scheduler"
118+
```
119+
120+
---
121+
122+
## Step 5: Verify
123+
124+
```bash
125+
# Run tests
126+
pytest tests/ -x -q
127+
128+
# Run preflight on existing configs
129+
python -m cli preflight <config> --device cpu
130+
131+
# Check system health
132+
python -m cli doctor
133+
```
134+
135+
---
136+
137+
## Important Notes
138+
139+
- **Never overwrite user's custom `load_data()`** in util.py. The data decoupling migration adds the `data` field to RunConfig but preserves the existing function.
140+
- **Never overwrite user's custom callbacks.** Add new callbacks alongside existing ones.
141+
- **Preserve user's custom model code.** Migrations only touch template infrastructure files.
142+
- **Back up first.** Recommend `git stash` or `git commit` before migrating.
143+
- **Config files need the `data` field** to use the new pluggable data loading. But if omitted, the default `util.load_data` is used automatically (backward compatible).

0 commit comments

Comments
 (0)