|
2 | 2 |
|
3 | 3 | [](https://github.com/OpenAdaptAI/openadapt-evals) |
4 | 4 |
|
5 | | -> *Auto-generated from [OpenAdaptAI/openadapt-evals](https://github.com/OpenAdaptAI/openadapt-evals). Last synced: 2026-03-29 15:50 UTC* |
| 5 | +> *Auto-generated from [OpenAdaptAI/openadapt-evals](https://github.com/OpenAdaptAI/openadapt-evals). Last synced: 2026-03-29 15:56 UTC* |
6 | 6 |
|
7 | 7 | --- |
8 | 8 |
|
@@ -204,6 +204,44 @@ python scripts/run_full_eval.py \ |
204 | 204 |
|
205 | 205 | The endpoint uses the UI-Venus native bounding-box prompt format (`[x1,y1,x2,y2]`) and is compatible with vLLM, Ollama, or any OpenAI-compatible server. Both `DemoExecutor` and `PlannerGrounderAgent` use the same prompt format for consistency. |
206 | 206 |
|
| 207 | +### GRPO training with TRL (recommended) |
| 208 | + |
| 209 | +The recommended path for RL training of VLM desktop agents uses TRL's `GRPOTrainer` with dense milestone rewards from WAA environments. This replaces the standalone GRPO trainer with a battle-tested implementation that supports Unsloth, vLLM, constrained decoding, and automatic telemetry. |
| 210 | + |
| 211 | +```bash |
| 212 | +# Basic training against a live WAA VM |
| 213 | +python scripts/train_trl_grpo.py \ |
| 214 | + --task-dir ./example_tasks \ |
| 215 | + --server-url http://localhost:5001 \ |
| 216 | + --model Qwen/Qwen2.5-VL-7B-Instruct \ |
| 217 | + --output ./grpo_output |
| 218 | + |
| 219 | +# With Unsloth (2x VRAM efficiency) + constrained decoding |
| 220 | +python scripts/train_trl_grpo.py \ |
| 221 | + --task-dir ./example_tasks \ |
| 222 | + --server-url http://localhost:5001 \ |
| 223 | + --model Qwen/Qwen2.5-VL-7B-Instruct \ |
| 224 | + --use-unsloth \ |
| 225 | + --constrained-decoding \ |
| 226 | + --output ./grpo_output |
| 227 | + |
| 228 | +# Mock mode (validates full pipeline without VM or GPU) |
| 229 | +python scripts/train_trl_grpo.py \ |
| 230 | + --task-dir ./example_tasks \ |
| 231 | + --mock \ |
| 232 | + --output ./grpo_output_mock |
| 233 | + |
| 234 | +# With Weave tracing for experiment tracking |
| 235 | +python scripts/train_trl_grpo.py \ |
| 236 | + --task-dir ./example_tasks \ |
| 237 | + --server-url http://localhost:5001 \ |
| 238 | + --model Qwen/Qwen2.5-VL-7B-Instruct \ |
| 239 | + --weave-project openadapt-grpo \ |
| 240 | + --output ./grpo_output |
| 241 | +``` |
| 242 | + |
| 243 | +Key flags: `--constrained-decoding` (Outlines regex, eliminates unparseable output), `--vision-loss-mode` (exclude/include/checkpoint), `--weave-project` (Weave tracing), `--use-vllm` (faster generation), `--loss-type` (grpo/dapo/dr_grpo). |
| 244 | + |
207 | 245 | ### Parallel evaluation |
208 | 246 |
|
209 | 247 | ```bash |
|
0 commit comments