Enhancing Laboratory 3D Perception and Long-Horizon Planning via Protocol-Conditioned Action Prediction
Overview | News | Explorer | Datasets | Leaderboard | Training | Agent | Quick Start | Citation
LabHorizon is a data and evaluation suite for laboratory action prediction. It studies how models connect multi-view laboratory assets, real-world experimental context, and long-horizon action structure before they can support reliable AI scientist workflows.
Unlike general scientific QA or diagram-based multimodal benchmarks, LabHorizon frames laboratory reasoning as protocol-conditioned action prediction: a model must either select the next protocol-consistent action from visually grounded candidates or produce a structured long-horizon experimental action sequence.
- 2026-05-29: Added the first LabHorizon trained+agents result.
Qwen3.6-35B-A3B(trained+agents)reaches 0.665 Level 1 next-action accuracy and 0.4532 Level 2 Final Score. - 2026-05-28: Refreshed the public Website with a rocket favicon, direct GitHub / Hugging Face links, diversified demo assets, and updated real test examples. Level 1 now highlights thermal cycler and vortex mixer samples with upright checked asset views. Level 2 now shows plasmid DNA purification and mRNA cleanup samples with card-based constraints, available-input cards, expandable action-pool cards, and graph-like gold action sequences.
- 2026-05-28: Initialized the public LabHorizon repository and released the two Hugging Face datasets: Level 1 3D Asset Perception and Level 2 Protocol-Conditioned Planning, each with train and test splits.
| π¬ 3D Asset Perception Multi-view laboratory asset inputs |
π§ Protocol Action Prediction History and protocol context guide the next action |
π§ͺ Long-Horizon Planning Structured action sequences with dependencies |
π³ AST Scoring Action, parameter, and dependency parsing |
| π 3,000 + 3,000 Train Training samples across two levels |
π 200 + 200 Test Matched evaluation samples |
π OpenAI-Compatible Works with any compatible model endpoint |
β»οΈ Resume Friendly JSONL outputs can be reused across runs |
flowchart TD
P["Real-world protocol condition<br/>P"] --> L1
P --> L2
I["Multi-view laboratory asset<br/>I"] --> L1["Level 1<br/>protocol-conditioned multi-view asset action prediction"]
H["Historical actions<br/>h"] --> L1
C["Candidate next actions<br/>C"] --> L1
L1 --> O1["Reasoning + next action<br/>r, n"]
O1 --> M1["Next Action Accuracy"]
CTX["Context, goal, constraints<br/>context, g, R"] --> L2["Level 2<br/>protocol-conditioned action-pool long-horizon prediction"]
U["Available inputs<br/>U"] --> L2
AP["Action pool<br/>A"] --> L2
L2 --> O2["Structured action sequence<br/>s = (s1, ..., sT)"]
O2 --> AST["Python AST action parser<br/>calls, parameters, variables"]
AST --> M2["Action Sequence Similarity"]
AST --> M3["Parameter Accuracy"]
AST --> M4["Final Score"]
style P fill:#ecfccb,stroke:#65a30d,stroke-width:2px
style I fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style L1 fill:#fef3c7,stroke:#d97706,stroke-width:2px
style L2 fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px
style AST fill:#fee2e2,stroke:#dc2626,stroke-width:2px
The docs/ directory contains a static GitHub Pages explorer for LabHorizon. It keeps the original dark visual style and interactive Three.js laboratory asset viewer, but now focuses on the two released data levels. The current public demo samples are chosen for asset diversity and manually checked so the rendered assets are upright, readable, and not broken.
- Level 1: real public test samples with thermal cycler and vortex mixer assets, three rendered asset views, historical actions, candidate next actions, card-based reference reasoning, and gold next action.
- Level 2: real public test samples covering plasmid DNA purification and mRNA cleanup, with context, goal, card-based constraints, available-input cards, expandable action-pool cards, and a graph-like gold experimental action sequence.
The sidebar also links directly to the GitHub repository and both Hugging Face dataset cards.
The page is fully static and can be previewed locally:
python -m http.server 8765 --directory docs| Level | Hugging Face Dataset | Input | Target | Metric |
|---|---|---|---|---|
| Level 1 | LabHorizon-3D-Asset-Perception | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
| Level 2 | LabHorizon-Protocol-Conditioned-Planning | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | Action Sequence Similarity, Parameter Accuracy |
| Column | Meaning |
|---|---|
id |
Stable public sample identifier, such as LabHorizon-L1-test-000001. |
asset |
Three rendered views of the same laboratory asset. |
historical_actions |
Previous protocol actions and the current experimental state. |
candidate_next_actions |
Candidate next laboratory actions. |
reasoning |
Reference reasoning for the gold next action. |
next_action |
Gold protocol-consistent next action. |
asset_name |
Human-readable asset name for analysis. |
asset_family |
Asset family label for distribution analysis. |
| Column | Meaning |
|---|---|
id |
Stable public sample identifier, such as LabHorizon-L2-test-000001. |
context |
Experimental context for the local protocol window. |
goal |
Planning objective. |
constraints |
Protocol-derived constraints and parameter requirements. |
available_inputs |
Raw materials, samples, or measurements available before planning. |
action_pool_names |
Names of available action-pool functions. |
action_pool |
Python function definitions describing available laboratory actions. |
gold_action_sequence |
Gold long-horizon experimental action sequence. |
The tables below report direct-prompting model results on the current v20260510-repaired 200-sample test split. Level 1 is sorted by Next Action Accuracy; Level 2 is sorted by Final Score.
| Rank | Model | Next Action Accuracy |
|---|---|---|
| π₯ | Grok 4.3 | 0.555 |
| π₯ | Kimi K2.6 | 0.550 |
| π₯ | GPT-5.5 | 0.535 |
| 4 | GPT-5.4 | 0.520 |
| 5 | Qwen3.6 Plus | 0.505 |
| 6 | Claude Opus 4.7 | 0.500 |
| 7 | Qwen3.5 35B-A3B | 0.495 |
| 8 | MiMo V2.5 | 0.495 |
| 9 | Qwen3.5 9B | 0.485 |
| 10 | Gemini 3.5 Flash | 0.485 |
| 11 | Qwen3.6 35B-A3B | 0.475 |
| 12 | Gemini 3.1 Pro Preview | 0.465 |
| Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
|---|---|---|---|---|
| π₯ | Gemini 3.1 Pro Preview | 0.3263 | 0.3195 | 0.3331 |
| π₯ | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
| π₯ | Kimi K2.6 | 0.3150 | 0.2845 | 0.3456 |
| 4 | Gemini 3.5 Flash | 0.3039 | 0.2686 | 0.3391 |
| 5 | Qwen3.7 Max | 0.3003 | 0.2905 | 0.3102 |
| 6 | Claude Opus 4.7 | 0.2737 | 0.2619 | 0.2856 |
| 7 | GPT-5.4 | 0.2715 | 0.2191 | 0.3239 |
| 8 | Qwen3.6 35B-A3B | 0.2534 | 0.2585 | 0.2483 |
| 9 | Qwen3.6 Plus | 0.2526 | 0.2264 | 0.2787 |
| 10 | MiMo V2.5 | 0.2491 | 0.2269 | 0.2713 |
| 11 | GLM 5.1 | 0.2413 | 0.2307 | 0.2519 |
| 12 | Qwen3.5 35B-A3B | 0.2391 | 0.2385 | 0.2398 |
| 13 | GPT-5.5 | 0.2276 | 0.2092 | 0.2459 |
| 14 | DeepSeek V4 Pro | 0.2135 | 0.1927 | 0.2342 |
| 15 | Qwen3.5 9B | 0.1315 | 0.1359 | 0.1271 |
LabHorizon is released with matched train and test splits, so it can evaluate models and also train domain models for laboratory action prediction. As an initial system result, we train Qwen/Qwen3.6-35B-A3B on the 6,000 LabHorizon training samples and combine it with the Actor-Simulator-Selector framework.
The table compares our trained+agents system with strong direct-prompting LLM baselines evaluated on the same test splits. Our best result is placed in the final row.
| System | Level 1 Next Action Accuracy | Level 2 Action Sequence Similarity | Level 2 Parameter Accuracy | Level 2 Final Score |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | 0.465 | 0.3195 | 0.3331 | 0.3263 |
| Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
| Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
| GPT-5.5 | 0.535 | 0.2092 | 0.2459 | 0.2276 |
| Qwen3.6-35B-A3B(trained+agents) | 0.665 | 0.4485 | 0.4580 | 0.4532 |
The result supports the Optimizable Learning Loop design. The trained+agents system provides more stable protocol-conditioned action prediction: it improves Level 1 asset-to-action alignment and better preserves action order, parameters, and intermediate dependencies. It does not solve the benchmark completely: Level 2 exact-match recovery remains hard, so continued agent refinement is still useful for checking global state consistency, action granularity, and parameter constraints during inference.
Training case insight. In a successful Level 2 organoid-preparation case, the trained+agents system preserves two parallel sample branches, two 100 x g, 5 min, 4 C centrifugation steps, branch-specific volume adjustment, and virus aliquot thawing on ice. This directly tests Long-Horizon Planning and Real-World Protocol Alignment. In a harder Golden Gate thermal-cycler case, the same system produces parseable actions but incorrectly expands a thermal-cycler program into local incubation calls and changes the required device-state order. This failure illustrates the remaining gap between local action familiarity and globally correct long-horizon experimental planning.
The evaluator keeps model interaction simple and model-agnostic. It sends natural-language prompts to an OpenAI-compatible chat completions endpoint, stores raw model outputs as JSONL, and computes metrics locally.
Level 1 prompts contain asset images, historical actions, and candidate next actions. The model is asked to reason first and end with:
Final Next Action: X
X may be a candidate letter or the exact candidate action. The evaluator maps the final response back to the candidate list and reports next_action_accuracy.
Level 2 prompts contain a real-world experimental context, constraints, available inputs, and an action pool. The model may answer in natural language, but the structured action sequence must appear as Python-style function calls, usually inside a fenced code block:
lysate = lyse_cells(sample=cell_pellet, buffer=lysis_buffer, duration_min=10)
clarified = centrifuge(sample=lysate, speed_x_g=12000, duration_min=15)The evaluator uses Python AST to extract action calls, keyword parameters, assigned intermediate variables, and variable dependencies. It reports:
| Metric | What It Measures |
|---|---|
Action Sequence Similarity |
Whether predicted actions appear at the correct positions relative to the gold sequence. |
Parameter Accuracy |
Whether aligned actions use correct parameter keys, values, raw inputs, and generated-variable dependencies. |
Final Score |
The mean of Action Sequence Similarity and Parameter Accuracy. |
LabHorizon includes a bounded Actor-Simulator-Selector agent for protocol-conditioned action prediction. The agent is not an open-ended ReAct loop and does not use a physical simulator. It wraps model sampling with a structured experimental state checker:
flowchart LR
T["LabHorizon task<br/>Level 1 or Level 2"] --> Actor["Actor<br/>sample candidate actions"]
T --> Sim0["Simulator<br/>construct current and target states"]
Actor --> C["Candidate next actions<br/>or action sequences"]
Sim0 --> Sim1["Simulator<br/>predict candidate state transitions"]
C --> Sim1
Sim1 --> Selector["Selector<br/>rank candidates by target-state fit"]
Selector --> Out["Final prediction<br/>next action or action sequence"]
style Actor fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style Sim0 fill:#fef3c7,stroke:#d97706,stroke-width:2px
style Sim1 fill:#fef3c7,stroke:#d97706,stroke-width:2px
style Selector fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px
The implementation in agents/ uses the same public dataset schema and evaluation contracts as evaluation/:
- Level 1 Actor outputs
Final Next Action: X, then the Selector returns one candidate next action. - Level 2 Actor outputs a structured action sequence, then AST metrics score the selected sequence.
- The Simulator and Selector can use the same model as the Actor or separate OpenAI-compatible models.
The recommended local layout keeps code and datasets as sibling repositories:
mkdir -p LabHorizon/code LabHorizon/data
git clone https://github.com/CongLab-Research/LabHorizon \
LabHorizon/code/LabHorizon
git clone https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception \
LabHorizon/data/LabHorizon-3D-Asset-Perception
git clone https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning \
LabHorizon/data/LabHorizon-Protocol-Conditioned-Planning
cd LabHorizon/code/LabHorizonpython -m pip install -r requirements.txtcp .env.example .envFill .env with an OpenAI-compatible endpoint:
BASE_URL=https://your-openai-compatible-endpoint/v1
API_KEY=your_api_key_here
EVAL_MODEL=openai/gpt-5.4
ACTOR_MODEL=qwen/qwen3.6-35b-a3b
SIMULATOR_MODEL=openai/gpt-5.4
SELECTOR_MODEL=openai/gpt-5.4
Do not commit .env. It is ignored by default.
python -m evaluation.level1.evaluate \
--split test \
--output results/level1_gpt54.jsonlpython -m evaluation.level2.evaluate \
--split test \
--output results/level2_gpt54.jsonlEach command writes one JSONL row per evaluated sample plus a .summary.json file. Use --resume to reuse already written rows after interruption.
python -m agents.run_agent \
--level 2 \
--split test \
--samples 4 \
--limit 5 \
--output results/agent_level2_demo.jsonlThe agent reads BASE_URL and API_KEY from .env by default. Advanced users may set ACTOR_BASE_URL / ACTOR_API_KEY, SIMULATOR_BASE_URL / SIMULATOR_API_KEY, and SELECTOR_BASE_URL / SELECTOR_API_KEY to route the three stages to different endpoints.
Offline tests validate dataset loading contracts, direct evaluation scoring, AST metrics, and the Actor-Simulator-Selector workflow with fake clients:
python -m unittest discover tests
python -m unittest discover agents/testsReal API smoke tests are opt-in because they call configured models through BASE_URL:
RUN_LABHORIZON_API_TESTS=1 python -m unittest tests.test_api_smokeThe smoke tests run one Level 1 direct-evaluation sample, one Level 2 direct-evaluation sample, and one Level 2 agent sample.
python -m evaluation.level1.evaluate --help
python -m evaluation.level2.evaluate --help
python -m agents.run_agent --help| Option | Default | Purpose |
|---|---|---|
--data-root |
../../data |
Directory containing the two Hugging Face dataset clones. |
--cache-dir |
.cache/huggingface/datasets |
Local Hugging Face dataset cache. |
--limit |
unset | Evaluate only the first N examples. |
--resume |
False |
Reuse existing JSONL rows in --output. |
--temperature |
unset | Optional model temperature. |
--timeout |
120 |
HTTP timeout in seconds. |
--retries |
2 |
API retry count. |
LabHorizon/
βββ README.md
βββ LICENSE
βββ requirements.txt
βββ .env.example
βββ evaluation/
β βββ utils.py # OpenAI-compatible client, dataset loading, JSONL cache
β βββ level1/
β β βββ prompts.py # Multi-image next-action prompts and answer parsing
β β βββ evaluate.py # Level 1 evaluation entry point
β βββ level2/
β βββ prompts.py # Protocol-conditioned planning prompts
β βββ metrics.py # AST parsing and ASS / PA metrics
β βββ evaluate.py # Level 2 evaluation entry point
βββ agents/
β βββ run_agent.py # Actor-Simulator-Selector CLI
β βββ workflow.py # Candidate sampling, simulation, selection, scoring
β βββ prompts.py # Actor / Simulator / Selector prompts
β βββ tests/ # Offline smoke tests
βββ tests/
βββ test_evaluation.py # Direct evaluator unit tests
βββ test_agent.py # Agent workflow unit tests
βββ test_api_smoke.py # Opt-in real API smoke tests
Generated outputs should go under results/, which is ignored by default.
- Release paper metadata and citation after the manuscript is public.
- Add official model results and analysis tables.
- Add official agent and fine-tuned model results when checkpoints are released.
Coming soon...
Please open a GitHub issue for reproducibility questions, dataset access problems, or evaluator bugs.
