Skip to content

CongLab-Research/LabHorizon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LabHorizon

WebsiteΒ  arXivΒ  GitHubΒ  HF Level 1Β  HF Level 2

Enhancing Laboratory 3D Perception and Long-Horizon Planning via Protocol-Conditioned Action Prediction

Overview | News | Explorer | Datasets | Leaderboard | Training | Agent | Quick Start | Citation


LabHorizon laboratory asset teaser

πŸ”Ž Overview

LabHorizon is a data and evaluation suite for laboratory action prediction. It studies how models connect multi-view laboratory assets, real-world experimental context, and long-horizon action structure before they can support reliable AI scientist workflows.

Unlike general scientific QA or diagram-based multimodal benchmarks, LabHorizon frames laboratory reasoning as protocol-conditioned action prediction: a model must either select the next protocol-consistent action from visually grounded candidates or produce a structured long-horizon experimental action sequence.

πŸ“° News

  • 2026-05-29: Added the first LabHorizon trained+agents result. Qwen3.6-35B-A3B(trained+agents) reaches 0.665 Level 1 next-action accuracy and 0.4532 Level 2 Final Score.
  • 2026-05-28: Refreshed the public Website with a rocket favicon, direct GitHub / Hugging Face links, diversified demo assets, and updated real test examples. Level 1 now highlights thermal cycler and vortex mixer samples with upright checked asset views. Level 2 now shows plasmid DNA purification and mRNA cleanup samples with card-based constraints, available-input cards, expandable action-pool cards, and graph-like gold action sequences.
  • 2026-05-28: Initialized the public LabHorizon repository and released the two Hugging Face datasets: Level 1 3D Asset Perception and Level 2 Protocol-Conditioned Planning, each with train and test splits.

✨ Highlights

πŸ”¬
3D Asset Perception
Multi-view laboratory asset inputs
🧭
Protocol Action Prediction
History and protocol context guide the next action
πŸ§ͺ
Long-Horizon Planning
Structured action sequences with dependencies
🌳
AST Scoring
Action, parameter, and dependency parsing
πŸ“š
3,000 + 3,000 Train
Training samples across two levels
πŸ“Š
200 + 200 Test
Matched evaluation samples
πŸ”Œ
OpenAI-Compatible
Works with any compatible model endpoint
♻️
Resume Friendly
JSONL outputs can be reused across runs

🧭 Data and Evaluation Flow

flowchart TD
    P["Real-world protocol condition<br/>P"] --> L1
    P --> L2
    I["Multi-view laboratory asset<br/>I"] --> L1["Level 1<br/>protocol-conditioned multi-view asset action prediction"]
    H["Historical actions<br/>h"] --> L1
    C["Candidate next actions<br/>C"] --> L1
    L1 --> O1["Reasoning + next action<br/>r, n"]
    O1 --> M1["Next Action Accuracy"]

    CTX["Context, goal, constraints<br/>context, g, R"] --> L2["Level 2<br/>protocol-conditioned action-pool long-horizon prediction"]
    U["Available inputs<br/>U"] --> L2
    AP["Action pool<br/>A"] --> L2
    L2 --> O2["Structured action sequence<br/>s = (s1, ..., sT)"]
    O2 --> AST["Python AST action parser<br/>calls, parameters, variables"]
    AST --> M2["Action Sequence Similarity"]
    AST --> M3["Parameter Accuracy"]
    AST --> M4["Final Score"]

    style P fill:#ecfccb,stroke:#65a30d,stroke-width:2px
    style I fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style L1 fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style L2 fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px
    style AST fill:#fee2e2,stroke:#dc2626,stroke-width:2px
Loading

πŸ–₯️ GitHub Pages Explorer

The docs/ directory contains a static GitHub Pages explorer for LabHorizon. It keeps the original dark visual style and interactive Three.js laboratory asset viewer, but now focuses on the two released data levels. The current public demo samples are chosen for asset diversity and manually checked so the rendered assets are upright, readable, and not broken.

  • Level 1: real public test samples with thermal cycler and vortex mixer assets, three rendered asset views, historical actions, candidate next actions, card-based reference reasoning, and gold next action.
  • Level 2: real public test samples covering plasmid DNA purification and mRNA cleanup, with context, goal, card-based constraints, available-input cards, expandable action-pool cards, and a graph-like gold experimental action sequence.

The sidebar also links directly to the GitHub repository and both Hugging Face dataset cards.

The page is fully static and can be previewed locally:

python -m http.server 8765 --directory docs

πŸ“¦ Datasets

Level Hugging Face Dataset Input Target Metric
Level 1 LabHorizon-3D-Asset-Perception Three asset views, historical actions, candidate next actions Gold next action Next-action accuracy
Level 2 LabHorizon-Protocol-Conditioned-Planning Context, goal, constraints, available inputs, action pool Gold experimental action sequence Action Sequence Similarity, Parameter Accuracy

πŸ”¬ Level 1 Schema

Column Meaning
id Stable public sample identifier, such as LabHorizon-L1-test-000001.
asset Three rendered views of the same laboratory asset.
historical_actions Previous protocol actions and the current experimental state.
candidate_next_actions Candidate next laboratory actions.
reasoning Reference reasoning for the gold next action.
next_action Gold protocol-consistent next action.
asset_name Human-readable asset name for analysis.
asset_family Asset family label for distribution analysis.

πŸ§ͺ Level 2 Schema

Column Meaning
id Stable public sample identifier, such as LabHorizon-L2-test-000001.
context Experimental context for the local protocol window.
goal Planning objective.
constraints Protocol-derived constraints and parameter requirements.
available_inputs Raw materials, samples, or measurements available before planning.
action_pool_names Names of available action-pool functions.
action_pool Python function definitions describing available laboratory actions.
gold_action_sequence Gold long-horizon experimental action sequence.

πŸ† Leaderboard

The tables below report direct-prompting model results on the current v20260510-repaired 200-sample test split. Level 1 is sorted by Next Action Accuracy; Level 2 is sorted by Final Score.

πŸ”¬ Level 1: 3D Asset Perception

Rank Model Next Action Accuracy
πŸ₯‡ Grok 4.3 0.555
πŸ₯ˆ Kimi K2.6 0.550
πŸ₯‰ GPT-5.5 0.535
4 GPT-5.4 0.520
5 Qwen3.6 Plus 0.505
6 Claude Opus 4.7 0.500
7 Qwen3.5 35B-A3B 0.495
8 MiMo V2.5 0.495
9 Qwen3.5 9B 0.485
10 Gemini 3.5 Flash 0.485
11 Qwen3.6 35B-A3B 0.475
12 Gemini 3.1 Pro Preview 0.465

πŸ§ͺ Level 2: Protocol-Conditioned Planning

Rank Model Final Score Action Sequence Similarity Parameter Accuracy
πŸ₯‡ Gemini 3.1 Pro Preview 0.3263 0.3195 0.3331
πŸ₯ˆ Grok 4.3 0.3244 0.3339 0.3148
πŸ₯‰ Kimi K2.6 0.3150 0.2845 0.3456
4 Gemini 3.5 Flash 0.3039 0.2686 0.3391
5 Qwen3.7 Max 0.3003 0.2905 0.3102
6 Claude Opus 4.7 0.2737 0.2619 0.2856
7 GPT-5.4 0.2715 0.2191 0.3239
8 Qwen3.6 35B-A3B 0.2534 0.2585 0.2483
9 Qwen3.6 Plus 0.2526 0.2264 0.2787
10 MiMo V2.5 0.2491 0.2269 0.2713
11 GLM 5.1 0.2413 0.2307 0.2519
12 Qwen3.5 35B-A3B 0.2391 0.2385 0.2398
13 GPT-5.5 0.2276 0.2092 0.2459
14 DeepSeek V4 Pro 0.2135 0.1927 0.2342
15 Qwen3.5 9B 0.1315 0.1359 0.1271

🧠 Training Result

LabHorizon is released with matched train and test splits, so it can evaluate models and also train domain models for laboratory action prediction. As an initial system result, we train Qwen/Qwen3.6-35B-A3B on the 6,000 LabHorizon training samples and combine it with the Actor-Simulator-Selector framework.

The table compares our trained+agents system with strong direct-prompting LLM baselines evaluated on the same test splits. Our best result is placed in the final row.

System Level 1 Next Action Accuracy Level 2 Action Sequence Similarity Level 2 Parameter Accuracy Level 2 Final Score
Gemini 3.1 Pro Preview 0.465 0.3195 0.3331 0.3263
Grok 4.3 0.555 0.3339 0.3148 0.3244
Kimi K2.6 0.550 0.2845 0.3456 0.3150
GPT-5.5 0.535 0.2092 0.2459 0.2276
Qwen3.6-35B-A3B(trained+agents) 0.665 0.4485 0.4580 0.4532

The result supports the Optimizable Learning Loop design. The trained+agents system provides more stable protocol-conditioned action prediction: it improves Level 1 asset-to-action alignment and better preserves action order, parameters, and intermediate dependencies. It does not solve the benchmark completely: Level 2 exact-match recovery remains hard, so continued agent refinement is still useful for checking global state consistency, action granularity, and parameter constraints during inference.

Training case insight. In a successful Level 2 organoid-preparation case, the trained+agents system preserves two parallel sample branches, two 100 x g, 5 min, 4 C centrifugation steps, branch-specific volume adjustment, and virus aliquot thawing on ice. This directly tests Long-Horizon Planning and Real-World Protocol Alignment. In a harder Golden Gate thermal-cycler case, the same system produces parseable actions but incorrectly expands a thermal-cycler program into local incubation calls and changes the required device-state order. This failure illustrates the remaining gap between local action familiarity and globally correct long-horizon experimental planning.

πŸ“ Evaluation

The evaluator keeps model interaction simple and model-agnostic. It sends natural-language prompts to an OpenAI-compatible chat completions endpoint, stores raw model outputs as JSONL, and computes metrics locally.

πŸ”¬ Level 1: Next-Action Prediction

Level 1 prompts contain asset images, historical actions, and candidate next actions. The model is asked to reason first and end with:

Final Next Action: X

X may be a candidate letter or the exact candidate action. The evaluator maps the final response back to the candidate list and reports next_action_accuracy.

πŸ§ͺ Level 2: Protocol-Conditioned Planning

Level 2 prompts contain a real-world experimental context, constraints, available inputs, and an action pool. The model may answer in natural language, but the structured action sequence must appear as Python-style function calls, usually inside a fenced code block:

lysate = lyse_cells(sample=cell_pellet, buffer=lysis_buffer, duration_min=10)
clarified = centrifuge(sample=lysate, speed_x_g=12000, duration_min=15)

The evaluator uses Python AST to extract action calls, keyword parameters, assigned intermediate variables, and variable dependencies. It reports:

Metric What It Measures
Action Sequence Similarity Whether predicted actions appear at the correct positions relative to the gold sequence.
Parameter Accuracy Whether aligned actions use correct parameter keys, values, raw inputs, and generated-variable dependencies.
Final Score The mean of Action Sequence Similarity and Parameter Accuracy.

πŸ€– Actor-Simulator-Selector Agent

LabHorizon includes a bounded Actor-Simulator-Selector agent for protocol-conditioned action prediction. The agent is not an open-ended ReAct loop and does not use a physical simulator. It wraps model sampling with a structured experimental state checker:

flowchart LR
    T["LabHorizon task<br/>Level 1 or Level 2"] --> Actor["Actor<br/>sample candidate actions"]
    T --> Sim0["Simulator<br/>construct current and target states"]
    Actor --> C["Candidate next actions<br/>or action sequences"]
    Sim0 --> Sim1["Simulator<br/>predict candidate state transitions"]
    C --> Sim1
    Sim1 --> Selector["Selector<br/>rank candidates by target-state fit"]
    Selector --> Out["Final prediction<br/>next action or action sequence"]

    style Actor fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style Sim0 fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style Sim1 fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style Selector fill:#f5f3ff,stroke:#7c3aed,stroke-width:2px
Loading

The implementation in agents/ uses the same public dataset schema and evaluation contracts as evaluation/:

  • Level 1 Actor outputs Final Next Action: X, then the Selector returns one candidate next action.
  • Level 2 Actor outputs a structured action sequence, then AST metrics score the selected sequence.
  • The Simulator and Selector can use the same model as the Actor or separate OpenAI-compatible models.

πŸš€ Quick Start

1. Clone Code and Data

The recommended local layout keeps code and datasets as sibling repositories:

mkdir -p LabHorizon/code LabHorizon/data

git clone https://github.com/CongLab-Research/LabHorizon \
  LabHorizon/code/LabHorizon

git clone https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception \
  LabHorizon/data/LabHorizon-3D-Asset-Perception

git clone https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning \
  LabHorizon/data/LabHorizon-Protocol-Conditioned-Planning

cd LabHorizon/code/LabHorizon

2. Install

python -m pip install -r requirements.txt

3. Configure

cp .env.example .env

Fill .env with an OpenAI-compatible endpoint:

BASE_URL=https://your-openai-compatible-endpoint/v1
API_KEY=your_api_key_here
EVAL_MODEL=openai/gpt-5.4
ACTOR_MODEL=qwen/qwen3.6-35b-a3b
SIMULATOR_MODEL=openai/gpt-5.4
SELECTOR_MODEL=openai/gpt-5.4

Do not commit .env. It is ignored by default.

4. Run Level 1 Evaluation

python -m evaluation.level1.evaluate \
  --split test \
  --output results/level1_gpt54.jsonl

5. Run Level 2 Evaluation

python -m evaluation.level2.evaluate \
  --split test \
  --output results/level2_gpt54.jsonl

Each command writes one JSONL row per evaluated sample plus a .summary.json file. Use --resume to reuse already written rows after interruption.

6. Run the Agent

python -m agents.run_agent \
  --level 2 \
  --split test \
  --samples 4 \
  --limit 5 \
  --output results/agent_level2_demo.jsonl

The agent reads BASE_URL and API_KEY from .env by default. Advanced users may set ACTOR_BASE_URL / ACTOR_API_KEY, SIMULATOR_BASE_URL / SIMULATOR_API_KEY, and SELECTOR_BASE_URL / SELECTOR_API_KEY to route the three stages to different endpoints.

7. Run Tests

Offline tests validate dataset loading contracts, direct evaluation scoring, AST metrics, and the Actor-Simulator-Selector workflow with fake clients:

python -m unittest discover tests
python -m unittest discover agents/tests

Real API smoke tests are opt-in because they call configured models through BASE_URL:

RUN_LABHORIZON_API_TESTS=1 python -m unittest tests.test_api_smoke

The smoke tests run one Level 1 direct-evaluation sample, one Level 2 direct-evaluation sample, and one Level 2 agent sample.

βš™οΈ Useful Options

python -m evaluation.level1.evaluate --help
python -m evaluation.level2.evaluate --help
python -m agents.run_agent --help
Option Default Purpose
--data-root ../../data Directory containing the two Hugging Face dataset clones.
--cache-dir .cache/huggingface/datasets Local Hugging Face dataset cache.
--limit unset Evaluate only the first N examples.
--resume False Reuse existing JSONL rows in --output.
--temperature unset Optional model temperature.
--timeout 120 HTTP timeout in seconds.
--retries 2 API retry count.

πŸ“ Project Structure

LabHorizon/
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ utils.py                  # OpenAI-compatible client, dataset loading, JSONL cache
β”‚   β”œβ”€β”€ level1/
β”‚   β”‚   β”œβ”€β”€ prompts.py            # Multi-image next-action prompts and answer parsing
β”‚   β”‚   └── evaluate.py           # Level 1 evaluation entry point
β”‚   └── level2/
β”‚       β”œβ”€β”€ prompts.py            # Protocol-conditioned planning prompts
β”‚       β”œβ”€β”€ metrics.py            # AST parsing and ASS / PA metrics
β”‚       └── evaluate.py           # Level 2 evaluation entry point
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ run_agent.py              # Actor-Simulator-Selector CLI
β”‚   β”œβ”€β”€ workflow.py               # Candidate sampling, simulation, selection, scoring
β”‚   β”œβ”€β”€ prompts.py                # Actor / Simulator / Selector prompts
β”‚   └── tests/                    # Offline smoke tests
└── tests/
    β”œβ”€β”€ test_evaluation.py        # Direct evaluator unit tests
    β”œβ”€β”€ test_agent.py             # Agent workflow unit tests
    └── test_api_smoke.py         # Opt-in real API smoke tests

Generated outputs should go under results/, which is ignored by default.

πŸ—ΊοΈ Roadmap

  • Release paper metadata and citation after the manuscript is public.
  • Add official model results and analysis tables.
  • Add official agent and fine-tuned model results when checkpoints are released.

πŸ“œ Citation

Coming soon...

πŸ’¬ Contact

Please open a GitHub issue for reproducibility questions, dataset access problems, or evaluator bugs.

⭐ Star History

Star History Chart

Back to top

About

Enhancing Laboratory 3D Perception and Long-Horizon Planning via Protocol-Conditioned Action Prediction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages