Some Issues and Questions About the Benchmark

## Summary

Hi, thanks for this excellent work! I have been trying to run `metaclaw-bench` recently and ran into a few issues/questions that I hope you can help clarify.

## Issues

### 1. `max_context_tokens` setting appears to be ignored

**Description**
The configured `max_context_tokens` value does not seem to be properly passed into `MetaClawConfig`. As a result, the default value (`20000`) appears to be used even when a larger value is specified, which can lead to unexpected prompt truncation. This can be observed from the `proxy_run_*.log` outputs.

**Proposed Fix**
``` 
diff --git a/metaclaw/config_store.py b/metaclaw/config_store.py
index 12b8829f..24f703c5 100644
--- a/metaclaw/config_store.py
+++ b/metaclaw/config_store.py
@@ -250,6 +250,8 @@ class ConfigStore:
             proxy_port=proxy.get("port", 30000),
             proxy_host=proxy.get("host", "0.0.0.0"),
             served_model_name=llm.get("model_id") or "metaclaw-model",
+            # Context window
+            max_context_tokens=int(data.get("max_context_tokens", 20000)),
             # Skills
```

**Related files**
- `metaclaw/config_store.py`
- `metaclaw/api_server.py`
- `metaclaw/launcher.py`

---

### 2. YAML configuration does not align between RL mode and skills mode.

**Description**
I noticed there are several YAML config files under `benchmark/scripts/config/`. My understanding is that `rl*.yaml` and `skills*.yaml` correspond to configuration of the different modes compared in the paper.

However, some settings do not seem aligned across these modes. For example, in RL mode, `max_context_tokens: 50000` is explicitly set, while in skills-only mode there does not seem to be a corresponding `max_context_tokens: 50000` entry. If so, I assume the default context limit would be used instead.

I was wondering whether this could affect the fairness of the comparison between the two modes.

**Related files**
- `benchmark/scripts/config/skills-only.yaml`
- `benchmark/scripts/config/rl.yaml`

---

## Question

1. Could you please explain why the prompt is compressed only when `mode != "skills_only"`?
In `metaclaw/api_server.py`, the prompt is compressed when the mode is not `skills_only`:

```
        # NOTE: In skills_only mode we forward directly to the user's LLM provider.
        # Do not rewrite/collapse the system prompt here.
        if self.config.mode != "skills_only":
            cached_system = self._read_cached_system_prompt()
            if not cached_system:
                raw_system = ""
                for m in messages:
                    if isinstance(m, dict) and m.get("role") == "system":
                        raw_system = _flatten_message_content(m.get("content"))
                        break
                if raw_system:
                    cached_system = await asyncio.to_thread(
                        run_llm,
                        [{"role": "user", "content": raw_system}],
                    )
                    cached_system = (cached_system or raw_system).strip()
                    self._write_cached_system_prompt(cached_system)


            if cached_system:
                for m in messages:
                    if isinstance(m, dict) and m.get("role") == "system":
                        m["content"] = cached_system
```
Could you please help explain the rationale behind this design?

More specifically, since this behavior applies to RL mode but not skills_only, could it introduce an unfair advantage or otherwise affect the comparison?

2. Could you please share the exact YAML configurations used for the three settings in the paper?
- baseline
- skills only
- RL + skills

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Issues and Questions About the Benchmark #65

Summary

Issues

1. `max_context_tokens` setting appears to be ignored

2. YAML configuration does not align between RL mode and skills mode.

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some Issues and Questions About the Benchmark #65

Description

Summary

Issues

1. max_context_tokens setting appears to be ignored

2. YAML configuration does not align between RL mode and skills mode.

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `max_context_tokens` setting appears to be ignored