Summary
Hi, thanks for this excellent work! I have been trying to run metaclaw-bench recently and ran into a few issues/questions that I hope you can help clarify.
Issues
1. max_context_tokens setting appears to be ignored
Description
The configured max_context_tokens value does not seem to be properly passed into MetaClawConfig. As a result, the default value (20000) appears to be used even when a larger value is specified, which can lead to unexpected prompt truncation. This can be observed from the proxy_run_*.log outputs.
Proposed Fix
diff --git a/metaclaw/config_store.py b/metaclaw/config_store.py
index 12b8829f..24f703c5 100644
--- a/metaclaw/config_store.py
+++ b/metaclaw/config_store.py
@@ -250,6 +250,8 @@ class ConfigStore:
proxy_port=proxy.get("port", 30000),
proxy_host=proxy.get("host", "0.0.0.0"),
served_model_name=llm.get("model_id") or "metaclaw-model",
+ # Context window
+ max_context_tokens=int(data.get("max_context_tokens", 20000)),
# Skills
Related files
metaclaw/config_store.py
metaclaw/api_server.py
metaclaw/launcher.py
2. YAML configuration does not align between RL mode and skills mode.
Description
I noticed there are several YAML config files under benchmark/scripts/config/. My understanding is that rl*.yaml and skills*.yaml correspond to configuration of the different modes compared in the paper.
However, some settings do not seem aligned across these modes. For example, in RL mode, max_context_tokens: 50000 is explicitly set, while in skills-only mode there does not seem to be a corresponding max_context_tokens: 50000 entry. If so, I assume the default context limit would be used instead.
I was wondering whether this could affect the fairness of the comparison between the two modes.
Related files
benchmark/scripts/config/skills-only.yaml
benchmark/scripts/config/rl.yaml
Question
- Could you please explain why the prompt is compressed only when
mode != "skills_only"?
In metaclaw/api_server.py, the prompt is compressed when the mode is not skills_only:
# NOTE: In skills_only mode we forward directly to the user's LLM provider.
# Do not rewrite/collapse the system prompt here.
if self.config.mode != "skills_only":
cached_system = self._read_cached_system_prompt()
if not cached_system:
raw_system = ""
for m in messages:
if isinstance(m, dict) and m.get("role") == "system":
raw_system = _flatten_message_content(m.get("content"))
break
if raw_system:
cached_system = await asyncio.to_thread(
run_llm,
[{"role": "user", "content": raw_system}],
)
cached_system = (cached_system or raw_system).strip()
self._write_cached_system_prompt(cached_system)
if cached_system:
for m in messages:
if isinstance(m, dict) and m.get("role") == "system":
m["content"] = cached_system
Could you please help explain the rationale behind this design?
More specifically, since this behavior applies to RL mode but not skills_only, could it introduce an unfair advantage or otherwise affect the comparison?
- Could you please share the exact YAML configurations used for the three settings in the paper?
- baseline
- skills only
- RL + skills
Thank you!
Summary
Hi, thanks for this excellent work! I have been trying to run
metaclaw-benchrecently and ran into a few issues/questions that I hope you can help clarify.Issues
1.
max_context_tokenssetting appears to be ignoredDescription
The configured
max_context_tokensvalue does not seem to be properly passed intoMetaClawConfig. As a result, the default value (20000) appears to be used even when a larger value is specified, which can lead to unexpected prompt truncation. This can be observed from theproxy_run_*.logoutputs.Proposed Fix
Related files
metaclaw/config_store.pymetaclaw/api_server.pymetaclaw/launcher.py2. YAML configuration does not align between RL mode and skills mode.
Description
I noticed there are several YAML config files under
benchmark/scripts/config/. My understanding is thatrl*.yamlandskills*.yamlcorrespond to configuration of the different modes compared in the paper.However, some settings do not seem aligned across these modes. For example, in RL mode,
max_context_tokens: 50000is explicitly set, while in skills-only mode there does not seem to be a correspondingmax_context_tokens: 50000entry. If so, I assume the default context limit would be used instead.I was wondering whether this could affect the fairness of the comparison between the two modes.
Related files
benchmark/scripts/config/skills-only.yamlbenchmark/scripts/config/rl.yamlQuestion
mode != "skills_only"?In
metaclaw/api_server.py, the prompt is compressed when the mode is notskills_only:Could you please help explain the rationale behind this design?
More specifically, since this behavior applies to RL mode but not skills_only, could it introduce an unfair advantage or otherwise affect the comparison?
Thank you!