Skip to content

Some Issues and Questions About the Benchmark #65

@TinaZhang66

Description

@TinaZhang66

Summary

Hi, thanks for this excellent work! I have been trying to run metaclaw-bench recently and ran into a few issues/questions that I hope you can help clarify.

Issues

1. max_context_tokens setting appears to be ignored

Description
The configured max_context_tokens value does not seem to be properly passed into MetaClawConfig. As a result, the default value (20000) appears to be used even when a larger value is specified, which can lead to unexpected prompt truncation. This can be observed from the proxy_run_*.log outputs.

Proposed Fix

diff --git a/metaclaw/config_store.py b/metaclaw/config_store.py
index 12b8829f..24f703c5 100644
--- a/metaclaw/config_store.py
+++ b/metaclaw/config_store.py
@@ -250,6 +250,8 @@ class ConfigStore:
             proxy_port=proxy.get("port", 30000),
             proxy_host=proxy.get("host", "0.0.0.0"),
             served_model_name=llm.get("model_id") or "metaclaw-model",
+            # Context window
+            max_context_tokens=int(data.get("max_context_tokens", 20000)),
             # Skills

Related files

  • metaclaw/config_store.py
  • metaclaw/api_server.py
  • metaclaw/launcher.py

2. YAML configuration does not align between RL mode and skills mode.

Description
I noticed there are several YAML config files under benchmark/scripts/config/. My understanding is that rl*.yaml and skills*.yaml correspond to configuration of the different modes compared in the paper.

However, some settings do not seem aligned across these modes. For example, in RL mode, max_context_tokens: 50000 is explicitly set, while in skills-only mode there does not seem to be a corresponding max_context_tokens: 50000 entry. If so, I assume the default context limit would be used instead.

I was wondering whether this could affect the fairness of the comparison between the two modes.

Related files

  • benchmark/scripts/config/skills-only.yaml
  • benchmark/scripts/config/rl.yaml

Question

  1. Could you please explain why the prompt is compressed only when mode != "skills_only"?
    In metaclaw/api_server.py, the prompt is compressed when the mode is not skills_only:
        # NOTE: In skills_only mode we forward directly to the user's LLM provider.
        # Do not rewrite/collapse the system prompt here.
        if self.config.mode != "skills_only":
            cached_system = self._read_cached_system_prompt()
            if not cached_system:
                raw_system = ""
                for m in messages:
                    if isinstance(m, dict) and m.get("role") == "system":
                        raw_system = _flatten_message_content(m.get("content"))
                        break
                if raw_system:
                    cached_system = await asyncio.to_thread(
                        run_llm,
                        [{"role": "user", "content": raw_system}],
                    )
                    cached_system = (cached_system or raw_system).strip()
                    self._write_cached_system_prompt(cached_system)


            if cached_system:
                for m in messages:
                    if isinstance(m, dict) and m.get("role") == "system":
                        m["content"] = cached_system

Could you please help explain the rationale behind this design?

More specifically, since this behavior applies to RL mode but not skills_only, could it introduce an unfair advantage or otherwise affect the comparison?

  1. Could you please share the exact YAML configurations used for the three settings in the paper?
  • baseline
  • skills only
  • RL + skills

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions