Skip to content

[codex] Enforce complete eval validation and quiet ATOM logs#1878

Draft
Oseltamivir wants to merge 2 commits into
mainfrom
codex/strict-eval-atom-logging
Draft

[codex] Enforce complete eval validation and quiet ATOM logs#1878
Oseltamivir wants to merge 2 commits into
mainfrom
codex/strict-eval-atom-logging

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

This fixes two issues exposed by run 27896778812, job 82549778335:

  1. A multi-concurrency eval job could silently run only one concurrency and still pass verification.
  2. ATOM serving emitted tens of thousands of engine, Uvicorn access, and atomesh INFO lines.

The change makes the workflow-requested concurrency list an independent verification contract, hardens eval artifact staging/collection/reuse, audits every active launcher path that forwards eval state, and adds fail-closed ATOM logging controls without adding a filter file.

Incident analysis

The affected job requested:

1 2 4 8 16 32 64 128 256 512

It finished in about 14 minutes because only concurrency 512 was evaluated. The AMD multi-node Docker launch used bare -e EVAL_CONC behind sudo. With normal sudo environment filtering, the Docker client did not receive the host value, so the container did not receive the requested list and the eval path fell back to the matrix concurrency.

The existing verifier then validated only the files that happened to exist. It had no independent copy of the workflow request, so one valid result could look like a complete batch.

The source log had 81,074 lines. Of those, 72,378 were ATOM logger lines, including 57,941 producer/consumer INFO lines; 2,649 were Uvicorn access lines; and 2,676 were atomesh HTTP INFO lines. The controls in this PR directly target 77,703 lines, or about 95.8% of that log, while preserving warnings and errors.

Eval execution and verification

Preserve the requested list

  • Forward EVAL_CONC with an explicit value through the AMD Slurm/Docker boundary instead of relying on Docker's bare environment inheritance through sudo.
  • Audit the active NVIDIA and AMD launchers and normalize eval environment forwarding across B200, B300, GB200, H100, H200, and MI355X paths.
  • Thread the model path, model length, eval backend, task directory, and eval model-length overrides through the AMD multi-node submit, Slurm, and container layers.

Fail on incomplete batches

The workflow now invokes validate_scores.py --expected-concs ... with the concurrency request derived independently from the matrix inputs. Verification fails when any of the following is true:

  • requested concurrency metadata is missing, malformed, duplicated, incomplete, or contains unexpected values;
  • completed and failed concurrency sets overlap;
  • any requested concurrency is failed or absent;
  • result files are missing, duplicated, unexpectedly suffixed, or do not exactly cover the requested set;
  • the eval command or artifact finalization recorded a nonzero exit code;
  • a result file has no checked metric, has a nonnumeric/non-finite/out-of-range metric, or has a different task/metric set from the other concurrency results;
  • any checked score is below its configured model/task threshold.

Verification runs after artifact upload and uses success() || failure() semantics, so failed and partial eval outputs remain available for diagnosis while the Verify eval scores step still turns the job red.

Harden artifacts and reuse

  • Require each eval point to produce both valid metadata and at least one results*.json artifact.
  • Record eval_exit_code, expected concurrencies, completed concurrencies, and failed concurrencies in meta_env.json.
  • Make multi-node ATOM, SGLang, and vLLM finalization/copy failures explicit instead of silently continuing.
  • Reject partial or failed batches in collect_eval_results.py.
  • Apply the same completeness and threshold checks when authorizing reusable sweep artifacts, preventing an incomplete historical eval from being reused as successful evidence.
  • Validate positive, unique concurrency values and consistent eval flags at the Pydantic matrix boundary.
  • Keep throughput matrices from accidentally also launching eval work; evals remain in their dedicated eval-only matrices.

ATOM logging controls

I checked both current upstream ROCm/ATOM (ea08015c51aeaab40bd39b89eef009df9c148dc3) and the recipe/image-linked revision (5d42d49f9e4292e5b61475917e92e7ec1b1dacb7). The following are InferenceX compatibility controls, not native upstream ATOM variables:

  • ATOM_LOG_LEVEL=WARNING
  • ATOM_UVICORN_LOG_LEVEL=warning
  • ATOM_UVICORN_ACCESS_LOG=0
  • ATOMESH_LOG_LEVEL=warn

Only atomesh exposes a native --log-level option. Upstream ATOM hardcodes its Python console handler to INFO and calls uvicorn.run() without logging controls. Therefore this PR uses the existing benchmarks/multi_node/amd_utils/setup_deps.sh to patch the installed package in place before serving. No new file is introduced.

The patch is idempotent and applies to spawned worker processes because each process reads the inherited environment. It:

  • sets both the ATOM logger and its console handler to ATOM_LOG_LEVEL;
  • disables logger propagation so a root INFO handler cannot re-emit ATOM messages;
  • passes the configured Uvicorn log level and access-log setting to uvicorn.run();
  • passes ATOMESH_LOG_LEVEL to atomesh's native flag.

Dependency setup and environment sourcing are explicitly guarded, so the server exits before serving if either returns nonzero. The installed-package patch also fails closed if the package shape is unexpected or if behavior checks show that ATOM INFO remains visible, warnings are duplicated/lost, propagation remains enabled, or Uvicorn access logging remains on. Setting the levels back to INFO/info and ATOM_UVICORN_ACCESS_LOG=1 restores verbose troubleshooting output.

Why existing tests missed this

The previous tests mostly asserted that launch commands contained -e EVAL_CONC; that assertion encoded the unsafe bare form instead of testing the value after the sudo boundary. Shell mocks also ran in a shared test environment and did not model sudo environment filtering.

Separately, score validation treated the produced artifacts as the source of truth. No test supplied an independent expected concurrency list, so there was no way to distinguish “one requested and one produced” from “ten requested and one produced.” Collector and reuse tests likewise focused on parsable rows rather than exact batch completeness.

This PR adds regression coverage for explicit environment propagation, invalid/missing/duplicate/unexpected concurrencies, command and staging failures, malformed metadata/results, metric-set drift, non-finite and below-threshold scores, collector rejection, reuse rejection, and matrix schema constraints. The workflow test path filters now include the relevant benchmark templates, eval files, AMD utilities, single-node scripts, and launchers so these changes actually trigger the gate.

Validation performed

  • uv run --with pytest --with pydantic --with pyyaml --with tabulate pytest utils/ -q: 413 passed
  • Ruff over every changed Python file: clean
  • bash -n over all changed shell scripts: clean
  • shellcheck benchmarks/multi_node/amd_utils/setup_deps.sh: clean
  • git diff --check: clean
  • Applied and behavior-tested the ATOM package patch twice against both upstream revisions to verify compatibility and idempotence.
  • AST-audited the noisy ATOM producer/consumer sites: normal request-path messages are INFO while error/exception paths remain visible at warning/error levels.

Live hardware validation

The source workload is run 27896739211: two MiniMax-M3 MXFP4 MI355X ATOM-disaggregated eval jobs (1k1k and 8k1k), each requesting 1 2 4 8 16 32 64 128 256 512 against a 1P/1D TP4 topology.

That config and launcher are still on PR #1856 rather than main. To keep this PR scoped, the exact replay will run from a temporary integration branch with PR #1856 layered on top of this commit. The exact replay was dispatched from codex/strict-eval-atom-logging-replay-1856 at commit 83a10e21ff87f109de302f6739b4f66417d18aed: run 27902697778. It is queued; this section will be updated with the terminal result and log-volume evidence.

Scope

  • Based directly on current main.
  • No perf-changelog.yaml changes.
  • No new files.
  • No changes under deprecated/.
  • No DSV4 configuration changes.

Note

Cursor Bugbot is generating a summary for commit 8a923fb. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@Oseltamivir Oseltamivir marked this pull request as ready for review June 21, 2026 11:21
@Oseltamivir Oseltamivir requested a review from a team June 21, 2026 11:21
@claude

claude Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 5m 18s —— View job


Reviewing PR #1878

  • Read repo guidelines and gather context
  • Review workflow template changes
  • Review shell script changes (launchers, AMD utils)
  • Review Python validation/collection changes
  • Review test changes
  • Post summary

LGTM - no blocking issues found.

This is a thorough and well-motivated fix for two real problems: (1) eval jobs silently passing with incomplete concurrency coverage due to sudo environment filtering dropping bare -e EVAL_CONC, and (2) ATOM engine logs drowning out useful output (~96% of the 81K-line log).

Key observations from the review:

  • Core fix is correct: job.slurm now passes EVAL_CONC as -e "EVAL_CONC=$EVAL_CONC" (quoted, with explicit value) instead of bare -e EVAL_CONC, which survives Docker-through-sudo.
  • Independent verification contract works: the workflow derives --expected-concs from the matrix inputs (not from eval output metadata), so a silent single-concurrency fallback would now fail validation.
  • Fail-closed behavior throughout: missing metadata, malformed results, non-finite/out-of-range scores, metric-set drift across concurrencies, and eval exit code failures all correctly produce errors.
  • Bug fix in server_atom.sh: the old code used $PREFILL_TP_SIZE for decode parallel args and used numeric comparison (-gt 1) on boolean string values — both fixed.
  • ATOM logging patch is carefully guarded: the embedded Python patch verifies its own effect (logger suppression, propagation, uvicorn access-log behavior) before allowing serving to proceed.
  • run-sweep.yml change is intentional: throughput matrices now hardcode run-eval: false; evals live in their dedicated eval-only matrices. All sweep jobs consistently reflect this.
  • Comprehensive test coverage: 1100+ lines of new tests covering invalid/missing/duplicate concurrencies, staging failures, metric drift, non-finite scores, below-threshold scores, collector rejection, reuse rejection, and matrix schema constraints.

No master config or perf-changelog changes were made, consistent with the PR's scope statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant