Skip to content

Add a tool for ProfilerStep GPU timeline sanity checks#1156

Merged
helloyongyang merged 2 commits into
mainfrom
yr/gap_prof_tool
Jun 16, 2026
Merged

Add a tool for ProfilerStep GPU timeline sanity checks#1156
helloyongyang merged 2 commits into
mainfrom
yr/gap_prof_tool

Conversation

@STwangyingrui

Copy link
Copy Markdown
Contributor

Adds a standalone trace linter that flags suspicious ProfilerStep idle gaps (ratio + per-launch budget); ALERT is a TensorBoard review hint, not a hard failure.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script, analyze_torch_trace_gap.py, designed to check ProfilerStep idle gaps in PyTorch Profiler TensorBoard traces by comparing step wall times against device-timeline GPU activity. The review feedback highlights two key areas for improvement: first, resolving a potential AttributeError when parsing trace files that are formatted as a JSON list rather than a dictionary; second, adding manual glob expansion using Python's glob module to ensure glob patterns work correctly across all platforms (such as Windows) or when arguments are quoted.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +177 to +179
events = payload.get("traceEvents", payload)
if not isinstance(events, list):
raise ValueError(f"Unexpected trace format in {path}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the JSON trace file is formatted as a direct list of events (which is a standard Chrome trace format variant), payload will be parsed as a list. Calling payload.get(...) on a list will raise an AttributeError: 'list' object has no attribute 'get'.\n\nWe should check the type of payload first to handle both dictionary and list formats safely.

    if isinstance(payload, dict):\n        events = payload.get(\"traceEvents\")\n        if not isinstance(events, list):\n            raise ValueError(f\"Unexpected trace format in {path}: 'traceEvents' is missing or not a list\")\n    elif isinstance(payload, list):\n        events = payload\n    else:\n        raise ValueError(f\"Unexpected trace format in {path}\")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: _load_trace_events() now accepts both { "traceEvents": [...] } (PyTorch TB export) and a top-level event list (Chrome trace variant).

Comment on lines +263 to +265
for path in args.traces:
stats = analyze_trace_file(path)
print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The usage instructions suggest using glob patterns like prof_results/**/*.pt.trace.json. However, on some platforms (like Windows) or when arguments are quoted, the shell does not automatically expand glob patterns. This will cause argparse to pass the literal glob pattern as a Path, leading to a FileNotFoundError.\n\nWe can resolve this by manually expanding glob patterns using Python's glob module.

    import glob\n\n    resolved_paths: list[Path] = []\n    for path in args.traces:\n        path_str = str(path)\n        if \"*\" in path_str or \"?\" in path_str:\n            matches = glob.glob(path_str, recursive=True)\n            resolved_paths.extend(Path(m) for m in matches)\n        else:\n            resolved_paths.append(path)\n\n    for path in resolved_paths:\n        stats = analyze_trace_file(path)\n        print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end=\"\")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: expand *, ?, [ patterns via glob.glob(..., recursive=True) so quoted globs and Windows work as documented.

@helloyongyang helloyongyang merged commit 989b84f into main Jun 16, 2026
2 checks passed
@helloyongyang helloyongyang deleted the yr/gap_prof_tool branch June 16, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants