Add a tool for ProfilerStep GPU timeline sanity checks#1156
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new script, analyze_torch_trace_gap.py, designed to check ProfilerStep idle gaps in PyTorch Profiler TensorBoard traces by comparing step wall times against device-timeline GPU activity. The review feedback highlights two key areas for improvement: first, resolving a potential AttributeError when parsing trace files that are formatted as a JSON list rather than a dictionary; second, adding manual glob expansion using Python's glob module to ensure glob patterns work correctly across all platforms (such as Windows) or when arguments are quoted.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| events = payload.get("traceEvents", payload) | ||
| if not isinstance(events, list): | ||
| raise ValueError(f"Unexpected trace format in {path}") |
There was a problem hiding this comment.
If the JSON trace file is formatted as a direct list of events (which is a standard Chrome trace format variant), payload will be parsed as a list. Calling payload.get(...) on a list will raise an AttributeError: 'list' object has no attribute 'get'.\n\nWe should check the type of payload first to handle both dictionary and list formats safely.
if isinstance(payload, dict):\n events = payload.get(\"traceEvents\")\n if not isinstance(events, list):\n raise ValueError(f\"Unexpected trace format in {path}: 'traceEvents' is missing or not a list\")\n elif isinstance(payload, list):\n events = payload\n else:\n raise ValueError(f\"Unexpected trace format in {path}\")There was a problem hiding this comment.
Fixed: _load_trace_events() now accepts both { "traceEvents": [...] } (PyTorch TB export) and a top-level event list (Chrome trace variant).
| for path in args.traces: | ||
| stats = analyze_trace_file(path) | ||
| print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="") |
There was a problem hiding this comment.
The usage instructions suggest using glob patterns like prof_results/**/*.pt.trace.json. However, on some platforms (like Windows) or when arguments are quoted, the shell does not automatically expand glob patterns. This will cause argparse to pass the literal glob pattern as a Path, leading to a FileNotFoundError.\n\nWe can resolve this by manually expanding glob patterns using Python's glob module.
import glob\n\n resolved_paths: list[Path] = []\n for path in args.traces:\n path_str = str(path)\n if \"*\" in path_str or \"?\" in path_str:\n matches = glob.glob(path_str, recursive=True)\n resolved_paths.extend(Path(m) for m in matches)\n else:\n resolved_paths.append(path)\n\n for path in resolved_paths:\n stats = analyze_trace_file(path)\n print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end=\"\")There was a problem hiding this comment.
Fixed: expand *, ?, [ patterns via glob.glob(..., recursive=True) so quoted globs and Windows work as documented.
Adds a standalone trace linter that flags suspicious ProfilerStep idle gaps (ratio + per-launch budget); ALERT is a TensorBoard review hint, not a hard failure.