add a tool for ProfilerStep GPU timeline sanity checks#1155
add a tool for ProfilerStep GPU timeline sanity checks#1155STwangyingrui wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new utility script, analyze_torch_trace_gap.py, to check for idle gaps in PyTorch Profiler TensorBoard traces. The feedback focuses on improving the robustness and efficiency of the script. Key suggestions include handling cases where the trace JSON is parsed directly as a list to avoid an AttributeError, adding defensive checks for event types and required keys, simplifying the gpu_span calculation using the sorted order of merged intervals, and wrapping the file processing loop in a try-except block to gracefully handle errors when analyzing multiple traces.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| events = payload.get("traceEvents", payload) | ||
| if not isinstance(events, list): | ||
| raise ValueError(f"Unexpected trace format in {path}") |
There was a problem hiding this comment.
If the trace file is formatted as a JSON array directly (which is a common format for Chrome traces), payload will be parsed as a list. Calling payload.get(...) will then raise an AttributeError: 'list' object has no attribute 'get'. We should check the type of payload before attempting to access it as a dictionary.
if isinstance(payload, dict):
events = payload.get("traceEvents", [])
elif isinstance(payload, list):
events = payload
else:
events = []
if not isinstance(events, list):
raise ValueError(f"Unexpected trace format in {path}")| for ev in events: | ||
| if ev.get("ph") != "X": | ||
| continue | ||
| if ev.get("cat") != "user_annotation": | ||
| continue | ||
| name = ev.get("name", "") | ||
| step_id = _parse_step_id(name) | ||
| if step_id is None: | ||
| continue | ||
| ts = float(ev["ts"]) | ||
| dur = float(ev["dur"]) | ||
| steps.append((step_id, Interval(ts, ts + dur), name)) |
There was a problem hiding this comment.
To prevent potential KeyError or AttributeError if the trace contains malformed events or non-dictionary elements, we should add defensive checks to ensure each event is a dictionary and contains the required keys (ts and dur) before attempting to access or convert them.
| for ev in events: | |
| if ev.get("ph") != "X": | |
| continue | |
| if ev.get("cat") != "user_annotation": | |
| continue | |
| name = ev.get("name", "") | |
| step_id = _parse_step_id(name) | |
| if step_id is None: | |
| continue | |
| ts = float(ev["ts"]) | |
| dur = float(ev["dur"]) | |
| steps.append((step_id, Interval(ts, ts + dur), name)) | |
| for ev in events: | |
| if not isinstance(ev, dict): | |
| continue | |
| if ev.get("ph") != "X": | |
| continue | |
| if ev.get("cat") != "user_annotation": | |
| continue | |
| name = ev.get("name", "") | |
| step_id = _parse_step_id(name) | |
| if step_id is None: | |
| continue | |
| ts = ev.get("ts") | |
| dur = ev.get("dur") | |
| if ts is None or dur is None: | |
| continue | |
| steps.append((step_id, Interval(float(ts), float(ts) + float(dur)), name)) |
| for ev in events: | ||
| if ev.get("ph") != "X": | ||
| continue | ||
| if ev.get("cat") not in _GPU_ACTIVITY_CATS: | ||
| continue | ||
| ts = float(ev["ts"]) | ||
| dur = float(ev["dur"]) | ||
| if dur <= 0: | ||
| continue | ||
| intervals.append(Interval(ts, ts + dur)) |
There was a problem hiding this comment.
Similarly, we should add defensive checks here to ensure each event is a dictionary and contains the required keys (ts and dur) before attempting to access or convert them.
| for ev in events: | |
| if ev.get("ph") != "X": | |
| continue | |
| if ev.get("cat") not in _GPU_ACTIVITY_CATS: | |
| continue | |
| ts = float(ev["ts"]) | |
| dur = float(ev["dur"]) | |
| if dur <= 0: | |
| continue | |
| intervals.append(Interval(ts, ts + dur)) | |
| for ev in events: | |
| if not isinstance(ev, dict): | |
| continue | |
| if ev.get("ph") != "X": | |
| continue | |
| if ev.get("cat") not in _GPU_ACTIVITY_CATS: | |
| continue | |
| ts = ev.get("ts") | |
| dur = ev.get("dur") | |
| if ts is None or dur is None: | |
| continue | |
| ts = float(ts) | |
| dur = float(dur) | |
| if dur <= 0: | |
| continue | |
| intervals.append(Interval(ts, ts + dur)) |
| if merged: | ||
| gpu_span = max(x.end for x in merged) - min(x.start for x in merged) | ||
| else: | ||
| gpu_span = 0.0 |
There was a problem hiding this comment.
Since merged is already sorted by start and contains mutually disjoint (non-overlapping) intervals, the minimum start time is guaranteed to be merged[0].start and the maximum end time is guaranteed to be merged[-1].end. We can simplify the calculation to avoid scanning the entire list with min and max.
| if merged: | |
| gpu_span = max(x.end for x in merged) - min(x.start for x in merged) | |
| else: | |
| gpu_span = 0.0 | |
| if merged: | |
| gpu_span = merged[-1].end - merged[0].start | |
| else: | |
| gpu_span = 0.0 |
| for path in args.traces: | ||
| stats = analyze_trace_file(path) | ||
| print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="") | ||
|
|
||
| return 0 |
There was a problem hiding this comment.
If one of the trace files in a batch is corrupt or missing, the script will crash immediately and skip processing the remaining files. Wrapping the analysis loop in a try-except block allows the script to gracefully report errors for individual files and continue processing the rest, returning a non-zero exit code at the end if any errors occurred.
| for path in args.traces: | |
| stats = analyze_trace_file(path) | |
| print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="") | |
| return 0 | |
| exit_code = 0 | |
| for path in args.traces: | |
| try: | |
| stats = analyze_trace_file(path) | |
| print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="") | |
| except Exception as e: | |
| print(f"Error analyzing {path}: {e}", file=sys.stderr) | |
| exit_code = 1 | |
| return exit_code |
Adds a standalone trace linter that flags suspicious ProfilerStep idle gaps (ratio + per-launch budget); ALERT is a TensorBoard review hint, not a hard failure.