Skip to content

add a tool for ProfilerStep GPU timeline sanity checks#1155

Closed
STwangyingrui wants to merge 1 commit into
mainfrom
yr/gap_prof_tool
Closed

add a tool for ProfilerStep GPU timeline sanity checks#1155
STwangyingrui wants to merge 1 commit into
mainfrom
yr/gap_prof_tool

Conversation

@STwangyingrui

Copy link
Copy Markdown
Contributor

Adds a standalone trace linter that flags suspicious ProfilerStep idle gaps (ratio + per-launch budget); ALERT is a TensorBoard review hint, not a hard failure.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new utility script, analyze_torch_trace_gap.py, to check for idle gaps in PyTorch Profiler TensorBoard traces. The feedback focuses on improving the robustness and efficiency of the script. Key suggestions include handling cases where the trace JSON is parsed directly as a list to avoid an AttributeError, adding defensive checks for event types and required keys, simplifying the gpu_span calculation using the sorted order of merged intervals, and wrapping the file processing loop in a try-except block to gracefully handle errors when analyzing multiple traces.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +180 to +182
events = payload.get("traceEvents", payload)
if not isinstance(events, list):
raise ValueError(f"Unexpected trace format in {path}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the trace file is formatted as a JSON array directly (which is a common format for Chrome traces), payload will be parsed as a list. Calling payload.get(...) will then raise an AttributeError: 'list' object has no attribute 'get'. We should check the type of payload before attempting to access it as a dictionary.

    if isinstance(payload, dict):
        events = payload.get("traceEvents", [])
    elif isinstance(payload, list):
        events = payload
    else:
        events = []
    if not isinstance(events, list):
        raise ValueError(f"Unexpected trace format in {path}")

Comment on lines +114 to +125
for ev in events:
if ev.get("ph") != "X":
continue
if ev.get("cat") != "user_annotation":
continue
name = ev.get("name", "")
step_id = _parse_step_id(name)
if step_id is None:
continue
ts = float(ev["ts"])
dur = float(ev["dur"])
steps.append((step_id, Interval(ts, ts + dur), name))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential KeyError or AttributeError if the trace contains malformed events or non-dictionary elements, we should add defensive checks to ensure each event is a dictionary and contains the required keys (ts and dur) before attempting to access or convert them.

Suggested change
for ev in events:
if ev.get("ph") != "X":
continue
if ev.get("cat") != "user_annotation":
continue
name = ev.get("name", "")
step_id = _parse_step_id(name)
if step_id is None:
continue
ts = float(ev["ts"])
dur = float(ev["dur"])
steps.append((step_id, Interval(ts, ts + dur), name))
for ev in events:
if not isinstance(ev, dict):
continue
if ev.get("ph") != "X":
continue
if ev.get("cat") != "user_annotation":
continue
name = ev.get("name", "")
step_id = _parse_step_id(name)
if step_id is None:
continue
ts = ev.get("ts")
dur = ev.get("dur")
if ts is None or dur is None:
continue
steps.append((step_id, Interval(float(ts), float(ts) + float(dur)), name))

Comment on lines +131 to +140
for ev in events:
if ev.get("ph") != "X":
continue
if ev.get("cat") not in _GPU_ACTIVITY_CATS:
continue
ts = float(ev["ts"])
dur = float(ev["dur"])
if dur <= 0:
continue
intervals.append(Interval(ts, ts + dur))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, we should add defensive checks here to ensure each event is a dictionary and contains the required keys (ts and dur) before attempting to access or convert them.

Suggested change
for ev in events:
if ev.get("ph") != "X":
continue
if ev.get("cat") not in _GPU_ACTIVITY_CATS:
continue
ts = float(ev["ts"])
dur = float(ev["dur"])
if dur <= 0:
continue
intervals.append(Interval(ts, ts + dur))
for ev in events:
if not isinstance(ev, dict):
continue
if ev.get("ph") != "X":
continue
if ev.get("cat") not in _GPU_ACTIVITY_CATS:
continue
ts = ev.get("ts")
dur = ev.get("dur")
if ts is None or dur is None:
continue
ts = float(ts)
dur = float(dur)
if dur <= 0:
continue
intervals.append(Interval(ts, ts + dur))

Comment on lines +153 to +156
if merged:
gpu_span = max(x.end for x in merged) - min(x.start for x in merged)
else:
gpu_span = 0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since merged is already sorted by start and contains mutually disjoint (non-overlapping) intervals, the minimum start time is guaranteed to be merged[0].start and the maximum end time is guaranteed to be merged[-1].end. We can simplify the calculation to avoid scanning the entire list with min and max.

Suggested change
if merged:
gpu_span = max(x.end for x in merged) - min(x.start for x in merged)
else:
gpu_span = 0.0
if merged:
gpu_span = merged[-1].end - merged[0].start
else:
gpu_span = 0.0

Comment on lines +275 to +279
for path in args.traces:
stats = analyze_trace_file(path)
print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="")

return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If one of the trace files in a batch is corrupt or missing, the script will crash immediately and skip processing the remaining files. Wrapping the analysis loop in a try-except block allows the script to gracefully report errors for individual files and continue processing the rest, returning a non-zero exit code at the end if any errors occurred.

Suggested change
for path in args.traces:
stats = analyze_trace_file(path)
print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="")
return 0
exit_code = 0
for path in args.traces:
try:
stats = analyze_trace_file(path)
print(render_report(path, stats, gap_threshold=args.gap_threshold, gap_per_event_us=args.gap_per_event_us), end="")
except Exception as e:
print(f"Error analyzing {path}: {e}", file=sys.stderr)
exit_code = 1
return exit_code

@STwangyingrui STwangyingrui deleted the yr/gap_prof_tool branch June 16, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant