fix(e2e): harden kube exec against apiserver SPDY hangs by r2k1 · Pull Request #8627 · Azure/AgentBaker

r2k1 · 2026-06-03T00:29:48Z

Hardens e2e against kube-exec hangs (PR #8580 regression: 30s budget passed straight into a single SPDY exec call → no retries on apiserver streaming-proxy stalls; build 166502267).

Changes

e2e/validation.go — validateWireServerBlocked:

Poll budget 30s → 1m
Wrap each exec in 15s context.WithTimeout so retries actually happen
curl --max-time 8 for in-pod self-bound

e2e/kube.go — REST config:

net.Dialer{Timeout: 10s, KeepAlive: 30s} — bounds TCP-layer hangs
http2.ConfigureTransports with ReadIdleTimeout=30s, PingTimeout=15s — detects silent dead apiserver conns on regular REST + the initial /exec POST (does NOT cover post-upgrade SPDY/WS stream — per-attempt ctx covers that)

Root cause (build 166502267)

Debug pod ready 6.5s before exec call
Kubelet log shows zero exec activity in the 30s window on the affected node
7 unrelated pods in the same run had >60s pod-ready latency → apiserver stress
Hang was at apiserver SPDY proxy, not kubelet

Which issue(s) this PR fixes:
Fixes #

The strict validator from #8580 cut the overall poll budget to 30s but left the SPDY exec call inheriting the poll's inner ctx. A single hung exec consumed the entire budget, with zero retries. Build 166502267 hit exactly that: the kube-apiserver exec subresource hung 30s (kubelet log shows no exec entry for the debug pod during the window), the cluster control plane was under load (7 pods on the same build took >60s to become ready), and the validator failed with 'context deadline exceeded'. Fix: - restore 1m overall poll budget - wrap each exec call in a 15s per-attempt context so a stuck SPDY setup gets cancelled and retried instead of starving the budget - bound curl total runtime with --max-time 8 (--connect-timeout only covers TCP connect) - distinguish per-attempt deadlines from other exec errors in logs Strict semantics from #8580 preserved: unexpected curl exits still fail loudly; only transport/setup errors retry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds two defenses against silent connection wedges that have caused flaky e2e failures (most recently the wireserver validator hang in build 166502267 where a kube SPDY exec sat for 30s before the caller's per-attempt context fired): * net.Dialer{Timeout: 10s, KeepAlive: 30s} on the REST config so TCP-layer hangs (LB/NAT drops, half-open connections) surface as dial errors instead of indefinite blocking. * HTTP/2 ReadIdleTimeout=30s + PingTimeout=15s so apiserver connections that go silent (control-plane stress, network blip) are actively probed and torn down, returning a connection error that the existing retry layer can act on. These don't cover the active SPDY exec stream itself (SPDY hijacks the connection after the HTTP upgrade), so they complement — not replace — the per-attempt context timeout introduced for the wireserver validator. They do cover every regular CoreV1 / AppsV1 call plus the initial /exec POST. Promotes golang.org/x/net from indirect to direct in e2e/go.mod. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR hardens the e2e harness against intermittent Kubernetes exec hangs (notably SPDY proxy stalls at the apiserver) by bounding both the exec attempt duration and underlying client transport behavior so flakes fail fast and retry safely within the scenario budget.

Changes:

Add a per-attempt timeout (15s) inside validateWireServerBlocked while restoring a 1-minute overall poll budget, and cap in-pod curl runtime via --max-time.
Configure the Kubernetes REST client with a TCP dial timeout and HTTP/2 keep-alive ping settings to surface dead/idle connections as errors (enabling existing retry paths).
Promote golang.org/x/net to a direct dependency to support http2 transport configuration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
e2e/validation.go	Adds per-exec attempt timeout + `curl --max-time` to prevent a single hung exec from consuming the entire wireserver validation budget.
e2e/kube.go	Adds REST transport hardening (dial timeout + HTTP/2 keepalive pings) to reduce silent connection wedges impacting exec and other API calls.
e2e/go.mod	Updates dependency metadata to include `golang.org/x/net/http2` as a direct requirement.

+			defer cancel()
+			r, execErr := execOnUnprivilegedPod(attemptCtx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd)
 			if execErr != nil {
-				s.T.Logf("wireserver check %q: exec error (retrying): %v", check.desc, execErr)
+				if errors.Is(execErr, context.DeadlineExceeded) {
+					s.T.Logf("wireserver check %q: exec attempt timed out after 15s (retrying): %v", check.desc, execErr)


r2k1 and others added 2 commits June 3, 2026 12:05

Copilot AI review requested due to automatic review settings June 3, 2026 00:29

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 3, 2026 00:29

r2k1 temporarily deployed to test June 3, 2026 00:29 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 June 3, 2026 00:29 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

r2k1 mentioned this pull request Jun 3, 2026

chore(e2e): bump client-go to v0.36.1, Go to 1.26.4, switch pod exec to WebSocket #8628

Merged

awesomenix approved these changes Jun 3, 2026

View reviewed changes

r2k1 enabled auto-merge (squash) June 3, 2026 01:02

r2k1 merged commit 3c6601e into main Jun 3, 2026
31 of 34 checks passed

r2k1 deleted the fix/e2e-wireserver-per-attempt-timeout branch June 3, 2026 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): harden kube exec against apiserver SPDY hangs#8627

fix(e2e): harden kube exec against apiserver SPDY hangs#8627
r2k1 merged 2 commits into
mainfrom
fix/e2e-wireserver-per-attempt-timeout

r2k1 commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

r2k1 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Root cause (build 166502267)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r2k1 commented Jun 3, 2026 •

edited

Loading