fix(e2e): harden kube exec against apiserver SPDY hangs#8627
Merged
Conversation
The strict validator from #8580 cut the overall poll budget to 30s but left the SPDY exec call inheriting the poll's inner ctx. A single hung exec consumed the entire budget, with zero retries. Build 166502267 hit exactly that: the kube-apiserver exec subresource hung 30s (kubelet log shows no exec entry for the debug pod during the window), the cluster control plane was under load (7 pods on the same build took >60s to become ready), and the validator failed with 'context deadline exceeded'. Fix: - restore 1m overall poll budget - wrap each exec call in a 15s per-attempt context so a stuck SPDY setup gets cancelled and retried instead of starving the budget - bound curl total runtime with --max-time 8 (--connect-timeout only covers TCP connect) - distinguish per-attempt deadlines from other exec errors in logs Strict semantics from #8580 preserved: unexpected curl exits still fail loudly; only transport/setup errors retry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two defenses against silent connection wedges that have caused
flaky e2e failures (most recently the wireserver validator hang in
build 166502267 where a kube SPDY exec sat for 30s before the
caller's per-attempt context fired):
* net.Dialer{Timeout: 10s, KeepAlive: 30s} on the REST config so
TCP-layer hangs (LB/NAT drops, half-open connections) surface as
dial errors instead of indefinite blocking.
* HTTP/2 ReadIdleTimeout=30s + PingTimeout=15s so apiserver
connections that go silent (control-plane stress, network blip)
are actively probed and torn down, returning a connection error
that the existing retry layer can act on.
These don't cover the active SPDY exec stream itself (SPDY hijacks
the connection after the HTTP upgrade), so they complement — not
replace — the per-attempt context timeout introduced for the
wireserver validator. They do cover every regular CoreV1 / AppsV1
call plus the initial /exec POST.
Promotes golang.org/x/net from indirect to direct in e2e/go.mod.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the e2e harness against intermittent Kubernetes exec hangs (notably SPDY proxy stalls at the apiserver) by bounding both the exec attempt duration and underlying client transport behavior so flakes fail fast and retry safely within the scenario budget.
Changes:
- Add a per-attempt timeout (15s) inside
validateWireServerBlockedwhile restoring a 1-minute overall poll budget, and cap in-podcurlruntime via--max-time. - Configure the Kubernetes REST client with a TCP dial timeout and HTTP/2 keep-alive ping settings to surface dead/idle connections as errors (enabling existing retry paths).
- Promote
golang.org/x/netto a direct dependency to supporthttp2transport configuration.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| e2e/validation.go | Adds per-exec attempt timeout + curl --max-time to prevent a single hung exec from consuming the entire wireserver validation budget. |
| e2e/kube.go | Adds REST transport hardening (dial timeout + HTTP/2 keepalive pings) to reduce silent connection wedges impacting exec and other API calls. |
| e2e/go.mod | Updates dependency metadata to include golang.org/x/net/http2 as a direct requirement. |
Comment on lines
+322
to
+326
| defer cancel() | ||
| r, execErr := execOnUnprivilegedPod(attemptCtx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd) | ||
| if execErr != nil { | ||
| s.T.Logf("wireserver check %q: exec error (retrying): %v", check.desc, execErr) | ||
| if errors.Is(execErr, context.DeadlineExceeded) { | ||
| s.T.Logf("wireserver check %q: exec attempt timed out after 15s (retrying): %v", check.desc, execErr) |
awesomenix
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hardens e2e against kube-exec hangs (PR #8580 regression: 30s budget passed straight into a single SPDY exec call → no retries on apiserver streaming-proxy stalls; build 166502267).
Changes
e2e/validation.go—validateWireServerBlocked:context.WithTimeoutso retries actually happencurl --max-time 8for in-pod self-bounde2e/kube.go— REST config:net.Dialer{Timeout: 10s, KeepAlive: 30s}— bounds TCP-layer hangshttp2.ConfigureTransportswithReadIdleTimeout=30s,PingTimeout=15s— detects silent dead apiserver conns on regular REST + the initial/execPOST (does NOT cover post-upgrade SPDY/WS stream — per-attempt ctx covers that)Root cause (build 166502267)
Which issue(s) this PR fixes:
Fixes #