Skip to content

fix(e2e): harden kube exec against apiserver SPDY hangs#8627

Merged
r2k1 merged 2 commits into
mainfrom
fix/e2e-wireserver-per-attempt-timeout
Jun 3, 2026
Merged

fix(e2e): harden kube exec against apiserver SPDY hangs#8627
r2k1 merged 2 commits into
mainfrom
fix/e2e-wireserver-per-attempt-timeout

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented Jun 3, 2026

Hardens e2e against kube-exec hangs (PR #8580 regression: 30s budget passed straight into a single SPDY exec call → no retries on apiserver streaming-proxy stalls; build 166502267).

Changes

e2e/validation.govalidateWireServerBlocked:

  • Poll budget 30s → 1m
  • Wrap each exec in 15s context.WithTimeout so retries actually happen
  • curl --max-time 8 for in-pod self-bound

e2e/kube.go — REST config:

  • net.Dialer{Timeout: 10s, KeepAlive: 30s} — bounds TCP-layer hangs
  • http2.ConfigureTransports with ReadIdleTimeout=30s, PingTimeout=15s — detects silent dead apiserver conns on regular REST + the initial /exec POST (does NOT cover post-upgrade SPDY/WS stream — per-attempt ctx covers that)

Root cause (build 166502267)

  • Debug pod ready 6.5s before exec call
  • Kubelet log shows zero exec activity in the 30s window on the affected node
  • 7 unrelated pods in the same run had >60s pod-ready latency → apiserver stress
  • Hang was at apiserver SPDY proxy, not kubelet

Which issue(s) this PR fixes:
Fixes #

r2k1 and others added 2 commits June 3, 2026 12:05
The strict validator from #8580 cut the overall poll budget to 30s but
left the SPDY exec call inheriting the poll's inner ctx. A single hung
exec consumed the entire budget, with zero retries. Build 166502267
hit exactly that: the kube-apiserver exec subresource hung 30s (kubelet
log shows no exec entry for the debug pod during the window), the
cluster control plane was under load (7 pods on the same build took
>60s to become ready), and the validator failed with
'context deadline exceeded'.

Fix:
- restore 1m overall poll budget
- wrap each exec call in a 15s per-attempt context so a stuck SPDY
  setup gets cancelled and retried instead of starving the budget
- bound curl total runtime with --max-time 8 (--connect-timeout only
  covers TCP connect)
- distinguish per-attempt deadlines from other exec errors in logs

Strict semantics from #8580 preserved: unexpected curl exits still
fail loudly; only transport/setup errors retry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two defenses against silent connection wedges that have caused
flaky e2e failures (most recently the wireserver validator hang in
build 166502267 where a kube SPDY exec sat for 30s before the
caller's per-attempt context fired):

  * net.Dialer{Timeout: 10s, KeepAlive: 30s} on the REST config so
    TCP-layer hangs (LB/NAT drops, half-open connections) surface as
    dial errors instead of indefinite blocking.

  * HTTP/2 ReadIdleTimeout=30s + PingTimeout=15s so apiserver
    connections that go silent (control-plane stress, network blip)
    are actively probed and torn down, returning a connection error
    that the existing retry layer can act on.

These don't cover the active SPDY exec stream itself (SPDY hijacks
the connection after the HTTP upgrade), so they complement — not
replace — the per-attempt context timeout introduced for the
wireserver validator. They do cover every regular CoreV1 / AppsV1
call plus the initial /exec POST.

Promotes golang.org/x/net from indirect to direct in e2e/go.mod.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the e2e harness against intermittent Kubernetes exec hangs (notably SPDY proxy stalls at the apiserver) by bounding both the exec attempt duration and underlying client transport behavior so flakes fail fast and retry safely within the scenario budget.

Changes:

  • Add a per-attempt timeout (15s) inside validateWireServerBlocked while restoring a 1-minute overall poll budget, and cap in-pod curl runtime via --max-time.
  • Configure the Kubernetes REST client with a TCP dial timeout and HTTP/2 keep-alive ping settings to surface dead/idle connections as errors (enabling existing retry paths).
  • Promote golang.org/x/net to a direct dependency to support http2 transport configuration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
e2e/validation.go Adds per-exec attempt timeout + curl --max-time to prevent a single hung exec from consuming the entire wireserver validation budget.
e2e/kube.go Adds REST transport hardening (dial timeout + HTTP/2 keepalive pings) to reduce silent connection wedges impacting exec and other API calls.
e2e/go.mod Updates dependency metadata to include golang.org/x/net/http2 as a direct requirement.

Comment thread e2e/validation.go
Comment on lines +322 to +326
defer cancel()
r, execErr := execOnUnprivilegedPod(attemptCtx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd)
if execErr != nil {
s.T.Logf("wireserver check %q: exec error (retrying): %v", check.desc, execErr)
if errors.Is(execErr, context.DeadlineExceeded) {
s.T.Logf("wireserver check %q: exec attempt timed out after 15s (retrying): %v", check.desc, execErr)
@r2k1 r2k1 enabled auto-merge (squash) June 3, 2026 01:02
@r2k1 r2k1 merged commit 3c6601e into main Jun 3, 2026
31 of 34 checks passed
@r2k1 r2k1 deleted the fix/e2e-wireserver-per-attempt-timeout branch June 3, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants