Skip to content

KafkaAgentClient.getBrokerState() has no HTTP timeout, blocks KafkaRoller single-threaded executor indefinitely #12513

@dariocazas

Description

@dariocazas

Bug description

KafkaAgentClient.getBrokerState() uses java.net.http.HttpClient with no connect or request timeout configured. When called against a broker that is alive but stuck on IO (TCP connection succeeds, but the Kafka Agent handler never responds), the call blocks the KafkaRoller's single-threaded executor (Executors.newSingleThreadScheduledExecutor()) indefinitely.

Root cause in code

KafkaAgentClient.java:

  • Lines 95-97: HttpClient.newBuilder().sslContext(sslContext).build() — no .connectTimeout()
  • Lines 105-108: HttpRequest.newBuilder().uri(uri).GET().build() — no .timeout()

The getBrokerState() method is called inside the catch block of the readiness await timeout (KafkaRoller.java line 433). Every other blocking operation in KafkaRoller has a bounded timeout (readiness await, pod deletion, admin API retries), except this one.

Why it hangs

When a broker's underlying storage becomes unavailable (e.g., cloud storage outage causing zero IOPS), the broker process stays alive but is stuck on disk IO. The kernel's TCP stack accepts connections normally, but the Kafka Agent handler thread — which needs to query broker internal state — is blocked. The HTTP response never arrives.

The call eventually resolves when the storage IO recovers, not via TCP keepalive as one might expect.

Impact

The KafkaRoller uses a single-threaded executor. While getBrokerState hangs on one broker:

  1. No other broker can be processed — the thread is blocked
  2. Brokers that have already recovered wait in the queue
  3. Total reconciliation time is significantly inflated
  4. Risk of FatalProblem — if the hung broker doesn't become ready within an additional operationTimeoutMs after getBrokerState returns (line 480), the entire rolling restart is aborted

Steps to reproduce

  1. Deploy a Kafka cluster with Strimzi 0.47.0
  2. Cause storage failure on one or more brokers (e.g., simulate zero IOPS on the data volume)
    • Brokers must stay alive (TCP reachable on port 8443) but unresponsive to HTTP
  3. Trigger a Kafka reconciliation
  4. Observe the KafkaRoller thread blocks on getBrokerState() for the duration of the storage failure

Identifying the hang in logs: The gap between Exceeded timeout of <X>ms while waiting for and the subsequent Pod <name> is not ready. We will check if KafkaRoller can do anything about it reveals the getBrokerState hang duration.

Expected behavior

getBrokerState() should have a bounded timeout (e.g., 30 seconds). On timeout, the existing exception handling should catch it as a RuntimeException and return BrokerState(-1, null), allowing the roller thread to move on to other brokers and retry later.

Strimzi version

0.47.0 (also verified unfixed on main/pre-1.0.0 as of 2026-03-09)

Kubernetes version

Kubernetes 1.30

Installation method

Helm chart

Infrastructure

AWS (EKS), block storage volumes for Kafka data

Configuration files and logs

Two distinct failure modes observed

Broker state TCP HTTP response Duration
Pod evicted (DNS gone) ConnectException / UnresolvedAddressException ~2 min
Alive but stuck on IO Connects OK Never arrives Unbounded

Log pattern showing the hang

HH:MM:SS  Exceeded timeout of <operationTimeoutMs>ms while waiting for ... <broker-pod>
          <<< No KafkaRoller log entries — thread is blocked on getBrokerState() >>>
HH:MM:SS  Pod <broker-pod> is not ready. We will check if KafkaRoller can do anything about it.

The time gap between these two messages equals the getBrokerState() hang duration. Zero KafkaRoller log entries appear during the gap, confirming the single thread was blocked.

Timeout inconsistency

Every other blocking operation in KafkaRoller has bounded timeouts:

Operation Timeout Source
Readiness await operationTimeoutMs (default 300s) Configurable
Pod deletion await operationTimeoutMs Configurable
Admin API retries Exponential backoff 250ms → 64s, max 10 attempts BackOff(250, 2, 10)
getBrokerState None (infinite) Missing

Additional context

Proposed fix

The critical missing timeout is the request timeout (HttpRequest.timeout()), which bounds the entire HTTP request lifecycle including waiting for a response. This is what causes the unbounded hang: TCP connects successfully to the alive-but-stuck broker, but the response never arrives. A connectTimeout alone would not help — TCP connection establishment succeeds in these cases.

Adding a connectTimeout on the HttpClient is a secondary improvement for explicitness, but the pod-evicted/DNS-gone cases already fail within ~2 minutes via ConnectException/UnresolvedAddressException.

// HttpClient builder (line 95) — secondary, for explicitness
return HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(30))
        .sslContext(sslContext)
        .build();

// HttpRequest builder (line 105) — critical fix: bounds the full request lifecycle
HttpRequest req = HttpRequest.newBuilder()
        .uri(uri)
        .timeout(Duration.ofSeconds(30))
        .GET()
        .build();

On timeout, HttpTimeoutException (subclass of IOException) is thrown, caught by the existing IOException handler in doGet(), and wrapped as RuntimeException — which getBrokerState() already handles gracefully by returning BrokerState(-1, null).

Expected improvement

With a 30s timeout, a single stuck-broker encounter blocks the thread for at most operationTimeoutMs + 30s instead of operationTimeoutMs + unbounded.

Backport request

We are currently running Strimzi 0.47.0 with Kafka 3.9.1 in production. Would it be possible to include this fix in a patch release for the 0.47.x line (or whichever maintained branch is compatible with Kafka 3.x)? The fix is minimal (two lines adding timeouts to existing code, no behavioral change on healthy brokers) and would help clusters affected by storage failures avoid unnecessarily long reconciliation times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions