Bug description
KafkaAgentClient.getBrokerState() uses java.net.http.HttpClient with no connect or request timeout configured. When called against a broker that is alive but stuck on IO (TCP connection succeeds, but the Kafka Agent handler never responds), the call blocks the KafkaRoller's single-threaded executor (Executors.newSingleThreadScheduledExecutor()) indefinitely.
Root cause in code
KafkaAgentClient.java:
- Lines 95-97:
HttpClient.newBuilder().sslContext(sslContext).build() — no .connectTimeout()
- Lines 105-108:
HttpRequest.newBuilder().uri(uri).GET().build() — no .timeout()
The getBrokerState() method is called inside the catch block of the readiness await timeout (KafkaRoller.java line 433). Every other blocking operation in KafkaRoller has a bounded timeout (readiness await, pod deletion, admin API retries), except this one.
Why it hangs
When a broker's underlying storage becomes unavailable (e.g., cloud storage outage causing zero IOPS), the broker process stays alive but is stuck on disk IO. The kernel's TCP stack accepts connections normally, but the Kafka Agent handler thread — which needs to query broker internal state — is blocked. The HTTP response never arrives.
The call eventually resolves when the storage IO recovers, not via TCP keepalive as one might expect.
Impact
The KafkaRoller uses a single-threaded executor. While getBrokerState hangs on one broker:
- No other broker can be processed — the thread is blocked
- Brokers that have already recovered wait in the queue
- Total reconciliation time is significantly inflated
- Risk of
FatalProblem — if the hung broker doesn't become ready within an additional operationTimeoutMs after getBrokerState returns (line 480), the entire rolling restart is aborted
Steps to reproduce
- Deploy a Kafka cluster with Strimzi 0.47.0
- Cause storage failure on one or more brokers (e.g., simulate zero IOPS on the data volume)
- Brokers must stay alive (TCP reachable on port 8443) but unresponsive to HTTP
- Trigger a Kafka reconciliation
- Observe the KafkaRoller thread blocks on
getBrokerState() for the duration of the storage failure
Identifying the hang in logs: The gap between Exceeded timeout of <X>ms while waiting for and the subsequent Pod <name> is not ready. We will check if KafkaRoller can do anything about it reveals the getBrokerState hang duration.
Expected behavior
getBrokerState() should have a bounded timeout (e.g., 30 seconds). On timeout, the existing exception handling should catch it as a RuntimeException and return BrokerState(-1, null), allowing the roller thread to move on to other brokers and retry later.
Strimzi version
0.47.0 (also verified unfixed on main/pre-1.0.0 as of 2026-03-09)
Kubernetes version
Kubernetes 1.30
Installation method
Helm chart
Infrastructure
AWS (EKS), block storage volumes for Kafka data
Configuration files and logs
Two distinct failure modes observed
| Broker state |
TCP |
HTTP response |
Duration |
| Pod evicted (DNS gone) |
ConnectException / UnresolvedAddressException |
— |
~2 min |
| Alive but stuck on IO |
Connects OK |
Never arrives |
Unbounded |
Log pattern showing the hang
HH:MM:SS Exceeded timeout of <operationTimeoutMs>ms while waiting for ... <broker-pod>
<<< No KafkaRoller log entries — thread is blocked on getBrokerState() >>>
HH:MM:SS Pod <broker-pod> is not ready. We will check if KafkaRoller can do anything about it.
The time gap between these two messages equals the getBrokerState() hang duration. Zero KafkaRoller log entries appear during the gap, confirming the single thread was blocked.
Timeout inconsistency
Every other blocking operation in KafkaRoller has bounded timeouts:
| Operation |
Timeout |
Source |
Readiness await |
operationTimeoutMs (default 300s) |
Configurable |
Pod deletion await |
operationTimeoutMs |
Configurable |
| Admin API retries |
Exponential backoff 250ms → 64s, max 10 attempts |
BackOff(250, 2, 10) |
getBrokerState |
None (infinite) |
Missing |
Additional context
Proposed fix
The critical missing timeout is the request timeout (HttpRequest.timeout()), which bounds the entire HTTP request lifecycle including waiting for a response. This is what causes the unbounded hang: TCP connects successfully to the alive-but-stuck broker, but the response never arrives. A connectTimeout alone would not help — TCP connection establishment succeeds in these cases.
Adding a connectTimeout on the HttpClient is a secondary improvement for explicitness, but the pod-evicted/DNS-gone cases already fail within ~2 minutes via ConnectException/UnresolvedAddressException.
// HttpClient builder (line 95) — secondary, for explicitness
return HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(30))
.sslContext(sslContext)
.build();
// HttpRequest builder (line 105) — critical fix: bounds the full request lifecycle
HttpRequest req = HttpRequest.newBuilder()
.uri(uri)
.timeout(Duration.ofSeconds(30))
.GET()
.build();
On timeout, HttpTimeoutException (subclass of IOException) is thrown, caught by the existing IOException handler in doGet(), and wrapped as RuntimeException — which getBrokerState() already handles gracefully by returning BrokerState(-1, null).
Expected improvement
With a 30s timeout, a single stuck-broker encounter blocks the thread for at most operationTimeoutMs + 30s instead of operationTimeoutMs + unbounded.
Backport request
We are currently running Strimzi 0.47.0 with Kafka 3.9.1 in production. Would it be possible to include this fix in a patch release for the 0.47.x line (or whichever maintained branch is compatible with Kafka 3.x)? The fix is minimal (two lines adding timeouts to existing code, no behavioral change on healthy brokers) and would help clusters affected by storage failures avoid unnecessarily long reconciliation times.
Bug description
KafkaAgentClient.getBrokerState()usesjava.net.http.HttpClientwith no connect or request timeout configured. When called against a broker that is alive but stuck on IO (TCP connection succeeds, but the Kafka Agent handler never responds), the call blocks the KafkaRoller's single-threaded executor (Executors.newSingleThreadScheduledExecutor()) indefinitely.Root cause in code
KafkaAgentClient.java:HttpClient.newBuilder().sslContext(sslContext).build()— no.connectTimeout()HttpRequest.newBuilder().uri(uri).GET().build()— no.timeout()The
getBrokerState()method is called inside the catch block of the readinessawaittimeout (KafkaRoller.java line 433). Every other blocking operation in KafkaRoller has a bounded timeout (readiness await, pod deletion, admin API retries), except this one.Why it hangs
When a broker's underlying storage becomes unavailable (e.g., cloud storage outage causing zero IOPS), the broker process stays alive but is stuck on disk IO. The kernel's TCP stack accepts connections normally, but the Kafka Agent handler thread — which needs to query broker internal state — is blocked. The HTTP response never arrives.
The call eventually resolves when the storage IO recovers, not via TCP keepalive as one might expect.
Impact
The KafkaRoller uses a single-threaded executor. While
getBrokerStatehangs on one broker:FatalProblem— if the hung broker doesn't become ready within an additionaloperationTimeoutMsaftergetBrokerStatereturns (line 480), the entire rolling restart is abortedSteps to reproduce
getBrokerState()for the duration of the storage failureIdentifying the hang in logs: The gap between
Exceeded timeout of <X>ms while waiting forand the subsequentPod <name> is not ready. We will check if KafkaRoller can do anything about itreveals thegetBrokerStatehang duration.Expected behavior
getBrokerState()should have a bounded timeout (e.g., 30 seconds). On timeout, the existing exception handling should catch it as aRuntimeExceptionand returnBrokerState(-1, null), allowing the roller thread to move on to other brokers and retry later.Strimzi version
0.47.0 (also verified unfixed on main/pre-1.0.0 as of 2026-03-09)
Kubernetes version
Kubernetes 1.30
Installation method
Helm chart
Infrastructure
AWS (EKS), block storage volumes for Kafka data
Configuration files and logs
Two distinct failure modes observed
Log pattern showing the hang
The time gap between these two messages equals the
getBrokerState()hang duration. Zero KafkaRoller log entries appear during the gap, confirming the single thread was blocked.Timeout inconsistency
Every other blocking operation in KafkaRoller has bounded timeouts:
awaitoperationTimeoutMs(default 300s)awaitoperationTimeoutMsBackOff(250, 2, 10)getBrokerStateAdditional context
Proposed fix
The critical missing timeout is the request timeout (
HttpRequest.timeout()), which bounds the entire HTTP request lifecycle including waiting for a response. This is what causes the unbounded hang: TCP connects successfully to the alive-but-stuck broker, but the response never arrives. AconnectTimeoutalone would not help — TCP connection establishment succeeds in these cases.Adding a
connectTimeouton theHttpClientis a secondary improvement for explicitness, but the pod-evicted/DNS-gone cases already fail within ~2 minutes viaConnectException/UnresolvedAddressException.On timeout,
HttpTimeoutException(subclass ofIOException) is thrown, caught by the existingIOExceptionhandler indoGet(), and wrapped asRuntimeException— whichgetBrokerState()already handles gracefully by returningBrokerState(-1, null).Expected improvement
With a 30s timeout, a single stuck-broker encounter blocks the thread for at most
operationTimeoutMs+ 30s instead ofoperationTimeoutMs+ unbounded.Backport request
We are currently running Strimzi 0.47.0 with Kafka 3.9.1 in production. Would it be possible to include this fix in a patch release for the 0.47.x line (or whichever maintained branch is compatible with Kafka 3.x)? The fix is minimal (two lines adding timeouts to existing code, no behavioral change on healthy brokers) and would help clusters affected by storage failures avoid unnecessarily long reconciliation times.