KafkaAgentClient.getBrokerState() has no HTTP timeout, blocks KafkaRoller single-threaded executor indefinitely

### Bug description

`KafkaAgentClient.getBrokerState()` uses `java.net.http.HttpClient` with **no connect or request timeout configured**. When called against a broker that is alive but stuck on IO (TCP connection succeeds, but the Kafka Agent handler never responds), the call blocks the KafkaRoller's single-threaded executor (`Executors.newSingleThreadScheduledExecutor()`) indefinitely.

#### Root cause in code

`KafkaAgentClient.java`:
- Lines 95-97: `HttpClient.newBuilder().sslContext(sslContext).build()` — no `.connectTimeout()`
- Lines 105-108: `HttpRequest.newBuilder().uri(uri).GET().build()` — no `.timeout()`

The `getBrokerState()` method is called inside the catch block of the readiness `await` timeout (KafkaRoller.java line 433). Every other blocking operation in KafkaRoller has a bounded timeout (readiness await, pod deletion, admin API retries), except this one.

#### Why it hangs

When a broker's underlying storage becomes unavailable (e.g., cloud storage outage causing zero IOPS), the broker process stays alive but is stuck on disk IO. The kernel's TCP stack accepts connections normally, but the Kafka Agent handler thread — which needs to query broker internal state — is blocked. The HTTP response never arrives.

The call eventually resolves when the storage IO recovers, not via TCP keepalive as one might expect.

#### Impact

The KafkaRoller uses a single-threaded executor. While `getBrokerState` hangs on one broker:
1. **No other broker can be processed** — the thread is blocked
2. Brokers that have already recovered wait in the queue
3. Total reconciliation time is significantly inflated
4. Risk of `FatalProblem` — if the hung broker doesn't become ready within an additional `operationTimeoutMs` after `getBrokerState` returns (line 480), the entire rolling restart is aborted

### Steps to reproduce

1. Deploy a Kafka cluster with Strimzi 0.47.0
2. Cause storage failure on one or more brokers (e.g., simulate zero IOPS on the data volume)
   - Brokers must stay alive (TCP reachable on port 8443) but unresponsive to HTTP
3. Trigger a Kafka reconciliation
4. Observe the KafkaRoller thread blocks on `getBrokerState()` for the duration of the storage failure

**Identifying the hang in logs**: The gap between `Exceeded timeout of <X>ms while waiting for` and the subsequent `Pod <name> is not ready. We will check if KafkaRoller can do anything about it` reveals the `getBrokerState` hang duration.

### Expected behavior

`getBrokerState()` should have a bounded timeout (e.g., 30 seconds). On timeout, the existing exception handling should catch it as a `RuntimeException` and return `BrokerState(-1, null)`, allowing the roller thread to move on to other brokers and retry later.

### Strimzi version

0.47.0 (also verified unfixed on main/pre-1.0.0 as of 2026-03-09)

### Kubernetes version

Kubernetes 1.30

### Installation method

Helm chart

### Infrastructure

AWS (EKS), block storage volumes for Kafka data

### Configuration files and logs

#### Two distinct failure modes observed

| Broker state | TCP | HTTP response | Duration |
|-------------|-----|---------------|----------|
| Pod evicted (DNS gone) | ConnectException / UnresolvedAddressException | — | ~2 min |
| Alive but stuck on IO | Connects OK | **Never arrives** | **Unbounded** |

#### Log pattern showing the hang

```
HH:MM:SS  Exceeded timeout of <operationTimeoutMs>ms while waiting for ... <broker-pod>
          <<< No KafkaRoller log entries — thread is blocked on getBrokerState() >>>
HH:MM:SS  Pod <broker-pod> is not ready. We will check if KafkaRoller can do anything about it.
```

The time gap between these two messages equals the `getBrokerState()` hang duration. Zero KafkaRoller log entries appear during the gap, confirming the single thread was blocked.

#### Timeout inconsistency

Every other blocking operation in KafkaRoller has bounded timeouts:

| Operation | Timeout | Source |
|-----------|---------|--------|
| Readiness `await` | `operationTimeoutMs` (default 300s) | Configurable |
| Pod deletion `await` | `operationTimeoutMs` | Configurable |
| Admin API retries | Exponential backoff 250ms → 64s, max 10 attempts | `BackOff(250, 2, 10)` |
| **`getBrokerState`** | **None (infinite)** | **Missing** |

### Additional context

#### Proposed fix

The critical missing timeout is the **request timeout** (`HttpRequest.timeout()`), which bounds the entire HTTP request lifecycle including waiting for a response. This is what causes the unbounded hang: TCP connects successfully to the alive-but-stuck broker, but the response never arrives. A `connectTimeout` alone would not help — TCP connection establishment succeeds in these cases.

Adding a `connectTimeout` on the `HttpClient` is a secondary improvement for explicitness, but the pod-evicted/DNS-gone cases already fail within ~2 minutes via `ConnectException`/`UnresolvedAddressException`.

```java
// HttpClient builder (line 95) — secondary, for explicitness
return HttpClient.newBuilder()
        .connectTimeout(Duration.ofSeconds(30))
        .sslContext(sslContext)
        .build();

// HttpRequest builder (line 105) — critical fix: bounds the full request lifecycle
HttpRequest req = HttpRequest.newBuilder()
        .uri(uri)
        .timeout(Duration.ofSeconds(30))
        .GET()
        .build();
```

On timeout, `HttpTimeoutException` (subclass of `IOException`) is thrown, caught by the existing `IOException` handler in `doGet()`, and wrapped as `RuntimeException` — which `getBrokerState()` already handles gracefully by returning `BrokerState(-1, null)`.

#### Expected improvement

With a 30s timeout, a single stuck-broker encounter blocks the thread for at most `operationTimeoutMs` + 30s instead of `operationTimeoutMs` + unbounded.

#### Backport request

We are currently running Strimzi 0.47.0 with Kafka 3.9.1 in production. Would it be possible to include this fix in a patch release for the 0.47.x line (or whichever maintained branch is compatible with Kafka 3.x)? The fix is minimal (two lines adding timeouts to existing code, no behavioral change on healthy brokers) and would help clusters affected by storage failures avoid unnecessarily long reconciliation times.

Operation	Timeout	Source
Readiness `await`	`operationTimeoutMs` (default 300s)	Configurable
Pod deletion `await`	`operationTimeoutMs`	Configurable
Admin API retries	Exponential backoff 250ms → 64s, max 10 attempts	`BackOff(250, 2, 10)`
`getBrokerState`	None (infinite)	Missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KafkaAgentClient.getBrokerState() has no HTTP timeout, blocks KafkaRoller single-threaded executor indefinitely #12513

Bug description

Root cause in code

Why it hangs

Impact

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Two distinct failure modes observed

Log pattern showing the hang

Timeout inconsistency

Additional context

Proposed fix

Expected improvement

Backport request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Broker state	TCP	HTTP response	Duration
Pod evicted (DNS gone)	ConnectException / UnresolvedAddressException	—	~2 min
Alive but stuck on IO	Connects OK	Never arrives	Unbounded

KafkaAgentClient.getBrokerState() has no HTTP timeout, blocks KafkaRoller single-threaded executor indefinitely #12513

Description

Bug description

Root cause in code

Why it hangs

Impact

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Two distinct failure modes observed

Log pattern showing the hang

Timeout inconsistency

Additional context

Proposed fix

Expected improvement

Backport request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions