Related problem
This is a follow-up to the discussion on Slack:
https://cloud-native.slack.com/archives/CMH3Q3SNP/p1764585476771369
We are experiencing failed graceful shutdowns of Kafka brokers during Strimzi-initiated rolling updates.
The root cause is the deletion propagation strategy used by Strimzi.
Kubernetes behaves differently depending on whether a pod is deleted with background or foreground propagation:
Background deletion:
Pod enters Terminating, then child resources such as CiliumEndpoint are cleaned up afterwards.
→ The broker still has working network during shutdown.
Foreground deletion (current Strimzi behavior):
The CiliumEndpoint is deleted immediately and first, before the pod’s containers exit.
→ The broker instantly loses all network connectivity:
• Cannot communicate with the KRaft controller
• Cannot replicate
• Cannot send final heartbeats
• Cannot perform any graceful shutdown steps
This results in the broker entering recovery mode on every restart, drastically increasing restart times (10+ minutes per broker in production).
Additionally, because the broker cannot communicate with the controller during shutdown, it is not fenced properly.
Clients continue to send traffic to a broker that is already Terminating, which leads to:
• request timeouts
• incomplete shutdown sequences
• longer failover
• risk of degraded ingest performance during rollouts
This behavior currently makes safe rolling updates extremely difficult:
• Every Strimzi-triggered broker restart results in a slow, recovery-heavy startup.
• We have no way to control the rollout behavior, making production updates risky.
• End-to-end pipelines slow down significantly because brokers are sequentially stuck in recovery mode.
• The cluster spends extended periods in a degraded state due to improperly fenced brokers.
This is currently one of the biggest operational blockers for us in using Strimzi.
A related issue already exists here:
#9592
The problem is not limited to TLS encryption—it affects any Strimzi installation running with Cilium as the CNI.
There is also a discussion on the Cilium side:
cilium/cilium#30683
Suggested solution
Please add an option in Strimzi to configure the deletion propagation strategy used when rolling broker pods.
Ideally this would allow configuring:
• Foreground (current behavior)
• Background (safer for CNIs where endpoint teardown timing matters)
• Possibly Orphan (depending on requirements)
At minimum, the ability to configure StrimziPodSet deletion propagation would allow Kafka brokers to retain their network (CEP) long enough to complete a graceful shutdown.
Alternatives
We explored multiple possible workarounds:
• Custom preStop hooks → Strimzi does not support defining lifecycle hooks for Kafka broker containers.
• Custom images → Does not solve network loss caused by CEP removal.
• Cilium NetworkPolicies → Cannot influence CEP deletion order or datapath teardown.
• Pausing reconciliation → Does not stop Strimzi from eventually deleting pods with foreground propagation.
None of these approaches can guarantee a safe, graceful broker shutdown while the CEP is being deleted first.
Additional context
Here is an excerpt from our broker logs during a Strimzi-triggered roll:
2025-12-08 11:45:54 INFO [SIGTERM handler] LoggingSignalHandler:93 - Terminating process due to signal SIGTERM
2025-12-08 11:45:54 INFO [kafka-shutdown-hook] BrokerServer:66 - [BrokerServer id=2] Transition from STARTED to SHUTTING_DOWN
2025-12-08 11:45:58 INFO [kafka-2-raft-outbound-request-thread] NetworkClient:921 - [RaftManager id=2] Disconnecting from node 5 due to request timeout.
2025-12-08 11:45:58 INFO [kafka-2-raft-outbound-request-thread] NetworkClient:411 - [RaftManager id=2] Cancelled in-flight FETCH request ... due to node 5 being disconnected
2025-12-08 11:45:58 INFO [broker-2-to-controller-heartbeat-channel-manager] NetworkClient:921 - [NodeToControllerChannelManager id=2 name=heartbeat] Disconnecting from node 5 due to request timeout.
Shortly after this sequence, the broker fails additional shutdown steps because it no longer has any network interface.
This directly corresponds to the CEP being deleted before the pod terminates.
Related problem
This is a follow-up to the discussion on Slack:
https://cloud-native.slack.com/archives/CMH3Q3SNP/p1764585476771369
We are experiencing failed graceful shutdowns of Kafka brokers during Strimzi-initiated rolling updates.
The root cause is the deletion propagation strategy used by Strimzi.
Kubernetes behaves differently depending on whether a pod is deleted with background or foreground propagation:
Background deletion:
Pod enters Terminating, then child resources such as CiliumEndpoint are cleaned up afterwards.
→ The broker still has working network during shutdown.
Foreground deletion (current Strimzi behavior):
The CiliumEndpoint is deleted immediately and first, before the pod’s containers exit.
→ The broker instantly loses all network connectivity:
• Cannot communicate with the KRaft controller
• Cannot replicate
• Cannot send final heartbeats
• Cannot perform any graceful shutdown steps
This results in the broker entering recovery mode on every restart, drastically increasing restart times (10+ minutes per broker in production).
Additionally, because the broker cannot communicate with the controller during shutdown, it is not fenced properly.
Clients continue to send traffic to a broker that is already Terminating, which leads to:
• request timeouts
• incomplete shutdown sequences
• longer failover
• risk of degraded ingest performance during rollouts
This behavior currently makes safe rolling updates extremely difficult:
• Every Strimzi-triggered broker restart results in a slow, recovery-heavy startup.
• We have no way to control the rollout behavior, making production updates risky.
• End-to-end pipelines slow down significantly because brokers are sequentially stuck in recovery mode.
• The cluster spends extended periods in a degraded state due to improperly fenced brokers.
This is currently one of the biggest operational blockers for us in using Strimzi.
A related issue already exists here:
#9592
The problem is not limited to TLS encryption—it affects any Strimzi installation running with Cilium as the CNI.
There is also a discussion on the Cilium side:
cilium/cilium#30683
Suggested solution
Please add an option in Strimzi to configure the deletion propagation strategy used when rolling broker pods.
Ideally this would allow configuring:
• Foreground (current behavior)
• Background (safer for CNIs where endpoint teardown timing matters)
• Possibly Orphan (depending on requirements)
At minimum, the ability to configure StrimziPodSet deletion propagation would allow Kafka brokers to retain their network (CEP) long enough to complete a graceful shutdown.
Alternatives
We explored multiple possible workarounds:
• Custom preStop hooks → Strimzi does not support defining lifecycle hooks for Kafka broker containers.
• Custom images → Does not solve network loss caused by CEP removal.
• Cilium NetworkPolicies → Cannot influence CEP deletion order or datapath teardown.
• Pausing reconciliation → Does not stop Strimzi from eventually deleting pods with foreground propagation.
None of these approaches can guarantee a safe, graceful broker shutdown while the CEP is being deleted first.
Additional context
Here is an excerpt from our broker logs during a Strimzi-triggered roll:
Shortly after this sequence, the broker fails additional shutdown steps because it no longer has any network interface.
This directly corresponds to the CEP being deleted before the pod terminates.