Skip to content

fix: prevent deadlock in websocket write during connection cleanup#414

Open
shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
shiv3:fix/ws-write-deadlock
Open

fix: prevent deadlock in websocket write during connection cleanup#414
shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
shiv3:fix/ws-write-deadlock

Conversation

@shiv3
Copy link
Copy Markdown
Contributor

@shiv3 shiv3 commented Apr 7, 2026

Summary

  • Fix a deadlock between WriteManual/Close and cleanup that occurs under high connection concurrency (e.g. 2k+ simultaneous OCPP charge points)
  • Add a done channel to webSocket struct, closed at the start of cleanup, which unblocks pending writes/closes via select
  • Release s.connMutex.RLock in server.Write before calling w.Write to prevent server-wide cleanup starvation

Problem

This issue was introduced between v0.17.0 and v0.19.0 by commit 750a822 ("Fix races in websocket and ocppj"), which changed server.Write and server.stopConnections from using exclusive Lock to shared RLock. While this improved read concurrency, it introduced a deadlock path that manifests under high connection counts.

In v0.17.0, Write used exclusive Lock, which serialized all writes and made the race window negligibly small — the deadlock was practically impossible.

In v0.19.0 and later (including current master), Write uses RLock, allowing many concurrent writes. Under high concurrency, the probability of a write blocking on a full outQueue while holding RLock becomes significant, triggering the deadlock described below.

We confirmed this regression in production: v0.17.0 handles 5k+ concurrent OCPP charge point connections without issue, while v0.19.0 consistently fails at ~2k connections with cascading "already exists" errors that render the central system unable to write to any WebSocket connection.

Deadlock mechanism

When many WebSocket connections disconnect simultaneously (e.g. Kubernetes cluster autoscaler pod restart with 5k+ charge points):

  1. writePump detects a connection error, exits its read loop, and calls cleanup
  2. A concurrent goroutine calls WriteManual, acquires w.mutex.RLock, and blocks on outQueue <- msg (buffer full, writePump no longer reading)
  3. cleanup waits for w.mutex.Lock, which requires all RLock holders to release
  4. Deadlock: WriteManual holds RLock waiting on channel send, cleanup waits for RLock release

Additionally, server.Write held s.connMutex.RLock for the entire duration of w.Write. If w.Write blocked (deadlock above), it prevented all handleDisconnect calls (which need s.connMutex.Lock) across all connections. Connections could not be removed from the map, so reconnecting charge points received "already exists" errors, and eventually no writes could succeed on any connection.

Production scenario

  1. Cluster autoscaler scales down → pod terminates → all charge point WebSocket connections drop
  2. 5k+ charge points reconnect simultaneously to the new pod
  3. Some old connections haven't been cleaned up yet (deadlocked in cleanup)
  4. New connections for the same charge point ID are rejected with "already exists"
  5. The deadlock cascades — blocked RLock holders prevent all connection cleanups
  6. Central system becomes completely unable to communicate with any charge point

Fix

  1. done channel: A new chan struct{} field on webSocket, created in newWebSocket and closed at the start of cleanup (before acquiring w.mutex.Lock). WriteManual and Close use select to abort immediately when done is closed, releasing their RLock so cleanup can proceed.

  2. server.Write lock scope: Release s.connMutex.RLock immediately after looking up the connection, before calling w.Write. The webSocket pointer remains valid after map deletion, so this is safe. This prevents a single blocked write from starving all handleDisconnect calls.

Test plan

  • All existing ws package tests pass (TestNetworkErrors failures are pre-existing, caused by missing toxiproxy dependency)
  • go build ./ws/ ./ocppj/ ./ocpp1.6/... ./ocpp2.0.1/... succeeds
  • Verify with high-concurrency OCPP charge point simulation (2k+ connections with simultaneous reconnection)

When many connections disconnect simultaneously (e.g. cluster autoscaler
pod restart), a deadlock can occur between WriteManual/Close and cleanup:

1. WriteManual holds w.mutex.RLock and blocks on outQueue send (buffer full,
   writePump no longer reading because it exited to call cleanup)
2. cleanup waits for w.mutex.Lock, which requires all RLock holders to release
3. Neither can proceed -> deadlock

Additionally, server.Write held s.connMutex.RLock during the entire w.Write
call. If w.Write blocked, it prevented all handleDisconnect calls (which need
s.connMutex.Lock) across all connections, causing cascading "already exists"
errors when charge points attempted to reconnect.

Fix:
- Add a done channel to webSocket, closed at the start of cleanup
- Use select in WriteManual and Close to abort when done is signaled
- Release s.connMutex.RLock in server.Write before calling w.Write
@shiv3 shiv3 requested a review from lorenzodonini as a code owner April 7, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant