fix: prevent deadlock in websocket write during connection cleanup by shiv3 · Pull Request #414 · lorenzodonini/ocpp-go

shiv3 · 2026-04-07T02:29:44Z

Summary

Fix a deadlock between WriteManual/Close and cleanup that occurs under high connection concurrency (e.g. 2k+ simultaneous OCPP charge points)
Add a done channel to webSocket struct, closed at the start of cleanup, which unblocks pending writes/closes via select
Release s.connMutex.RLock in server.Write before calling w.Write to prevent server-wide cleanup starvation

Problem

This issue was introduced between v0.17.0 and v0.19.0 by commit 750a822 ("Fix races in websocket and ocppj"), which changed server.Write and server.stopConnections from using exclusive Lock to shared RLock. While this improved read concurrency, it introduced a deadlock path that manifests under high connection counts.

In v0.17.0, Write used exclusive Lock, which serialized all writes and made the race window negligibly small — the deadlock was practically impossible.

In v0.19.0 and later (including current master), Write uses RLock, allowing many concurrent writes. Under high concurrency, the probability of a write blocking on a full outQueue while holding RLock becomes significant, triggering the deadlock described below.

We confirmed this regression in production: v0.17.0 handles 5k+ concurrent OCPP charge point connections without issue, while v0.19.0 consistently fails at ~2k connections with cascading "already exists" errors that render the central system unable to write to any WebSocket connection.

Deadlock mechanism

When many WebSocket connections disconnect simultaneously (e.g. Kubernetes cluster autoscaler pod restart with 5k+ charge points):

writePump detects a connection error, exits its read loop, and calls cleanup
A concurrent goroutine calls WriteManual, acquires w.mutex.RLock, and blocks on outQueue <- msg (buffer full, writePump no longer reading)
cleanup waits for w.mutex.Lock, which requires all RLock holders to release
Deadlock: WriteManual holds RLock waiting on channel send, cleanup waits for RLock release

Additionally, server.Write held s.connMutex.RLock for the entire duration of w.Write. If w.Write blocked (deadlock above), it prevented all handleDisconnect calls (which need s.connMutex.Lock) across all connections. Connections could not be removed from the map, so reconnecting charge points received "already exists" errors, and eventually no writes could succeed on any connection.

Production scenario

Cluster autoscaler scales down → pod terminates → all charge point WebSocket connections drop
5k+ charge points reconnect simultaneously to the new pod
Some old connections haven't been cleaned up yet (deadlocked in cleanup)
New connections for the same charge point ID are rejected with "already exists"
The deadlock cascades — blocked RLock holders prevent all connection cleanups
Central system becomes completely unable to communicate with any charge point

Fix

done channel: A new chan struct{} field on webSocket, created in newWebSocket and closed at the start of cleanup (before acquiring w.mutex.Lock). WriteManual and Close use select to abort immediately when done is closed, releasing their RLock so cleanup can proceed.
server.Write lock scope: Release s.connMutex.RLock immediately after looking up the connection, before calling w.Write. The webSocket pointer remains valid after map deletion, so this is safe. This prevents a single blocked write from starving all handleDisconnect calls.

Test plan

All existing ws package tests pass (TestNetworkErrors failures are pre-existing, caused by missing toxiproxy dependency)
go build ./ws/ ./ocppj/ ./ocpp1.6/... ./ocpp2.0.1/... succeeds
Verify with high-concurrency OCPP charge point simulation (2k+ connections with simultaneous reconnection)

When many connections disconnect simultaneously (e.g. cluster autoscaler pod restart), a deadlock can occur between WriteManual/Close and cleanup: 1. WriteManual holds w.mutex.RLock and blocks on outQueue send (buffer full, writePump no longer reading because it exited to call cleanup) 2. cleanup waits for w.mutex.Lock, which requires all RLock holders to release 3. Neither can proceed -> deadlock Additionally, server.Write held s.connMutex.RLock during the entire w.Write call. If w.Write blocked, it prevented all handleDisconnect calls (which need s.connMutex.Lock) across all connections, causing cascading "already exists" errors when charge points attempted to reconnect. Fix: - Add a done channel to webSocket, closed at the start of cleanup - Use select in WriteManual and Close to abort when done is signaled - Release s.connMutex.RLock in server.Write before calling w.Write

shiv3 requested a review from lorenzodonini as a code owner April 7, 2026 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent deadlock in websocket write during connection cleanup#414

fix: prevent deadlock in websocket write during connection cleanup#414
shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
shiv3:fix/ws-write-deadlock

shiv3 commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shiv3 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Deadlock mechanism

Production scenario

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shiv3 commented Apr 7, 2026 •

edited

Loading