fix: prevent deadlock in websocket write during connection cleanup#414
Open
shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
Open
fix: prevent deadlock in websocket write during connection cleanup#414shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
shiv3 wants to merge 1 commit intolorenzodonini:masterfrom
Conversation
When many connections disconnect simultaneously (e.g. cluster autoscaler pod restart), a deadlock can occur between WriteManual/Close and cleanup: 1. WriteManual holds w.mutex.RLock and blocks on outQueue send (buffer full, writePump no longer reading because it exited to call cleanup) 2. cleanup waits for w.mutex.Lock, which requires all RLock holders to release 3. Neither can proceed -> deadlock Additionally, server.Write held s.connMutex.RLock during the entire w.Write call. If w.Write blocked, it prevented all handleDisconnect calls (which need s.connMutex.Lock) across all connections, causing cascading "already exists" errors when charge points attempted to reconnect. Fix: - Add a done channel to webSocket, closed at the start of cleanup - Use select in WriteManual and Close to abort when done is signaled - Release s.connMutex.RLock in server.Write before calling w.Write
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WriteManual/Closeandcleanupthat occurs under high connection concurrency (e.g. 2k+ simultaneous OCPP charge points)donechannel towebSocketstruct, closed at the start ofcleanup, which unblocks pending writes/closes viaselects.connMutex.RLockinserver.Writebefore callingw.Writeto prevent server-wide cleanup starvationProblem
This issue was introduced between v0.17.0 and v0.19.0 by commit 750a822 ("Fix races in websocket and ocppj"), which changed
server.Writeandserver.stopConnectionsfrom using exclusiveLockto sharedRLock. While this improved read concurrency, it introduced a deadlock path that manifests under high connection counts.In v0.17.0,
Writeused exclusiveLock, which serialized all writes and made the race window negligibly small — the deadlock was practically impossible.In v0.19.0 and later (including current master),
WriteusesRLock, allowing many concurrent writes. Under high concurrency, the probability of a write blocking on a fulloutQueuewhile holdingRLockbecomes significant, triggering the deadlock described below.We confirmed this regression in production: v0.17.0 handles 5k+ concurrent OCPP charge point connections without issue, while v0.19.0 consistently fails at ~2k connections with cascading
"already exists"errors that render the central system unable to write to any WebSocket connection.Deadlock mechanism
When many WebSocket connections disconnect simultaneously (e.g. Kubernetes cluster autoscaler pod restart with 5k+ charge points):
writePumpdetects a connection error, exits its read loop, and callscleanupWriteManual, acquiresw.mutex.RLock, and blocks onoutQueue <- msg(buffer full,writePumpno longer reading)cleanupwaits forw.mutex.Lock, which requires allRLockholders to releaseWriteManualholdsRLockwaiting on channel send,cleanupwaits forRLockreleaseAdditionally,
server.Writehelds.connMutex.RLockfor the entire duration ofw.Write. Ifw.Writeblocked (deadlock above), it prevented allhandleDisconnectcalls (which needs.connMutex.Lock) across all connections. Connections could not be removed from the map, so reconnecting charge points received"already exists"errors, and eventually no writes could succeed on any connection.Production scenario
"already exists"RLockholders prevent all connection cleanupsFix
donechannel: A newchan struct{}field onwebSocket, created innewWebSocketand closed at the start ofcleanup(before acquiringw.mutex.Lock).WriteManualandCloseuseselectto abort immediately whendoneis closed, releasing theirRLocksocleanupcan proceed.server.Writelock scope: Releases.connMutex.RLockimmediately after looking up the connection, before callingw.Write. ThewebSocketpointer remains valid after map deletion, so this is safe. This prevents a single blocked write from starving allhandleDisconnectcalls.Test plan
wspackage tests pass (TestNetworkErrors failures are pre-existing, caused by missing toxiproxy dependency)go build ./ws/ ./ocppj/ ./ocpp1.6/... ./ocpp2.0.1/...succeeds