fix: deadlock in DefaultServerDispatcher#412
Open
smulloni wants to merge 2 commits intolorenzodonini:masterfrom
Open
fix: deadlock in DefaultServerDispatcher#412smulloni wants to merge 2 commits intolorenzodonini:masterfrom
smulloni wants to merge 2 commits intolorenzodonini:masterfrom
Conversation
…oncurrently DefaultServerDispatcher shared its RWMutex with ServerState (via NewServerState(&d.mutex)). When more than 10 client request timeouts fired simultaneously, waitForTimeout goroutines would acquire RLock and block sending to timerC (buffer=10). Meanwhile messagePump, the sole timerC reader, would call HasPendingRequest which needed Lock on the same mutex — forming a circular wait that froze the entire server permanently. This matches the symptoms reported in lorenzodonini#358. Three fixes applied: 1. Give ServerState its own RWMutex instead of sharing the dispatcher's. This breaks the cross-concern locking dependency. 2. In waitForTimeout, release RLock before sending to timerC. Previously the lock was held via defer across the blocking send. 3. In messagePump's timeout handler, inline the request completion instead of calling CompleteRequest, which sends to readyForDispatch (buffer=1). Since messagePump is the sole reader of that channel, this was a self-deadlock path when processing consecutive timeouts. Additionally, read-only ServerState/ClientState methods (HasPendingRequest, GetPendingRequest, DeletePendingRequest, HasPendingRequests) now use RLock instead of Lock, and GetClientState uses a read-then-upgrade pattern, reducing contention under load. Includes a regression test that reliably triggers the deadlock on unfixed code (15 clients, 50ms timeout, expects all callbacks to fire within 2s — previously only 1/15 would fire before freezing).
fix: deadlock in DefaultServerDispatcher
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
This fixes a deadlock in DefaultServerDispatcher when many requests timeout concurrently, matching the symptoms reported in #358.
DefaultServerDispatcher shares its RWMutex with ServerState (via NewServerState(&d.mutex)). When more than 10 client request timeouts fire simultaneously, waitForTimeout goroutines acquire RLock and block sending to timerC (buffer=10). Meanwhile messagePump, the sole timerC reader, calls HasPendingRequest which needs Lock on the same mutex — forming a circular wait that freezes the entire server permanently.
Three fixes applied:
Give ServerState its own RWMutex instead of sharing the dispatcher's. This breaks the cross-concern locking dependency.
In waitForTimeout, release RLock before sending to timerC. Previously the lock was held via defer across the blocking send.
In messagePump's timeout handler, inline the request completion instead of calling CompleteRequest, which sends to readyForDispatch (buffer=1). Since messagePump is the sole reader of that channel, this was a self-deadlock path when processing consecutive timeouts.
Additionally, read-only ServerState/ClientState methods (HasPendingRequest, GetPendingRequest, DeletePendingRequest, HasPendingRequests) now use RLock instead of Lock, and GetClientState uses a read-then-upgrade pattern, reducing contention under load.
Includes a regression test that reliably triggers the deadlock on unfixed code (15 clients, 50ms timeout, expects all callbacks to fire within 2s — previously only 1/15 would fire before freezing).
Types of changes
What types of changes does your code introduce?
Put an
xin the boxes that applyChecklist
Put an
xin the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any ofthem, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before
merging your code.
Further comments
There are several PRs partially addressing this issue, but as far as we can tell this is comprehensive (and explains why the problem can be mitigated through horizontal scaling). It is currently running in production for us at GreenJoulez.