Skip to content

fix: deadlock in DefaultServerDispatcher#412

Open
smulloni wants to merge 2 commits intolorenzodonini:masterfrom
GreenJoulez:master
Open

fix: deadlock in DefaultServerDispatcher#412
smulloni wants to merge 2 commits intolorenzodonini:masterfrom
GreenJoulez:master

Conversation

@smulloni
Copy link
Copy Markdown

Proposed changes

This fixes a deadlock in DefaultServerDispatcher when many requests timeout concurrently, matching the symptoms reported in #358.

DefaultServerDispatcher shares its RWMutex with ServerState (via NewServerState(&d.mutex)). When more than 10 client request timeouts fire simultaneously, waitForTimeout goroutines acquire RLock and block sending to timerC (buffer=10). Meanwhile messagePump, the sole timerC reader, calls HasPendingRequest which needs Lock on the same mutex — forming a circular wait that freezes the entire server permanently.

Three fixes applied:

Give ServerState its own RWMutex instead of sharing the dispatcher's. This breaks the cross-concern locking dependency.

In waitForTimeout, release RLock before sending to timerC. Previously the lock was held via defer across the blocking send.

In messagePump's timeout handler, inline the request completion instead of calling CompleteRequest, which sends to readyForDispatch (buffer=1). Since messagePump is the sole reader of that channel, this was a self-deadlock path when processing consecutive timeouts.

Additionally, read-only ServerState/ClientState methods (HasPendingRequest, GetPendingRequest, DeletePendingRequest, HasPendingRequests) now use RLock instead of Lock, and GetClientState uses a read-then-upgrade pattern, reducing contention under load.

Includes a regression test that reliably triggers the deadlock on unfixed code (15 clients, 50ms timeout, expects all callbacks to fire within 2s — previously only 1/15 would fire before freezing).

Types of changes

What types of changes does your code introduce?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of
them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before
merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (glad to, where do I do this?)
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

There are several PRs partially addressing this issue, but as far as we can tell this is comprehensive (and explains why the problem can be mitigated through horizontal scaling). It is currently running in production for us at GreenJoulez.

…oncurrently

DefaultServerDispatcher shared its RWMutex with ServerState (via
NewServerState(&d.mutex)). When more than 10 client request timeouts
fired simultaneously, waitForTimeout goroutines would acquire RLock
and block sending to timerC (buffer=10). Meanwhile messagePump, the
sole timerC reader, would call HasPendingRequest which needed Lock
on the same mutex — forming a circular wait that froze the entire
server permanently. This matches the symptoms reported in
lorenzodonini#358.

Three fixes applied:

1. Give ServerState its own RWMutex instead of sharing the
   dispatcher's. This breaks the cross-concern locking dependency.

2. In waitForTimeout, release RLock before sending to timerC.
   Previously the lock was held via defer across the blocking send.

3. In messagePump's timeout handler, inline the request completion
   instead of calling CompleteRequest, which sends to
   readyForDispatch (buffer=1). Since messagePump is the sole reader
   of that channel, this was a self-deadlock path when processing
   consecutive timeouts.

Additionally, read-only ServerState/ClientState methods
(HasPendingRequest, GetPendingRequest, DeletePendingRequest,
HasPendingRequests) now use RLock instead of Lock, and GetClientState
uses a read-then-upgrade pattern, reducing contention under load.

Includes a regression test that reliably triggers the deadlock on
unfixed code (15 clients, 50ms timeout, expects all callbacks to
fire within 2s — previously only 1/15 would fire before freezing).
fix: deadlock in DefaultServerDispatcher
@smulloni smulloni requested a review from lorenzodonini as a code owner March 23, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant