Skip to content

fix: shard Dispatcher DO per session instead of global singleton#83

Merged
rpurdel merged 5 commits intomainfrom
fix/shard-do-per-session
Apr 17, 2026
Merged

fix: shard Dispatcher DO per session instead of global singleton#83
rpurdel merged 5 commits intomainfrom
fix/shard-do-per-session

Conversation

@rpurdel
Copy link
Copy Markdown

@rpurdel rpurdel commented Apr 17, 2026

Replace idFromName('global') with idFromName(sessionId) for the TranscriptionDispatcher DO. Each transcription session gets its own isolated DO instead of sharing one global instance.

Follow-ups in this branch

  • Random-UUID fallback for missing sessionId (5907dff) — avoids everyone-colliding-on-'unknown' when the client doesn't supply one.

Test plan

  • Deployed to staging (worker-only, container image pinned to existing CF tag 29fdac7f — no container rebuild)
  • Generate staging traffic and confirm per-session DOs appear in logs

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Code Review

Overview

This PR shards the TranscriptionDispatcher Durable Object from a single global instance to one instance per sessionId. The change is conceptually correct and simple — just two lines changed in handleWebSocketWithDispatcher.

Positive Aspects

  • Correct isolation: Previously all sessions shared one global DO, which meant transcription dispatches across every concurrent session were competing within a single DO. Per-session sharding eliminates that contention.
  • Consistent: Both call sites (idFromName) are updated together — there's no risk of the initial connection and the connectToDispatcher reconnect function routing to different DOs for the same session.
  • Session resumption compatibility: If a client reconnects with the same sessionId, it will naturally land on the same DO instance, which is the right behaviour.

Issues / Concerns

1. The 'unknown' fallback becomes a hidden singleton (medium)

// worker/index.ts line 515
const sessionId = url.searchParams.get('sessionId') || 'unknown';

With the old 'global' name every session shared one DO. Now sessions without a sessionId all share an 'unknown' DO — same problem, different name. If production deployments always supply sessionId this is fine, but if not:

  • Multiple anonymous sessions share state unexpectedly.
  • A busy production event could saturate the 'unknown' DO just as 'global' could be saturated.

Suggestion: Either throw/reject requests without a sessionId when DISPATCHER_DO is in use, or generate a random fallback ID:

const sessionId = url.searchParams.get('sessionId') ?? crypto.randomUUID();

2. DO instances are never cleaned up (low-medium)

Cloudflare DOs persist indefinitely. Each session now creates a new DO instance that is never explicitly deleted after the session ends. For a high-volume deployment this will accumulate unboundedly.

Suggestion: After flushBeforeClose() (or when the session closes cleanly), send a cleanup/delete signal to the DO so it can call this.storage.deleteAll() and mark itself for Cloudflare's automatic eviction. Alternatively, document this limitation so operators are aware.

3. Comment slightly overstates what idFromName gives you (minor)

// Each session gets its own DO instance for isolation and geographic locality.

idFromName does not guarantee geographic locality — the DO is placed in a single region determined by Cloudflare's routing, not necessarily co-located with the container. The isolation claim is correct; the locality claim is aspirational at best. Consider removing or softening the locality reference to avoid confusion.

4. No tests cover the sharding behaviour (minor)

There are no existing tests for handleWebSocketWithDispatcher visible in the diff, but given this is a one-line logic change the risk is low. Still worth a note: a regression test that asserts idFromName is called with the session ID (not 'global') would make this invariant explicit and prevent accidental reversion.

Summary

The core change is correct and addresses a real scalability problem. The main actionable item before merging is clarifying the 'unknown' fallback behaviour — either document it explicitly or harden it so anonymous sessions don't silently share a dispatcher DO.

@rpurdel
Copy link
Copy Markdown
Author

rpurdel commented Apr 17, 2026

Addressed #1 and #3 in the latest push.

Re #2 (DO cleanup): This DO writes no storage (TranscriptionDispatcherDO never calls ctx.storage), so storage.deleteAll() would be a no-op. There's also no API to delete a DO ID — idFromName() IDs are deterministic hashes that exist conceptually forever. The actual cost (isolate runtime) evicts automatically after inactivity. No cleanup needed for our stateless DO.

Re #4 (tests): Agreed this is low risk for a one-line change. The sharding behavior will be validated in staging with a live transcription meeting.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Code Review

Overview

This PR fixes a correctness bug: the TranscriptionDispatcher DO was a global singleton (idFromName('global')), meaning all active sessions shared the same DO instance. That would cause transcription events from different sessions to be routed through the same DO — almost certainly wrong. Sharding by sessionId is the right fix.

The 'unknown' fallback changed to crypto.randomUUID() is also correct — previously all sessions without a sessionId would pile up on a single DO named 'unknown'.


Issues to Address

1. DO resource accumulation (potential leak)

With per-session DOs, a new DO instance is created for every session. On session close, the code correctly closes the WebSocket connection to the DO:

dispatcherWs?.close(1000, 'Session ended');

But this only closes the WebSocket connection — it does not destroy the DO instance itself. Cloudflare DOs that use this.state.storage persist indefinitely unless explicitly deleted. If the DispatcherDO accumulates storage per session and has no self-cleanup mechanism (e.g., an alarm that fires after the WebSocket closes), you will accumulate one DO per historical session. Worth verifying that the DispatcherDO either:

  • Uses no persistent storage (it is evicted from memory after inactivity — fine), or
  • Sets a cleanup alarm when its last WebSocket closes so it can delete its own storage.

2. sessionId inconsistency between DO routing and container routing

selectContainerInstance reads sessionId independently from the URL query string (url.searchParams.get('sessionId') — returns null if absent). The fetch handler now resolves it as url.searchParams.get('sessionId') || crypto.randomUUID(). When the param is absent, the DO gets a fresh random UUID but container routing gets null (falling back to 'default' in session mode or random pool routing). The two identifiers for the "session" don't align. This is a pre-existing issue, but the crypto.randomUUID() change makes it more visible. Consider passing the resolved sessionId into selectContainerInstance as an argument rather than having it re-read from the URL.

3. Unbounded reconnect loop

scheduleReconnect caps backoff delay at 30 s but never stops retrying. If the Dispatcher DO is persistently unreachable for the full life of a session, the retry loop (and its log output) runs indefinitely. Consider capping reconnectAttempts at a maximum and giving up after e.g. 10 attempts.


Positive Notes

  • The fix is minimal and well-targeted — touches only the two idFromName call sites and the fallback string.
  • The added inline comment ("Each session gets its own DO instance for isolation.") clearly documents the intent.
  • The crypto.randomUUID() fallback ensures sessions without a sessionId are isolated from each other, not collapsed onto a hot singleton DO.
  • No unrelated changes; low regression risk.

Summary

The core fix (global singleton → per-session DO) is correct and necessary. The main follow-up question before merging is whether the DispatcherDO cleans itself up after a session ends, to avoid unbounded DO accumulation over time. The sessionId / container routing inconsistency is a pre-existing issue worth a separate ticket.

Review generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Code Review — Updated Analysis

What This PR Does

Shards the TranscriptionDispatcher Durable Object from a single global singleton (idFromName('global')) to one DO instance per sessionId. Also passes the sessionId as a query parameter on the WebSocket upgrade request, and replaces the 'unknown' fallback with crypto.randomUUID().


Positive Notes

  • Core fix is correct. A global singleton DO for all sessions was almost certainly a bug — all sessions would share the same dispatcher context. Per-session sharding is the right model.
  • Both call sites updated together. The initial connection in handleWebSocketWithDispatcher and the connectToDispatcher reconnect helper both use the same idFromName(sessionId), so a reconnect will reach the same DO instance as the original connection.
  • crypto.randomUUID() fallback is an improvement over the previous 'unknown' string, which would have collapsed all anonymous sessions onto one shared DO.
  • sessionId passed in the upgrade request URL — this is a nice touch; the DO handler can now read sessionId from the URL if it wants to log or key on it without needing to extract it from the DO name.
  • Minimal diff. Low regression risk.

Remaining Concerns

1. sessionId / container routing still misaligned (pre-existing, now more visible)

selectContainerInstance independently re-reads sessionId from the URL query string (returns null if absent). The fetch handler now resolves it as:

const sessionId = url.searchParams.get('sessionId') || crypto.randomUUID();

When the param is absent, the DO gets a fresh random UUID but selectContainerInstance sees null, which falls back to 'default' (session mode) or random pool routing. The two subsystems disagree on the session identity. This is pre-existing but now more visible. Worth a follow-up ticket to thread the resolved sessionId into selectContainerInstance.

2. Unbounded reconnect loop (pre-existing)

scheduleReconnect caps the backoff delay at 30 s but never stops retrying. If the dispatcher DO is persistently unreachable for the full lifetime of a session, the retry loop runs indefinitely. Consider capping reconnectAttempts (e.g., give up after 10 attempts) as a separate improvement.


DO Cleanup

As the author noted, the DispatcherDO is stateless (no ctx.storage writes), so there is nothing to clean up — Cloudflare will evict the isolate automatically after inactivity. This concern is resolved.


Overall

The core change is correct and necessary. The two remaining items are pre-existing issues unrelated to this PR. Good to merge with awareness that the sessionId / container-routing alignment and the reconnect cap are worth follow-up work.

Review generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Code Review — Final Pass

This review addresses the current diff state after the follow-up commits, given that three earlier review rounds have already covered the main concerns.

What's in the diff (current state)

Change Assessment
idFromName('global')idFromName(sessionId) at both call sites Correct — core fix
Fallback 'unknown'crypto.randomUUID() Correct — avoids anonymous-session collision
?meetingId=… query param added to upgrade request Minor inconsistency (see below)
Inline comment "Each session gets its own DO instance for isolation." Clear and accurate

One remaining nit: query param name vs. PR description

The PR description says:

Pass sessionId as a query param on the dispatcher DO upgrade URL (http://dispatcher/websocket?sessionId=…)

But the code passes meetingId:

`http://dispatcher/websocket?meetingId=${encodeURIComponent(sessionId)}`

This is a doc/code mismatch. meetingId might be the right name from the DispatcherDO's domain perspective (if the DO models it as a meeting rather than a proxy session), in which case the PR description should say ?meetingId=…. If it's just an oversight, consider renaming the query param to sessionId for consistency with the variable name and the rest of the codebase. Either way, update the PR description to match.

Pre-existing issues (acknowledged in earlier rounds, not blocking this PR)

  • sessionId / container routing misalignmentselectContainerInstance re-reads sessionId from the URL independently; when the param is absent, the DO gets a UUID but container routing sees null. Worth a follow-up ticket.
  • Unbounded reconnect loopscheduleReconnect never stops; consider a max-attempts cap in a follow-up.

Summary

The core change is correct, minimal, and addresses a real isolation bug. The DO-cleanup concern from earlier rounds is resolved (DispatcherDO is stateless). The only actionable item here is aligning the query param name with the PR description. Ready to merge after that minor clarification.

Review generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

Code Review — Final State (after all 5 commits)

The PR has iterated well through review. Here is an assessment of the final diff vs. the full commit history.

Final diff is clean

The net change across all 5 commits is exactly what the PR title says: replace idFromName('global') with idFromName(sessionId) at both call sites, and replace the 'unknown' fallback with crypto.randomUUID(). Everything else (passing ?sessionId= / ?meetingId= in the upgrade URL) was explored and reverted — the right call, since Cloudflare redacts query params in DO-to-DO requests and the DO can read the name via ctx.id.name anyway.

What's addressed

Concern (from earlier rounds) Status
'unknown' fallback creates a hidden singleton Fixed — crypto.randomUUID()
Comment overstated geographic locality Fixed — "isolation" only
DO cleanup needed Resolved — DispatcherDO is stateless; CF evicts on inactivity
?meetingId naming inconsistency Moot — query param dropped entirely in commit 5

One small note on ctx.id.name

Commit 5's message says "DO reads ctx.id.name instead". That works correctly here because all DO IDs in this file go through idFromName() — both real session IDs and the crypto.randomUUID() fallback. ctx.id.name is only populated for IDs created via idFromName(); if a future change ever switched to newUniqueId(), ctx.id.name would return undefined. Worth a brief comment in the DispatcherDO handler if it depends on this, so the invariant is visible there too.

Pre-existing issues (not blocking)

  • sessionId / container routing misalignmentselectContainerInstance independently re-reads sessionId from the URL; when the param is absent, the DO gets a UUID but container routing sees null. Worth a follow-up ticket.
  • Unbounded reconnect loopscheduleReconnect never stops retrying; worth capping reconnectAttempts in a follow-up.

Summary

Core change is correct and minimal. The full commit history shows good iterative refinement. Ready to merge.

Review generated with Claude Code

@rpurdel rpurdel merged commit 6215b13 into main Apr 17, 2026
2 checks passed
@rpurdel rpurdel deleted the fix/shard-do-per-session branch April 17, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants