fix: batched catchup in YRoomUpdateBuffer.resume() by xrl · Pull Request #218 · jupyter-ai-contrib/jupyter-server-documents

xrl · 2026-04-10T16:32:42Z

Summary

Ports the batched catchup approach onto the YRoomUpdateBuffer class introduced in Fix content duplication on reconnection #215
Saves the server's state vector before the handshake in handle_sync()
After the handshake, computes a single batched diff via ydoc.get_update(pre_sync_sv) and passes it to resume()
resume() broadcasts the batched diff instead of replaying individual queued messages

This fixes two issues for divergent client handshakes:

Silent data loss: updates queued during the divergent handshake are delivered as one clean diff covering the full gap
findIndexSS crash (Bugs when adding Unicode chars through Python API #197): individual incremental Text updates from pycrdt after multi-byte characters use offset encoding that JS yjs cannot resolve. A single batched update keeps all CRDT struct references resolvable within the same message.

Changes

rooms/yroom.py — handle_sync():

Save pre_sync_sv = self._ydoc.get_state() before the handshake
After handshake, compute catchup = self._ydoc.get_update(pre_sync_sv)
Pass batched catchup to self.update_buffer.resume(catchup_message=...)

rooms/yroom_update_buffer.py — resume():

Accept optional catchup_message parameter
Discard the queue and broadcast the batched message instead of replaying individual updates

Test plan

All 22 tests pass (pytest jupyter_server_documents/tests/test_yroom_sync.py jupyter_server_documents/tests/test_yroom.py)
Stress tests in separate PR: test: stress tests for sync handshake under concurrent mutations #219

Upstream pycrdt fix for the offset encoding bug: fix: use UTF-16 offsets for Text operations y-crdt/pycrdt#379
Bugs when adding Unicode chars through Python API #197 (findIndexSS crash with Unicode chars)
Fix content duplication on reconnection #215 (YRoomUpdateBuffer introduction)
test: stress tests for sync handshake under concurrent mutations #219 (stress tests, split out per @dlqqq's request)

dlqqq · 2026-04-10T21:35:55Z

@xrl Thank you for opening this PR! Really appreciate your interest in this package. 🤗

We just opened a PR that does something similar in #215. Although it was motivated by an issue different from the one you're reporting, I believe the implementation is similar. In #215 we are introducing a mechanism to make clients sync one at a time, queuing updates while a client is syncing, and then flushing the queue of updates after the client completed syncing.

However it does not implement the "batched catchup diff" approach where you merge all of the updates into a single message, which you discovered. Great find! Currently we plan to merge #215 ASAP to unblock the JSD v0.2.0 release, which will deploy the memory management feature that admins really need in prod deployments.

Could you help port over your changes to the new YRoomUpdateBuffer class after #215 is merged? Would love to see your stress test in a separate PR too.

Port the batched catchup approach onto the YRoomUpdateBuffer class introduced in jupyter-ai-contrib#215. Instead of replaying individual queued messages in resume(), compute a single batched diff from the pre-handshake state vector via ydoc.get_update(pre_sync_sv). This fixes two issues: 1. Silent data loss (jupyter-ai-contrib#197): updates queued during a divergent handshake are now delivered as one clean diff that covers the full gap. 2. findIndexSS crash (jupyter-ai-contrib#197): individual incremental Text updates from pycrdt after multi-byte characters use offset encoding that JS yjs cannot resolve. A single batched update keeps all CRDT struct references resolvable within the same message. Changes: - yroom.py: save pre_sync_sv before handle_sync handshake, compute catchup diff after handshake, pass to resume() - yroom_update_buffer.py: resume() accepts optional catchup_message and broadcasts it instead of replaying individual queued messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

xrl · 2026-04-14T03:23:06Z

@dlqqq Thanks for the review and for merging #215!

I've rebased this PR onto main (absorbing #215) and reworked the implementation to use YRoomUpdateBuffer:

resume() now accepts an optional catchup_message parameter. Instead of replaying individual queued messages (which triggers the pycrdt offset encoding bug from Bugs when adding Unicode chars through Python API #197), it broadcasts a single batched diff computed from the pre-handshake state vector.
handle_sync() saves pre_sync_sv = self._ydoc.get_state() before the handshake and passes ydoc.get_update(pre_sync_sv) to resume() after it completes.

Stress tests are in a separate PR: #219

The diff is now just 2 files changed, +28/-8 lines.

dlqqq · 2026-04-21T23:35:20Z

Thanks for working on this & for rebasing the PR! We will definitely give this a review and merge following JSD v0.2.0 release.

dlqqq

Thank you for opening this PR! The catchup now happens instantly all-at-once, instead of in rapid sequential updates. This is an innovative approach. Tested it locally and verified it works well, even better than our previous experience. 🎉

This change allows for broader simplifications in our existing code, which I documented as suggestions below. You can work on them in the same PR, or do it later in a new one. Totally up to you, let me know your thoughts before I merge (just to make sure you see them 😄).

dlqqq · 2026-04-24T21:52:34Z

+        # Save the server's state vector before the handshake. After the
+        # handshake completes, get_update(pre_sync_sv) produces a single
+        # batched diff covering any mutations that occurred during the
+        # handshake gap. This replaces replaying individual queued messages,
+        # which triggers a pycrdt offset encoding bug (#197) where
+        # incremental Text updates after multi-byte characters crash JS yjs.
+        pre_sync_sv = self._ydoc.get_state()
+
        # Check if client has divergent history
        ss1_payload = ss1_message[1:]
        divergent = self._has_divergent_history(ss1_payload)


We are calling self._ydoc.get_state() twice - once here and another in self._has_divergent_history().

To avoid this duplicate computation, could we have _has_divergent_history() accept the server state vector (in bytes) as an additional argument? e.g. client_ss1 and server_sv_bytes.

dlqqq · 2026-04-24T22:03:08Z

+    def resume(self, catchup_message: bytes | None = None) -> None:
+        """Discard queued updates and unpause. If catchup_message is provided,
+        broadcast it as a single batched update instead of replaying individual
+        queued messages.
+
+        Batching avoids a pycrdt offset encoding bug
+        (jupyter-ai-contrib/jupyter-server-documents#197) where individual
+        incremental Text updates after multi-byte characters crash JS yjs
+        clients with findIndexSS "Unexpected case".
+        """
        self._queue = []
        self._paused = False
-        for message in queued:
-            self._broadcast(message)
+        if catchup_message is not None:
+            self._broadcast(catchup_message)


It seems like we no longer need to store the update messages at all, since we always apply incoming YDoc updates to the YDoc so they will always be included with the catchup message. Also this method does nothing if catchup_message is None, which is expected. Perhaps we can require the pre-sync server state vector as an argument here, and compute the catch up message in the implementation?

The new interface would look like:

def resume(self, pre_sync_sv: bytes) -> None:

Then, we should rename this class to YRoomUpdateChannel or something similar that doesn't imply it's a buffer, but instead a broadcast channel that can be turned on or off. Not in scope for this PR, can be done in the future.

dlqqq · 2026-04-24T22:03:47Z

+        catchup = self._ydoc.get_update(pre_sync_sv)
+        # An empty yjs update is 2 bytes (b"\x00\x00").
+        catchup_message = None
+        if catchup and len(catchup) > 2:
+            catchup_message = pycrdt.create_update_message(catchup)
+        self.update_buffer.resume(catchup_message=catchup_message)


With the above suggestion, we can delegate this to self.update_buffer (which we can rename to self.update_channel), e.g.

Suggested change

catchup = self._ydoc.get_update(pre_sync_sv)

# An empty yjs update is 2 bytes (b"\x00\x00").

catchup_message = None

if catchup and len(catchup) > 2:

catchup_message = pycrdt.create_update_message(catchup)

self.update_buffer.resume(catchup_message=catchup_message)

self.update_channel.resume(pre_sync_sv=pre_sync_sv)

dlqqq · 2026-05-04T22:55:23Z

Merging this now as I haven't gotten a reply from you in a bit. Will work on the follow up comments in a separate PR. Thanks for working on this @xrl!

xrl force-pushed the fix-sync-handshake-race branch from d4f25f9 to 8e0f98f Compare April 10, 2026 18:49

xrl mentioned this pull request Apr 14, 2026

test: stress tests for sync handshake under concurrent mutations #219

Merged

1 task

xrl force-pushed the fix-sync-handshake-race branch from a84356d to c9d9587 Compare April 14, 2026 03:22

xrl changed the title ~~fix: batched catchup during sync handshake (fixes silent data loss)~~ fix: batched catchup in YRoomUpdateBuffer.resume() Apr 14, 2026

3coins added the bug Something isn't working label Apr 17, 2026

dlqqq approved these changes Apr 24, 2026

View reviewed changes

dlqqq merged commit 4109e40 into jupyter-ai-contrib:main May 4, 2026
10 of 12 checks passed

dlqqq mentioned this pull request May 4, 2026

fix: batched catchup in YRoomUpdateChannel.resume() #227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batched catchup in YRoomUpdateBuffer.resume()#218

fix: batched catchup in YRoomUpdateBuffer.resume()#218
dlqqq merged 1 commit into
jupyter-ai-contrib:mainfrom
xrl:fix-sync-handshake-race

xrl commented Apr 10, 2026 •

edited

Loading

Uh oh!

dlqqq commented Apr 10, 2026

Uh oh!

xrl commented Apr 14, 2026

Uh oh!

dlqqq commented Apr 21, 2026

Uh oh!

dlqqq left a comment •

edited

Loading

Uh oh!

dlqqq Apr 24, 2026

Uh oh!

dlqqq Apr 24, 2026

Uh oh!

dlqqq Apr 24, 2026

Uh oh!

dlqqq commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xrl commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Related

Uh oh!

dlqqq commented Apr 10, 2026

Uh oh!

xrl commented Apr 14, 2026

Uh oh!

dlqqq commented Apr 21, 2026

Uh oh!

dlqqq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlqqq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

dlqqq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

dlqqq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

dlqqq commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xrl commented Apr 10, 2026 •

edited

Loading

dlqqq left a comment •

edited

Loading