fix: use UTF-16 offsets for Text operations#379
Conversation
Set OffsetKind::Utf16 on yrs Doc so the wire format uses UTF-16 code unit offsets, matching JS yjs. Without this, pycrdt uses UTF-8 byte offsets, causing findIndexSS "Unexpected case" crashes when JS yjs clients apply incremental updates containing multi-byte characters. In the Python wrapper, convert character (code point) indices to UTF-16 code unit indices before passing to yrs. This ensures Text.insert(), __setitem__, __delitem__, and format() all work correctly with emoji and other non-BMP characters. Fixes: y-crdt#308 Related: jupyter-ai-contrib/jupyter-server-documents#197
Replace the individual queued message replay in handle_sync_step1 with a single batched diff computed from the pre-sync state vector. The queue+replay approach (Approach B) correctly fixed the handshake race (infra#307) but exposed a pycrdt offset encoding bug (jupyter-server-documents#197) where incremental Text updates after multi-byte characters crash JS yjs with findIndexSS "Unexpected case". The batched catchup avoids the crash because all CRDT struct references are resolvable within a single update, while also covering any mutations that occurred during the handshake gap. Upstream pycrdt fix: y-crdt/pycrdt#379
|
Thanks for the PR. I think this should include comprehensive tests. See jupyter-server/jupyter_ydoc#370 for some prior art (using a different implementation, but you could also pull test cases out of there, though they will need adopting) |
Cover insert, delete, setitem, slice, len, and cross-doc sync with: - emoji (surrogate pairs: 📊 🎉) - CJK (BMP: 价格 世界 特征工程) - Cyrillic (мир) - supplementary plane (𝒜 𠀀) - mixed scripts in one text These all fail on stock pycrdt 0.12.50 and pass with the OffsetKind::Utf16 fix.
11 parametrized test cases adapted from jupyter-server/jupyter_ydoc#370 covering emoji swaps, flags, ZWJ family sequences, combining marks, keycap sequences, RTL/LTR text, Japanese, and math operators. These exercise Text insert/delete/replace via SequenceMatcher-based diffing (the same pattern jupyter_ydoc.YUnicode.set() uses).
Replace the individual queued message replay in handle_sync_step1 with a single batched diff computed from the pre-sync state vector. The queue+replay approach (Approach B) correctly fixed the handshake race (infra#307) but exposed a pycrdt offset encoding bug (jupyter-server-documents#197) where incremental Text updates after multi-byte characters crash JS yjs with findIndexSS "Unexpected case". The batched catchup avoids the crash because all CRDT struct references are resolvable within a single update, while also covering any mutations that occurred during the handshake gap. Upstream pycrdt fix: y-crdt/pycrdt#379
|
@krassowski please take a look now 👌 |
|
@xrl It looks like |
Addresses review feedback from @davidbrochart. Tests cover ASCII (identity), BMP characters, supplementary plane (emoji), multiple emoji, and roundtrip with _char_to_utf16.
|
@davidbrochart take a look now, thanks! |
|
Hmm I'm a bit confused that |
|
@davidbrochart I am currently looking into this, but I need a bit of time. |
|
The from pycrdt import Doc, Text
d = Doc()
t = d["t"] = Text()
t += "A📊B"
t += "X"
print(str(t))which prints A📊XB # contrary to expected 'A📊BX'Internally, |
…_char Text.__iadd__ passed len(self) (Python character count) to the yrs insert, but yrs expects a UTF-16 code unit index — so `t += "X"` after an emoji landed inside the surrogate pair. Convert the index through _char_to_utf16, matching every other mutating method. Also removes _utf16_to_char and its tests, which had no callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
XmlText mirrored the same bug that y-crdt#379 fixed in Text: every mutating method passed a raw Python character index to yrs, but yrs (with OffsetKind::Utf16 set on the Doc) expects UTF-16 code unit offsets. Non-BMP content (emoji, supplementary plane) landed at the wrong position or split surrogate pairs. Convert the index through _char_to_utf16 in insert, insert_embed, format, and __delitem__ (handling surrogate-pair length for single deletes). __iadd__ inherits the fix via self.insert. Also change __len__ to return Python character count, matching Text. Previously it returned yrs' UTF-16 length, which disagreed with the string returned by str(self) for non-BMP characters. Depends on y-crdt#379 for the OffsetKind::Utf16 doc option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Further review of the surface area turned up three more spots touched by the
|
|
I cannot reproduce the garbling example doc['source'] += '# Analysis 📊\n' # 15 chars, but yrs sees 18 bytes
doc['source'].insert(len(text), 'model()\n') # Python says 15, yrs reads 15
# → offset 15 in bytes lands INSIDE 📊given in the PR description on the When running from pycrdt import Doc, Text
d = Doc()
t = d["t"] = Text()
text = "# Analysis 📊\n"
t += text
t.insert(len(text), 'model()\n')
print(str(t))I get and not Is |
|
@jbdyn I wasn't able to reproduce that specific error either, but I am definitely able to reproduce other issues with # fetch PR
git remote add xrl git@github.com:xrl/pycrdt.git
git fetch xrl
git switch 308-fix-utf16-offset-encoding
# check out new test suite on main branch
git checkout main
git checkout 308-fix-utf16-offset-encoding -- tests/test_text.py
# activate env and run new tests against main
uv venv # (if needed)
source .venv/bin/activate
uv pip install -e .
pytest tests/test_text.pyThe test cases fail on the DetailsThis exactly matches the behavior we are seeing in AI extensions when they attempt to add multi-byte characters like emojis into notebooks using RTC MCP tools. |
dlqqq
left a comment
There was a problem hiding this comment.
@xrl Looks good, thank you so much for working on this! The approach is correct and the test coverage is thorough. Approving.
@davidbrochart I think that this PR is good to merge - what do you think?
Also, I think there are two edge cases that aren't covered by this PR. I don't think that these are worth blocking the PR over since they are not regressions, but would be good to address as a follow-up.
-
sticky_indexfeature:StickyIndex.new()passes the Python code point index directly to yrs without converting to UTF-16 code units, and get_index() returns the raw yrs UTF-16 offset without converting back. This will be wrong for text containing characters above U+FFFF. Not a regression since this wouldn't have worked before anyways. -
TextEventdeltas fromobserve(): The retain/delete lengths in change events come from yrs in UTF-16 code units. We should convert the UTF-16 code unit offsets back to the code point offsets that Python consumers expect. This isn't a regression either, since we weren't converting UTF-8 byte offsets back to code point offsets before anyways.
| self.integrated.remove_range(txn._txn, key, 1) | ||
| utf16_idx = _char_to_utf16(current, key) | ||
| char_at = current[key] | ||
| utf16_len = 2 if ord(char_at) > 0xFFFF else 1 |
There was a problem hiding this comment.
I did some digging into this. This is counting the number of UTF-16 code units per code point. Python counts length in code points.
Each code point is always either 1 or 2 UTF-16 code units in length; this is defined by the UTF-16 encoding itself. A surrogate pair (a Python code point comprised of 2 code units) is always composed of a high surrogate (>0xFFFF) and a low surrogate (<=0xFFFF). So this logic is correct.
Since Python indexes strings using code points, the _char_to_utf16() function can be implemented similarly by iterating over every code point and add 1 for each high surrogate to compute the UTF-16 byte offset. That is what sum(1 for ch in prefix if ord(ch) > 0xFFFF) does. So that function is correct as well.
There was a problem hiding this comment.
Generated a small table for reference:
| Visual character | Grapheme clusters (humans) | Code points (Python) | UTF-16 code units (Yjs) |
|---|---|---|---|
| A | 1 | 1 | 1 |
| é (precomposed, U+00E9) | 1 | 1 | 1 |
| é (decomposed, U+0065 + U+0301) | 1 | 2 | 2 |
| ★ (U+2605) | 1 | 1 | 1 |
| 中 (U+4E2D) | 1 | 1 | 1 |
| 📊 (U+1F4CA) | 1 | 1 | 2 |
| 🇺🇸 (flag) | 1 | 2 | 4 |
| 👨👩👧👦 (family) | 1 | 7 | 11 |
| Hello | 5 | 5 | 5 |
| // Use UTF-16 offsets for compatibility with JS yjs clients. | ||
| // Without this, pycrdt uses UTF-8 byte offsets which causes | ||
| // findIndexSS crashes when JS yjs applies incremental updates | ||
| // containing multi-byte characters. | ||
| options.offset_kind = OffsetKind::Utf16; |
There was a problem hiding this comment.
This looks like the correct approach. Yjs is inherently locked into UTF-16 code unit offsets for indices and lengths because JavaScript strings are encoded in UTF-16. This line configures pycrdt documents to also use UTF-16 code unit offsets for indices and lengths returned by Yrs's methods. This is pretty much required for compatibility with Yjs for containing anything except plain ASCII, since UTF-8 byte offsets != UTF-16 code unit offsets for everything outside plain ASCII (even in the BMP).
Reference: https://github.com/y-crdt/y-crdt/blob/195f0f4627be6b3fbc223bd3bfbadddc874458ca/yrs/src/doc.rs#L948
davidbrochart
left a comment
There was a problem hiding this comment.
I don't think that pycrdt should be hard-coded to use an offset kind of UTF-16. Just like Yrs, it should be parameterizable.
| """ | ||
| with self.doc.transaction() as txn: | ||
| return self.integrated.len(txn._txn) | ||
| # Return Python character count, not yrs UTF-16 code unit count |
There was a problem hiding this comment.
| # Return Python character count, not yrs UTF-16 code unit count |
There was a problem hiding this comment.
Done in 855af5c — dropped the inline comment and tightened the __len__ docstring to match.
| from ._doc import Doc | ||
|
|
||
|
|
||
| def _char_to_utf16(text: str, char_index: int) -> int: |
There was a problem hiding this comment.
| def _char_to_utf16(text: str, char_index: int) -> int: | |
| def get_utf16_index(text: str, char_index: int) -> int: |
|
|
||
| For pure-ASCII / BMP text this is a no-op (returns ``char_index`` | ||
| unchanged). | ||
| """ |
There was a problem hiding this comment.
Please use Google style docstrings (Args, Returns).
There was a problem hiding this comment.
Done in 855af5c — both get_utf16_index and get_utf8_index now use Google-style Args / Returns docstrings.
Agreed, as it decides about
An offset parameter would come in handy during migration. |
|
@davidbrochart With a paramterizable offset, |
I think so, yes. |
|
@davidbrochart Can we prefer UTF-16 as the default since that's the default Yjs uses? It seems more reasonable to follow the standard set by the original, most common implementation of the Y protocol (Yjs). Otherwise every Python consumer of |
|
Can we also allow the offset kind to parametrized in a future PR? This was not parametrized previously so that could be developed in a future PR. IMO it makes sense to merge this as-is since it fixes bugs with the existing |
|
I think it's a smoother path to first allow the offset kind to be parameterized with a default value of UTF-8. |
|
I understand. Are we imagining this to be a parameter passed to For this change to work, I think we would need to configure |
|
Yes it seems that the offset kind would be a parameter to a document initialization. |
|
Hi folks, just checking back here, any update on updating |
Adds an `offset_kind` parameter to `Doc(...)` so callers can choose between UTF-8 byte offsets (the yrs default) and UTF-16 code unit offsets (required for cross-runtime interop with JS yjs). Default is "utf8", matching yrs. The Text wrapper previously assumed UTF-16 unconditionally and passed Python char indices through a UTF-16 conversion. This commit replaces that with a dispatcher that picks UTF-8 or UTF-16 conversion based on the doc's offset_kind, so Text behaves correctly in either mode while the public API still takes Python character indices. src/doc.rs gains a third positional arg on `Doc::new` accepting "utf8" / "utf-8" / "utf16" / "utf-16" / None (None preserves the yrs default). Invalid values raise PyValueError. _from_snapshot_impl now reads the offset kind from the source doc instead of hardcoding it. A new `#[getter] offset_kind` returns the canonical "utf8" or "utf16" string. _base.py forwards the new kwarg, and raises ValueError if both `doc=` (an existing _Doc) and `offset_kind=` are supplied with disagreeing values. _doc.py adds the kwarg to Doc.__init__, documents it, and exposes a read-only `Doc.offset_kind` property. Addresses davidbrochart's review on y-crdt#379 asking for parameterization with a UTF-8 default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses davidbrochart's three inline reviews on y-crdt#379: - _text.py:14 — rename `_char_to_utf16` → `get_utf16_index` (public, per the suggested rename). Add a counterpart `get_utf8_index` since with offset_kind parameterization the conversion is needed in both directions (jbdyn + davidbrochart agreed in the thread). The dispatcher and single-char-length helpers stay private — trivial conditionals best read at the call site. - _text.py:23 — both public helpers now use Google-style docstrings (Args / Returns sections), matching the rest of the module. - _text.py:112 — drop the inline comment that duplicated the docstring; tighten the docstring to match. Re-export the two helpers from `pycrdt.__init__` so consumers can reach them without importing from the underscore module. Drop a now-dead `_char_to_utf16` import in tests/test_text.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an `offset_kind` pytest fixture parametrizing all Unicode tests over both ``"utf8"`` and ``"utf16"`` modes. The same test bodies now exercise the dispatcher in `Text` for both yrs offset configurations, without changing the assertions — public API contract is always Python char indices. `test_unicode_cross_doc_sync` constructs both peers with the same offset_kind (mismatched offset kinds across peers is unsupported by yrs/yjs). New explicit tests cover: - default-is-utf8 - explicit "utf8" / "utf-8" / "utf16" / "utf-16" - ValueError on unknown values - snapshot round-trip preserves offset_kind for both modes - new public `get_utf16_index` / `get_utf8_index` helpers (identity / BMP CJK / non-BMP emoji) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
|
@krassowski take a look now, please |
The Rust-extension stub didn't reflect the new constructor signature or the offset_kind getter, so `mypy python` failed in CI on _base.py and _doc.py with "Too many arguments" / "no attribute offset_kind". Add the third positional `offset_kind: str | None` to `Doc.__init__` and declare the read-only `offset_kind` property. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes the ubuntu python-3.14 coverage regression on y-crdt#379. The new mismatch check at python/pycrdt/_base.py:74 (raise ValueError when both doc= and offset_kind= are passed and they disagree) wasn't exercised by any pytest test, dropping coverage to 99% and tripping the suite's fail-under=100 gate. Add a one-shot regression test that constructs an existing Doc with offset_kind="utf8", then attempts to wrap it as a new Doc with offset_kind="utf16" and asserts ValueError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces configurable text offset semantics so pycrdt can safely operate on Unicode text (including non-BMP emoji) and interoperate with JS yjs when using UTF-16 code unit offsets.
Changes:
- Add
offset_kindtoDocconstruction (Rust + Python) and exposeDoc.offset_kindto Python. - Update Python
Textoperations to convert Python character indices into the doc’s internal offset units (UTF-8 bytes or UTF-16 code units). - Add extensive Unicode-focused tests and export
get_utf8_index/get_utf16_indexhelpers.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/doc.rs |
Adds offset_kind parsing to yrs Options and exposes an offset_kind getter. |
python/pycrdt/_base.py |
Wires offset_kind through to the Rust _Doc and validates consistency when reusing an existing doc. |
python/pycrdt/_doc.py |
Exposes Doc.offset_kind and documents the new constructor option. |
python/pycrdt/_text.py |
Adds UTF-8/UTF-16 index conversion helpers and updates Text mutation methods to use converted offsets. |
python/pycrdt/__init__.py |
Re-exports get_utf8_index / get_utf16_index as public helpers. |
python/pycrdt/_pycrdt.pyi |
Updates type stubs for the new _pycrdt.Doc constructor arg and property. |
tests/test_text.py |
Adds parameterized tests for Unicode behavior across both offset kinds and validates offset_kind API behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| doc = _Doc(client_id, skip_gc, offset_kind) | ||
| elif offset_kind is not None and offset_kind != doc.offset_kind: | ||
| raise ValueError( | ||
| f"offset_kind={offset_kind!r} does not match doc.offset_kind={doc.offset_kind!r}" | ||
| ) |
There was a problem hiding this comment.
The offset_kind mismatch check compares the raw user string to doc.offset_kind, but the constructor accepts hyphenated aliases (e.g. utf-16). Passing an existing _Doc plus offset_kind='utf-16' will currently raise even though it's semantically the same as 'utf16'. Normalize offset_kind to the canonical form (e.g. strip hyphens / lowercase) before comparing, or accept both spellings in the comparison.
| # initial content, then applies a granular edit (using SequenceMatcher on | ||
| # byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies | ||
| # the result is correct. |
There was a problem hiding this comment.
The block comment says the granular edit uses "SequenceMatcher on byte offsets", but _apply_diff() uses SequenceMatcher(a=old_value, b=new_value) on Python strings and then applies edits using character indices. Update the comment to avoid implying byte-based offsets here (since that distinction is important in this PR).
| # initial content, then applies a granular edit (using SequenceMatcher on | |
| # byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies | |
| # the result is correct. | |
| # initial content, then applies a granular edit using SequenceMatcher on | |
| # Python strings / character indices in this helper, and verifies the result | |
| # is correct. |
| @@ -84,10 +89,30 @@ impl Doc { | |||
| .map_err(|_| PyValueError::new_err("skip_gc must be a valid bool"))?; | |||
| options.skip_gc = _skip_gc; | |||
| } | |||
| if !offset_kind.is_none() { | |||
| let _offset_kind: String = offset_kind | |||
| .extract() | |||
| .map_err(|_| PyValueError::new_err("offset_kind must be a string"))?; | |||
| options.offset_kind = match _offset_kind.as_str() { | |||
| "utf8" | "utf-8" => OffsetKind::Bytes, | |||
| "utf16" | "utf-16" => OffsetKind::Utf16, | |||
| _ => return Err(PyValueError::new_err( | |||
| "offset_kind must be 'utf8' or 'utf16'", | |||
| )), | |||
| }; | |||
| } | |||
| let doc = _Doc::with_options(options); | |||
There was a problem hiding this comment.
This constructor only sets options.offset_kind when offset_kind is explicitly provided, which means Doc() will still use the yrs default (OffsetKind::Bytes). That conflicts with the PR description/title claiming the Doc is configured to use UTF-16 offsets by default. Either set options.offset_kind = OffsetKind::Utf16 unconditionally (and adjust tests/docs accordingly) or update the PR description to reflect that UTF-16 is opt-in via offset_kind.
There was a problem hiding this comment.
update the PR description to reflect that UTF-16 is opt-in via offset_kind
I suppose we agreed on this one.
| current = str(self) | ||
| offset = _char_to_offset(current, len(current), self.doc.offset_kind) |
There was a problem hiding this comment.
__iadd__ now calls str(self) just to compute the append offset, which makes text += ... O(n) in the current text size and allocates the whole string. For appends you can avoid this by using the underlying integrated length in the doc’s offset units (e.g. self.integrated.len(txn._txn)) as the insert index, since that already matches the offset kind.
| current = str(self) | |
| offset = _char_to_offset(current, len(current), self.doc.offset_kind) | |
| offset = self.integrated.len(txn._txn) |
| prefix = text[:char_index] | ||
| # Count characters that need a surrogate pair (code point > 0xFFFF) | ||
| extra = sum(1 for ch in prefix if ord(ch) > 0xFFFF) | ||
| return char_index + extra |
There was a problem hiding this comment.
get_utf16_index() returns char_index + extra, which becomes incorrect when char_index is out of range or negative because Python slicing clamps but char_index is not clamped (e.g. get_utf16_index('A📊B', 10) returns 11 instead of the end offset 4). This can produce offsets beyond the actual text length and break insert/delete/format calls in UTF-16 mode. Compute the result from the sliced prefix length (or clamp char_index to [0, len(text)]) so the returned UTF-16 offset always stays within bounds.
| return char_index + extra | |
| return len(prefix) + extra |
Summary
OffsetKind::Utf16on yrsDocso the wire format uses UTF-16 code unit offsets, matching JS yjsTextwrapperWithout this fix, pycrdt uses UTF-8 byte offsets (the yrs default). This causes two problems:
Text.insert(N, ...)puts text at the wrong position when the preceding text contains multi-byte characters.findIndexSS "Unexpected case"errors when applied to a JS yjsY.Doc.Error before fix
The emoji
📊is 1 Python character, 4 UTF-8 bytes, and 2 UTF-16 code units. With the defaultOffsetKind::Bytes, yrs counts it as 4 units. Here's what happens step by step:Python-side result:
'A📊BX'instead of'A📊XB'— theXlands afterB, not after📊.The garbling compounds with each operation. A realistic sequence:
Result:
'# Analysis 📊\nmod📊el()\n'— text inserted in the wrong place, document garbled.Cross-runtime crash: When the second
insertis encoded as an incremental yjs update, it references struct clock 15. But JS yjs (which uses UTF-16) only created a struct of length 14 for'# Analysis 📊\n'(14 UTF-16 code units). Clock 15 doesn't exist →findIndexSSthrows"Unexpected case":The mismatch is exactly the number of extra UTF-8 bytes from multi-byte characters:
📊= 4 bytes - 2 UTF-16 code units = 2 extra. After N emoji, the offset is off by 2N.Works after fix
With
OffsetKind::Utf16, yrs counts📊as 2 units (matching JS yjs). The Python wrapper converts character index 2 → UTF-16 index 3 (1 forA+ 2 for📊):Result:
'A📊XB'— correct.Sequential inserts also work:
Result:
'# Analysis 📊\nmodel()\n'— correct, no garbling.Cross-runtime: The update references struct clock 14 (matching the struct's length of 14 UTF-16 code units). JS yjs applies it without error.
Reproduction
Changes
src/doc.rs: Setoptions.offset_kind = OffsetKind::Utf16when creating aDoc. This makes yrs use UTF-16 code unit offsets in the wire format, matching JS yjs.python/pycrdt/_text.py: Added_char_to_utf16()helper that converts Python character indices to UTF-16 code unit indices. Applied ininsert(),insert_embed(),__setitem__,__delitem__,format(), and__len__(). For pure-ASCII / BMP text this is a no-op.tests/test_text.py: 23 new tests covering emoji, CJK, Cyrillic, supplementary plane, cross-doc sync, and granular diff operations (adapted from jupyter-server/jupyter_ydoc#370). The granular diff tests cover emoji swaps, flags, ZWJ family sequences, combining marks, keycap sequences, RTL/LTR text, Japanese, and math operators.Test results
findIndexSSFixes #308
Related: jupyter-ai-contrib/jupyter-server-documents#197