fix: use UTF-16 offsets for Text operations by xrl · Pull Request #379 · y-crdt/pycrdt

xrl · 2026-04-10T16:24:14Z

Summary

Set OffsetKind::Utf16 on yrs Doc so the wire format uses UTF-16 code unit offsets, matching JS yjs
Convert Python character (code point) indices to UTF-16 code unit indices in the Text wrapper
Add 23 new tests covering Unicode/emoji text operations

Without this fix, pycrdt uses UTF-8 byte offsets (the yrs default). This causes two problems:

Python-side corruption: Text.insert(N, ...) puts text at the wrong position when the preceding text contains multi-byte characters.
JS yjs crash: Incremental updates generated by pycrdt cause findIndexSS "Unexpected case" errors when applied to a JS yjs Y.Doc.

Error before fix

The emoji 📊 is 1 Python character, 4 UTF-8 bytes, and 2 UTF-16 code units. With the default OffsetKind::Bytes, yrs counts it as 4 units. Here's what happens step by step:

doc = Doc()
doc['t'] = Text()
doc['t'] += 'A📊B'    # yrs internal: 6 items (1 + 4 + 1 bytes)
doc['t'].insert(2, 'X')  # Python means "after 📊", but yrs reads 2 as byte offset 2
                          # → inserts inside the emoji's byte sequence

Python-side result: 'A📊BX' instead of 'A📊XB' — the X lands after B, not after 📊.

The garbling compounds with each operation. A realistic sequence:

doc['source'] += '# Analysis 📊\n'           # 15 chars, but yrs sees 18 bytes
doc['source'].insert(len(text), 'model()\n')  # Python says 15, yrs reads 15
                                               # → offset 15 in bytes lands INSIDE 📊

Result: '# Analysis 📊\nmod📊el()\n' — text inserted in the wrong place, document garbled.

Cross-runtime crash: When the second insert is encoded as an incremental yjs update, it references struct clock 15. But JS yjs (which uses UTF-16) only created a struct of length 14 for '# Analysis 📊\n' (14 UTF-16 code units). Clock 15 doesn't exist → findIndexSS throws "Unexpected case":

[yjs findIndexSS] FAILED — clock not found in structs {
  searchClock: 14,
  structsLength: 1,
  firstStruct: { client: 518772262, clock: 0, len: 14 },
  lastStruct: { client: 518772262, clock: 0, len: 14 },
  allRanges: '0..13'
}
Error: Unexpected case

The mismatch is exactly the number of extra UTF-8 bytes from multi-byte characters: 📊 = 4 bytes - 2 UTF-16 code units = 2 extra. After N emoji, the offset is off by 2N.

Works after fix

With OffsetKind::Utf16, yrs counts 📊 as 2 units (matching JS yjs). The Python wrapper converts character index 2 → UTF-16 index 3 (1 for A + 2 for 📊):

doc['t'] += 'A📊B'       # yrs internal: 4 items (1 + 2 + 1 UTF-16 code units)
doc['t'].insert(2, 'X')   # Python index 2 → UTF-16 index 3 → after 📊

Result: 'A📊XB' — correct.

Sequential inserts also work:

doc['source'] += '# Analysis 📊\n'            # 14 UTF-16 code units
doc['source'].insert(len(text), 'model()\n')   # Python 15 → UTF-16 16 → end of text

Result: '# Analysis 📊\nmodel()\n' — correct, no garbling.

Cross-runtime: The update references struct clock 14 (matching the struct's length of 14 UTF-16 code units). JS yjs applies it without error.

Reproduction

import pycrdt

doc = pycrdt.Doc()
doc['t'] = pycrdt.Text()
doc['t'] += 'A📊B'
doc['t'].insert(2, 'X')  # Before fix: 'A📊BX' — After fix: 'A📊XB'

Changes

src/doc.rs: Set options.offset_kind = OffsetKind::Utf16 when creating a Doc. This makes yrs use UTF-16 code unit offsets in the wire format, matching JS yjs.

python/pycrdt/_text.py: Added _char_to_utf16() helper that converts Python character indices to UTF-16 code unit indices. Applied in insert(), insert_embed(), __setitem__, __delitem__, format(), and __len__(). For pure-ASCII / BMP text this is a no-op.

tests/test_text.py: 23 new tests covering emoji, CJK, Cyrillic, supplementary plane, cross-doc sync, and granular diff operations (adapted from jupyter-server/jupyter_ydoc#370). The granular diff tests cover emoji swaps, flags, ZWJ family sequences, combining marks, keycap sequences, RTL/LTR text, Japanese, and math operators.

Test results

pycrdt test suite: **145 passed (122 existing + 23 new)
Cross-runtime (pycrdt → JS yjs): All 6 test cases with emoji/CJK/Cyrillic content pass where they previously crashed with findIndexSS

Fixes #308
Related: jupyter-ai-contrib/jupyter-server-documents#197

Set OffsetKind::Utf16 on yrs Doc so the wire format uses UTF-16 code unit offsets, matching JS yjs. Without this, pycrdt uses UTF-8 byte offsets, causing findIndexSS "Unexpected case" crashes when JS yjs clients apply incremental updates containing multi-byte characters. In the Python wrapper, convert character (code point) indices to UTF-16 code unit indices before passing to yrs. This ensures Text.insert(), __setitem__, __delitem__, and format() all work correctly with emoji and other non-BMP characters. Fixes: y-crdt#308 Related: jupyter-ai-contrib/jupyter-server-documents#197

Replace the individual queued message replay in handle_sync_step1 with a single batched diff computed from the pre-sync state vector. The queue+replay approach (Approach B) correctly fixed the handshake race (infra#307) but exposed a pycrdt offset encoding bug (jupyter-server-documents#197) where incremental Text updates after multi-byte characters crash JS yjs with findIndexSS "Unexpected case". The batched catchup avoids the crash because all CRDT struct references are resolvable within a single update, while also covering any mutations that occurred during the handshake gap. Upstream pycrdt fix: y-crdt/pycrdt#379

krassowski · 2026-04-10T16:54:34Z

Thanks for the PR. I think this should include comprehensive tests.

See jupyter-server/jupyter_ydoc#370 for some prior art (using a different implementation, but you could also pull test cases out of there, though they will need adopting)

Cover insert, delete, setitem, slice, len, and cross-doc sync with: - emoji (surrogate pairs: 📊 🎉) - CJK (BMP: 价格世界特征工程) - Cyrillic (мир) - supplementary plane (𝒜 𠀀) - mixed scripts in one text These all fail on stock pycrdt 0.12.50 and pass with the OffsetKind::Utf16 fix.

11 parametrized test cases adapted from jupyter-server/jupyter_ydoc#370 covering emoji swaps, flags, ZWJ family sequences, combining marks, keycap sequences, RTL/LTR text, Japanese, and math operators. These exercise Text insert/delete/replace via SequenceMatcher-based diffing (the same pattern jupyter_ydoc.YUnicode.set() uses).

Replace the individual queued message replay in handle_sync_step1 with a single batched diff computed from the pre-sync state vector. The queue+replay approach (Approach B) correctly fixed the handshake race (infra#307) but exposed a pycrdt offset encoding bug (jupyter-server-documents#197) where incremental Text updates after multi-byte characters crash JS yjs with findIndexSS "Unexpected case". The batched catchup avoids the crash because all CRDT struct references are resolvable within a single update, while also covering any mutations that occurred during the handshake gap. Upstream pycrdt fix: y-crdt/pycrdt#379

xrl · 2026-04-10T18:53:16Z

@krassowski please take a look now 👌

davidbrochart · 2026-04-13T08:14:55Z

@xrl It looks like _utf16_to_char is not tested.

@davidbrochart

Addresses review feedback from @davidbrochart. Tests cover ASCII (identity), BMP characters, supplementary plane (emoji), multiple emoji, and roundtrip with _char_to_utf16.

xrl · 2026-04-14T00:12:57Z

@davidbrochart take a look now, thanks!

davidbrochart · 2026-04-14T06:42:31Z

Hmm I'm a bit confused that _utf16_to_char is actually not used. Should we remove it and the corresponding tests?

davidbrochart · 2026-04-16T08:49:01Z

@jbdyn Since you looked at that in #129, what do you think about this PR?

jbdyn · 2026-04-16T15:04:55Z

@davidbrochart I am currently looking into this, but I need a bit of time.

jbdyn · 2026-04-16T15:38:12Z

The __iadd__ of Text does not behave:

from pycrdt import Doc, Text

d = Doc()
t = d["t"] = Text()
t += "A📊B"
t += "X"
print(str(t))

which prints

A📊XB  # contrary to expected 'A📊BX'

Internally, Text.integrated.insert(txn, index, chars) needs the UTF-16 index, but Text.__len__ returns the Python character counts.

…_char Text.__iadd__ passed len(self) (Python character count) to the yrs insert, but yrs expects a UTF-16 code unit index — so `t += "X"` after an emoji landed inside the surrogate pair. Convert the index through _char_to_utf16, matching every other mutating method. Also removes _utf16_to_char and its tests, which had no callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

XmlText mirrored the same bug that y-crdt#379 fixed in Text: every mutating method passed a raw Python character index to yrs, but yrs (with OffsetKind::Utf16 set on the Doc) expects UTF-16 code unit offsets. Non-BMP content (emoji, supplementary plane) landed at the wrong position or split surrogate pairs. Convert the index through _char_to_utf16 in insert, insert_embed, format, and __delitem__ (handling surrogate-pair length for single deletes). __iadd__ inherits the fix via self.insert. Also change __len__ to return Python character count, matching Text. Previously it returned yrs' UTF-16 length, which disagreed with the string returned by str(self) for non-BMP characters. Depends on y-crdt#379 for the OffsetKind::Utf16 doc option. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

xrl · 2026-04-17T04:18:22Z

Further review of the surface area turned up three more spots touched by the OffsetKind::Utf16 change. Rather than balloon this PR, I've split them out:

fix: use UTF-16 offsets for XmlText operations #381 — XmlText has the same bug as Text (unconverted offsets in insert/insert_embed/format/__delitem__/__iadd__, plus __len__ returning UTF-16 count). Opened as a draft PR stacked on this one; will mark ready once this lands.
StickyIndex: UTF-16 offsets not converted for Text sequences #382 — StickyIndex doesn't convert in either direction for Text/XmlText. Filed as an issue rather than a PR because the fix has design questions around serialized indices and their source sequence.
TextEvent.delta / XmlEvent.delta — retain/delete counts bubble out of yrs as UTF-16 code units. This might actually be the right contract for yjs wire compat — flagging here for discussion rather than changing unilaterally. Let me know if you'd prefer a separate issue for it.

jbdyn · 2026-04-17T06:00:25Z

I cannot reproduce the garbling example

doc['source'] += '# Analysis 📊\n'           # 15 chars, but yrs sees 18 bytes
doc['source'].insert(len(text), 'model()\n')  # Python says 15, yrs reads 15
                                               # → offset 15 in bytes lands INSIDE 📊

given in the PR description on the main branch.

When running

from pycrdt import Doc, Text

d = Doc()
t = d["t"] = Text()
text = "# Analysis 📊\n"
t += text
t.insert(len(text), 'model()\n')
print(str(t))

I get

# Analysis 📊\nmodel()\n

and not # Analysis 📊\nmod📊el()\n.

Is yjs still crashing then?

dlqqq · 2026-04-20T21:05:11Z

@jbdyn I wasn't able to reproduce that specific error either, but I am definitely able to reproduce other issues with pycrdt's existing handling of multi-byte characters using the test cases added in this PR.

# fetch PR
git remote add xrl git@github.com:xrl/pycrdt.git
git fetch xrl
git switch 308-fix-utf16-offset-encoding

# check out new test suite on main branch
git checkout main
git checkout 308-fix-utf16-offset-encoding -- tests/test_text.py

# activate env and run new tests against main
uv venv # (if needed)
source .venv/bin/activate
uv pip install -e .
pytest tests/test_text.py

The test cases fail on the main branch with some runtime panics (click to show):

Details

----------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------

thread '<unnamed>' (75872785) panicked at /Users/dlq/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/yrs-0.25.0/src/types/text.rs:845:9:
Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
=========================================================================================================================== short test summary info ============================================================================================================================
FAILED tests/test_text.py::test_unicode_emoji_insert - assert 6 == 3
FAILED tests/test_text.py::test_unicode_emoji_len - assert 6 == 3
FAILED tests/test_text.py::test_unicode_emoji_delete - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_emoji_delete_emoji - AssertionError: Got 'A'
FAILED tests/test_text.py::test_unicode_emoji_slice_delete - AssertionError: Got 'A'
FAILED tests/test_text.py::test_unicode_emoji_setitem - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_emoji_slice_setitem - AssertionError: Got 'AXYZ'
FAILED tests/test_text.py::test_unicode_cjk - assert 7 == 3
FAILED tests/test_text.py::test_unicode_mixed_scripts - AssertionError: Got 'Hello 世界 м!ир 📊'
FAILED tests/test_text.py::test_unicode_supplementary_plane - assert 11 == 5
FAILED tests/test_text.py::test_unicode_granular_diff[emoji_swap] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[text_change_emoji_stay] - AssertionError: Got 'Here is a star: ⭐ and a rocketar: 🚀', expected 'Here is a star: ⭐ and a rocket: 🚀'
FAILED tests/test_text.py::test_unicode_granular_diff[combining_marks] - pyo3_runtime.PanicException: Couldn't remove 5 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[flags] - AssertionError: Got 'Flags: 🇨🇦🇬🇧🇸🇬🇧🇨🇦', expected 'Flags: 🇨🇦🇬🇧🇺🇸'
FAILED tests/test_text.py::test_unicode_granular_diff[zwj_family] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[rtl_ltr] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[keycap] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[emoji_boundaries] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[japanese] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[math_operators] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
======================================================================================================================== 20 failed, 16 passed in 0.63s =========================================================================================================================

This exactly matches the behavior we are seeing in AI extensions when they attempt to add multi-byte characters like emojis into notebooks using RTC MCP tools.

dlqqq

@xrl Looks good, thank you so much for working on this! The approach is correct and the test coverage is thorough. Approving.

@davidbrochart I think that this PR is good to merge - what do you think?

Also, I think there are two edge cases that aren't covered by this PR. I don't think that these are worth blocking the PR over since they are not regressions, but would be good to address as a follow-up.

sticky_index feature: StickyIndex.new() passes the Python code point index directly to yrs without converting to UTF-16 code units, and get_index() returns the raw yrs UTF-16 offset without converting back. This will be wrong for text containing characters above U+FFFF. Not a regression since this wouldn't have worked before anyways.
TextEvent deltas from observe(): The retain/delete lengths in change events come from yrs in UTF-16 code units. We should convert the UTF-16 code unit offsets back to the code point offsets that Python consumers expect. This isn't a regression either, since we weren't converting UTF-8 byte offsets back to code point offsets before anyways.

dlqqq · 2026-04-20T21:19:37Z

-                self.integrated.remove_range(txn._txn, key, 1)
+                utf16_idx = _char_to_utf16(current, key)
+                char_at = current[key]
+                utf16_len = 2 if ord(char_at) > 0xFFFF else 1


I did some digging into this. This is counting the number of UTF-16 code units per code point. Python counts length in code points.

Each code point is always either 1 or 2 UTF-16 code units in length; this is defined by the UTF-16 encoding itself. A surrogate pair (a Python code point comprised of 2 code units) is always composed of a high surrogate (>0xFFFF) and a low surrogate (<=0xFFFF). So this logic is correct.

Since Python indexes strings using code points, the _char_to_utf16() function can be implemented similarly by iterating over every code point and add 1 for each high surrogate to compute the UTF-16 byte offset. That is what sum(1 for ch in prefix if ord(ch) > 0xFFFF) does. So that function is correct as well.

Generated a small table for reference:

Visual character Grapheme clusters (humans) Code points (Python) UTF-16 code units (Yjs)

A 1 1 1

é (precomposed, U+00E9) 1 1 1

é (decomposed, U+0065 + U+0301) 1 2 2

★ (U+2605) 1 1 1

中 (U+4E2D) 1 1 1

📊 (U+1F4CA) 1 1 2

🇺🇸 (flag) 1 2 4

👨‍👩‍👧‍👦 (family) 1 7 11

Hello 5 5 5

dlqqq · 2026-04-20T21:37:28Z

+        // Use UTF-16 offsets for compatibility with JS yjs clients.
+        // Without this, pycrdt uses UTF-8 byte offsets which causes
+        // findIndexSS crashes when JS yjs applies incremental updates
+        // containing multi-byte characters.
+        options.offset_kind = OffsetKind::Utf16;


This looks like the correct approach. Yjs is inherently locked into UTF-16 code unit offsets for indices and lengths because JavaScript strings are encoded in UTF-16. This line configures pycrdt documents to also use UTF-16 code unit offsets for indices and lengths returned by Yrs's methods. This is pretty much required for compatibility with Yjs for containing anything except plain ASCII, since UTF-8 byte offsets != UTF-16 code unit offsets for everything outside plain ASCII (even in the BMP).

Reference: https://github.com/y-crdt/y-crdt/blob/195f0f4627be6b3fbc223bd3bfbadddc874458ca/yrs/src/doc.rs#L948

davidbrochart

I don't think that pycrdt should be hard-coded to use an offset kind of UTF-16. Just like Yrs, it should be parameterizable.

davidbrochart · 2026-04-21T05:07:44Z

        """
-        with self.doc.transaction() as txn:
-            return self.integrated.len(txn._txn)
+        # Return Python character count, not yrs UTF-16 code unit count


Suggested change

# Return Python character count, not yrs UTF-16 code unit count

Done in 855af5c — dropped the inline comment and tightened the __len__ docstring to match.

davidbrochart · 2026-04-21T05:10:20Z

    from ._doc import Doc


+def _char_to_utf16(text: str, char_index: int) -> int:


Suggested change

def _char_to_utf16(text: str, char_index: int) -> int:

def get_utf16_index(text: str, char_index: int) -> int:

Renamed in 855af5c. Added a get_utf8_index counterpart since offset_kind is now parameterizable per #379 (commit 41c8cc5) — both are re-exported from the top-level package. The dispatcher itself stays private (one-line conditional).

davidbrochart · 2026-04-21T05:12:53Z

+
+    For pure-ASCII / BMP text this is a no-op (returns ``char_index``
+    unchanged).
+    """


Please use Google style docstrings (Args, Returns).

Done in 855af5c — both get_utf16_index and get_utf8_index now use Google-style Args / Returns docstrings.

jbdyn · 2026-04-21T08:01:47Z

I don't think that pycrdt should be hard-coded to use an offset kind of UTF-16.

Agreed, as it decides about yjs compatibility (thanks @dlqqq for the reference) and thus also about compatibility with other clients using the UTF-8 offset. The user needs to choose one.
Also, yrs might become the core implementation for all language wrappers/bindings at some point:

Eventually Yrs might replace the Yjs module by generating a wasm module from this package.

An offset parameter would come in handy during migration.

jbdyn · 2026-04-21T08:04:56Z

@davidbrochart With a paramterizable offset, Text and other datatypes also need a get_utf8_index conversion, right?

davidbrochart · 2026-04-21T08:11:20Z

@davidbrochart With a paramterizable offset, Text and other datatypes also need a get_utf8_index conversion, right?

I think so, yes.

dlqqq · 2026-04-21T18:55:33Z

@davidbrochart Can we prefer UTF-16 as the default since that's the default Yjs uses? It seems more reasonable to follow the standard set by the original, most common implementation of the Y protocol (Yjs). Otherwise every Python consumer of pycrdt will need to manually set the offset kind to work with Yjs.

dlqqq · 2026-04-21T20:26:03Z

Can we also allow the offset kind to parametrized in a future PR? This was not parametrized previously so that could be developed in a future PR. IMO it makes sense to merge this as-is since it fixes bugs with the existing Text API when handling Unicode characters, as well as Yjs compatibility.

davidbrochart · 2026-04-21T20:43:58Z

I think it's a smoother path to first allow the offset kind to be parameterized with a default value of UTF-8.
We'll see later if we diverge from Yrs, since it would be a breaking change.

dlqqq · 2026-04-21T22:00:44Z

I understand. Are we imagining this to be a parameter passed to Doc.__init__()?

For this change to work, I think we would need to configure jupyter_ydoc to initialize all pycrdt.Doc objects with offset_kind="utf16". Then, the Text methods need to check the offset kind set in self.doc to know which offsets to use when computing indexes and lengths. Does this match with what you're expecting also?

davidbrochart · 2026-04-22T06:28:32Z

Yes it seems that the offset kind would be a parameter to a document initialization.
Let's not talk about jupyter-ydoc here, as pycrdt is independent. Please open an issue there if you want to discuss that topic.

krassowski · 2026-04-28T09:08:29Z

Hi folks, just checking back here, any update on updating offset_kind to allow for merging this?

Adds an `offset_kind` parameter to `Doc(...)` so callers can choose between UTF-8 byte offsets (the yrs default) and UTF-16 code unit offsets (required for cross-runtime interop with JS yjs). Default is "utf8", matching yrs. The Text wrapper previously assumed UTF-16 unconditionally and passed Python char indices through a UTF-16 conversion. This commit replaces that with a dispatcher that picks UTF-8 or UTF-16 conversion based on the doc's offset_kind, so Text behaves correctly in either mode while the public API still takes Python character indices. src/doc.rs gains a third positional arg on `Doc::new` accepting "utf8" / "utf-8" / "utf16" / "utf-16" / None (None preserves the yrs default). Invalid values raise PyValueError. _from_snapshot_impl now reads the offset kind from the source doc instead of hardcoding it. A new `#[getter] offset_kind` returns the canonical "utf8" or "utf16" string. _base.py forwards the new kwarg, and raises ValueError if both `doc=` (an existing _Doc) and `offset_kind=` are supplied with disagreeing values. _doc.py adds the kwarg to Doc.__init__, documents it, and exposes a read-only `Doc.offset_kind` property. Addresses davidbrochart's review on y-crdt#379 asking for parameterization with a UTF-8 default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses davidbrochart's three inline reviews on y-crdt#379: - _text.py:14 — rename `_char_to_utf16` → `get_utf16_index` (public, per the suggested rename). Add a counterpart `get_utf8_index` since with offset_kind parameterization the conversion is needed in both directions (jbdyn + davidbrochart agreed in the thread). The dispatcher and single-char-length helpers stay private — trivial conditionals best read at the call site. - _text.py:23 — both public helpers now use Google-style docstrings (Args / Returns sections), matching the rest of the module. - _text.py:112 — drop the inline comment that duplicated the docstring; tighten the docstring to match. Re-export the two helpers from `pycrdt.__init__` so consumers can reach them without importing from the underscore module. Drop a now-dead `_char_to_utf16` import in tests/test_text.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an `offset_kind` pytest fixture parametrizing all Unicode tests over both ``"utf8"`` and ``"utf16"`` modes. The same test bodies now exercise the dispatcher in `Text` for both yrs offset configurations, without changing the assertions — public API contract is always Python char indices. `test_unicode_cross_doc_sync` constructs both peers with the same offset_kind (mismatched offset kinds across peers is unsupported by yrs/yjs). New explicit tests cover: - default-is-utf8 - explicit "utf8" / "utf-8" / "utf16" / "utf-16" - ValueError on unknown values - snapshot round-trip preserves offset_kind for both modes - new public `get_utf16_index` / `get_utf8_index` helpers (identity / BMP CJK / non-BMP emoji) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

xrl · 2026-04-28T14:30:39Z

@krassowski take a look now, please

The Rust-extension stub didn't reflect the new constructor signature or the offset_kind getter, so `mypy python` failed in CI on _base.py and _doc.py with "Too many arguments" / "no attribute offset_kind". Add the third positional `offset_kind: str | None` to `Doc.__init__` and declare the read-only `offset_kind` property. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes the ubuntu python-3.14 coverage regression on y-crdt#379. The new mismatch check at python/pycrdt/_base.py:74 (raise ValueError when both doc= and offset_kind= are passed and they disagree) wasn't exercised by any pytest test, dropping coverage to 99% and tripping the suite's fail-under=100 gate. Add a one-shot regression test that constructs an existing Doc with offset_kind="utf8", then attempts to wrap it as a new Doc with offset_kind="utf16" and asserts ValueError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces configurable text offset semantics so pycrdt can safely operate on Unicode text (including non-BMP emoji) and interoperate with JS yjs when using UTF-16 code unit offsets.

Changes:

Add offset_kind to Doc construction (Rust + Python) and expose Doc.offset_kind to Python.
Update Python Text operations to convert Python character indices into the doc’s internal offset units (UTF-8 bytes or UTF-16 code units).
Add extensive Unicode-focused tests and export get_utf8_index / get_utf16_index helpers.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/doc.rs`	Adds `offset_kind` parsing to yrs `Options` and exposes an `offset_kind` getter.
`python/pycrdt/_base.py`	Wires `offset_kind` through to the Rust `_Doc` and validates consistency when reusing an existing doc.
`python/pycrdt/_doc.py`	Exposes `Doc.offset_kind` and documents the new constructor option.
`python/pycrdt/_text.py`	Adds UTF-8/UTF-16 index conversion helpers and updates `Text` mutation methods to use converted offsets.
`python/pycrdt/__init__.py`	Re-exports `get_utf8_index` / `get_utf16_index` as public helpers.
`python/pycrdt/_pycrdt.pyi`	Updates type stubs for the new `_pycrdt.Doc` constructor arg and property.
`tests/test_text.py`	Adds parameterized tests for Unicode behavior across both offset kinds and validates offset_kind API behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T10:07:26Z

+            doc = _Doc(client_id, skip_gc, offset_kind)
+        elif offset_kind is not None and offset_kind != doc.offset_kind:
+            raise ValueError(
+                f"offset_kind={offset_kind!r} does not match doc.offset_kind={doc.offset_kind!r}"
+            )


The offset_kind mismatch check compares the raw user string to doc.offset_kind, but the constructor accepts hyphenated aliases (e.g. utf-16). Passing an existing _Doc plus offset_kind='utf-16' will currently raise even though it's semantically the same as 'utf16'. Normalize offset_kind to the canonical form (e.g. strip hyphens / lowercase) before comparing, or accept both spellings in the comparison.

Copilot · 2026-04-30T10:07:27Z

+# initial content, then applies a granular edit (using SequenceMatcher on
+# byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies
+# the result is correct.


The block comment says the granular edit uses "SequenceMatcher on byte offsets", but _apply_diff() uses SequenceMatcher(a=old_value, b=new_value) on Python strings and then applies edits using character indices. Update the comment to avoid implying byte-based offsets here (since that distinction is important in this PR).

Suggested change

# initial content, then applies a granular edit (using SequenceMatcher on

# byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies

# the result is correct.

# initial content, then applies a granular edit using SequenceMatcher on

# Python strings / character indices in this helper, and verifies the result

# is correct.

Copilot · 2026-04-30T10:07:27Z

@@ -84,10 +89,30 @@ impl Doc {
                .map_err(|_| PyValueError::new_err("skip_gc must be a valid bool"))?;
            options.skip_gc = _skip_gc;
        }
+        if !offset_kind.is_none() {
+            let _offset_kind: String = offset_kind
+                .extract()
+                .map_err(|_| PyValueError::new_err("offset_kind must be a string"))?;
+            options.offset_kind = match _offset_kind.as_str() {
+                "utf8" | "utf-8" => OffsetKind::Bytes,
+                "utf16" | "utf-16" => OffsetKind::Utf16,
+                _ => return Err(PyValueError::new_err(
+                    "offset_kind must be 'utf8' or 'utf16'",
+                )),
+            };
+        }
        let doc = _Doc::with_options(options);


This constructor only sets options.offset_kind when offset_kind is explicitly provided, which means Doc() will still use the yrs default (OffsetKind::Bytes). That conflicts with the PR description/title claiming the Doc is configured to use UTF-16 offsets by default. Either set options.offset_kind = OffsetKind::Utf16 unconditionally (and adjust tests/docs accordingly) or update the PR description to reflect that UTF-16 is opt-in via offset_kind.

update the PR description to reflect that UTF-16 is opt-in via offset_kind

I suppose we agreed on this one.

Copilot · 2026-04-30T10:07:28Z

+            current = str(self)
+            offset = _char_to_offset(current, len(current), self.doc.offset_kind)


__iadd__ now calls str(self) just to compute the append offset, which makes text += ... O(n) in the current text size and allocates the whole string. For appends you can avoid this by using the underlying integrated length in the doc’s offset units (e.g. self.integrated.len(txn._txn)) as the insert index, since that already matches the offset kind.

Suggested change

current = str(self)

offset = _char_to_offset(current, len(current), self.doc.offset_kind)

offset = self.integrated.len(txn._txn)

Copilot · 2026-04-30T10:07:28Z

+    prefix = text[:char_index]
+    # Count characters that need a surrogate pair (code point > 0xFFFF)
+    extra = sum(1 for ch in prefix if ord(ch) > 0xFFFF)
+    return char_index + extra


get_utf16_index() returns char_index + extra, which becomes incorrect when char_index is out of range or negative because Python slicing clamps but char_index is not clamped (e.g. get_utf16_index('A📊B', 10) returns 11 instead of the end offset 4). This can produce offsets beyond the actual text length and break insert/delete/format calls in UTF-16 mode. Compute the result from the sliced prefix length (or clamp char_index to [0, len(text)]) so the returned UTF-16 offset always stays within bounds.

Suggested change

return char_index + extra

return len(prefix) + extra

xrl mentioned this pull request Apr 10, 2026

fix: batched catchup in YRoomUpdateBuffer.resume() jupyter-ai-contrib/jupyter-server-documents#218

Merged

2 tasks

xrl added 2 commits April 10, 2026 14:44

fix: move SequenceMatcher import to top of file (ruff E402)

7cf5af4

test: add tests for _utf16_to_char helper

b1ed6ae

Addresses review feedback from @davidbrochart. Tests cover ASCII (identity), BMP characters, supplementary plane (emoji), multiple emoji, and roundtrip with _char_to_utf16.

xrl and others added 2 commits April 17, 2026 00:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

01a1756

for more information, see https://pre-commit.ci

This was referenced Apr 17, 2026

fix: use UTF-16 offsets for XmlText operations #381

Draft

StickyIndex: UTF-16 offsets not converted for Text sequences #382

Open

dlqqq approved these changes Apr 20, 2026

View reviewed changes

dlqqq mentioned this pull request Apr 20, 2026

edit_cell() crashes during heavy usage jupyter-ai-contrib/jupyter-ai-tools#22

Closed

davidbrochart requested changes Apr 21, 2026

View reviewed changes

xrl and others added 4 commits April 28, 2026 10:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6b938c

for more information, see https://pre-commit.ci

xrl and others added 2 commits April 28, 2026 10:35

krassowski requested a review from Copilot April 30, 2026 10:01

Copilot started reviewing on behalf of krassowski April 30, 2026 10:02 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Visual character	Grapheme clusters (humans)	Code points (Python)	UTF-16 code units (Yjs)
A	1	1	1
é (precomposed, U+00E9)	1	1	1
é (decomposed, U+0065 + U+0301)	1	2	2
★ (U+2605)	1	1	1
中 (U+4E2D)	1	1	1
📊 (U+1F4CA)	1	1	2
🇺🇸 (flag)	1	2	4
👨‍👩‍👧‍👦 (family)	1	7	11
Hello	5	5	5

		from ._doc import Doc


		def _char_to_utf16(text: str, char_index: int) -> int:

	def _char_to_utf16(text: str, char_index: int) -> int:
	def get_utf16_index(text: str, char_index: int) -> int:

		current = str(self)
		offset = _char_to_offset(current, len(current), self.doc.offset_kind)

	current = str(self)
	offset = _char_to_offset(current, len(current), self.doc.offset_kind)
	offset = self.integrated.len(txn._txn)

Conversation

xrl commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Error before fix

Works after fix

Reproduction

Changes

Test results

Uh oh!

krassowski commented Apr 10, 2026

Uh oh!

xrl commented Apr 10, 2026

Uh oh!

davidbrochart commented Apr 13, 2026

Uh oh!

xrl commented Apr 14, 2026

Uh oh!

davidbrochart commented Apr 14, 2026

Uh oh!

davidbrochart commented Apr 16, 2026

Uh oh!

jbdyn commented Apr 16, 2026

Uh oh!

jbdyn commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrl commented Apr 17, 2026

Uh oh!

jbdyn commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlqqq commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlqqq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidbrochart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbdyn commented Apr 21, 2026

Uh oh!

jbdyn commented Apr 21, 2026

Uh oh!

davidbrochart commented Apr 21, 2026

Uh oh!

dlqqq commented Apr 21, 2026

Uh oh!

dlqqq commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidbrochart commented Apr 21, 2026

Uh oh!

dlqqq commented Apr 21, 2026

Uh oh!

davidbrochart commented Apr 22, 2026

Uh oh!

krassowski commented Apr 28, 2026

Uh oh!

xrl commented Apr 28, 2026

Uh oh!

xrl commented Apr 10, 2026 •

edited

Loading

jbdyn commented Apr 16, 2026 •

edited

Loading

jbdyn commented Apr 17, 2026 •

edited

Loading

dlqqq commented Apr 20, 2026 •

edited

Loading

dlqqq commented Apr 21, 2026 •

edited

Loading