Skip to content

fix: use UTF-16 offsets for Text operations#379

Open
xrl wants to merge 13 commits into
y-crdt:mainfrom
xrl:308-fix-utf16-offset-encoding
Open

fix: use UTF-16 offsets for Text operations#379
xrl wants to merge 13 commits into
y-crdt:mainfrom
xrl:308-fix-utf16-offset-encoding

Conversation

@xrl
Copy link
Copy Markdown

@xrl xrl commented Apr 10, 2026

Summary

  • Set OffsetKind::Utf16 on yrs Doc so the wire format uses UTF-16 code unit offsets, matching JS yjs
  • Convert Python character (code point) indices to UTF-16 code unit indices in the Text wrapper
  • Add 23 new tests covering Unicode/emoji text operations

Without this fix, pycrdt uses UTF-8 byte offsets (the yrs default). This causes two problems:

  1. Python-side corruption: Text.insert(N, ...) puts text at the wrong position when the preceding text contains multi-byte characters.
  2. JS yjs crash: Incremental updates generated by pycrdt cause findIndexSS "Unexpected case" errors when applied to a JS yjs Y.Doc.

Error before fix

The emoji 📊 is 1 Python character, 4 UTF-8 bytes, and 2 UTF-16 code units. With the default OffsetKind::Bytes, yrs counts it as 4 units. Here's what happens step by step:

doc = Doc()
doc['t'] = Text()
doc['t'] += 'A📊B'    # yrs internal: 6 items (1 + 4 + 1 bytes)
doc['t'].insert(2, 'X')  # Python means "after 📊", but yrs reads 2 as byte offset 2
                          # → inserts inside the emoji's byte sequence

Python-side result: 'A📊BX' instead of 'A📊XB' — the X lands after B, not after 📊.

The garbling compounds with each operation. A realistic sequence:

doc['source'] += '# Analysis 📊\n'           # 15 chars, but yrs sees 18 bytes
doc['source'].insert(len(text), 'model()\n')  # Python says 15, yrs reads 15
                                               # → offset 15 in bytes lands INSIDE 📊

Result: '# Analysis 📊\nmod📊el()\n' — text inserted in the wrong place, document garbled.

Cross-runtime crash: When the second insert is encoded as an incremental yjs update, it references struct clock 15. But JS yjs (which uses UTF-16) only created a struct of length 14 for '# Analysis 📊\n' (14 UTF-16 code units). Clock 15 doesn't exist → findIndexSS throws "Unexpected case":

[yjs findIndexSS] FAILED — clock not found in structs {
  searchClock: 14,
  structsLength: 1,
  firstStruct: { client: 518772262, clock: 0, len: 14 },
  lastStruct: { client: 518772262, clock: 0, len: 14 },
  allRanges: '0..13'
}
Error: Unexpected case

The mismatch is exactly the number of extra UTF-8 bytes from multi-byte characters: 📊 = 4 bytes - 2 UTF-16 code units = 2 extra. After N emoji, the offset is off by 2N.

Works after fix

With OffsetKind::Utf16, yrs counts 📊 as 2 units (matching JS yjs). The Python wrapper converts character index 2 → UTF-16 index 3 (1 for A + 2 for 📊):

doc['t'] += 'A📊B'       # yrs internal: 4 items (1 + 2 + 1 UTF-16 code units)
doc['t'].insert(2, 'X')   # Python index 2 → UTF-16 index 3 → after 📊

Result: 'A📊XB' — correct.

Sequential inserts also work:

doc['source'] += '# Analysis 📊\n'            # 14 UTF-16 code units
doc['source'].insert(len(text), 'model()\n')   # Python 15 → UTF-16 16 → end of text

Result: '# Analysis 📊\nmodel()\n' — correct, no garbling.

Cross-runtime: The update references struct clock 14 (matching the struct's length of 14 UTF-16 code units). JS yjs applies it without error.

Reproduction

import pycrdt

doc = pycrdt.Doc()
doc['t'] = pycrdt.Text()
doc['t'] += 'A📊B'
doc['t'].insert(2, 'X')  # Before fix: 'A📊BX' — After fix: 'A📊XB'

Changes

src/doc.rs: Set options.offset_kind = OffsetKind::Utf16 when creating a Doc. This makes yrs use UTF-16 code unit offsets in the wire format, matching JS yjs.

python/pycrdt/_text.py: Added _char_to_utf16() helper that converts Python character indices to UTF-16 code unit indices. Applied in insert(), insert_embed(), __setitem__, __delitem__, format(), and __len__(). For pure-ASCII / BMP text this is a no-op.

tests/test_text.py: 23 new tests covering emoji, CJK, Cyrillic, supplementary plane, cross-doc sync, and granular diff operations (adapted from jupyter-server/jupyter_ydoc#370). The granular diff tests cover emoji swaps, flags, ZWJ family sequences, combining marks, keycap sequences, RTL/LTR text, Japanese, and math operators.

Test results

  • pycrdt test suite: **145 passed (122 existing + 23 new)
  • Cross-runtime (pycrdt → JS yjs): All 6 test cases with emoji/CJK/Cyrillic content pass where they previously crashed with findIndexSS

Fixes #308
Related: jupyter-ai-contrib/jupyter-server-documents#197

Set OffsetKind::Utf16 on yrs Doc so the wire format uses UTF-16 code
unit offsets, matching JS yjs. Without this, pycrdt uses UTF-8 byte
offsets, causing findIndexSS "Unexpected case" crashes when JS yjs
clients apply incremental updates containing multi-byte characters.

In the Python wrapper, convert character (code point) indices to
UTF-16 code unit indices before passing to yrs. This ensures
Text.insert(), __setitem__, __delitem__, and format() all work
correctly with emoji and other non-BMP characters.

Fixes: y-crdt#308
Related: jupyter-ai-contrib/jupyter-server-documents#197
xrl added a commit to xrl/jupyter-server-documents that referenced this pull request Apr 10, 2026
Replace the individual queued message replay in handle_sync_step1 with
a single batched diff computed from the pre-sync state vector.

The queue+replay approach (Approach B) correctly fixed the handshake
race (infra#307) but exposed a pycrdt offset encoding bug
(jupyter-server-documents#197) where incremental Text updates after
multi-byte characters crash JS yjs with findIndexSS "Unexpected case".

The batched catchup avoids the crash because all CRDT struct references
are resolvable within a single update, while also covering any mutations
that occurred during the handshake gap.

Upstream pycrdt fix: y-crdt/pycrdt#379
@krassowski
Copy link
Copy Markdown
Member

Thanks for the PR. I think this should include comprehensive tests.

See jupyter-server/jupyter_ydoc#370 for some prior art (using a different implementation, but you could also pull test cases out of there, though they will need adopting)

xrl added 2 commits April 10, 2026 14:44
Cover insert, delete, setitem, slice, len, and cross-doc sync with:
- emoji (surrogate pairs: 📊 🎉)
- CJK (BMP: 价格 世界 特征工程)
- Cyrillic (мир)
- supplementary plane (𝒜 𠀀)
- mixed scripts in one text

These all fail on stock pycrdt 0.12.50 and pass with the OffsetKind::Utf16 fix.
11 parametrized test cases adapted from jupyter-server/jupyter_ydoc#370
covering emoji swaps, flags, ZWJ family sequences, combining marks,
keycap sequences, RTL/LTR text, Japanese, and math operators.

These exercise Text insert/delete/replace via SequenceMatcher-based
diffing (the same pattern jupyter_ydoc.YUnicode.set() uses).
xrl added a commit to xrl/jupyter-server-documents that referenced this pull request Apr 10, 2026
Replace the individual queued message replay in handle_sync_step1 with
a single batched diff computed from the pre-sync state vector.

The queue+replay approach (Approach B) correctly fixed the handshake
race (infra#307) but exposed a pycrdt offset encoding bug
(jupyter-server-documents#197) where incremental Text updates after
multi-byte characters crash JS yjs with findIndexSS "Unexpected case".

The batched catchup avoids the crash because all CRDT struct references
are resolvable within a single update, while also covering any mutations
that occurred during the handshake gap.

Upstream pycrdt fix: y-crdt/pycrdt#379
@xrl
Copy link
Copy Markdown
Author

xrl commented Apr 10, 2026

@krassowski please take a look now 👌

@davidbrochart
Copy link
Copy Markdown
Collaborator

@xrl It looks like _utf16_to_char is not tested.

Addresses review feedback from @davidbrochart. Tests cover ASCII
(identity), BMP characters, supplementary plane (emoji), multiple
emoji, and roundtrip with _char_to_utf16.
@xrl
Copy link
Copy Markdown
Author

xrl commented Apr 14, 2026

@davidbrochart take a look now, thanks!

@davidbrochart
Copy link
Copy Markdown
Collaborator

Hmm I'm a bit confused that _utf16_to_char is actually not used. Should we remove it and the corresponding tests?

@davidbrochart
Copy link
Copy Markdown
Collaborator

@jbdyn Since you looked at that in #129, what do you think about this PR?

@jbdyn
Copy link
Copy Markdown
Contributor

jbdyn commented Apr 16, 2026

@davidbrochart I am currently looking into this, but I need a bit of time.

@jbdyn
Copy link
Copy Markdown
Contributor

jbdyn commented Apr 16, 2026

The __iadd__ of Text does not behave:

from pycrdt import Doc, Text

d = Doc()
t = d["t"] = Text()
t += "A📊B"
t += "X"
print(str(t))

which prints

A📊XB  # contrary to expected 'A📊BX'

Internally, Text.integrated.insert(txn, index, chars) needs the UTF-16 index, but Text.__len__ returns the Python character counts.

xrl and others added 2 commits April 17, 2026 00:05
…_char

Text.__iadd__ passed len(self) (Python character count) to the yrs
insert, but yrs expects a UTF-16 code unit index — so `t += "X"` after
an emoji landed inside the surrogate pair. Convert the index through
_char_to_utf16, matching every other mutating method.

Also removes _utf16_to_char and its tests, which had no callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xrl added a commit to xrl/pycrdt that referenced this pull request Apr 17, 2026
XmlText mirrored the same bug that y-crdt#379 fixed in Text: every mutating
method passed a raw Python character index to yrs, but yrs (with
OffsetKind::Utf16 set on the Doc) expects UTF-16 code unit offsets.
Non-BMP content (emoji, supplementary plane) landed at the wrong
position or split surrogate pairs.

Convert the index through _char_to_utf16 in insert, insert_embed,
format, and __delitem__ (handling surrogate-pair length for single
deletes). __iadd__ inherits the fix via self.insert.

Also change __len__ to return Python character count, matching Text.
Previously it returned yrs' UTF-16 length, which disagreed with the
string returned by str(self) for non-BMP characters.

Depends on y-crdt#379 for the OffsetKind::Utf16 doc option.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xrl
Copy link
Copy Markdown
Author

xrl commented Apr 17, 2026

Further review of the surface area turned up three more spots touched by the OffsetKind::Utf16 change. Rather than balloon this PR, I've split them out:

  • fix: use UTF-16 offsets for XmlText operations #381XmlText has the same bug as Text (unconverted offsets in insert/insert_embed/format/__delitem__/__iadd__, plus __len__ returning UTF-16 count). Opened as a draft PR stacked on this one; will mark ready once this lands.
  • StickyIndex: UTF-16 offsets not converted for Text sequences #382StickyIndex doesn't convert in either direction for Text/XmlText. Filed as an issue rather than a PR because the fix has design questions around serialized indices and their source sequence.
  • TextEvent.delta / XmlEvent.deltaretain/delete counts bubble out of yrs as UTF-16 code units. This might actually be the right contract for yjs wire compat — flagging here for discussion rather than changing unilaterally. Let me know if you'd prefer a separate issue for it.

@jbdyn
Copy link
Copy Markdown
Contributor

jbdyn commented Apr 17, 2026

I cannot reproduce the garbling example

doc['source'] += '# Analysis 📊\n'           # 15 chars, but yrs sees 18 bytes
doc['source'].insert(len(text), 'model()\n')  # Python says 15, yrs reads 15
                                               # → offset 15 in bytes lands INSIDE 📊

given in the PR description on the main branch.

When running

from pycrdt import Doc, Text

d = Doc()
t = d["t"] = Text()
text = "# Analysis 📊\n"
t += text
t.insert(len(text), 'model()\n')
print(str(t))

I get

# Analysis 📊\nmodel()\n

and not # Analysis 📊\nmod📊el()\n.

Is yjs still crashing then?

@dlqqq
Copy link
Copy Markdown
Collaborator

dlqqq commented Apr 20, 2026

@jbdyn I wasn't able to reproduce that specific error either, but I am definitely able to reproduce other issues with pycrdt's existing handling of multi-byte characters using the test cases added in this PR.

# fetch PR
git remote add xrl git@github.com:xrl/pycrdt.git
git fetch xrl
git switch 308-fix-utf16-offset-encoding

# check out new test suite on main branch
git checkout main
git checkout 308-fix-utf16-offset-encoding -- tests/test_text.py

# activate env and run new tests against main
uv venv # (if needed)
source .venv/bin/activate
uv pip install -e .
pytest tests/test_text.py

The test cases fail on the main branch with some runtime panics (click to show):

Details
----------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------

thread '<unnamed>' (75872785) panicked at /Users/dlq/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/yrs-0.25.0/src/types/text.rs:845:9:
Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
=========================================================================================================================== short test summary info ============================================================================================================================
FAILED tests/test_text.py::test_unicode_emoji_insert - assert 6 == 3
FAILED tests/test_text.py::test_unicode_emoji_len - assert 6 == 3
FAILED tests/test_text.py::test_unicode_emoji_delete - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_emoji_delete_emoji - AssertionError: Got 'A'
FAILED tests/test_text.py::test_unicode_emoji_slice_delete - AssertionError: Got 'A'
FAILED tests/test_text.py::test_unicode_emoji_setitem - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_emoji_slice_setitem - AssertionError: Got 'AXYZ'
FAILED tests/test_text.py::test_unicode_cjk - assert 7 == 3
FAILED tests/test_text.py::test_unicode_mixed_scripts - AssertionError: Got 'Hello 世界 м!ир 📊'
FAILED tests/test_text.py::test_unicode_supplementary_plane - assert 11 == 5
FAILED tests/test_text.py::test_unicode_granular_diff[emoji_swap] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[text_change_emoji_stay] - AssertionError: Got 'Here is a star: ⭐ and a rocketar: 🚀', expected 'Here is a star: ⭐ and a rocket: 🚀'
FAILED tests/test_text.py::test_unicode_granular_diff[combining_marks] - pyo3_runtime.PanicException: Couldn't remove 5 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[flags] - AssertionError: Got 'Flags: 🇨🇦🇬🇧🇸🇬🇧🇨🇦', expected 'Flags: 🇨🇦🇬🇧🇺🇸'
FAILED tests/test_text.py::test_unicode_granular_diff[zwj_family] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[rtl_ltr] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[keycap] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[emoji_boundaries] - pyo3_runtime.PanicException: Couldn't remove 1 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[japanese] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
FAILED tests/test_text.py::test_unicode_granular_diff[math_operators] - pyo3_runtime.PanicException: Couldn't remove 2 elements from an array. Only 0 of them were successfully removed.
======================================================================================================================== 20 failed, 16 passed in 0.63s =========================================================================================================================

This exactly matches the behavior we are seeing in AI extensions when they attempt to add multi-byte characters like emojis into notebooks using RTC MCP tools.

Copy link
Copy Markdown
Collaborator

@dlqqq dlqqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xrl Looks good, thank you so much for working on this! The approach is correct and the test coverage is thorough. Approving.

@davidbrochart I think that this PR is good to merge - what do you think?

Also, I think there are two edge cases that aren't covered by this PR. I don't think that these are worth blocking the PR over since they are not regressions, but would be good to address as a follow-up.

  1. sticky_index feature: StickyIndex.new() passes the Python code point index directly to yrs without converting to UTF-16 code units, and get_index() returns the raw yrs UTF-16 offset without converting back. This will be wrong for text containing characters above U+FFFF. Not a regression since this wouldn't have worked before anyways.

  2. TextEvent deltas from observe(): The retain/delete lengths in change events come from yrs in UTF-16 code units. We should convert the UTF-16 code unit offsets back to the code point offsets that Python consumers expect. This isn't a regression either, since we weren't converting UTF-8 byte offsets back to code point offsets before anyways.

Comment thread python/pycrdt/_text.py Outdated
self.integrated.remove_range(txn._txn, key, 1)
utf16_idx = _char_to_utf16(current, key)
char_at = current[key]
utf16_len = 2 if ord(char_at) > 0xFFFF else 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging into this. This is counting the number of UTF-16 code units per code point. Python counts length in code points.

Each code point is always either 1 or 2 UTF-16 code units in length; this is defined by the UTF-16 encoding itself. A surrogate pair (a Python code point comprised of 2 code units) is always composed of a high surrogate (>0xFFFF) and a low surrogate (<=0xFFFF). So this logic is correct.

Since Python indexes strings using code points, the _char_to_utf16() function can be implemented similarly by iterating over every code point and add 1 for each high surrogate to compute the UTF-16 byte offset. That is what sum(1 for ch in prefix if ord(ch) > 0xFFFF) does. So that function is correct as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated a small table for reference:

Visual character Grapheme clusters (humans) Code points (Python) UTF-16 code units (Yjs)
A 1 1 1
é (precomposed, U+00E9) 1 1 1
é (decomposed, U+0065 + U+0301) 1 2 2
★ (U+2605) 1 1 1
中 (U+4E2D) 1 1 1
📊 (U+1F4CA) 1 1 2
🇺🇸 (flag) 1 2 4
👨‍👩‍👧‍👦 (family) 1 7 11
Hello 5 5 5

Comment thread src/doc.rs Outdated
Comment on lines +88 to +92
// Use UTF-16 offsets for compatibility with JS yjs clients.
// Without this, pycrdt uses UTF-8 byte offsets which causes
// findIndexSS crashes when JS yjs applies incremental updates
// containing multi-byte characters.
options.offset_kind = OffsetKind::Utf16;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the correct approach. Yjs is inherently locked into UTF-16 code unit offsets for indices and lengths because JavaScript strings are encoded in UTF-16. This line configures pycrdt documents to also use UTF-16 code unit offsets for indices and lengths returned by Yrs's methods. This is pretty much required for compatibility with Yjs for containing anything except plain ASCII, since UTF-8 byte offsets != UTF-16 code unit offsets for everything outside plain ASCII (even in the BMP).

Reference: https://github.com/y-crdt/y-crdt/blob/195f0f4627be6b3fbc223bd3bfbadddc874458ca/yrs/src/doc.rs#L948

Copy link
Copy Markdown
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that pycrdt should be hard-coded to use an offset kind of UTF-16. Just like Yrs, it should be parameterizable.

Comment thread python/pycrdt/_text.py Outdated
"""
with self.doc.transaction() as txn:
return self.integrated.len(txn._txn)
# Return Python character count, not yrs UTF-16 code unit count
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Return Python character count, not yrs UTF-16 code unit count

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 855af5c — dropped the inline comment and tightened the __len__ docstring to match.

Comment thread python/pycrdt/_text.py Outdated
from ._doc import Doc


def _char_to_utf16(text: str, char_index: int) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _char_to_utf16(text: str, char_index: int) -> int:
def get_utf16_index(text: str, char_index: int) -> int:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in 855af5c. Added a get_utf8_index counterpart since offset_kind is now parameterizable per #379 (commit 41c8cc5) — both are re-exported from the top-level package. The dispatcher itself stays private (one-line conditional).

Comment thread python/pycrdt/_text.py

For pure-ASCII / BMP text this is a no-op (returns ``char_index``
unchanged).
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Google style docstrings (Args, Returns).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 855af5c — both get_utf16_index and get_utf8_index now use Google-style Args / Returns docstrings.

@jbdyn
Copy link
Copy Markdown
Contributor

jbdyn commented Apr 21, 2026

I don't think that pycrdt should be hard-coded to use an offset kind of UTF-16.

Agreed, as it decides about yjs compatibility (thanks @dlqqq for the reference) and thus also about compatibility with other clients using the UTF-8 offset. The user needs to choose one.
Also, yrs might become the core implementation for all language wrappers/bindings at some point:

Eventually Yrs might replace the Yjs module by generating a wasm module from this package.

An offset parameter would come in handy during migration.

@jbdyn
Copy link
Copy Markdown
Contributor

jbdyn commented Apr 21, 2026

@davidbrochart With a paramterizable offset, Text and other datatypes also need a get_utf8_index conversion, right?

@davidbrochart
Copy link
Copy Markdown
Collaborator

@davidbrochart With a paramterizable offset, Text and other datatypes also need a get_utf8_index conversion, right?

I think so, yes.

@dlqqq
Copy link
Copy Markdown
Collaborator

dlqqq commented Apr 21, 2026

@davidbrochart Can we prefer UTF-16 as the default since that's the default Yjs uses? It seems more reasonable to follow the standard set by the original, most common implementation of the Y protocol (Yjs). Otherwise every Python consumer of pycrdt will need to manually set the offset kind to work with Yjs.

@dlqqq
Copy link
Copy Markdown
Collaborator

dlqqq commented Apr 21, 2026

Can we also allow the offset kind to parametrized in a future PR? This was not parametrized previously so that could be developed in a future PR. IMO it makes sense to merge this as-is since it fixes bugs with the existing Text API when handling Unicode characters, as well as Yjs compatibility.

@davidbrochart
Copy link
Copy Markdown
Collaborator

I think it's a smoother path to first allow the offset kind to be parameterized with a default value of UTF-8.
We'll see later if we diverge from Yrs, since it would be a breaking change.

@dlqqq
Copy link
Copy Markdown
Collaborator

dlqqq commented Apr 21, 2026

I understand. Are we imagining this to be a parameter passed to Doc.__init__()?

For this change to work, I think we would need to configure jupyter_ydoc to initialize all pycrdt.Doc objects with offset_kind="utf16". Then, the Text methods need to check the offset kind set in self.doc to know which offsets to use when computing indexes and lengths. Does this match with what you're expecting also?

@davidbrochart
Copy link
Copy Markdown
Collaborator

Yes it seems that the offset kind would be a parameter to a document initialization.
Let's not talk about jupyter-ydoc here, as pycrdt is independent. Please open an issue there if you want to discuss that topic.

@krassowski
Copy link
Copy Markdown
Member

Hi folks, just checking back here, any update on updating offset_kind to allow for merging this?

xrl and others added 4 commits April 28, 2026 10:27
Adds an `offset_kind` parameter to `Doc(...)` so callers can choose
between UTF-8 byte offsets (the yrs default) and UTF-16 code unit
offsets (required for cross-runtime interop with JS yjs). Default is
"utf8", matching yrs.

The Text wrapper previously assumed UTF-16 unconditionally and passed
Python char indices through a UTF-16 conversion. This commit replaces
that with a dispatcher that picks UTF-8 or UTF-16 conversion based on
the doc's offset_kind, so Text behaves correctly in either mode while
the public API still takes Python character indices.

src/doc.rs gains a third positional arg on `Doc::new` accepting "utf8"
/ "utf-8" / "utf16" / "utf-16" / None (None preserves the yrs default).
Invalid values raise PyValueError. _from_snapshot_impl now reads the
offset kind from the source doc instead of hardcoding it. A new
`#[getter] offset_kind` returns the canonical "utf8" or "utf16" string.

_base.py forwards the new kwarg, and raises ValueError if both `doc=`
(an existing _Doc) and `offset_kind=` are supplied with disagreeing
values. _doc.py adds the kwarg to Doc.__init__, documents it, and
exposes a read-only `Doc.offset_kind` property.

Addresses davidbrochart's review on y-crdt#379 asking for parameterization
with a UTF-8 default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses davidbrochart's three inline reviews on y-crdt#379:

- _text.py:14 — rename `_char_to_utf16` → `get_utf16_index` (public,
  per the suggested rename). Add a counterpart `get_utf8_index` since
  with offset_kind parameterization the conversion is needed in both
  directions (jbdyn + davidbrochart agreed in the thread). The
  dispatcher and single-char-length helpers stay private — trivial
  conditionals best read at the call site.
- _text.py:23 — both public helpers now use Google-style docstrings
  (Args / Returns sections), matching the rest of the module.
- _text.py:112 — drop the inline comment that duplicated the
  docstring; tighten the docstring to match.

Re-export the two helpers from `pycrdt.__init__` so consumers can
reach them without importing from the underscore module. Drop a
now-dead `_char_to_utf16` import in tests/test_text.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an `offset_kind` pytest fixture parametrizing all Unicode tests
over both ``"utf8"`` and ``"utf16"`` modes. The same test bodies now
exercise the dispatcher in `Text` for both yrs offset configurations,
without changing the assertions — public API contract is always Python
char indices.

`test_unicode_cross_doc_sync` constructs both peers with the same
offset_kind (mismatched offset kinds across peers is unsupported by
yrs/yjs). New explicit tests cover:
- default-is-utf8
- explicit "utf8" / "utf-8" / "utf16" / "utf-16"
- ValueError on unknown values
- snapshot round-trip preserves offset_kind for both modes
- new public `get_utf16_index` / `get_utf8_index` helpers
  (identity / BMP CJK / non-BMP emoji)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xrl
Copy link
Copy Markdown
Author

xrl commented Apr 28, 2026

@krassowski take a look now, please

xrl and others added 2 commits April 28, 2026 10:35
The Rust-extension stub didn't reflect the new constructor signature
or the offset_kind getter, so `mypy python` failed in CI on _base.py
and _doc.py with "Too many arguments" / "no attribute offset_kind".

Add the third positional `offset_kind: str | None` to `Doc.__init__`
and declare the read-only `offset_kind` property.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes the ubuntu python-3.14 coverage regression on y-crdt#379. The new
mismatch check at python/pycrdt/_base.py:74 (raise ValueError when
both doc= and offset_kind= are passed and they disagree) wasn't
exercised by any pytest test, dropping coverage to 99% and tripping
the suite's fail-under=100 gate.

Add a one-shot regression test that constructs an existing Doc with
offset_kind="utf8", then attempts to wrap it as a new Doc with
offset_kind="utf16" and asserts ValueError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces configurable text offset semantics so pycrdt can safely operate on Unicode text (including non-BMP emoji) and interoperate with JS yjs when using UTF-16 code unit offsets.

Changes:

  • Add offset_kind to Doc construction (Rust + Python) and expose Doc.offset_kind to Python.
  • Update Python Text operations to convert Python character indices into the doc’s internal offset units (UTF-8 bytes or UTF-16 code units).
  • Add extensive Unicode-focused tests and export get_utf8_index / get_utf16_index helpers.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/doc.rs Adds offset_kind parsing to yrs Options and exposes an offset_kind getter.
python/pycrdt/_base.py Wires offset_kind through to the Rust _Doc and validates consistency when reusing an existing doc.
python/pycrdt/_doc.py Exposes Doc.offset_kind and documents the new constructor option.
python/pycrdt/_text.py Adds UTF-8/UTF-16 index conversion helpers and updates Text mutation methods to use converted offsets.
python/pycrdt/__init__.py Re-exports get_utf8_index / get_utf16_index as public helpers.
python/pycrdt/_pycrdt.pyi Updates type stubs for the new _pycrdt.Doc constructor arg and property.
tests/test_text.py Adds parameterized tests for Unicode behavior across both offset kinds and validates offset_kind API behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/pycrdt/_base.py
Comment on lines +72 to +76
doc = _Doc(client_id, skip_gc, offset_kind)
elif offset_kind is not None and offset_kind != doc.offset_kind:
raise ValueError(
f"offset_kind={offset_kind!r} does not match doc.offset_kind={doc.offset_kind!r}"
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset_kind mismatch check compares the raw user string to doc.offset_kind, but the constructor accepts hyphenated aliases (e.g. utf-16). Passing an existing _Doc plus offset_kind='utf-16' will currently raise even though it's semantically the same as 'utf16'. Normalize offset_kind to the canonical form (e.g. strip hyphens / lowercase) before comparing, or accept both spellings in the comparison.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_text.py
Comment on lines +421 to +423
# initial content, then applies a granular edit (using SequenceMatcher on
# byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies
# the result is correct.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block comment says the granular edit uses "SequenceMatcher on byte offsets", but _apply_diff() uses SequenceMatcher(a=old_value, b=new_value) on Python strings and then applies edits using character indices. Update the comment to avoid implying byte-based offsets here (since that distinction is important in this PR).

Suggested change
# initial content, then applies a granular edit (using SequenceMatcher on
# byte offsets, matching how jupyter_ydoc.YUnicode.set() works), and verifies
# the result is correct.
# initial content, then applies a granular edit using SequenceMatcher on
# Python strings / character indices in this helper, and verifies the result
# is correct.

Copilot uses AI. Check for mistakes.
Comment thread src/doc.rs
Comment on lines 77 to 104
@@ -84,10 +89,30 @@ impl Doc {
.map_err(|_| PyValueError::new_err("skip_gc must be a valid bool"))?;
options.skip_gc = _skip_gc;
}
if !offset_kind.is_none() {
let _offset_kind: String = offset_kind
.extract()
.map_err(|_| PyValueError::new_err("offset_kind must be a string"))?;
options.offset_kind = match _offset_kind.as_str() {
"utf8" | "utf-8" => OffsetKind::Bytes,
"utf16" | "utf-16" => OffsetKind::Utf16,
_ => return Err(PyValueError::new_err(
"offset_kind must be 'utf8' or 'utf16'",
)),
};
}
let doc = _Doc::with_options(options);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor only sets options.offset_kind when offset_kind is explicitly provided, which means Doc() will still use the yrs default (OffsetKind::Bytes). That conflicts with the PR description/title claiming the Doc is configured to use UTF-16 offsets by default. Either set options.offset_kind = OffsetKind::Utf16 unconditionally (and adjust tests/docs accordingly) or update the PR description to reflect that UTF-16 is opt-in via offset_kind.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the PR description to reflect that UTF-16 is opt-in via offset_kind

I suppose we agreed on this one.

Comment thread python/pycrdt/_text.py
Comment on lines +179 to +180
current = str(self)
offset = _char_to_offset(current, len(current), self.doc.offset_kind)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__iadd__ now calls str(self) just to compute the append offset, which makes text += ... O(n) in the current text size and allocates the whole string. For appends you can avoid this by using the underlying integrated length in the doc’s offset units (e.g. self.integrated.len(txn._txn)) as the insert index, since that already matches the offset kind.

Suggested change
current = str(self)
offset = _char_to_offset(current, len(current), self.doc.offset_kind)
offset = self.integrated.len(txn._txn)

Copilot uses AI. Check for mistakes.
Comment thread python/pycrdt/_text.py
prefix = text[:char_index]
# Count characters that need a surrogate pair (code point > 0xFFFF)
extra = sum(1 for ch in prefix if ord(ch) > 0xFFFF)
return char_index + extra
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_utf16_index() returns char_index + extra, which becomes incorrect when char_index is out of range or negative because Python slicing clamps but char_index is not clamped (e.g. get_utf16_index('A📊B', 10) returns 11 instead of the end offset 4). This can produce offsets beyond the actual text length and break insert/delete/format calls in UTF-16 mode. Compute the result from the sliced prefix length (or clamp char_index to [0, len(text)]) so the returned UTF-16 offset always stays within bounds.

Suggested change
return char_index + extra
return len(prefix) + extra

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpected text behavior for XmlText (possibly Text)

6 participants