Skip to content

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE#113275

Open
qsantos wants to merge 2 commits into
python:mainfrom
qsantos:fix-issue-113274
Open

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE#113275
qsantos wants to merge 2 commits into
python:mainfrom
qsantos:fix-issue-113274

Conversation

@qsantos
Copy link
Copy Markdown

@qsantos qsantos commented Dec 19, 2023

This PR closes #113274. This is done by changing the UCS-2 codepoint 126 (~, TILDE) to 65374 (~, FULLWIDTH TILDE) in the __jisx0212_decmap reference table. Since no character is added or removed, no other changes are needed.

Bug report

Bug description:

Python decodes the bytes 0x8FA2A7 as ~ (TILDE) in EUC-JP.

assert b'\x8f\xa2\xb7'.decode('euc_jp') == '~'

This reference document is ambiguous in that it shows a simple ~ (TILDE), but most other software (iconv, Vim, Firefox, Rust's encoding_rs) interpret this as ~ (FULLWIDTH TILDE). Note that EUC-JP already includes US-ASCII, and so:

assert '~'.encode('euc-jp') == b'~'

CPython versions tested on:

3.11, CPython main branch

Operating systems tested on:

Linux

Comment thread Modules/cjkcodecs/mappings_jp.h Outdated

static const ucs2_t __jisx0212_decmap[6179] = {
728,711,184,729,733,175,731,730,126,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
728,711,184,729,733,175,731,730,65374,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
Copy link
Copy Markdown
Member

@corona10 corona10 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// AUTO-GENERATED FILE FROM genmap_japanese.py: DO NOT EDIT

Please take a look at the comment on the first line.
I 've not taken a look at the issue deeply yet, but if you want to change the mapping file, you should modify the generator.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that. I will do that.

Copy link
Copy Markdown
Author

@qsantos qsantos Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how Vim and Firefox proceed (although Vim does fallback to iconv for non-internally-supported encodings). Since I do not expect to get Unicode to update the document of an obsolete standard that they only used to adapt, I see two options:

  1. add a hack in genmap_japanese.py around line 90 after the call to loadmap(jisx0212file) to reassign jisx0212decmap[34][55] = ord('~') (it does fix __jisx0212_decmap as expected and changes nothing else), and comment that properly
  2. switch to WhatWG as the source of truth

1 can be done pretty quickly and with minimal risk of breaking anything. 2 has the advantage that it might help discover other issues in mappings, but is obviously much more involved.

If 1 seems reasonable to you, I will update this PR accordingly. If 2, I will close it and document this in the ticket for later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this MR with the hack in genmap_japanese.py and regenerated mappings_jp.h.

Copy link
Copy Markdown
Member

@corona10 corona10 Dec 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at which will be the better, If we decide to solve this issue, option 2 will be the better way.
But I need to check the current status related to JIS0212 and side effect.
Also need to consider stability of whatwg spec.

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Dec 19, 2023

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@qsantos
Copy link
Copy Markdown
Author

qsantos commented Dec 24, 2023

I have made the requested changes; please review again.

@corona10
Copy link
Copy Markdown
Member

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

@corona10 corona10 self-assigned this Dec 24, 2023
@corona10
Copy link
Copy Markdown
Member

corona10 commented Dec 24, 2023

Okay, I like this PR, it will guarantee the round trip

a = b'\x8f\xa2\xb7'
assert a.decode('euc_jp').encode('euc_jp') == a

But as I said, I will take a look at this PR by next week.

@qsantos
Copy link
Copy Markdown
Author

qsantos commented Dec 24, 2023

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

Thanks, I was not sure if I had properly triggered the bot. Enjoy your Christmas as well!

Copy link
Copy Markdown
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here is my temporal conclusion for this change.
(If we decide to adopt the change!)

  1. Let's use uncode.org data as same as now, whatwg is great but not stable. We are not a browser, I am not sure that we can follow every change without side effect.
  2. Let's vendor the modified JIS0212.TXT into https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings and store the diff file to https://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff for tracking changes.

But I need to listen to the opinion of @ezio-melotti, who is a more experienced unicode expert than me.

@qsantos
Copy link
Copy Markdown
Author

qsantos commented Feb 1, 2024

@ezio-melotti What is your take on this?

@qsantos
Copy link
Copy Markdown
Author

qsantos commented Apr 5, 2024

Should we consider this dead?

@python python deleted a comment Apr 7, 2025
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 15, 2026
@qsantos
Copy link
Copy Markdown
Author

qsantos commented Apr 15, 2026

Not stale, just waiting for review

@github-actions github-actions Bot removed the stale Stale PR or inactive for long period of time. label May 12, 2026
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with patching the generator if it allows us to move forward.

I have only questions: is EUC-JP the only encoding that needs this fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python decodes EUC-JP 8FA2A7 as TILDE instead of FULLWIDTH TILDE

4 participants