Skip to content

PyThaiNLP v5.3.0 Released!

Choose a tag to compare

@bact bact released this 10 Mar 06:27
· 207 commits to dev since this release
ef81afc

This release modernizes the codebase, delivering a 62x reduction in peak import memory utilizing lazy-loaded technique (thanks to @what-in-the-nim)*, better support for offline and read-only environments, and achieving typed package status through improved type annotations.

Install/upgrade:

pip install -U pythainlp
  • This is the final minor update for the 5.x series. The upcoming 6.0 release will be a major milestone and is expected to introduce breaking changes.
  • Minimum required Python version is now 3.9. The library ensures compatibility across Python 3.9–3.14.
  • Many updates were AI-assisted; see pull requests for specific prompts and implementation details.
  • Lazy-loaded word lists: Users may notice a slight "cold start" delay during the first function call while word lists initialize; subsequent runs will perform at full speed.
  • Documentation: https://pythainlp.github.io/docs/5.3
  • Report bug: https://github.com/PyThaiNLP/pythainlp/issues

What's changed

Added

  • Tapsai et al. 2020 soundex (#1175)
  • Thai profanity detection (#1183)
  • Qwen3-0.6B language model (#1217)
  • Thai-NNER integration with top-level entity filtering (#1221)
  • pythainlp.braille module for Thai braille conversion (#1287)
  • BLEU, ROUGE, WER, and CER metrics to pythainlp.benchmarks (#1295)
  • Attaparse engine to dependency parser (dependency_parsing, engine="attaparse") (#1303)
  • pythainlp.is_offline_mode() helper function; use PYTHAINLP_OFFLINE=1 to disable automatic corpus downloads (#1306)
  • Thai consonant cluster detection (check_khuap_klam) (#1308)
  • pythainlp.is_read_only_mode() helper function; use PYTHAINLP_READ_ONLY=1 to prevent all write operations (#1317)

Changed

  • Optimized for performance (#1182, #1237, #1320)
  • Lazy load dictionaries to reduce memory usage (#1186)
  • Migrate configurations to pyproject.toml (#1188, #1190, #1226, #1239)
  • Update type hints; use Python 3.9 features (#1189, #1190, #1232, #1262, #1263, #1264, #1274, etc.)
  • Make package zip-safe (#1212)
  • Ensure thread-safety for tokenizers (#1213)
  • Replace TNC word frequency dataset with Phupha filtered by ORST words (#1284)
  • Reorganize "noauto" test suite by dependency groups (torch, tensorflow, onnx, cython, network) (#1290)
  • get_corpus_path() now respects PYTHAINLP_OFFLINE env var (follows HF_HUB_OFFLINE convention from Hugging Face): raises FileNotFoundError if the corpus is not cached locally when the var is set; auto-downloads otherwise (#1306)
  • Callers raise FileNotFoundError with download instructions when a corpus path cannot be resolved (#1306)
  • Migrate build backend to hatchling (#1311)

Deprecated

  • PYTHAINLP_DATA_DIR env var; use PYTHAINLP_DATA instead (follows NLTK_DATA convention from NLTK) PYTHAINLP_DATA_DIR will be removed in a future version (#1306)
  • PYTHAINLP_READ_MODE env var; use PYTHAINLP_READ_ONLY instead PYTHAINLP_READ_MODE will be removed in a future version (#1317)

Removed

  • Duplicated entries in Volubilis dictionary (#1200)
  • Star imports (#1207)
  • requests dependency (#1211)
  • pythainlp.util.is_native_thai (deprecated since v5.0); use pythainlp.morpheme.is_native_thai instead (#1315)

Fixed

  • royin romanization: Consonant cluster boundary (#1172)
  • check_marttra(): Final consonant classification (#1173)
  • Base dependencies (#1185)
  • tltk transliteration: Kho Khon alphabet issue in (#1187)
  • Fix tone_detector and sound_syllable bugs (#1197)
  • normalize(): Remove spaces before tone marks and non-base characters (#1222)
  • Suppress Gensim duplicate-word warnings when loading word2vec binary files (#1316)
  • db.json: created lazily only when a corpus is first downloaded (#1317)
  • newmm tokenization: Exponential-time explosion when text has many ambiguous breaking points (#1319)
  • Trie: Reduce memory usage and faster TCC boundary lookups (#1323)

Security

  • Prevent path traversal and symlink attacks in archive extraction (#1225)

New contributors

Full Changelog: v5.2.0...v5.3.0