PyThaiNLP v5.3.0 Released!
This release modernizes the codebase, delivering a 62x reduction in peak import memory utilizing lazy-loaded technique (thanks to @what-in-the-nim)*, better support for offline and read-only environments, and achieving typed package status through improved type annotations.
Install/upgrade:
pip install -U pythainlp- This is the final minor update for the 5.x series. The upcoming 6.0 release will be a major milestone and is expected to introduce breaking changes.
- Minimum required Python version is now 3.9. The library ensures compatibility across Python 3.9–3.14.
- Many updates were AI-assisted; see pull requests for specific prompts and implementation details.
- Lazy-loaded word lists: Users may notice a slight "cold start" delay during the first function call while word lists initialize; subsequent runs will perform at full speed.
- Documentation: https://pythainlp.github.io/docs/5.3
- Report bug: https://github.com/PyThaiNLP/pythainlp/issues
What's changed
Added
- Tapsai et al. 2020 soundex (#1175)
- Thai profanity detection (#1183)
- Qwen3-0.6B language model (#1217)
- Thai-NNER integration with top-level entity filtering (#1221)
pythainlp.braillemodule for Thai braille conversion (#1287)- BLEU, ROUGE, WER, and CER metrics to
pythainlp.benchmarks(#1295) - Attaparse engine to dependency parser (
dependency_parsing, engine="attaparse") (#1303) pythainlp.is_offline_mode()helper function; usePYTHAINLP_OFFLINE=1to disable automatic corpus downloads (#1306)- Thai consonant cluster detection (
check_khuap_klam) (#1308) pythainlp.is_read_only_mode()helper function; usePYTHAINLP_READ_ONLY=1to prevent all write operations (#1317)
Changed
- Optimized for performance (#1182, #1237, #1320)
- Lazy load dictionaries to reduce memory usage (#1186)
- Migrate configurations to
pyproject.toml(#1188, #1190, #1226, #1239) - Update type hints; use Python 3.9 features (#1189, #1190, #1232, #1262, #1263, #1264, #1274, etc.)
- Make package zip-safe (#1212)
- Ensure thread-safety for tokenizers (#1213)
- Replace TNC word frequency dataset with Phupha filtered by ORST words (#1284)
- Reorganize "noauto" test suite by dependency groups (torch, tensorflow, onnx, cython, network) (#1290)
get_corpus_path()now respectsPYTHAINLP_OFFLINEenv var (followsHF_HUB_OFFLINEconvention from Hugging Face): raisesFileNotFoundErrorif the corpus is not cached locally when the var is set; auto-downloads otherwise (#1306)- Callers raise
FileNotFoundErrorwith download instructions when a corpus path cannot be resolved (#1306) - Migrate build backend to
hatchling(#1311)
Deprecated
PYTHAINLP_DATA_DIRenv var; usePYTHAINLP_DATAinstead (followsNLTK_DATAconvention from NLTK)PYTHAINLP_DATA_DIRwill be removed in a future version (#1306)PYTHAINLP_READ_MODEenv var; usePYTHAINLP_READ_ONLYinsteadPYTHAINLP_READ_MODEwill be removed in a future version (#1317)
Removed
- Duplicated entries in Volubilis dictionary (#1200)
- Star imports (#1207)
requestsdependency (#1211)pythainlp.util.is_native_thai(deprecated since v5.0); usepythainlp.morpheme.is_native_thaiinstead (#1315)
Fixed
royinromanization: Consonant cluster boundary (#1172)check_marttra(): Final consonant classification (#1173)- Base dependencies (#1185)
tltktransliteration: Kho Khon alphabet issue in (#1187)- Fix tone_detector and sound_syllable bugs (#1197)
normalize(): Remove spaces before tone marks and non-base characters (#1222)- Suppress Gensim duplicate-word warnings when loading word2vec binary files (#1316)
db.json: created lazily only when a corpus is first downloaded (#1317)newmmtokenization: Exponential-time explosion when text has many ambiguous breaking points (#1319)Trie: Reduce memory usage and faster TCC boundary lookups (#1323)
Security
- Prevent path traversal and symlink attacks in archive extraction (#1225)
New contributors
- @Copilot made their first contribution in #1172
- @what-in-the-nim made their first contribution in #1185
Full Changelog: v5.2.0...v5.3.0