| SPDX-FileCopyrightText | 2026 PyThaiNLP Project |
|---|---|
| SPDX-FileType | DOCUMENTATION |
| SPDX-License-Identifier | Apache-2.0 |
This directory contains tests that verify the integrity, format, parseability, and catalog functionality of corpus in PyThaiNLP.
These tests are separate from regular unit tests because:
- They test actual file loading and parsing (not mocked)
- Downloadable corpus tests require network access and can be slow
- They verify corpus format and structure
- They test corpus catalog download and query functionality
- They should only run when corpus files or corpus code changes
Tests corpus catalog functionality:
- Catalog download from remote server
- Catalog URL and path validation
- Catalog JSON structure verification
- Querying specific corpus details
- Version information validation
Tests corpus files that are included in the package:
- Text word lists (negations, stopwords, syllables, words, etc.)
- CSV files (provinces)
- Frequency data (TNC, TTC)
- Name lists (family names, person names)
Tests corpus files that need to be downloaded:
- OSCAR word frequencies (96MB)
- TNC bigram/trigram frequencies (41MB + 145MB)
This test will take longer time than others due to size of the downloads.
Run all corpus tests:
python -m unittest discover -s tests/corpus -vRun only catalog tests:
python -m unittest tests.corpus.test_catalog -vRun only built-in corpus tests:
python -m unittest tests.corpus.test_builtin_corpus -vRun only downloadable corpus tests:
python -m unittest tests.corpus.test_downloadable_corpus -vThe corpus test runs automatically via
GitHub Actions workflow (.github/workflows/corpus.yml) when:
- Changes are made to
pythainlp/corpus/** - Changes are made to
tests/corpus/** - The workflow file itself is modified
Each test verifies:
- Loadability: File can be loaded without errors
- Type correctness: Returns expected data type (frozenset, list, dict)
- Non-empty: Contains actual data
- Format validity: Data structure matches expected format
- Content validity: Contains expected content (e.g., Thai characters)
- Catalog functionality: Catalog can be downloaded and queried correctly
When adding a new corpus file or function to pythainlp.corpus:
- Add a test to
test_builtin_corpus.pyif it's included in the package - Add a test to
test_downloadable_corpus.pyif it requires download - Add a test to
test_catalog.pyif it involves catalog operations - Verify the test catches format errors by temporarily breaking the corpus
- Unit tests (
tests/core/test_corpus.py): Use mocks for speed, test code logic - Corpus test (this directory): Use real data, test file integrity and catalog
Both test suites are important and complementary.