feat: add Philippines TIN (PH_TIN) recognizer#2016
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces a new Philippines-specific PII recognizer (PH_TIN) to detect and validate Philippine Taxpayer Identification Numbers (TIN) using regex + context + checksum validation, and wires it into configuration, tests, and docs.
Changes:
- Added
PhTinRecognizerwith regex patterns, PH-specific context terms, and weighted modulo-11 validation. - Integrated the recognizer into analyzer registries/configs and context-sentence datasets.
- Added/updated unit tests and documentation (supported entities + changelog).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/philippines/ph_tin_recognizer.py | Adds the PhTinRecognizer implementation (patterns, context, checksum validation). |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/philippines/init.py | Exposes Philippines recognizers package exports. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Registers PhTinRecognizer in the predefined recognizers public API. |
| presidio-analyzer/tests/test_ph_tin_recognizer.py | Adds unit tests for TIN detection + validation. |
| presidio-analyzer/tests/test_context_support.py | Adds PH_TIN to the context enhancer test harness and updates dataset size check. |
| presidio-analyzer/tests/data/context_sentences_tests.txt | Adds PH TIN context sentences for the context enhancer dataset. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Registers recognizer in default recognizers list (disabled by default). |
| presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml | Registers recognizer in full analyzer config (disabled by default). |
| presidio-analyzer/presidio_analyzer/conf/slim.yaml | Registers recognizer in slim analyzer config (disabled by default). |
| e2e-tests/resources/test_ollama_enabled_recognizers.yaml | Registers recognizer in e2e test recognizers config (disabled by default). |
| docs/supported_entities.md | Documents PH_TIN as a supported entity. |
| CHANGELOG.md | Adds a changelog entry for the new PH_TIN recognizer. |
| The 9th digit is a check digit calculated using a weighted modulo 11 algorithm. | ||
| The last 3 digits (in the 12-digit version) represent the branch code (default 000). | ||
|
|
||
| Format: XXX-XXX-XXX-XXX or XXXXXXXXXXXX |
| - name: PhTinRecognizer | ||
| supported_languages: | ||
| - en | ||
| type: predefined |
| #### Added | ||
| - Canadian SIN (`CA_SIN`) recognizer for the Canadian Social Insurance Number, using regex pattern matching, context words (English and French), and Luhn checksum validation. Disabled by default. | ||
|
|
||
| - Philippines TIN (`PH_TIN`) recognizer for the Philippines Taxpayer Identification Number, using regex pattern matching, context words, and weighted modulo 11 checksum validation. |
| ("My TIN is 000-123-456-000", 1, [(10, 25)], [(0.1, 1.0)]), | ||
| ("BIR TIN: 000123456", 1, [(9, 18)], [(0.1, 1.0)]), | ||
| ("Tax ID: 000-123-456-001", 1, [(8, 23)], [(0.1, 1.0)]), | ||
| # Valid 9-digit with hyphens | ||
| ("TIN 000-123-456", 1, [(4, 15)], [(0.1, 1.0)]), | ||
| # Invalid TINs (wrong checksum) | ||
| ("Invalid TIN 000-123-457-000", 0, [], []), | ||
| ("Not a TIN 123456789", 0, [], []), | ||
| # Context tests | ||
| ("TIN: 000-123-456-000", 1, [(5, 20)], [(0.1, 1.0)]), | ||
| ("Please use 000-123-456-000 as your ID", 1, [(11, 26)], [(0.1, 1.0)]), |
| if not pattern_text.isdigit(): | ||
| return False | ||
|
|
||
| if len(pattern_text) not in [9, 12]: |
| Pattern( | ||
| "TIN (High)", | ||
| r"\b(\d{3}-\d{3}-\d{3}(-\d{3})?)\b", | ||
| 0.6, |
There was a problem hiding this comment.
This is very high for a very common pattern. Consider reducing and boosting score with context and validation. I would suggest to reduce to 0.05
| Pattern( | ||
| "TIN (Medium)", | ||
| r"\b(\d{9}|\d{12})\b", | ||
| 0.3, |
| The 9th digit is a check digit calculated using a weighted modulo 11 algorithm. | ||
| The last 3 digits (in the 12-digit version) represent the branch code (default 000). | ||
|
|
||
| Format: XXX-XXX-XXX-XXX or XXXXXXXXXXXX |
| if not pattern_text.isdigit(): | ||
| return False | ||
|
|
||
| if len(pattern_text) not in [9, 12]: |
| ("My TIN is 000-123-456-000", 1, [(10, 25)], [(0.1, 1.0)]), | ||
| ("BIR TIN: 000123456", 1, [(9, 18)], [(0.1, 1.0)]), | ||
| ("Tax ID: 000-123-456-001", 1, [(8, 23)], [(0.1, 1.0)]), | ||
| # Valid 9-digit with hyphens | ||
| ("TIN 000-123-456", 1, [(4, 15)], [(0.1, 1.0)]), | ||
| # Invalid TINs (wrong checksum) | ||
| ("Invalid TIN 000-123-457-000", 0, [], []), | ||
| ("Not a TIN 123456789", 0, [], []), | ||
| # Context tests | ||
| ("TIN: 000-123-456-000", 1, [(5, 20)], [(0.1, 1.0)]), | ||
| ("Please use 000-123-456-000 as your ID", 1, [(11, 26)], [(0.1, 1.0)]), |
|
|
||
| - name: PhTinRecognizer | ||
| supported_languages: | ||
| - en |
There was a problem hiding this comment.
Good point. I think adding filipino is broader than this PR since Presidio’s analyzer and registry languages need to match and there isn’t an existing Filipino/Tagalog NLP config.
PH_TIN validation is language-independent, it still works when the analyzer runs under an enabled language like en.
|
Ready for another look. |
Change Description
This PR introduces the
PhTinRecognizer(PH_TIN) to detect the Philippines Taxpayer Identification Number issued by the Bureau of Internal Revenue (BIR), as part of the broader proposal to add Philippines-specific PII recognizers.Key changes included:
XXX-XXX-XXXandXXX-XXX-XXX-XXX).validate_resultto eliminate false positives and ensure high-confidence matches.tests/test_ph_tin_recognizer.pyverifying valid/invalid structures, checksums, and context enhancement.tests/data/context_sentences_tests.txtandtests/test_context_support.py).PhTinRecognizerindefault_recognizers.yaml,default_analyzer_full.yaml,slim.yaml, and test configurations, ensuring it isenabled: falseby default per project guidelines for country-specific recognizers.docs/supported_entities.mdandCHANGELOG.mdto reflect the new feature.Issue reference
Fixes #2015
Checklist