-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: add Philippines TIN (PH_TIN) recognizer #2016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aaronaco
wants to merge
6
commits into
microsoft:main
Choose a base branch
from
aaronaco:feat/philippines-tin-number-recognizer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
0f77cb7
feat: add Philippines TIN recognizer and related tests
aaronaco 8164de0
Merge branch 'main' into feat/philippines-tin-number-recognizer
omri374 4fed892
fix: address PH_TIN review cleanup
aaronaco 7b38d7b
fix: lower PH_TIN base scores
aaronaco cc5e457
test: tighten PH_TIN score expectations
aaronaco 737c728
fix: restore recognizer config order
aaronaco File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
...nalyzer/presidio_analyzer/predefined_recognizers/country_specific/philippines/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| """Philippines-specific recognizers package.""" | ||
|
|
||
| from .ph_tin_recognizer import PhTinRecognizer | ||
|
|
||
| __all__ = ["PhTinRecognizer"] |
106 changes: 106 additions & 0 deletions
106
...residio_analyzer/predefined_recognizers/country_specific/philippines/ph_tin_recognizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| from typing import List, Optional, Tuple | ||
|
|
||
| from presidio_analyzer import Pattern, PatternRecognizer | ||
|
|
||
|
|
||
| class PhTinRecognizer(PatternRecognizer): | ||
| """ | ||
| Recognizes Philippines Taxpayer Identification Number (TIN). | ||
|
|
||
| The TIN is a 9 or 12-digit number issued by the Bureau of Internal Revenue (BIR). | ||
| The 9th digit is a check digit calculated using a weighted modulo 11 algorithm. | ||
| The last 3 digits (in the 12-digit version) represent the branch code (default 000). | ||
|
|
||
| Formats: XXXXXXXXX, XXXXXXXXXXXX, XXX-XXX-XXX, or XXX-XXX-XXX-XXX | ||
| Reference: https://www.bir.gov.ph/ | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| :param replacement_pairs: List of tuples with potential replacement values | ||
| """ | ||
|
|
||
| PATTERNS = [ | ||
| Pattern( | ||
| "TIN (Low)", | ||
| r"\b(\d{3}-\d{3}-\d{3}(-\d{3})?)\b", | ||
| 0.05, | ||
| ), | ||
| Pattern( | ||
| "TIN (Very Low)", | ||
| r"\b(\d{9}|\d{12})\b", | ||
| 0.01, | ||
| ), | ||
| ] | ||
|
|
||
| CONTEXT = [ | ||
| "tin", | ||
| "taxpayer identification number", | ||
| "bir", | ||
| "taxpayer id", | ||
| "tax id", | ||
| "rdo", | ||
| "revenue district office", | ||
| ] | ||
|
|
||
| def __init__( | ||
| self, | ||
| patterns: Optional[List[Pattern]] = None, | ||
| context: Optional[List[str]] = None, | ||
| supported_language: str = "en", | ||
| supported_entity: str = "PH_TIN", | ||
| replacement_pairs: Optional[List[Tuple[str, str]]] = None, | ||
| name: Optional[str] = None, | ||
| ): | ||
| self.replacement_pairs = ( | ||
| replacement_pairs if replacement_pairs else [("-", ""), (" ", "")] | ||
| ) | ||
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| super().__init__( | ||
| supported_entity=supported_entity, | ||
| patterns=patterns, | ||
| context=context, | ||
| supported_language=supported_language, | ||
| name=name, | ||
| ) | ||
|
|
||
| def invalidate_result(self, pattern_text: str) -> bool: | ||
| """ | ||
| Check if the Philippines TIN fails weighted modulo 11 validation. | ||
|
|
||
| :param pattern_text: The text to validate | ||
| :return: True if invalid, False otherwise | ||
| """ | ||
| return not self._is_valid_tin(pattern_text) | ||
|
|
||
| def _is_valid_tin(self, pattern_text: str) -> bool: | ||
| """Validate the Philippines TIN using weighted modulo 11.""" | ||
| # Clean the input | ||
| for search, replace in self.replacement_pairs: | ||
| pattern_text = pattern_text.replace(search, replace) | ||
|
|
||
| if not pattern_text.isdigit(): | ||
| return False | ||
|
|
||
| if len(pattern_text) not in (9, 12): | ||
| return False | ||
|
|
||
| # Weights for the first 8 digits | ||
| weights = [9, 8, 7, 6, 5, 4, 3, 2] | ||
|
|
||
| # Calculate sum of first 8 digits multiplied by weights | ||
| total_sum = 0 | ||
| for i in range(8): | ||
| total_sum += int(pattern_text[i]) * weights[i] | ||
|
|
||
| # Modulo 11 of the sum | ||
| remainder = total_sum % 11 | ||
|
|
||
| # The 9th digit is the check digit | ||
| # Note: If remainder is 10, it's usually not issued or handled specifically. | ||
| # Most implementations for BIR TIN treat the remainder as the check digit. | ||
| check_digit = int(pattern_text[8]) | ||
|
|
||
| return remainder == check_digit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| import pytest | ||
| from presidio_analyzer.predefined_recognizers.country_specific.philippines import ( | ||
| PhTinRecognizer, | ||
| ) | ||
|
|
||
| from tests import assert_result_within_score_range | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def recognizer(): | ||
| """Return an instance of the PhTinRecognizer.""" | ||
| return PhTinRecognizer() | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def entities(): | ||
| """Return the entities supported by this recognizer.""" | ||
| return ["PH_TIN"] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "text, expected_len, expected_positions, expected_score_ranges", | ||
| [ | ||
| # Valid TINs (using weighted modulo 11: 000-123-456-000 -> rem 6) | ||
| ("My TIN is 000-123-456-000", 1, [(10, 25)], [(0.049, 0.051)]), | ||
| ("TIN: 000123456", 1, [(5, 14)], [(0.009, 0.011)]), | ||
| ("BIR TIN: 000123456000", 1, [(9, 21)], [(0.009, 0.011)]), | ||
| ("Tax ID: 000-123-456-001", 1, [(8, 23)], [(0.049, 0.051)]), | ||
| ("TIN 000-123-456", 1, [(4, 15)], [(0.049, 0.051)]), | ||
| # Invalid TINs (wrong checksum) | ||
| ("Invalid TIN 000-123-457-000", 0, [], []), | ||
| ("Not a TIN 123456789", 0, [], []), | ||
| # Remainder 10 cannot match a single check digit. | ||
| ("Invalid TIN 600-000-000", 0, [], []), | ||
| # Raw recognizer tests do not apply context enhancement. | ||
| ("TIN: 000-123-456-000", 1, [(5, 20)], [(0.049, 0.051)]), | ||
| ("Please use 000-123-456-000 as your ID", 1, [(11, 26)], [(0.049, 0.051)]), | ||
| ], | ||
| ) | ||
| def test_ph_tin_recognizer( | ||
| text, expected_len, expected_positions, expected_score_ranges, recognizer, entities | ||
| ): | ||
| """Test the PhTinRecognizer.""" | ||
| results = recognizer.analyze(text, entities) | ||
| assert len(results) == expected_len | ||
| results_with_expectations = zip(results, expected_positions, expected_score_ranges) | ||
| for res, pos, score_range in results_with_expectations: | ||
| assert_result_within_score_range( | ||
| res, entities[0], pos[0], pos[1], score_range[0], score_range[1] | ||
| ) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Filipino?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think adding filipino is broader than this PR since Presidio’s analyzer and registry languages need to match and there isn’t an existing Filipino/Tagalog NLP config.
PH_TIN validation is language-independent, it still works when the analyzer runs under an enabled language like
en.