-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add US CLIA recognizer for clinical laboratory identifiers #2029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Manas103
wants to merge
1
commit into
microsoft:main
Choose a base branch
from
Manas103:manas/add-us-clia-recognizer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
105 changes: 105 additions & 0 deletions
105
...alyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_clia_recognizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| """Recognizer for US CLIA (Clinical Laboratory Improvement Amendments) numbers.""" | ||
|
|
||
| from typing import List, Optional, Tuple | ||
|
|
||
| from presidio_analyzer import EntityRecognizer, Pattern, PatternRecognizer | ||
|
|
||
|
|
||
| class UsCliaRecognizer(PatternRecognizer): | ||
| """Recognize US CLIA (Clinical Laboratory Improvement Amendments) numbers. | ||
|
|
||
| A CLIA number uniquely identifies a clinical laboratory certified under the | ||
| CLIA program administered by CMS (Centers for Medicare & Medicaid Services). | ||
| CLIA numbers appear on lab orders, lab reports, and Medicare claims for | ||
| laboratory services. | ||
|
|
||
| Format: 10 characters, ``NN D NNNNNNN`` | ||
| - Positions 1-2: 2-digit state code (numeric) | ||
| - Position 3: literal letter ``D`` (designates "lab") | ||
| - Positions 4-10: 7-digit unique sequence | ||
|
|
||
| Example: ``11D2030122`` | ||
|
|
||
| No publicly documented check-digit algorithm exists for CLIA numbers, so | ||
| this recognizer is regex + context only. The base patterns therefore carry | ||
| a low confidence and rely on surrounding context words ("CLIA", "lab", | ||
| "laboratory", "clinical") to reach a meaningful score. | ||
|
|
||
| Reference: https://www.cms.gov/medicare/quality/clinical-laboratory-improvement-amendments | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| :param replacement_pairs: List of tuples with potential replacement values | ||
| for different strings to be used during pattern matching. | ||
| This can allow a greater variety in input, for example by removing dashes | ||
| or spaces. | ||
| """ | ||
|
|
||
| COUNTRY_CODE = "us" | ||
|
|
||
| PATTERNS = [ | ||
| Pattern( | ||
| "CLIA number (weak)", | ||
| r"\b\d{2}[Dd]\d{7}\b", | ||
| 0.1, | ||
| ), | ||
| Pattern( | ||
| "CLIA number with separators (medium)", | ||
| r"\b\d{2}[ -][Dd][ -]\d{7}\b", | ||
| 0.4, | ||
| ), | ||
| ] | ||
|
|
||
| CONTEXT = [ | ||
| "clia", | ||
| "clia number", | ||
| "clia id", | ||
| "lab", | ||
| "laboratory", | ||
| "clinical laboratory", | ||
| "lab id", | ||
| "lab number", | ||
| ] | ||
|
|
||
| def __init__( | ||
| self, | ||
| patterns: Optional[List[Pattern]] = None, | ||
| context: Optional[List[str]] = None, | ||
| supported_language: str = "en", | ||
| supported_entity: str = "US_CLIA", | ||
| replacement_pairs: Optional[List[Tuple[str, str]]] = None, | ||
| name: Optional[str] = None, | ||
| ): | ||
| self.replacement_pairs = ( | ||
| replacement_pairs if replacement_pairs else [("-", ""), (" ", "")] | ||
| ) | ||
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| super().__init__( | ||
| supported_entity=supported_entity, | ||
| patterns=patterns, | ||
| context=context, | ||
| supported_language=supported_language, | ||
| name=name, | ||
| ) | ||
|
|
||
| def invalidate_result(self, pattern_text: str) -> bool: # noqa: D102 | ||
| sanitized_value = EntityRecognizer.sanitize_value( | ||
| pattern_text, self.replacement_pairs | ||
| ) | ||
| # Defensive: pattern already enforces length, but guard anyway. | ||
| if len(sanitized_value) != 10: | ||
| return True | ||
| # Position 3 must be a 'D' (case-insensitive); already enforced by the | ||
| # regex, but kept as an explicit assertion for the no-separator path. | ||
| if sanitized_value[2].upper() != "D": | ||
| return True | ||
| # Reject degenerate sequences where all 7 trailing digits are identical | ||
| # (e.g., 11D0000000, 11D1111111). These almost certainly represent | ||
| # placeholders rather than real CLIA numbers. | ||
| trailing = sanitized_value[3:] | ||
| if len(set(trailing)) == 1: | ||
| return True | ||
| return False | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| import pytest | ||
|
|
||
| from tests import assert_result_within_score_range | ||
| from presidio_analyzer.predefined_recognizers import UsCliaRecognizer | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def recognizer(): | ||
| return UsCliaRecognizer() | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") | ||
| def entities(): | ||
| return ["US_CLIA"] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "text, expected_len, expected_positions, expected_score_ranges", | ||
| [ | ||
| # fmt: off | ||
| # Valid CLIA, weak base score (no separators, no context) | ||
| ("11D2030122", 1, ((0, 10),), ((0.0, 0.4),),), | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add another example where the CLIA number is in the middle of a sentence |
||
| # Valid CLIA, lowercase 'd' is also accepted (case-insensitive global flag) | ||
| ("11d2030122", 1, ((0, 10),), ((0.0, 0.4),),), | ||
| # Valid CLIA with dashes — medium score from the separator pattern | ||
| ("11-D-2030122", 1, ((0, 12),), ((0.3, 0.6),),), | ||
| # Valid CLIA with spaces — medium score from the separator pattern | ||
| ("11 D 2030122", 1, ((0, 12),), ((0.3, 0.6),),), | ||
| # CLIA inside text with explicit "CLIA:" prefix — base regex score; the | ||
| # AnalyzerEngine layer applies context boosting on top of this. | ||
| ( | ||
| "CLIA: 11D2030122", | ||
| 1, ((6, 16),), ((0.0, 0.4),), | ||
| ), | ||
| # CLIA in a sentence with "laboratory" context — bare recognizer score | ||
| ( | ||
| "Laboratory ID 22D9876543 was on the report", | ||
| 1, ((14, 24),), ((0.0, 0.4),), | ||
| ), | ||
| # Multiple CLIA numbers in text | ||
| ( | ||
| "Labs 11D2030122 and 33D4455667 sent results", | ||
| 2, ((5, 15), (20, 30),), ((0.0, 0.4), (0.0, 0.4),), | ||
| ), | ||
| # Invalid: position 3 is not 'D' | ||
| ("11X2030122", 0, (), (),), | ||
| # Invalid: too short (9 chars) | ||
| ("11D203012", 0, (), (),), | ||
| # Invalid: too long (11 chars) | ||
| ("11D20301223", 0, (), (),), | ||
| # Invalid: starts with a letter (positions 1-2 must be digits) | ||
| ("AAD2030122", 0, (), (),), | ||
| # Invalid: all trailing digits identical — degenerate, invalidated | ||
| ("11D0000000", 0, (), (),), | ||
| ("11D1111111", 0, (), (),), | ||
| # fmt: on | ||
| ], | ||
| ) | ||
| def test_when_clia_in_text_then_all_us_clias_are_found( | ||
| text, | ||
| expected_len, | ||
| expected_positions, | ||
| expected_score_ranges, | ||
| recognizer, | ||
| entities, | ||
| max_score, | ||
| ): | ||
| results = recognizer.analyze(text, entities) | ||
| results = sorted(results, key=lambda x: x.start) | ||
| assert len(results) == expected_len | ||
| for res, (st_pos, fn_pos), (st_score, fn_score) in zip( | ||
| results, expected_positions, expected_score_ranges | ||
| ): | ||
| if st_score == "max": | ||
| st_score = max_score | ||
| if fn_score == "max": | ||
| fn_score = max_score | ||
| assert_result_within_score_range( | ||
| res, entities[0], st_pos, fn_pos, st_score, fn_score | ||
| ) | ||
|
|
||
|
|
||
| def test_clia_recognizer_supported_entity(recognizer): | ||
| """Test that recognizer supports the correct entity.""" | ||
| assert recognizer.supported_entities == ["US_CLIA"] | ||
|
|
||
|
|
||
| def test_clia_recognizer_supported_language(recognizer): | ||
| """Test that recognizer supports English by default.""" | ||
| assert recognizer.supported_language == "en" | ||
|
|
||
|
|
||
| def test_clia_recognizer_context_words(recognizer): | ||
| """Test that recognizer has appropriate context words.""" | ||
| expected_context = [ | ||
| "clia", | ||
| "clia number", | ||
| "clia id", | ||
| "lab", | ||
| "laboratory", | ||
| "clinical laboratory", | ||
| "lab id", | ||
| "lab number", | ||
| ] | ||
| assert recognizer.context == expected_context | ||
|
|
||
|
|
||
| def test_clia_recognizer_country_code(recognizer): | ||
| """Test that recognizer is tagged as US-specific.""" | ||
| assert recognizer.COUNTRY_CODE == "us" | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this doesn't add additional guards over the regex, I would suggest to remove it.