Add US CLIA recognizer for clinical laboratory identifiers#2029
Add US CLIA recognizer for clinical laboratory identifiers#2029Manas103 wants to merge 1 commit into
Conversation
CLIA (Clinical Laboratory Improvement Amendments) numbers are 10-character
identifiers issued by CMS to certify clinical laboratories. They appear on
lab orders, lab reports, and Medicare claims for laboratory services.
Format: NN D NNNNNNN (2 digits, literal 'D', 7 digits).
There is no publicly documented checksum for CLIA numbers, so the base
regex carries a low confidence score (0.1) and the recognizer relies on
context words ('CLIA', 'lab', 'laboratory', etc.) to reach a meaningful
score. Degenerate trailing sequences (all identical digits) are rejected
via invalidate_result.
- New recognizer: UsCliaRecognizer (entity US_CLIA, COUNTRY_CODE us)
- Registered in default_recognizers.yaml (enabled: false, matching MBI/NPI)
- 17 parametrized + behavior tests; 100% line coverage on the new module
- CHANGELOG and supported_entities.md updated
|
@Manas103 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
| name=name, | ||
| ) | ||
|
|
||
| def invalidate_result(self, pattern_text: str) -> bool: # noqa: D102 |
There was a problem hiding this comment.
As this doesn't add additional guards over the regex, I would suggest to remove it.
| [ | ||
| # fmt: off | ||
| # Valid CLIA, weak base score (no separators, no context) | ||
| ("11D2030122", 1, ((0, 10),), ((0.0, 0.4),),), |
There was a problem hiding this comment.
Please add another example where the CLIA number is in the middle of a sentence
There was a problem hiding this comment.
Pull request overview
Adds first-class Presidio Analyzer coverage for US CLIA (Clinical Laboratory Improvement Amendments) identifiers by introducing a new predefined, US-specific pattern recognizer and wiring it into the registry, tests, and public documentation.
Changes:
- Added
UsCliaRecognizer(US_CLIA) as a country-specific predefinedPatternRecognizerwith weak/medium regex tiers plus placeholder/degenerate-value invalidation. - Exported the recognizer through the predefined recognizers package and registered it in
default_recognizers.yaml(disabled by default). - Added unit tests, updated supported entities documentation, and documented the change in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_clia_recognizer.py | Implements the new US CLIA recognizer with patterns, context terms, and invalidation logic. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/init.py | Re-exports UsCliaRecognizer from the US country-specific package. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Re-exports UsCliaRecognizer from the top-level predefined recognizers package. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Registers UsCliaRecognizer as predefined, US-tagged, and disabled by default. |
| presidio-analyzer/tests/test_us_clia_recognizer.py | Adds unit coverage for matching, boundaries, scoring ranges, and metadata expectations. |
| docs/supported_entities.md | Documents the new US_CLIA entity type in the supported entities list. |
| CHANGELOG.md | Adds an Unreleased Analyzer “Added” entry for the new recognizer. |
Change Description
Adds a new predefined recognizer,
UsCliaRecognizer, for the US Clinical Laboratory Improvement Amendments (CLIA) number — a 10-character identifier issued by CMS to certify clinical laboratories. CLIA numbers appear on lab orders, lab reports, and Medicare claims for laboratory services, and they are a common form of PHI in healthcare text.Format (per CMS):
NN D NNNNNNND(designates "lab")Example:
11D2030122. Reference: https://www.cms.gov/medicare/quality/clinical-laboratory-improvement-amendmentsWhy no checksum?
CLIA numbers do not have a publicly documented check-digit algorithm. The base regex therefore carries a low confidence (
0.1weak /0.4with separators), and the recognizer leans on context words (CLIA,lab,laboratory,clinical laboratory,lab id,lab number, …) to reach a confident match via the analyzer's context-enhancement layer. Degenerate sequences where all 7 trailing digits are identical (e.g.11D0000000,11D1111111) are rejected ininvalidate_resultas obvious placeholders. This matches the regex-only-with-context approach already used elsewhere in the codebase (e.g.UsMbiRecognizer,AbaRoutingRecognizer's weak tier).What changed
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/us/us_clia_recognizer.py—UsCliaRecognizer(PatternRecognizer)withCOUNTRY_CODE = "us",SUPPORTED_ENTITY = "US_CLIA", two patterns (weak / with separators), 8 context words, and aninvalidate_resultthat filters degenerate trailing sequences.predefined_recognizers/__init__.pyandpredefined_recognizers/country_specific/us/__init__.py, with__all__updated.presidio-analyzer/presidio_analyzer/conf/default_recognizers.yamlwithenabled: falseandcountry_code: us, matching the convention used byUsMbiRecognizerandUsNpiRecognizer.presidio-analyzer/tests/test_us_clia_recognizer.pywith 12 parametrized cases (valid, lowercased, dashed, spaced, multi-match, malformed, degenerate) plus 5 behavior tests (supported_entity, supported_language, context words, country code).docs/supported_entities.mdrow added;CHANGELOG.mdUnreleased Analyzer/Added bullet.Issue reference
No existing issue tracks this; CLIA was a notable gap in the US healthcare-PII coverage alongside the already-present
US_NPI,US_MBI, andMEDICAL_LICENSE(DEA) recognizers.How to verify
cd presidio-analyzer pytest tests/test_us_clia_recognizer.py -vAll 17 tests pass. The new module has 100% line coverage (measured via
coverage run --source=...us_clia_recognizer -m pytest tests/test_us_clia_recognizer.py).Smoke check:
Checklist