docs: Add custom Docker images documentation for multi-language support#1890
docs: Add custom Docker images documentation for multi-language support#1890haroldfabla2-hue wants to merge 2 commits into
Conversation
Addresses issue microsoft#1663 - Added comprehensive guide for building custom Docker images - Documented how to add support for additional languages - Included typical pitfalls and troubleshooting tips - Added memory optimization guidance for multi-language deployments
|
@haroldfabla2-hue please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Adds a new "Building Custom Docker Images" section to docs/installation.md aimed at helping users build multi-language Presidio Docker images (issue #1663). However, the new content describes a configuration mechanism (env-var driven NER_RECOGNIZERS / DEFAULT_LANGUAGES) that does not exist in Presidio, references non-existent spaCy models, and misdescribes the repo's compose files, so it would mislead users rather than help them.
Changes:
- New section in
docs/installation.mdwith a custom Dockerfile, multi-language example, and troubleshooting tips. - Documents (fabricated) env vars
NER_RECOGNIZERS/DEFAULT_LANGUAGESand uses invalid spaCy model names likees_core_web_sm. - Inaccurate one-liner descriptions of the existing
docker-compose-*.ymlfiles.
|
|
||
| #### Step 1: Modify the NLP Configuration | ||
|
|
||
| The primary way to add language support is through the `NER_REGOGNIZERS` environment variable. Create a custom Dockerfile: |
| The primary way to add language support is through the `NER_REGOGNIZERS` environment variable. Create a custom Dockerfile: | ||
|
|
||
| ```dockerfile | ||
| FROM mcr.microsoft.com/presidio-analyzer:latest | ||
|
|
||
| # Install additional language models | ||
| RUN python -m spacy download es_core_news_lg | ||
| RUN python -m spacy download fr_core_news_lg | ||
| RUN python -m spacy download de_core_news_lg | ||
|
|
||
| # Set environment variables for additional languages | ||
| ENV NER_RECOGNIZERS='{"en": ["SpacyRecognizer"], "es": ["SpacyRecognizer"], "fr": ["SpacyRecognizer"], "de": ["SpacyRecognizer"]}' | ||
| ENV DEFAULT_LANGUAGES="en,es,fr,de" | ||
|
|
||
| CMD ["python", "-m", "presidio-analyzer"] | ||
| ``` | ||
|
|
||
| #### Step 2: Build Your Custom Image | ||
|
|
||
| ```bash | ||
| docker build -t my-presidio-analyzer:custom -f Dockerfile . | ||
| ``` | ||
|
|
||
| #### Step 3: Run with Custom Configuration | ||
|
|
||
| ```bash | ||
| docker run -d -p 5002:3000 \ | ||
| -e NER_RECOGNIZERS='{"en": ["SpacyRecognizer"], "es": ["SpacyRecognizer"]}' \ | ||
| -e DEFAULT_LANGUAGES="en,es" \ | ||
| my-presidio-analyzer:custom | ||
| ``` | ||
|
|
||
| ### Typical Pitfalls to Avoid | ||
|
|
||
| #### Memory Issues with Multiple Languages | ||
|
|
||
| !!! warning "Important" | ||
|
|
||
| Adding 10+ languages at once may cause the Docker image to run out of memory during model loading. | ||
|
|
||
| **Solution:** Add languages incrementally and optimize model sizes: | ||
|
|
||
| ```dockerfile | ||
| # Use smaller models when possible | ||
| RUN python -m spacy download es_core_web_sm # Use small model | ||
| # Instead of: RUN python -m spacy download es_core_web_lg # Large model | ||
| ``` | ||
|
|
||
| Alternatively, use lazy loading for NLP models. | ||
|
|
||
| #### NLP Recognizer Warnings | ||
|
|
||
| If you see warnings like: | ||
| ``` | ||
| UserWarning: NLP recognizer (e.g. SpacyRecognizer, StanzaRecognizer) is not in the list of recognizers for language en. | ||
| ``` | ||
|
|
||
| **Solution:** Ensure your recognizers are properly registered in the `NER_RECOGNIZERS` configuration: | ||
|
|
||
| ```bash | ||
| docker run -d -p 5002:3000 \ | ||
| -e NER_RECOGNIZERS='{"en": ["SpacyRecognizer"], "es": ["SpacyRecognizer"]}' \ | ||
| my-presidio-analyzer:custom | ||
| ``` | ||
|
|
||
| ### Complete Example: Multi-Language Support | ||
|
|
||
| Create a `Dockerfile.multilang`: | ||
|
|
||
| ```dockerfile | ||
| FROM mcr.microsoft.com/presidio-analyzer:latest | ||
|
|
||
| # Install Spanish, French, German, and Italian models | ||
| RUN python -m spacy download es_core_web_sm && \ | ||
| python -m spacy download fr_core_web_sm && \ | ||
| python -m spacy download de_core_web_sm && \ | ||
| python -m spacy download it_core_web_sm | ||
|
|
||
| # Configure recognizers for all supported languages | ||
| ENV NER_RECOGNIZERS='{ | ||
| "en": ["SpacyRecognizer"], | ||
| "es": ["SpacyRecognizer"], | ||
| "fr": ["SpacyRecognizer"], | ||
| "de": ["SpacyRecognizer"], | ||
| "it": ["SpacyRecognizer"] | ||
| }' | ||
| ENV DEFAULT_LANGUAGES="en,es,fr,de,it" | ||
|
|
||
| EXPOSE 3000 | ||
|
|
||
| CMD ["python", "-m", "presidio-analyzer"] | ||
| ``` |
| RUN python -m spacy download es_core_web_sm && \ | ||
| python -m spacy download fr_core_web_sm && \ | ||
| python -m spacy download de_core_web_sm && \ | ||
| python -m spacy download it_core_web_sm |
|
|
||
| CMD ["python", "-m", "presidio-analyzer"] |
| The official Presidio Docker images support English by default. To add support for additional languages, you'll need to build custom images using the provided YAML configuration files. | ||
|
|
||
| ### Key Configuration Files | ||
|
|
||
| The main configuration files for customizing Presidio are located in the root directory: | ||
|
|
||
| - `docker-compose.yml` - Main compose file | ||
| - `docker-compose-image.yml` - Image-specific configuration | ||
| - `docker-compose-text.yml` - Text processing configuration | ||
| - `docker-compose-transformers.yml` - Transformer models configuration | ||
|
|
||
| ### Adding Support for Additional Languages | ||
|
|
||
| #### Step 1: Modify the NLP Configuration | ||
|
|
||
| The primary way to add language support is through the `NER_REGOGNIZERS` environment variable. Create a custom Dockerfile: |
| # Instead of: RUN python -m spacy download es_core_web_lg # Large model | ||
| ``` | ||
|
|
||
| Alternatively, use lazy loading for NLP models. |
| If your container runs out of memory: | ||
| - Reduce the number of languages loaded simultaneously | ||
| - Use smaller spaCy models (`_sm` instead of `_lg`) | ||
| - Limit the number of NLP engine workers |
| #### Step 3: Run with Custom Configuration | ||
|
|
||
| ```bash | ||
| docker run -d -p 5002:3000 \ |
Description
This PR adds comprehensive documentation for building custom Docker images in Presidio, addressing issue #1663.
Changes Made
Related Issue
Fixes #1663
Testing
Documentation build was verified locally.
Screenshots
N/A - Documentation only
Contributed on behalf of Silhouette Team (Alberto Farah)