docling-serve CUDA OOM crashes on do_chart_extraction =True #3206

SammyDlx · 2026-03-28T08:16:32Z

SammyDlx
Mar 28, 2026

I use docling-serve but I am experiencing continuosly CUDA OOM when doing do_chart_extraction = True. I have tried minimizing GPU demand by the following settings down below but to no avail. I am using docling-serve 1.15.1 on a T4 instance. "Normal" docling works fine, but docling-serve crashes on the first try.

│ Env Var │ Value │ Purpose │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ UVICORN_WORKERS │ 1 │ Single process, no │
│ │ │ model duplication │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ DOCLING_SERVE_ENG_LOC_NUM_WORKERS │ 1 │ Single async worker │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ DOCLING_SERVE_ENG_LOC_SHARE_MODELS │ true │ Reuse shared │
│ │ │ converter │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ DOCLING_SERVE_LOAD_MODELS_AT_BOOT │ false │ Lazy model loading │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ DOCLING_SERVE_OPTIONS_CACHE_SIZE │ 1 │ Only 1 converter │
│ │ │ cached at a time │
├────────────────────────────────────┼──────────────────────────┼───────────────────────┤
│ │ │ Better VRAM │
│ PYTORCH_CUDA_ALLOC_CONF │ expandable_segments:True │ fragmentation │
│ │ │ handling

2026-03-28T08:17:58Z

dosubot[bot]
Bot Mar 28, 2026

This is a known issue and there's been excellent analysis in a related GitHub issue that explains exactly why docling-serve OOMs on T4 while standalone docling works.

Root cause: Even with DOCLING_SERVE_LOAD_MODELS_AT_BOOT=false, the key problem is that chart extraction requires loading a ~4GB vision model (Granite Vision) on top of the baseline pipeline models (~250MB for layout, table, OCR). On a T4 with 16GB VRAM, this combined with PyTorch overhead can easily cause OOM.

Additionally, Flash Attention 2 (which reduces memory usage) is not available on T4 since it requires Ampere+ architecture (T4 is Turing).

Recommended solutions:

Offload chart extraction to an external API (most reliable for T4):

{
  "options": {
    "do_chart_extraction": true,
    "picture_description_custom_config": {
      "engine_options": {"engine_type": "api"},
      "api_overrides": {
        "api": {
          "url": "https://your-openai-compatible-api/v1/chat/completions",
          "headers": {"Authorization": "Bearer YOUR_API_KEY"},
          "params": {"model": "your-model"}
        }
      }
    }
  }
}

Requires setting DOCLING_SERVE_ALLOW_CUSTOM_PICTURE_DESCRIPTION_CONFIG=true and DOCLING_SERVE_ENABLE_REMOTE_SERVICES=true.

If you must run locally, try vLLM with reduced GPU memory:

{
  "picture_description_custom_config": {
    "engine_options": {
      "engine_type": "vllm",
      "gpu_memory_utilization": 0.7,
      "cudagraph_mode": "NONE"
    }
  }
}

Disable unused features to free baseline VRAM: do_ocr=false, do_table_structure=false, do_formula_enrichment=false if not needed.

The fundamental constraint is that the Granite Vision model for chart extraction is ~4GB, and T4's 16GB VRAM doesn't leave enough headroom when combined with other pipeline models.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

SammyDlx · 2026-03-28T08:34:58Z

SammyDlx
Mar 28, 2026
Author

we are setting DOCLING_SERVE_LOAD_MODELS_AT_BOOT = False so models are loaded lazily. Even if we turn do_ocr=false, do_table_structure=false, do_formula_enrichment=false, we still get OOM on a T4 which has 16gb memory. Nota bene 'normal' docling works fine and I do not get OOM. Does this mean the feature is not working?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docling-serve CUDA OOM crashes on do_chart_extraction =True #3206

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

docling-serve CUDA OOM crashes on do_chart_extraction =True #3206

Uh oh!

SammyDlx Mar 28, 2026

Replies: 2 comments

Uh oh!

dosubot[bot] Bot Mar 28, 2026

Uh oh!

Uh oh!

SammyDlx Mar 28, 2026 Author

SammyDlx
Mar 28, 2026

dosubot[bot]
Bot Mar 28, 2026

SammyDlx
Mar 28, 2026
Author