Replies: 2 comments
-
|
This is a known issue and there's been excellent analysis in a related GitHub issue that explains exactly why docling-serve OOMs on T4 while standalone docling works. Root cause: Even with Additionally, Flash Attention 2 (which reduces memory usage) is not available on T4 since it requires Ampere+ architecture (T4 is Turing). Recommended solutions:
The fundamental constraint is that the Granite Vision model for chart extraction is ~4GB, and T4's 16GB VRAM doesn't leave enough headroom when combined with other pipeline models. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
we are setting DOCLING_SERVE_LOAD_MODELS_AT_BOOT = False so models are loaded lazily. Even if we turn do_ocr=false, do_table_structure=false, do_formula_enrichment=false, we still get OOM on a T4 which has 16gb memory. Nota bene 'normal' docling works fine and I do not get OOM. Does this mean the feature is not working? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I use docling-serve but I am experiencing continuosly CUDA OOM when doing do_chart_extraction = True. I have tried minimizing GPU demand by the following settings down below but to no avail. I am using docling-serve 1.15.1 on a T4 instance. "Normal" docling works fine, but docling-serve crashes on the first try.
Beta Was this translation helpful? Give feedback.
All reactions