Replies: 1 comment
-
|
Yes, you can offload almost all element description (images, figures, tables, code blocks) to a VLM running in vLLM, keeping Docling’s resource usage minimal. The key is to configure Docling to do only layout and basic structure recognition, then send all detected elements to your vLLM endpoint for description via API. This works well for images, figures, and tables; code blocks can also be described if your VLM model and prompt support it. Minimal pipeline configuration:
Working example: pipeline_options = ThreadedPdfPipelineOptions(
do_ocr=False,
do_table_structure=True,
generate_page_images=False,
generate_picture_images=True,
images_scale=1.5,
accelerator_options=AcceleratorOptions(
device=AcceleratorDevice.CUDA,
num_threads=16,
),
ocr_batch_size=4,
layout_batch_size=4,
table_batch_size=4,
profile_pipeline_timings=False,
pdf_backend="dlparse_v4",
do_picture_classification=False,
do_picture_description=True,
do_code_enrichment=False,
enable_remote_services=True,
picture_description_options=PictureDescriptionApiOptions(
url="http://your-vllm-server:8000/v1/chat/completions",
params={
"model": "your-vlm-model",
"max_tokens": 2000,
},
prompt="Provide a dense, factual description of the image capturing all key visual elements for retrieval. If the image contains code, source code, or terminal output, transcribe it exactly preserving formatting for retrieval.",
timeout=60,
concurrency=32,
batch_size=32,
scale=1.5,
picture_area_threshold=0.0,
)
)This setup will extract images, figures, and tables, and send them to your vLLM endpoint for description. For code blocks and tables, make sure your VLM model and prompt are designed to handle those elements—some models (like GraniteDocling, SmolDocling, Qwen, Pixtral) are especially strong at extracting complex tables and converting images to structured text, but prompt engineering is important for best results. If your VLM output does not include explicit structured content (like DocTags or Markdown with Caveats:
Verification: for element, _ in doc.iterate_items():
if isinstance(element, PictureItem):
print(f"Found Picture: {element.self_ref}")
print(f" > Caption: {element.caption_text(doc=doc)}")
print(f" > VLM Annotations: {element.annotations}")This will show both PDF captions and VLM-generated descriptions for each image source. Let me know if you need a full pipeline example or want to discuss prompt/model tuning for tables and code blocks! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@dosu To save VRAM, I would like Docling to do only basic layout recognition, but then all the found elements (images, figures, code blocks, tables?) would be described by VLM running in vLLM. Is that possible? If so, how to do this? Could you create a working example?
Using ThreadedPdfPipelineOptions and PictureDescriptionApiOptions
Docling famously does not scale well with all this, therefore I would like to keep its usage minimal.
vLLM on the other hand scales beautifully with larger batches, larger VLMs etc. so I think it would make sense at the end.
My pipeline so far:
Beta Was this translation helpful? Give feedback.
All reactions