How to offload as much as possible to vLLM? #2821

vaclcer · 2025-12-29T16:51:01Z

vaclcer
Dec 29, 2025

@dosu To save VRAM, I would like Docling to do only basic layout recognition, but then all the found elements (images, figures, code blocks, tables?) would be described by VLM running in vLLM. Is that possible? If so, how to do this? Could you create a working example?

Using ThreadedPdfPipelineOptions and PictureDescriptionApiOptions

Docling famously does not scale well with all this, therefore I would like to keep its usage minimal.

vLLM on the other hand scales beautifully with larger batches, larger VLMs etc. so I think it would make sense at the end.

My pipeline so far:

pipeline_options = ThreadedPdfPipelineOptions(
        do_ocr=False,
        do_table_structure=True,
        generate_page_images=False,
        generate_picture_images=True,
        images_scale=1.5,
        accelerator_options=AcceleratorOptions(
            device=AcceleratorDevice.CUDA,
            num_threads=16,
        ),
        ocr_batch_size=4,
        layout_batch_size=4,
        table_batch_size=4,
        profile_pipeline_timings=False,
        pdf_backend="dlparse_v4",
        do_picture_classification=False,
        do_picture_description=True,
        do_code_enrichment=False,
        enable_remote_services=True,
        picture_description_options=PictureDescriptionApiOptions(
            url=VLM_URL,
            params={
                "model": VLM_MODEL,
                "max_tokens": 2000,
            },
            prompt="Provide a dense, factual description of the image capturing all key visual elements for retrieval. If the image contains code, source code, or terminal output, transcribe it exactly preserving formatting for retrieval.",
            timeout=VLM_TIMEOUT,
            concurrency=32,
            batch_size=32,
            scale=1.5,
            picture_area_threshold=0.0,
        )
    )

2025-12-29T16:55:21Z

dosubot[bot]
Bot Dec 29, 2025

Yes, you can offload almost all element description (images, figures, tables, code blocks) to a VLM running in vLLM, keeping Docling’s resource usage minimal. The key is to configure Docling to do only layout and basic structure recognition, then send all detected elements to your vLLM endpoint for description via API. This works well for images, figures, and tables; code blocks can also be described if your VLM model and prompt support it.

Minimal pipeline configuration:

Disable OCR and code enrichment (do_ocr=False, do_code_enrichment=False)
Enable only layout and table structure (do_table_structure=True)
Enable image extraction and description (generate_picture_images=True, do_picture_description=True)
Set enable_remote_services=True and configure PictureDescriptionApiOptions to point to your vLLM endpoint
Use picture_area_threshold=0.0 to ensure all images are sent for description
Tune concurrency and batch_size for vLLM scalability

Working example:

pipeline_options = ThreadedPdfPipelineOptions(
    do_ocr=False,
    do_table_structure=True,
    generate_page_images=False,
    generate_picture_images=True,
    images_scale=1.5,
    accelerator_options=AcceleratorOptions(
        device=AcceleratorDevice.CUDA,
        num_threads=16,
    ),
    ocr_batch_size=4,
    layout_batch_size=4,
    table_batch_size=4,
    profile_pipeline_timings=False,
    pdf_backend="dlparse_v4",
    do_picture_classification=False,
    do_picture_description=True,
    do_code_enrichment=False,
    enable_remote_services=True,
    picture_description_options=PictureDescriptionApiOptions(
        url="http://your-vllm-server:8000/v1/chat/completions",
        params={
            "model": "your-vlm-model",
            "max_tokens": 2000,
        },
        prompt="Provide a dense, factual description of the image capturing all key visual elements for retrieval. If the image contains code, source code, or terminal output, transcribe it exactly preserving formatting for retrieval.",
        timeout=60,
        concurrency=32,
        batch_size=32,
        scale=1.5,
        picture_area_threshold=0.0,
    )
)

This setup will extract images, figures, and tables, and send them to your vLLM endpoint for description. For code blocks and tables, make sure your VLM model and prompt are designed to handle those elements—some models (like GraniteDocling, SmolDocling, Qwen, Pixtral) are especially strong at extracting complex tables and converting images to structured text, but prompt engineering is important for best results. If your VLM output does not include explicit structured content (like DocTags or Markdown with <table> and <picture> tokens), Docling may not populate those fields in the output—so always check the raw model output and tune your prompt as needed source.

Caveats:

Docling’s markdown export may not always include VLM-generated descriptions if a caption is present; you may need to patch the serializer logic to always include VLM annotations source.
Concurrency and batch size should be tuned for your hardware and vLLM server.
For advanced customization (e.g., custom HTTP client, rate limiting), you can subclass or monkey-patch the picture description model source.

Verification:
To confirm VLM-generated descriptions are present, iterate over PictureItem elements and print their annotations:

for element, _ in doc.iterate_items():
    if isinstance(element, PictureItem):
        print(f"Found Picture: {element.self_ref}")
        print(f"  > Caption: {element.caption_text(doc=doc)}")
        print(f"  > VLM Annotations: {element.annotations}")

This will show both PDF captions and VLM-generated descriptions for each image source.

Let me know if you need a full pipeline example or want to discuss prompt/model tuning for tables and code blocks!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to offload as much as possible to vLLM? #2821

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to offload as much as possible to vLLM? #2821

Uh oh!

Uh oh!

vaclcer Dec 29, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Dec 29, 2025

vaclcer
Dec 29, 2025

dosubot[bot]
Bot Dec 29, 2025