Using Docling for Specific Pdf Pages #2744

egesnr · 2025-12-07T23:34:37Z

egesnr
Dec 7, 2025

@dosu
can we use docling for specific pdf pages like in camelot library here: tables = camelot.read_pdf(pdf_path, pages=str(page_number), flavor='stream')

egesnr · 2025-12-07T23:36:45Z

dosubot[bot]
Bot Dec 7, 2025

Yes, you can process specific PDF pages with Docling, similar to Camelot, by using the page_range parameter in the Python API or the --page-range option in the CLI. For example, to process only the first page with the Python API:

conv_res = doc_converter.convert(source=pdf_path, page_range=(0, 1))  # zero-based, just page 1

or

conv_res = doc_converter.convert(source=pdf_path, page_range=[1, 5])  # one-based, pages 1 to 5

Check your Docling version for the exact indexing convention (reference, reference, reference).

In the CLI:

docling convert yourfile.pdf --to markdown --output ./output_folder --page-range 1-5

This will process only pages 1 to 5 (reference).

If you need to extract tables or content from a single page, you can also extract that page into a temporary PDF using PyMuPDF (fitz) or pypdf, then process it with Docling (reference).

For exporting content from a specific page after conversion, use:

md = result.document.export_to_markdown(page_no=1)

Note: There are known issues with page-specific export in large documents (like extra tables or page breaks in the output) (reference).

Docling does not natively support multi-page table extraction (tables spanning multiple pages), so you may need postprocessing if you encounter this case (reference). For large PDFs or memory efficiency, process in batches using page_range or page_batch_size, and consider using the PyPdfiumDocumentBackend for better memory handling (reference).

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

egesnr Dec 7, 2025
Author

@dosu
pipeline_options = PdfPipelineOptions(
artifacts_path = "docling_offline"
)
pipeline_options.do_ocr = False

#Convert the document
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend = PyPdfiumDocumentBackend,
            pipeline_options=pipeline_options)
    }
)

result = converter.convert(temp_path)
docling_doc = result.document

How can I use your approach here i want to select single page or range of pages

dosubot[bot] Bot Dec 7, 2025

To select a single page or a range of pages, pass the page_range parameter directly to converter.convert. The page_range should be a tuple or list of two integers, using one-based inclusive indexing (e.g., (3, 5) for pages 3 to 5). PdfPipelineOptions does not support page_range directly; it must be set at convert time.

Here's how you can update your code:

# To select a single page (e.g., page 2)
result = converter.convert(temp_path, page_range=(2, 2))

# To select a range (e.g., pages 2 to 4)
result = converter.convert(temp_path, page_range=(2, 4))

This works with PyPdfiumDocumentBackend and your current pipeline setup. There are no major caveats, but for best results, process one page at a time if you notice any text ordering issues in the output [source] [source] [source].

If you want to export content from a specific page after conversion, use:

md = result.document.export_to_markdown(page_no=2)

Let me know if you need an example for batch processing or have other questions!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

egesnr Dec 7, 2025
Author

@dosu
I'm actually try to extract tables if i convert to markdown like u did can i extract the tables from the one specific page like here
pipeline_options = PdfPipelineOptions(
artifacts_path = "docling_offline"
)
pipeline_options.do_ocr = False

#Convert the document
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend = PyPdfiumDocumentBackend,
            pipeline_options=pipeline_options)
    }
)

result = converter.convert(temp_path)
docling_doc = result.document

dfs = [docling_table_to_df(t) for t in docling_doc.tables]

dosubot[bot] Bot Dec 7, 2025

Yes, you can extract tables from a specific page after conversion. Each table in docling_doc.tables has provenance metadata, so you can filter tables by page number like this:

target_page_no = 2  # for example, page 2
tables_on_page = [t for t in docling_doc.tables if t.prov and t.prov[0].page_no == target_page_no]
dfs = [docling_table_to_df(t) for t in tables_on_page]

(reference, reference)

Alternatively, you can process only the page(s) you want by passing page_range=(2, 2) to convert, and then all tables in docling_doc.tables will be from that page (reference, reference).

If you need to filter after conversion, always check table.prov[0].page_no for the page number. This works with your current pipeline setup. Let me know if you want an example for a page range or for exporting to markdown!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Docling for Specific Pdf Pages #2744

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using Docling for Specific Pdf Pages #2744

Uh oh!

egesnr Dec 7, 2025

Replies: 1 comment · 4 replies

Uh oh!

dosubot[bot] Bot Dec 7, 2025

Uh oh!

egesnr Dec 7, 2025 Author

Uh oh!

dosubot[bot] Bot Dec 7, 2025

Uh oh!

egesnr Dec 7, 2025 Author

Uh oh!

dosubot[bot] Bot Dec 7, 2025

egesnr
Dec 7, 2025

Replies: 1 comment 4 replies

dosubot[bot]
Bot Dec 7, 2025

egesnr Dec 7, 2025
Author

egesnr Dec 7, 2025
Author