Replies: 1 comment 4 replies
-
|
Yes, you can process specific PDF pages with Docling, similar to Camelot, by using the page_range parameter in the Python API or the --page-range option in the CLI. For example, to process only the first page with the Python API: conv_res = doc_converter.convert(source=pdf_path, page_range=(0, 1)) # zero-based, just page 1or conv_res = doc_converter.convert(source=pdf_path, page_range=[1, 5]) # one-based, pages 1 to 5Check your Docling version for the exact indexing convention (reference, reference, reference). In the CLI: This will process only pages 1 to 5 (reference). If you need to extract tables or content from a single page, you can also extract that page into a temporary PDF using PyMuPDF (fitz) or pypdf, then process it with Docling (reference). For exporting content from a specific page after conversion, use: md = result.document.export_to_markdown(page_no=1)Note: There are known issues with page-specific export in large documents (like extra tables or page breaks in the output) (reference). Docling does not natively support multi-page table extraction (tables spanning multiple pages), so you may need postprocessing if you encounter this case (reference). For large PDFs or memory efficiency, process in batches using page_range or page_batch_size, and consider using the PyPdfiumDocumentBackend for better memory handling (reference). To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@dosu
can we use docling for specific pdf pages like in camelot library here: tables = camelot.read_pdf(pdf_path, pages=str(page_number), flavor='stream')
Beta Was this translation helpful? Give feedback.
All reactions