I want to use Docling for legal text processing, and have some questions #3241
-
|
I want to specifically use the HTML output, so that I can easily use a HTML library to extract specific texts and do post-processing Every single document should follow the same patterns (headers, text layout, etc) How much can I trust Docling with having consistent HTML outputs, because I will need to perform operations like: Essentially my question is, does it stay consistent, is there any way I can make it output custom tags like, for example: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
A few things that might help here: HTML consistency — Docling's Custom tags like Useful alternatives to consider:
For legal clause extraction specifically, you'd likely want to combine Docling's structural parsing with your own post-processing logic — e.g., detecting clause patterns based on heading hierarchy and numbering, then wrapping them in To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
A few things that might help here:
HTML consistency — Docling's
export_to_html()produces a structured HTML output based on the document element types it detects (headings, paragraphs, tables, lists, etc.) [1]. For documents that follow the same layout, the output structure should be consistent, but it ultimately depends on how well Docling's layout model recognizes each element. If your legal documents have a uniform format (e.g., Word templates), you should get fairly reliable results.Custom tags like
<clause>— Docling does not natively support custom HTML tags. The HTML exporter uses a fixed set of standard HTML tags [1]. For your use case, the recommended approach would be to post-…