Issues with ligatures with "default" model #2807

jameshowison · 2025-12-19T00:52:58Z

jameshowison
Dec 19, 2025

I'm seeing great performance with the "default" model in extracting markdown from academic syllabi.

One issue, though, is I think I'm facing trouble with the conversion of ligatures (things like the ff fi characters that are joined together). I attached an example of a PDF that is showing these issues: syllabus.pdf

In the conversion the word "software" everywhere is a problem (because there is a ligature for the ft in software.

https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)

These are not a problem with smoldocling but it's significantly slower and seems like overkill for my application.

Versions details:
docling 2.65.0
docling-core 2.56.0
docling-ibm-models 3.10.3

@jameshowison · 2025-12-19T00:57:12Z

dosubot[bot]
Bot Dec 19, 2025

Hi @jameshowison! I'm Dosu and I’m helping the docling team.

The "default" model in docling doesn't normalize Unicode ligature code points (like ﬀ, ﬁ, ﬂ, ﬅ, etc.) to their decomposed forms during PDF extraction—these ligatures are preserved as-is in the output text, which is why you're seeing issues with words like "software" when the PDF uses a ligature for "ft" (source). There are no built-in options or pipeline flags for ligature normalization in the default pipeline, and the sanitize_text method only handles a few punctuation marks, not ligatures.

You can fix this by adding a post-processing step to replace ligature code points with their standard letter sequences. Here’s a simple Python snippet you can use after extraction:

LIGATURES = {
    "\uFB00": "ff",
    "\uFB01": "fi",
    "\uFB02": "fl",
    "\uFB03": "ffi",
    "\uFB04": "ffl",
    "\uFB05": "ft",
    "\uFB06": "st",
    # Add more if needed
}

def normalize_ligatures(text):
    for lig, repl in LIGATURES.items():
        text = text.replace(lig, repl)
    return text

# Usage:
# output = normalize_ligatures(output)

You can apply this to your extracted markdown, or extend the sanitize_text method if you want to integrate it deeper into your workflow. No need to switch to smoldocling for this issue—this workaround is lightweight and effective.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with ligatures with "default" model #2807

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Issues with ligatures with "default" model #2807

Uh oh!

jameshowison Dec 19, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Dec 19, 2025

jameshowison
Dec 19, 2025

dosubot[bot]
Bot Dec 19, 2025