Issues with ligatures with "default" model #2807
Replies: 1 comment
-
|
Hi @jameshowison! I'm Dosu and I’m helping the docling team. The "default" model in docling doesn't normalize Unicode ligature code points (like ff, fi, fl, ſt, etc.) to their decomposed forms during PDF extraction—these ligatures are preserved as-is in the output text, which is why you're seeing issues with words like "software" when the PDF uses a ligature for "ft" (source). There are no built-in options or pipeline flags for ligature normalization in the default pipeline, and the sanitize_text method only handles a few punctuation marks, not ligatures. You can fix this by adding a post-processing step to replace ligature code points with their standard letter sequences. Here’s a simple Python snippet you can use after extraction: LIGATURES = {
"\uFB00": "ff",
"\uFB01": "fi",
"\uFB02": "fl",
"\uFB03": "ffi",
"\uFB04": "ffl",
"\uFB05": "ft",
"\uFB06": "st",
# Add more if needed
}
def normalize_ligatures(text):
for lig, repl in LIGATURES.items():
text = text.replace(lig, repl)
return text
# Usage:
# output = normalize_ligatures(output)You can apply this to your extracted markdown, or extend the sanitize_text method if you want to integrate it deeper into your workflow. No need to switch to smoldocling for this issue—this workaround is lightweight and effective. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm seeing great performance with the "default" model in extracting markdown from academic syllabi.
One issue, though, is I think I'm facing trouble with the conversion of ligatures (things like the ff fi characters that are joined together). I attached an example of a PDF that is showing these issues: syllabus.pdf
In the conversion the word "software" everywhere is a problem (because there is a ligature for the
ftin software.https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)
These are not a problem with smoldocling but it's significantly slower and seems like overkill for my application.
Versions details:
docling 2.65.0
docling-core 2.56.0
docling-ibm-models 3.10.3
Beta Was this translation helpful? Give feedback.
All reactions