Skip to content

fix ._hocr#1493

Merged
jbarlow83 merged 1 commit intoocrmypdf:mainfrom
BBC-Esq:main
Apr 6, 2025
Merged

fix ._hocr#1493
jbarlow83 merged 1 commit intoocrmypdf:mainfrom
BBC-Esq:main

Conversation

@BBC-Esq
Copy link
Copy Markdown
Contributor

@BBC-Esq BBC-Esq commented Mar 8, 2025

Fix AttributeError in HocrTransform._get_text_direction when par is None

When processing certain HOCR files, the _get_text_direction method can receive a None value for its par parameter. This occurs when either:

  1. The HOCR file lacks properly structured paragraph elements (<p class='ocr_par'>)
  2. The fallback case is triggered but self.hocr.find(...) returns None

Without a check for this condition, the method attempts to access par.attrib on a None object, resulting in an AttributeError:

AttributeError: 'NoneType' object has no attribute 'attrib'

The Fix

This PR adds a simple null check to handle the case where par is None, returning the default left-to-right text direction instead of crashing.

This change:

  • Prevents crashes when processing HOCR files with missing or malformed paragraph structures
  • Maintains backward compatibility (same return values and behavior for valid inputs)
  • Simplifies error handling for clients of this library
  • Follows the principle of defensive programming

Example File - error occurring when processing Page 115

DivXGuide51.pdf

@BBC-Esq
Copy link
Copy Markdown
Contributor Author

BBC-Esq commented Mar 8, 2025

Please note that this fix isn't perfect. For RTL langauges, for example, you might do a subsequent PR like this:

def _get_text_direction(self, par):
    """Get the text direction of the paragraph.

    Arabic, Hebrew, Persian, are right-to-left languages.
    
    When the paragraph element is None, attempts to infer direction from
    other paragraphs in the document, defaulting to left-to-right.
    """
    if par is None:
        # Try to infer from other paragraphs in the document
        for p in self.hocr.iterfind(self._child_xpath('p', 'ocr_par')):
            if p is not None and p.attrib.get('dir') == 'rtl':
                return TextDirection.RTL
        return TextDirection.LTR
        
    return (
        TextDirection.RTL
        if par.attrib.get('dir', 'ltr') == 'rtl'
        else TextDirection.LTR
    )

However, this PR should resolve a vast majority of the issues.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 15, 2025

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 90.11%. Comparing base (7b2dd89) to head (0970ceb).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/ocrmypdf/hocrtransform/_hocr.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
- Coverage   90.14%   90.11%   -0.03%     
==========================================
  Files          95       95              
  Lines        7133     7135       +2     
  Branches      728      729       +1     
==========================================
  Hits         6430     6430              
- Misses        497      498       +1     
- Partials      206      207       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BBC-Esq
Copy link
Copy Markdown
Contributor Author

BBC-Esq commented Apr 5, 2025

Anyone going to review this?

@jbarlow83 jbarlow83 merged commit 8b1443c into ocrmypdf:main Apr 6, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants