Skip to content

[Question]: difference between command line and ocrmypdf.ocr() API #1636

@cfoq

Description

@cfoq

Describe the bug

I have a file that is a PDF consisting of an SVG of dummy text (attached). If I run from the command line:
python -m ocrmypdf dummy.pdf dummyocr.pdf
The file dummyocr.pdf is generated as expected with actual text that can be selected.

If I run the following code in a python file as described in the docs:

import ocrmypdf
from ocrmypdf import OcrOptions
if __name__ == '__main__':
    
    options = OcrOptions(
        input_file="dummy.pdf",
        output_file="dummyocr2.pdf",
    )

    ocrmypdf.ocr(options)

The outpuf file dummyocr2.pdf still appears to have an SVG and the text is NOT selectable.

Here is the output from the command line:

Scanning contents    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 [tesseract] lots of diacritics - possibly poor OCR                                                                                    tesseract.py:295
OCR                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Parsing 1 pages with HocrParser                                                                                                                _graft.py:334
Postprocessing...                                                                                                                                 ocr.py:156
[WinError 2] The system cannot find the file specified                                                                                        _windows.py:87
Auto mode: no verapdf available and input is not PDF/A, outputting PDF                                                                     _pipeline.py:1078
Linearizing          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
[WinError 2] The system cannot find the file specified                                                                                        _windows.py:87
[WinError 2] The system cannot find the file specified                                                                                        _windows.py:87
Image optimization ratio: 1.00 savings: 0.0%                                                                                               _pipeline.py:1175
Total file size ratio: 0.93 savings: -7.7%                                                                                                 _pipeline.py:1178
Output file is a PDF (auto mode)            

And here is the output from the API call:

Scanning contents    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
OCR                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
[WinError 2] The system cannot find the file specified
Linearizing          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
[WinError 2] The system cannot find the file specified
[WinError 2] The system cannot find the file specified

Notably, the command line indicates "Parsing 1 pages with HocrParser" whereas the API call does not. Do I need to specify different params to trigger that? Or any other reason these are not working the same?

I have tried various combinations of options in the API call such as force_ocr=True, redo_ocr=True (not concurrently with force_ocr), languages=['eng',], output_type="pdf", but cannot reproduce what is done with the command line.

Note, I believe the error "The system cannot find the file specified" is because GhostScript is NOT installed.

Any help appreciated. Thank you!

Steps to reproduce

See description

Files

dummy.pdf
dummyocr.pdf
dummyocr2.pdf

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

17.2.0

Relevant log output

see output in description

Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions