Describe the bug
I have a file that is a PDF consisting of an SVG of dummy text (attached). If I run from the command line:
python -m ocrmypdf dummy.pdf dummyocr.pdf
The file dummyocr.pdf is generated as expected with actual text that can be selected.
If I run the following code in a python file as described in the docs:
import ocrmypdf
from ocrmypdf import OcrOptions
if __name__ == '__main__':
options = OcrOptions(
input_file="dummy.pdf",
output_file="dummyocr2.pdf",
)
ocrmypdf.ocr(options)
The outpuf file dummyocr2.pdf still appears to have an SVG and the text is NOT selectable.
Here is the output from the command line:
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
1 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:295
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Parsing 1 pages with HocrParser _graft.py:334
Postprocessing... ocr.py:156
[WinError 2] The system cannot find the file specified _windows.py:87
Auto mode: no verapdf available and input is not PDF/A, outputting PDF _pipeline.py:1078
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
[WinError 2] The system cannot find the file specified _windows.py:87
[WinError 2] The system cannot find the file specified _windows.py:87
Image optimization ratio: 1.00 savings: 0.0% _pipeline.py:1175
Total file size ratio: 0.93 savings: -7.7% _pipeline.py:1178
Output file is a PDF (auto mode)
And here is the output from the API call:
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
[WinError 2] The system cannot find the file specified
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
[WinError 2] The system cannot find the file specified
[WinError 2] The system cannot find the file specified
Notably, the command line indicates "Parsing 1 pages with HocrParser" whereas the API call does not. Do I need to specify different params to trigger that? Or any other reason these are not working the same?
I have tried various combinations of options in the API call such as force_ocr=True, redo_ocr=True (not concurrently with force_ocr), languages=['eng',], output_type="pdf", but cannot reproduce what is done with the command line.
Note, I believe the error "The system cannot find the file specified" is because GhostScript is NOT installed.
Any help appreciated. Thank you!
Steps to reproduce
Files
dummy.pdf
dummyocr.pdf
dummyocr2.pdf
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
17.2.0
Relevant log output
see output in description
Describe the bug
I have a file that is a PDF consisting of an SVG of dummy text (attached). If I run from the command line:
python -m ocrmypdf dummy.pdf dummyocr.pdfThe file dummyocr.pdf is generated as expected with actual text that can be selected.
If I run the following code in a python file as described in the docs:
The outpuf file dummyocr2.pdf still appears to have an SVG and the text is NOT selectable.
Here is the output from the command line:
And here is the output from the API call:
Notably, the command line indicates "Parsing 1 pages with HocrParser" whereas the API call does not. Do I need to specify different params to trigger that? Or any other reason these are not working the same?
I have tried various combinations of options in the API call such as force_ocr=True, redo_ocr=True (not concurrently with force_ocr), languages=['eng',], output_type="pdf", but cannot reproduce what is done with the command line.
Note, I believe the error "The system cannot find the file specified" is because GhostScript is NOT installed.
Any help appreciated. Thank you!
Steps to reproduce
Files
dummy.pdf
dummyocr.pdf
dummyocr2.pdf
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
17.2.0
Relevant log output