Hello again,
I noticed that contrary to the "text-extraction" recipe, the "grayscale" recipe doesn't handle corrupted pdfs (see jobdiag).
I think the same kind of error handling as in the "text-extraction" recipe would be nice.
Here is how I modified the recipe (image-conversion/recipe.py):
for i, sample_file in enumerate(input_filenames):
prefix = sample_file.split('.')[0]
suffix = sample_file.split('.')[-1].lower()
if suffix in Constants.OCR_TYPES:
try:
with input_folder.get_download_stream(sample_file) as stream:
img_bytes = stream.read()
if suffix == "pdf":
for j, img in enumerate(pdf_to_pil_images_iterator(img_bytes)):
img_bytes = convert_image_to_greyscale_bytes(img, quality=params[Constants.QUALITY])
output_folder.upload_data("{0}/{0}{1}{2:05d}.jpg".format(prefix, Constants.PDF_MULTI_SUFFIX, j+1), img_bytes)
else:
img = Image.open(BytesIO(img_bytes))
img_bytes = convert_image_to_greyscale_bytes(img, quality=params[Constants.QUALITY])
output_folder.upload_data("{}.jpg".format(prefix), img_bytes)
logger.info("OCR - Converted {}/{} images".format(i+1, total_images))
except Exception as e:
logger.info("Failed converting file {} to greyscale because: {}".format(sample_file, e))
else:
logger.info("OCR - Rejecting {} because it is not a {} file.".format(sample_file, '/'.join(Constants.OCR_TYPES)))
logger.info("OCR - Rejected {}/{} images".format(i+1, total_images))
Simply added a try/except arround the file conversion and a logger with the error if an exception occured.
You can also notice that I added a ".lower()" for the suffix as I noticed that ".PDF" files weren't converted because "PDF" is not in the Constants.OCR_TYPES list.
Regards,
Adrien
Hello again,
I noticed that contrary to the "text-extraction" recipe, the "grayscale" recipe doesn't handle corrupted pdfs (see jobdiag).
I think the same kind of error handling as in the "text-extraction" recipe would be nice.
Here is how I modified the recipe (image-conversion/recipe.py):
Simply added a try/except arround the file conversion and a logger with the error if an exception occured.
You can also notice that I added a ".lower()" for the suffix as I noticed that ".PDF" files weren't converted because "PDF" is not in the Constants.OCR_TYPES list.
Regards,
Adrien