Add Exception handling for Grayscale recipe

Hello again,

I noticed that contrary to the "text-extraction" recipe, the "grayscale" recipe doesn't handle corrupted pdfs (see [jobdiag](https://dl.dataiku.com/file/c2AvzkGtM9vsiVAm/LzvCSMw5vPC1yNMA/dss-job-diag-POC_LRE_GPT-Build_Temp__NP__2024-03-18T08-46-14.233_.zip?dl=1)).

I think the same kind of error handling as in the "text-extraction" recipe would be nice.

Here is how I modified the recipe (image-conversion/recipe.py):

```
for i, sample_file in enumerate(input_filenames):
    prefix = sample_file.split('.')[0]
    suffix = sample_file.split('.')[-1].lower()

    if suffix in Constants.OCR_TYPES:
        try:
            with input_folder.get_download_stream(sample_file) as stream:
                img_bytes = stream.read()

            if suffix == "pdf":
                for j, img in enumerate(pdf_to_pil_images_iterator(img_bytes)):
                    img_bytes = convert_image_to_greyscale_bytes(img, quality=params[Constants.QUALITY])
                    output_folder.upload_data("{0}/{0}{1}{2:05d}.jpg".format(prefix, Constants.PDF_MULTI_SUFFIX, j+1), img_bytes)

            else:
                img = Image.open(BytesIO(img_bytes))
                img_bytes = convert_image_to_greyscale_bytes(img, quality=params[Constants.QUALITY])
                output_folder.upload_data("{}.jpg".format(prefix), img_bytes)

            logger.info("OCR - Converted {}/{} images".format(i+1, total_images))
        except Exception as e:
             logger.info("Failed converting file {} to greyscale because: {}".format(sample_file, e))

    else:
        logger.info("OCR - Rejecting {} because it is not a {} file.".format(sample_file, '/'.join(Constants.OCR_TYPES)))
        logger.info("OCR - Rejected {}/{} images".format(i+1, total_images))
``` 

Simply added a try/except arround the file conversion and a logger with the error if an exception occured.

You can also notice that I added a ".lower()" for the suffix as I noticed that ".PDF" files weren't converted because "PDF" is not in the Constants.OCR_TYPES list.

Regards,

Adrien

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Exception handling for Grayscale recipe #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Exception handling for Grayscale recipe #80

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions