Tesseract HOCR plugin

The TESSERACT_HOCR plugin is commonly used in the Page Process module. This plugin reads the image files listed in the batch.xml file for a batch, generates an HOCR.xml file for each image, and updates the batch.xml accordingly.

Configuration

The following table includes the list of configurable properties for the TESSERACT_HOCR plugin.

Configurable property Type of value Value options Description
Tesseract Switch List of values ON, OFF This switch is used to turn this plugin on or off. If this switch is OFF, this plugin does not function.
Tesseract color switch List of values ON, OFF Tesseract is unable to read color TIFFs. If a user has color images (and the color switch is set to ON), PNG files are sent for OCR processing. Set the color switch to ON if you expect to have color images.
Tesseract Language String N/A This option provides the user an option to select the language to use for OCR. Now Tesseract supports only a single language per image file OCR. For example, specify 'eng' for English, 'tur'- for Turkish, and so on.
Tesseract Version String N/A This option provides the user an option to define the Tesseract version installed on the system. For example: specify 'tesseract_version_3' for Tesseract 3.0, 'tesseract_version_2'- for Tesseract 2.0 and so on.
Tesseract Valid Extensions Multi-select tif, gif, png The file extensions that this plugin supports.

Steps of execution

This plugin works in the Page Process phase of Transact after the import processing is complete. The plugin performs OCR for all the input images. After all the work is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of HTML and HOCR.xml.

Troubleshooting

The following table lists several possible error messages that can appear for this plugin and explanations of what each error message means.

Error message

Possible root cause

Tesseract Base path not configured.

The environment variable for Tesseract is either not set or the path is configured incorrectly.

Space found in the name of image: xyz.png. So it cannot be processed.

One or more spaces were found in the file name. Remove the spaces from the image file name and restart the batch from the Page Process module.

No valid extensions are specified in resources.

No extensions were specified for this plugin.

Image Processing or XML update failed for image: xyz

The image file being processed has a file extension that is not included in the list of valid extensions for the plugin.