Tesseract HOCR plugin
The TESSERACT_HOCR plugin is commonly used in the Page Process module. This plugin reads the image files listed in the batch.xml file for a batch, generates an HOCR.xml file for each image, and updates the batch.xml accordingly.
Configuration
The following table includes the list of configurable properties for the TESSERACT_HOCR plugin.
Configurable property | Type of value | Value options | Description |
---|---|---|---|
Tesseract Switch | List of values | ON, OFF | This switch is used to turn this plugin on or off. If this switch is OFF, this plugin does not function. |
Tesseract color switch | List of values | ON, OFF | Tesseract is unable to read color TIFFs. If a user has color images (and the color switch is set to ON), PNG files are sent for OCR processing. Set the color switch to ON if you expect to have color images. |
Tesseract Language | String | N/A | This option provides the user an option to select the language to use for OCR. Now Tesseract supports only a single language per image file OCR. For example, specify 'eng' for English, 'tur'- for Turkish, and so on. |
Tesseract Version | String | N/A | This option provides the user an option to define the Tesseract version installed on the system. For example: specify 'tesseract_version_3' for Tesseract 3.0, 'tesseract_version_2'- for Tesseract 2.0 and so on. |
Tesseract Valid Extensions | Multi-select | tif, gif, png | The file extensions that this plugin supports. |
Steps of execution
This plugin works in the Page Process phase of Transact after the import processing is complete. The plugin performs OCR for all the input images. After all the work is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of HTML and HOCR.xml.
Dependency
This plugin only requires an image as an input (PNG if the color switch is ON, TIFF if the color switch is OFF). Therefore, either the Create OCR Input plugin or the Create Display Image plugin must run before this plugin.
Troubleshooting
The following table lists several possible error messages that can appear for this plugin and explanations of what each error message means.
Error message |
Possible root cause |
---|---|
Tesseract Base path not configured. |
The environment variable for Tesseract is either not set or the path is configured incorrectly. |
Space found in the name of image: xyz.png. So it cannot be processed. |
One or more spaces were found in the file name. Remove the spaces from the image file name and restart the batch from the Page Process module. |
No valid extensions are specified in resources. |
No extensions were specified for this plugin. |
Image Processing or XML update failed for image: xyz |
The image file being processed has a file extension that is not included in the list of valid extensions for the plugin. |