Train Tesseract

Kofax RPA uses either the Tesseract or OmniPage OCR engine to capture text from images and to perform Intelligent Screen Automation (ISA). OmniPage includes all supported languages in the installation. For Tesseract, only English language is included in the installation. You can change the language in Tesseract by supplying a .traineddata file for the corresponding language.

If you experience issues recognizing specific languages or letters, you can train Tesseract to read the fonts properly.

The supplied by Kofax RPA scripts for preparing training data are intended for Linux operating systems. Currently, Tesseract version 3.4.0 is used.

Prerequisites

Make sure your system complies with the following prerequisites before creating training data.

System requirements for Ubuntu-based systems

Install the following libraries using the sudo apt-get install command as follows.

sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev git

Training prerequisites
Go to nativelib/hub/linux-x64/<hub_id>/tools/tesseract_train/bin in the Kofax RPA installation directory and run the prepare.sh script. For example:

$ cd /home/user88/Kofax_RPA/nativelib/hub/linux-x64/574/tools/tesseract_train/bin

$ ./prepare.sh

Automatic training

Choose this mode if you have the TTF font file used in the UI that you want to recognize. This mode is simpler than the manual training mode. To create a training data file for the desired font, run the tesseract_auto.sh script located in tesseract_train/bin folder specifying the language code, the font name, and the font file directory as follows.

Make sure you execute the script from tesseract_train/bin working directory.

$ ./tesstrain_auto.sh --lang eng --fontlist 'Envy Code R' --fonts_dir ..

Once you execute the script, you should see the following message.

Moving /tmp/tmp.OtEqYbS3qV/eng/eng.traineddata to ../output
Completed training for language 'eng'

Now you can use the trained data file in Kofax RPA. See Change Default OCR Language in Configure Automation Device and "Change or add UI recognition language" under the Intelligent Screen Automation topic in Tree Modes.

Manual training

Choose this mode if you do not have the TTF font file used in the UI (so the Auto mode cannot be applied), but you have many UI screen shots that include all alphabet characters you want the robot to recognize. Unlike the automatic mode, where a training image file is created automatically by the script, you need to manually create a training image. It requires some time and diligence to craft such a file.

Perform the following steps to create a training data file for Tesseract. The file should contain all characters (uppercase and lowercase letters, numbers, punctuation marks, and more) that need to be present in the final training data file. The partial example below shows how to create training data for use with the following UI.



  1. Determine the full character set to be used. Bear in mind when creating a training file that a minimum number of samples for each character is five. For the most frequently used characters, include additional samples.

  2. Put all parts of the UI screen shots that will be used for training into a single TIFF file. You can use any image editor for this operation. In this example, we limit the target alphabet to 10-15 English letters. In production, make sure that you have examples of all letters.



  3. Select areas with inverted colors and restore them to normal.



  4. Scale the image using cubic interpolation so that the uppercase letters have height equal to 36px. For this particular example, we upscaled the image 2.97 times (showing only a part of the image).



  5. Rearrange words to have easily detectable text lines without large spaces between text regions. Remove text that is redundant in your judgment, as in the following example (downscaled to fit the page).



  6. Convert the image to grayscale and apply a threshold color effect that produces text of the best quality. It might be difficult to select the proper threshold. Consider applying two or more different thresholds and copy the result images into a single TIFF file. The training image would contain many different representations of the same letter. In this example we applied 125 and 150 thresholds in GIMP editor and copied the images into one file. You may notice that text in the upper half of the image is thinner than in the bottom half (downscaled to fit the page).



  7. Manually remove noise as in the following example (downscaled to fit the page).



  8. Save the image in TIF or TIFF format without compression, such as MyFont.tif.

  9. Make a box file. The box file is a text file that lists characters in the training image, one per line, with the coordinates of the bounding box around the image. See the "Training Tesseract - Make Box Files" page in Tesseract project on GitHub: https://github.com.

    Copy the box text and put into a new file, such as MyFont.box.

    In our example, the box file should start with the following lines:

    P 15 1076 39 1108 0
    r 41 1076 53 1100 0
    i 57 1076 62 1108 0
    n 68 1076 89 1100 0
    t 92 1076
    ...
  10. Go to tesseract_train/bin folder and run tesstrain_manual.sh script, specifying language code and paths to the TIF image and box file, for example:

    $ ./tesstrain_manual.sh --lang eng --box_file ../MyFont.box --training_image ../MyFont.tif

    Once you execute the script, you should see the following message.

    Moving /tmp/tmp.OtEqYbS3qV/eng/eng.traineddata to ../output

Now you can use the trained data file in Kofax RPA. See Change Default OCR Language in Configure Automation Device and "Change or add UI recognition language" under the Intelligent Screen Automation topic in Tree Modes.

More information is available on the Tesseract wiki pages on the GitHub website.