OCR language selection
You can select the languages used by OCR engines from the Plugin Configuration screen of the applicable plugins. The names of the fields are as follows:
-
OmniPage HOCR plugin: OCR Country/Language
-
Tesseract HOCR plugin: OCR Language
When you select or type the language name, the widget helps you by giving suggestions. The complete suggestion list is opened by the suggestion token, which is a semicolon (;) or by clicking in the field with predictive typing if no language is selected. The suggestion token automatically lists languages based on the user's input. As you start typing the first letters of the required language name, the widget suggests languages according to the letters already entered.
The widget has several icons.
| Icon | Description |
|---|---|
|
Help icon to provide suggestions, such as using a semicolon (;) to display the language suggestion list. |
|
Error icon to indicate invalid input or left the field empty. It also indicates if you select a non-licensed language:
|
|
Warning icon to provide information and alert about conditions, such as missing information. (For example, the Tesseract Test-Data folder should contain test data for the selected languages.) |
Note the following:
-
If you do not specify the language in the HOCR plugin, English is used by default.
-
During the OCR process with OmniPage OCR engine, the system checks whether all selected languages are licensed. If not, then the empty HOCR is generated for all pages and an error log is created in the log file.
- If you need to OCR documents in Asian languages using the OmniPage OCR engine, you need to purchase additional Transact OCR language licenses for Arabic and Asian languages (Chinese_Simplified, Chinese_Traditional, Japanese, Korean).
Information about selected languages is now included in the HOCR.xml file. This file contains the <LanguageCode> tag, which specifies the code(s) of the OCR language(s) used in OMNIPAGE_HOCR and TESSERACT_HOCR.