Support for multilingual files (Machine Learning)
In Transact, machine learning functionality provides multilingual support. You can use machine learning to extract data from documents written not only in English but in any language supported by the underlying Transact OCR engine (RecoStar for Microsoft Windows and OmniPage for Linux). The system picks up the languages specified by the <LanguageCode> tag in the HOCR.xml file and uses this data to learn and extract values from the document.
To use machine learning for multilingual files:
- Create or open a Batch Class.
- Create a Document Type.
- Create and configure the Index Fields.
-
Navigate to the
RECOSTAR_HOCR plugin or
OMNIPAGE_HOCR plugin in the
Page Process module and select the language in the
OCR Country/Language field.
You can select one or several languages using a semicolon (;). Once a semicolon is typed, the list of available OCR languages appears. The language suggestions list contains all the languages currently supported by the application.
If you do not specify the language in the HOCR plugin, English is used by default.Now, every time you run a batch using this Batch Class, the HOCR.xml file contains the <LanguageCode> tag with the code of the OCR language specified in the RECOSTAR_HOCR or OMNIPAGE_HOCR plugin.
- Go to the Extraction module, add the MACHINE_LEARNING_BASED_EXTRACTION plugin and click Apply to save your changes.
-
Navigate to the
MACHINE_LEARNING_BASED_EXTRACTION plugin configuration screen, turn the
Machine Learning Based Extraction Switch ON and click
Apply.
To enable machine learning for tables, turn ON the Machine Learning Based Table Extraction Switch.
- Go to the Upload Batch screen and run the batch.
-
If any
Index Field is not extracted properly, the batch stops at the
Validation stage.
Open the Validation screen and perform machine learning:
- Place your cursor in the text box of the index field to be learned in the middle pane of the Validation screen.
- On the image view pane of the Validation screen, click on the area of the image where the index field is located. An overlay appears on the image and the text box is populated with the index field value.
- Click on the overlay to open the Suggestion View window.
- Select a predefined regex or create a new regex and click OK.
Now, the data is machine learned in a language or languages defined in the HOCR.xml file.
On the Validation screen, you can also add custom dictionaries containing data in various languages.
Create a custom dictionary for a specific language
- On the Validation screen, click on the overlay created for the Index Field to open the Suggestion View window.
- Select the Create Type option and in the Type drop-down list select Dictionary.
-
Define the
Type Name and add as many values for the dictionary as required by using the plus button.
Use the corresponding button to delete any value.
- Click OK to save the custom dictionary.
Now, your dictionary is added to the list of default dictionaries in .txt format. You can find it in the dictionaries folder at the following location:
EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries
This custom dictionary file contains all the values added on the Validation screen.
Import a custom dictionary for a specific language
- Navigate to the Batch Class Management screen and select your Batch Class.
- Go to .
-
In the
Upload Machine Learning Dictionary(s) section, click
Select Files or drag and drop the file containing the dictionary into the specified area.
The dictionary is imported successfully.
-
When you import the dictionary manually, you are prompted to make changes in the mappings file:
- Navigate to the Folder Management section and select your Batch Class.
- Go to the dictionaries folder (machine-learning-dictionariesknowledge-basedictionaries) and select the dictionary_mappings_properties file.
- Click Edit.
-
Provide the following information to perform the dictionary mapping.
Field Description Key
Define the Dictionary name (such as German_Names). This name will appear in the Predefined Types list in the Suggestion View window on the Validation screen.
Value
Define the dictionary text file (such as German_Names.txt) and provide the Display value:
-
0 = do not display the Dictionary Type in the Suggestion View window on the Validation screen
-
1 = display the Dictionary Type in the Suggestion View window on the Validation screen
-
- Click Save to save your changes.
Machine learning dictionaries and regex can also be modified in the Folder Management section.
Customize dictionaries and regex for a specific language
To customize dictionaries and regex for a specific language in the Folder Management screen, do the following:
- On the left menu panel, select Folder Management and double-click the selected Batch Class.
- Go to the knowledge-base folder (SharedFoldersBC{Id}machine-learning-dictionariesknowledge-base) to find all stored dictionaries and regex.
-
In the dictionaries folder, double-click any dictionary to see its entries.
Here, you can add, delete, and edit values as required.
- Click Save to save your changes.
-
In the
regex folder, open the
regex.txt file to view the list of all pre-defined regular expressions.
Here, you can add, delete, and edit values as required.
- Click Save to save your changes.