Support for multilingual files (Machine Learning)

In Transact, machine learning functionality provides multilingual support. You can use machine learning to extract data from documents written not only in English but in any language supported by the underlying Transact OCR engine (RecoStar for Microsoft Windows and OmniPage for Linux). The system picks up the languages specified by the <LanguageCode> tag in the HOCR.xml file and uses this data to learn and extract values from the document.

The default machine learning dictionaries included in the installer are currently provided only for English. If you need a dictionary for any other language, you can create it at the time of data extraction on the Validation screen (see later in this topic).

To use machine learning for multilingual files:

  1. Create or open a Batch Class.
  2. Create a Document Type.
  3. Create and configure the Index Fields.
  4. Navigate to the RECOSTAR_HOCR plugin or OMNIPAGE_HOCR plugin in the Page Process module and select the language in the OCR Country/Language field.

    You can select one or several languages using a semicolon (;). Once a semicolon is typed, the list of available OCR languages appears. The language suggestions list contains all the languages currently supported by the application.

    If you do not specify the language in the HOCR plugin, English is used by default.

    Now, every time you run a batch using this Batch Class, the HOCR.xml file contains the <LanguageCode> tag with the code of the OCR language specified in the RECOSTAR_HOCR or OMNIPAGE_HOCR plugin.

  5. Go to the Extraction module, add the MACHINE_LEARNING_BASED_EXTRACTION plugin and click Apply to save your changes.
  6. Navigate to the MACHINE_LEARNING_BASED_EXTRACTION plugin configuration screen, turn the Machine Learning Based Extraction Switch ON and click Apply.

    To enable machine learning for tables, turn ON the Machine Learning Based Table Extraction Switch.

  7. Go to the Upload Batch screen and run the batch.
  8. If any Index Field is not extracted properly, the batch stops at the Validation stage.

    Open the Validation screen and perform machine learning:

    1. Place your cursor in the text box of the index field to be learned in the middle pane of the Validation screen.
    2. On the image view pane of the Validation screen, click on the area of the image where the index field is located. An overlay appears on the image and the text box is populated with the index field value.
    3. Click on the overlay to open the Suggestion View window.
    4. Select a predefined regex or create a new regex and click OK.

Now, the data is machine learned in a language or languages defined in the HOCR.xml file.

If you are using machine learning to extract data from a document which contains data in multiple languages, the extraction results can be inconsistent.

On the Validation screen, you can also add custom dictionaries containing data in various languages.

Create a custom dictionary for a specific language

  1. On the Validation screen, click on the overlay created for the Index Field to open the Suggestion View window.
  2. Select the Create Type option and in the Type drop-down list select Dictionary.
  3. Define the Type Name and add as many values for the dictionary as required by using the plus button.

    Use the corresponding button to delete any value.

  4. Click OK to save the custom dictionary.

Now, your dictionary is added to the list of default dictionaries in .txt format. You can find it in the dictionaries folder at the following location:

EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries

This custom dictionary file contains all the values added on the Validation screen.

Import a custom dictionary for a specific language

  1. Navigate to the Batch Class Management screen and select your Batch Class.
  2. Go to machine-learning-dictionaries > knowledge-base > dictionaries.
  3. In the Upload Machine Learning Dictionary(s) section, click Select Files or drag and drop the file containing the dictionary into the specified area.

    The dictionary is imported successfully.

  4. When you import the dictionary manually, you are prompted to make changes in the mappings file:
    1. Navigate to the Folder Management section and select your Batch Class.
    2. Go to the dictionaries folder (machine-learning-dictionariesknowledge-basedictionaries) and select the dictionary_mappings_properties file.
    3. Click Edit.
    4. Provide the following information to perform the dictionary mapping.

      Field Description

      Key

      Define the Dictionary name (such as German_Names). This name will appear in the Predefined Types list in the Suggestion View window on the Validation screen.

      Value

      Define the dictionary text file (such as German_Names.txt) and provide the Display value:

      • 0 = do not display the Dictionary Type in the Suggestion View window on the Validation screen

      • 1 = display the Dictionary Type in the Suggestion View window on the Validation screen

    5. Click Save to save your changes.

Machine learning dictionaries and regex can also be modified in the Folder Management section.

Customize dictionaries and regex for a specific language

To customize dictionaries and regex for a specific language in the Folder Management screen, do the following:

  1. On the left menu panel, select Folder Management and double-click the selected Batch Class.
  2. Go to the knowledge-base folder (SharedFoldersBC{Id}machine-learning-dictionariesknowledge-base) to find all stored dictionaries and regex.
  3. In the dictionaries folder, double-click any dictionary to see its entries.

    Here, you can add, delete, and edit values as required.

  4. Click Save to save your changes.
  5. In the regex folder, open the regex.txt file to view the list of all pre-defined regular expressions.

    Here, you can add, delete, and edit values as required.

  6. Click Save to save your changes.