Machine Learning Custom Dictionary support

Dictionaries are a part of the machine learning mechanism. They are created and used to extract those types of values, for which it is not possible to define any specific regex. Dictionaries contain sets of predefined values such as the US States, US Cities, Personal Names, etc. One of these values is selected at the time of data extraction according to the system settings.

Previously, the dictionaries were provided at the application level and were stored in the META-INF folder (EphesoftApplicationMETA-INF).

In Transact, the default dictionaries are on the Batch Class level. The current path for the machine-learning-dictionaries folder is EphesoftSharedFoldersBC{Id}machine-learning-dictionaries (the folder structure is explained in detail below). You can also add your own custom dictionaries:

  • You can create the dictionaries at the time of DLF learning on the Validation screen. Whenever you create and click on the overlay on the screen, the Suggestion View window pop-up with all the predefined and custom regex types as well as dictionaries. Here, you can create new custom types of dictionaries. The custom dictionary can contain any number of values. Once the dictionary values are added and saved, they will be used during extraction.
  • You can import the dictionaries from the Batch Class Management screen. The main menu of each Batch Class now includes a new machine-learning dictionaries tab. Here, you can use Import Machine Learning Dictionary(s) section to upload your dictionaries into the system. The Export button allows you to export selected folders or files.

Machine-learning-dictionaries folder

All the dictionaries are provided at the Batch Class level. The path for the machine-learning-dictionaries folder is:

EphesoftSharedFoldersBC{Id}machine-learning-dictionaries

This folder has the following subfolders:

  • language-packs: This folder contains language-specific text files with stop words (used in machine learning to filter out any words, which are not to be extracted like "and", "the"). The user can add, modify or delete any file in this folder. By default, language-pack dictionaries are provided for English, German, French, Turkish, Spanish, and Dutch:

    en_stopWords.txt contains English stop words.

    de_stopWords.txt contains German stop words, etc.

  • knowledge-base: This folder contains regex and dictionaries subfolders.

    The regex folder

    The regex folder contains regex-specific text files.

    • The regex.txt file contains simple predefined regex, such as Number, Date, SSN, Amount, Email, etc. as well as custom regex created by a user.

    • The composite.txt file contains the information about the Composite types created by the user via Suggestion View window on the Validation screen. The Composite type name (or custom block name) is mapped against the Composite type values (either created or predefined). Data will be stored in the following format:

      Custom_Block_Name=Custom_Regex_Name/Predefined_Regex_Name|Custom_Regex_Name/Pre-defined_Regex_Name

      Where custom block name is followed by the equal sign, followed by a series of custom regex names or predefined regex/dictionary names, separated by pipe operator "|".

      Example: CustomBlock1=CustomId|SSN

      It creates a custom block with name "CustomBlock1" which contains regex of "CustomId" (where "CustomId" is custom regex type) followed by regex of "SSN" (where "SSN" is predefined regex type).

      The composite block type cannot have composite types as part of its definition.

    • The regex_mappings.properties file contains parent-child relation mappings for regex.

      Child = Parent

      Number

      =

      ALL

      Date

      =

      ALL

      Amount

      =

      Number

      USA_Amount

      =

      Amount

      NON_USA_Amount

      =

      Amount

      DD_MM_YYYY

      =

      DATE

      MM_DD_YYYY

      =

      DATE, etc.

    The dictionaries folder

    This folder contains dictionaries and dictionary_mappings.properties file.

    • The dictionaries folder includes both default dictionaries and custom dictionaries (created or imported by the user) in .txt format.

    • The dictionary_mappings.properties file contains dictionary types mapped against corresponding .txt files. Here, you can also specify whether the dictionary should be displayed in the list of Predefined Types in the Suggestion View window on the Validation screen: Dictionary Type=Dictionary File=Display: -1, 0, 1.

      The following dictionaries are provided by default:

      NAME=name.txt

      PERSON_NAME_PREFIX=personNamePrefix.txt

      PERSON_NAME_SUFFIX=personNameSuffix.txt

      USA_CITY=usCity.txt

      PARTIAL_CITY=partialUSCity.txt

      USA_STATE=usState.txt

      PARTIAL_STATE=partialUSState.txt

      COMPANY_SUFFIX=companySuffix.txt

      ORGANIZATION_NAME=organizationName.txt

      Display options:

      - 1 = hidden and not loaded into memory (if the dictionary is a part of the composite block type, neither the dictionary, not the composite type will be displayed in the Suggestion View window)

      0 = hidden and loaded into memory (if the dictionary is a part of the composite block type, the dictionary will not be displayed; however, the composite type containing it will be shown in the Suggestion View window)

      1 = displayed and loaded (both the dictionary as well as all composite types containing the dictionary will be displayed in the Suggestion View window)

      By default, the English language dictionary is used if the required dictionary file is not present.

Create custom dictionary

There are two ways to add a custom dictionary. You can create it on the Validation screen during DLF training or you can import it from the Batch Class Management screen.

To create a custom dictionary on the Validation screen:

  1. Place your cursor in the text box of the index field to be learned in the middle pane of the Validation screen.
  2. On the image view pane of the Validation screen, click on the area of the image where the index field is located (right-click to draw overlay on multiple values).

    An overlay appears on the image and the text box is populated with the index field value.

  3. Click on the overlay to open the Suggestion View window.
  4. Select the Create Type option and from the Type drop-down list, select Dictionary.
  5. Define the Type Name and add as many values for the dictionary as required by using the plus button.

    Also, use the corresponding button to delete any value.

  6. Click OK to save the custom dictionary.
After you save the dictionary, a new .txt file is created in the dictionaries folder (EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries). This custom dictionary file has the same name as given in the Type Name field and contains all the values added on the Validation screen.

Next time, the newly created dictionary will be included in the Predefined Type list on the Validation screen and will be used to extract a value on the basis of the predefined value set.

If two users try to create a new custom dictionary with the same name for the same Batch Class, the dictionary entries will be merged.

Modify custom dictionary

If required, you can modify the custom dictionary that you create. This can be done in several ways:

  • As an operator, you can add values to your dictionary on the Validation screen.

  • As an admin, you can modify your dictionary in the Folder Management section of Transact.

  • You can make changes directly in the dictionary .txt file on the Transact server.

Default dictionaries can also be modified in the Folder Management section or on the Transact server.

Add values to your dictionary on the Validation screen

  1. Click on the overlay to open the Suggestion View window.
  2. Select the Create Type option and from the Type drop-down list, select Dictionary.
  3. In the Type Name drop-down, find and select your dictionary name.

    All values contained in the dictionary will be displayed in the Suggestion View window.

  4. Use the plus button to add values to the dictionary.
  5. Click OK to save the changes.
The custom dictionary is now updated according to the changes done on the Validation screen.

Modify your custom dictionary in Folder Management

  1. On the left menu panel, select Folder Management and double-click on the selected Batch Class.
  2. Navigate to the dictionaries folder (SharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries) and find your dictionary.
  3. Select the dictionary and click Edit.
  4. Make the changes in your dictionary as required, the field is editable.
  5. Click Save to save the changes.
The custom dictionary is now updated according to the changes made in the Folder Management section.

Make changes in the dictionary .txt file

  1. Navigate to the dictionaries folder (EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries) and open the text file containing your dictionary.
  2. Add, remove, or change values as in an ordinary text editor.
  3. Save the changes.

Export a dictionary

Dictionaries can be exported so you can use the same dictionaries in other Batch Classes. An exported dictionary is downloaded as a .zip file containing the .txt file with associated dictionary values.

You can export the dictionaries from the Batch Class Management section as well as from the Folder Management section.

Export a dictionary from Batch Class Management

  1. Navigate to the Batch Class Management screen and select your Batch Class.
  2. Navigate to the machine-learning-dictionaries > knowledge-base > dictionaries folder.
  3. Select your dictionary and click Export.
  4. Specify the destination folder and click Save.

    The.zip file saved on your local machine contains your dictionary in .txt format along with all associated values.

Export a dictionary from Folder Management

  1. Select your Batch Class.
  2. Navigate to machine-learning-dictionaries > knowledge-base > dictionaries, select your dictionary and right-click.
  3. Select the Download option.
  4. In the dialog window, specify the destination folder and click Save.

If you export the dictionary_mappings.properties file and modify it before importing it again, the system will pick up the changes, and the updated file will be used to perform machine learning.

Import a dictionary

To import the dictionary in the Batch Class Management section:

  1. On the Batch Class Management screen, select your Batch Class.
  2. Navigate to machine-learning-dictionaries > knowledge-base > dictionaries.
  3. In the Upload Machine Learning Dictionary(s) section, click Select Files or drag and drop the file containing the dictionary into specified area.

    The dictionary is imported successfully. Since you are importing the dictionary manually, the following message is displayed: "Please make corresponding changes in the mapping files manually".

  4. To make changes in the mappings file:
    1. In the Folder Management section, select your Batch Class.
    2. Go to the dictionaries folder (machine-learning-dictionariesknowledge-basedictionaries) and select the dictionary_mappings_properties file.
    3. Click Edit.
    4. Provide the following information to perform the dictionary mapping.

      Field Description

      Key

      Define the Dictionary name (such as Irvine_streets). This name will appear in the Predefined Types list in the Suggestion View window on the Validation screen.

      Value

      Define the dictionary text file (such as Irvine_streets.txt) and provide the Display value: 0 = do not to display Dictionary Type in the Suggestion View window on the Validation screen 1 = display Dictionary Type in the Suggestion View window on the Validation screen

    5. Click Save to save your changes.

    Dictionary mapping can also be done directly in the dictionary_mappings properties file on the Transact server. For that, navigate to the dictionaries folder (EphesoftSharedFoldersBC{Id}machine-learning-dictionariesknowledge-basedictionaries), open dictionary_mappings properties file and perform the mapping as described above.

If you import a dictionary that already exists in the Batch Class, a pop-up window is displayed containing the list of dictionaries that are already present. You can select either to override or to merge the dictionary files.