Configure table extraction

The process to configure a document type for table extraction follows these basic steps:

  1. Define the table

  2. Define the table columns

  3. Configure table extraction rules

  4. Test table extraction

To configure table extraction, the following prerequisites must be in place:

  • You need a batch class with a document type configured. For detailed steps, see Add new document type.

  • The TABLE_EXTRACTION plugin must be added to the Extraction module and turned on.

Define the table

First, you need to create a table for your document type.

  1. From the Batch Class Management screen, select and open your batch class.
  2. Go to Document Types  > <your document type> > Tables.
  3. Click Add.

    A new entry is added to the grid.

  4. Enter an intuitive Name for the new table.
  5. Configure the remaining fields according to your workflow needs. Refer to the following table for help.
    Field Description

    Validation Rule Operator

    Specifies whether all Table validation rules must be satisfied for a field to be validated (AND), or only one rule (OR).

    Remove Invalid Rows

    When selected, any rows from the extracted data that do not match the validation rule are removed.

    For best results, clear this check box during your initial testing, then you can select it depending on your test results. This is because documents with lower OCR quality may cause matching rows to be incorrectly removed.

    Currency

    Specifies the currency format that should be applied to the extracted data. See Configure currency settings for tables for more information.

    Table Cell Value Change Script

    When selected, enables the trigger field value change script for table data fields. For more information, see the corresponding section in the Ephesoft Transact Developer's Guide.

    Rule Filter DLF Name

    Enter document-level fields (DLF) to be used as filtering criteria. Vendor ID or Postal/Zip Code is an example DLF that may be used. This enables you to apply table information and rules to specific document variants within a document type. For more information, see "AI Table Rule Builder" in the Ephesoft Transact Developer's Guide.

    Use Rule Filter DLF Name with a Table Rule Filter ID. The Filter ID can contain a pipe-separated list of DLF values to determine if the table extraction rule should apply. Vendor ID or Postal/Zip Code values can be used as an example.

    During Table Extraction processing:

    • If a rule contains a filter value AND the value matches that found in the DLF indicated above, the Table Extraction rule is applied.

    • If a rule contains a filter value BUT the value does NOT match that found in the DLF indicated above, the Table Extraction rule is NOT applied.

    • If a rule contains an empty filter value, the rule is a global rule and is applied in the following conditions:

      • If the Rule Filter DLF Name is empty, all global rules are applied.

      • If a Rule Filter DLF Name is provided, global rules are applied under these scenarios:

        • No other rules are found based on the provided Filter ID and Rule Filter DLF Name.

        • In the dcma-tablefinder.properties file, autotable_rulebuilder.consider_global_table_rules_when_filter_rule_exists is set to true.

  6. Click Apply.

Define the table columns

After defining the table, you need to define the columns within the table.

  1. Open the newly created table from the Tables page. This will open the Table Columns.
  2. Click Add to create a new column.
  3. Enter a Column Name and a Description.
  4. Enter the Column Order. This is the order of the column within the table, left to right, starting with 1.
  5. Configure the remaining fields according to your workflow needs. See the following table for help.

    Field

    Description

    Validation Pattern

    Enter a validation pattern for the new table column name using either the Regex Pool or the Regex Builder.

    This field is optional.

    Alternate Values

    Enter a list of alternate values for a particular column. This is used to provide a drop-down list of commonly used values to operators when validating tables. This drop-down list is only visible for blank cells (fields for which there is no extracted value).

    This field is optional.

    OCR Confidence Threshold

    A number between 1 and 100, defines the minimum OCR confidence for a value to be validated automatically. If the OCR confidence is below this level, it is marked for operator validation. You may need to test and refine this number for best results.

    The default value is 90.

    Default Value

    If no value is extracted for a row in this column, it is replaced with the default value specified here.

    This field is optional.

    Additional Configuration

    New Row Anchor

    Select this check box to indicate the start of a new row if a value from this column is extracted.

    Required

    Select this check box to mark the column as mandatory for each row. If the row does not have a value from this column, the row is discarded.

    Currency

    Select this check box for columns containing only currency.

    Hidden

    Select this check box for this column to be hidden from operators during Review and Validation.

  6. Repeat the steps for the remaining columns in the table.
  7. Click Apply.

Configure table extraction rules

After defining the table columns, you need to configure the extraction rule for the table.

  1. In the left navigation, click Table Extraction Rules.
  2. Click Add.

    The Extraction Rule Configuration screen displays.

  3. If no preview image appears, click Select Files in the bottom panel to select and upload the sample file.

Extraction rule

First, you need to specify the basic configurations for the rule in the Extraction Rule box. See the following tables for information on each configurable option.

Field Description

Rule Name

Enter a name for the extraction rule.

Start Pattern

Defines the starting point of the table using a regular expression. This must be unique across all extraction rules in a document type. To configure the start pattern, enter a valid regex in the Start Pattern field, or use the provided overlay:

  1. Move and resize the grey Start Pattern overlay to the starting point of the table.

  2. Click the overlay to display the Suggest Regex box.

  3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

  4. Click OK.

For more information about regular expressions, see "Regular Expressions" in the Ephesoft Transact Developer's Guide.

End Pattern

Defines the end point of the table using a regular expression. To configure the end pattern, enter a valid regex in the End Pattern field, or use the provided overlay:

  1. Move and resize the yellow End Pattern overlay to the end point of the table.

  2. Click the overlay to display the Suggest Regex box.

  3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

  4. Click OK

Row Exclusion Pattern

Defines the exclusion rule for a row using a regular expression.

Extract Repeating Tables

Select this option if the table may span multiple pages.

Overlapping Columns Table

Select this option if the headers of two columns may overlap one another.

2-Column Layout

Select this option if your table is split into two columns on the same page. For more information about this option, see Table extraction for 2-column layout.

Table Extraction API

Defines the table extraction methods and their operators.

  • Column Coordinates: Extract data based on the defined column coordinates. If you select this API, you must use the overlays to define the coordinates.

    If you use the Column Coordinates extraction method, the start value is set to the lower of the two Column Header and Column Pattern values, and the end value is set to the higher of the two.

  • Column Header: Extract data based on the defined column headers.

  • Regex Extraction: Extract data based on the defined regex patterns.

If you select more than one extraction method, you need to define the rule operator. Rule operators apply to the field to their left.

  • AND: Performs extraction based on all selected methods. This results in a stricter extraction experience.
  • OR: Performs extraction based on one of the selected methods. This results in a less rigid extraction experience.

For best results, begin your configuration process using a single extraction method. This is to make sure that you can successfully extract data from the table, regardless of the quality or accuracy. If further refinement is needed, add more operators.

Extract Repeating Tables check box

The Extract Repeating Tables check box can be used to extract data from tables that span multiple pages. If the first page and last page are both unique, you can use a unique Start Pattern value from the first page, and a unique End Pattern from the last page. However, some forms require you to duplicate the same page as many times as required to submit the form. As a result, you cannot use End Pattern from the bottom of the page because it will prevent Transact from continuing to the next page to read the continuation of the table.

To get around this issue, define Start Pattern from the first page as you normally would, but instead of defining End Pattern from the bottom of the page, select End Pattern using a string from the top of the page, and make sure that the Extract Repeating Tables check box is selected. Now, when Transact processes a multi-page document like this, when it gets to the bottom of the first page, it continues to the following page and stops reading when it hits the End Pattern value. However, because Transact is processing the second page now, it continues reading until it finds the Start Pattern value again. If you use Column Coordinates and Regex Extraction as your table extraction methods, Transact ignores the text at the bottom of the document and does not include it in your table extraction results.

Column configuration

Next, you need to specify the extraction rules for each table column in the Column Configuration box. Ensure you have defined the table columns before proceeding.

You can collapse the Extraction Rule section to get a better view of the Column Configuration section.

Configuring table column rules follows these general steps:

  1. Select a Table Column from the drop-down list. The first column will be selected by default.
  2. Configure the remaining fields.
  3. Repeat steps 1-2 for each column in the table. All Column Configuration options are tied to the currently selected Table Column.

    See the following table for information on each configurable option.

    Field

    Description

    Table Column

    Select the column for which you are defining the column extraction rule. You can only select the columns that you add in Define the table columns.

    Column Header Pattern

    This field is only required if the Table Extraction API is set to Column Header or Column Coordinates .

    Defines the header of the selected column using a regular expression. To configure the column header pattern:

    1. Move and resize the green Column Header overlay to the header text.

    2. Click the overlay to display the Suggest Regex box.

    3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

    4. Click OK.

    When using the Column Coordinates API, ensure the overlay spans the full possible width of each column, as this defines the Start Coordinates and End Coordinates.

    Column Pattern

    Defines the expected data within a selected column using a regular expression. To configure the column pattern, enter a valid regex in the Column Pattern field or use the provided overlays:

    1. Move and resize the red Column Data overlay to a cell in the respective column.

    2. Click the overlay to display the Suggest Regex box.

    3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

    4. Click OK.

    Pattern Left

    Defines the expected data to the left of a selected column using a regular expression. If you use the Regex Extraction API and have issues with data being extracted from the wrong location, you can use this option to better direct the table extraction.

    The expected data must have a unique, reliable pattern-such as a date or part number.

    To configure the left pattern, enter a valid regex in the Pattern Left field, or use the provided overlay:

    1. Move and resize the orange Pattern Left overlay to the area on the left side of the column data.

    2. Click the overlay to display the Suggest Regex box.

    3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

    4. Click OK.

    Pattern Right

    Defines the expected data to the right of a selected column using a regular expression. If you use the Regex Extraction API and have issues with data being extracted from the wrong location, you can use this option to better direct the table extraction.

    The expected data must have a unique, reliable pattern-such as a date or part number.

    To configure the right pattern, enter a valid regex in the Pattern Right field, or use the provided overlay:

    1. Move and resize the purple Pattern Right overlay to the area on the right side of the column data.

    2. Click the overlay to display the Suggest Regex box.

    3. Use a suggested regex from the Regex drop-down list, select a Predefined Type, or enter your own regex.

    4. Click OK.

    Start Coordinate

    This field only applies if the Table Extraction API includes Column Coordinates.

    The Start Coordinate is automatically configured based on the position of the Column Header or Column Data overlays.

    This is set automatically when the Column Header or the Column Data overlays are positioned. It is not necessary to use both. If both are selected, the Start Coordinate is set to the lowest (left-most) value of both overlays, and the End Coordinate is set to the highest (right-most) value of both overlays.

    End Coordinate

    This field only applies if the Table Extraction API includes Column Coordinates.

    The End Coordinate is automatically configured based on the position of the Column Header or Column Data overlays.

    Extract Data From Column

    This field allows you to pinpoint a portion of the column and extract data from that specific area. It functions as a "search within a search".

    Additional Configurations

    Use Default Column Configuration

    Keep this check box selected to use the default values of the column, as configured above in Define the table columns.

    If this check box is cleared, it enables the remaining Additional Configurations:

    • New Row Anchor
    • Required
    • Currency

    New Row Anchor

    Select this check box to indicate the start of a new row if a value from this column is extracted.

    For best results, enable this feature for columns that have a single value that does not wrap, which will tell the extraction to create a new row each time a new value is found for this column.

    For example, if the Part Number field always has a single value that does not wrap, but Part Description often wraps for several lines, enabling the New Row Anchor for Part Number will tell the system to capture wrapping data for the other columns in the row until a new value is found for Part Number.

    This field is disabled if Use Default Column Configuration is selected.

    Required

    Select this check box to mark the column as mandatory for each row. If the row does not have a value from this column, the row is discarded.

    This field is disabled if Use Default Column Configuration is selected.

    Currency

    Select this check box for columns containing only currency.

    This field is disabled if Use Default Column Configuration is selected.

Test table extraction

After you configure the applicable fields for each Table Column, you need to test the table extraction.

  1. If required, click Validate Regex to check any manually inserted regular expressions.
  2. Determine your testing method based on what kind of documents you expect to receive in production:
    • OCR: If you expect to receive documents that do not contain an E-Text layer, select OCR from the top-right drop-down list.

    • EText: If you expect to receive some documents that contain an E-Text layer (such as documents that were created using a Print to PDF feature), and you have configured your batch class to use the E-Text feature, select EText from the top-right drop-down list.

  3. Click Test Table. The extracted results will display in the Test Table Results panel.
  4. Review the test extraction results. The test extraction works best for single page tables. If you are testing tables that span multiple pages, run test batches through Transact and view the results in the Validation screen.

    Any fields highlighted in red indicate low OCR confidence or validation errors for extracted data.

    For example, if the OCR confidence is lower than the default minimum OCR confidence, the row is highlighted in red. If these highlighted results are extracted correctly and this sample is of high OCR quality, this indicates that the OCR Confidence Threshold for the column should probably be lowered.

  5. Click Close to exit from Test Table Results. You can continue to test and adjust your configuration according to your requirements.
  6. When you are satisfied with your results, click Apply.
  7. Click Apply again to save the rule.

    You may need to create multiple sets of extraction rules depending on the variations of documents that may be processed.

    For example, an invoice from two vendors may be fundamentally the same, but has different layouts. When incoming documents are processed, Transact executes all table extraction rules against every document, with the Start Pattern value being used to determine whether the row-level extractions should take place.

Filter IDs

If you assign a DLF to a table in the Rule Filter DLF Name column, you can use the Filter Ids field as filtering criteria. Enter the values to be filtered based on the DLF, separated by a pipe (|) character.

For example, if your DLF for the table is ZipCode, you can enter several zip codes, such as 92618|92620.

Table extractions use these filters and values to determine whether a table extraction rule is applied.

  • The rule is applied if:

    • The filter value is empty.

    • The rule has a filter and value, and the value matches that in the referenced DLF.

  • The rule is not applied if:

    • The rule has a filter and value, and the value does not match that found in the referenced DLF.