Table Extraction plugin

The Table Extraction plugin is responsible for extracting data from the batch with tabular data in the form of tables. The user defines basic table information and a set of table columns for the table.

Transact uses one or more rules to perform table extraction.

Note the following factors for table extraction:

  • There may be multiple table extraction rules defined for a table.
  • The extraction rule that provides the best and most valid data from table columns is selected to show table extraction results.
  • The validity of extracted columns is based on one or more validation patterns for each column combined with table validation rules that are applied to each row of extracted table data.

Characteristics

  • Each table extraction rule has table column extraction rules i.e. one extraction rule for each of the table columns. It contains information used in column extraction by table extraction APIs like column pattern, column header pattern, start coordinate, end coordinate, multiline Anchor, required, etc.

  • For each document, consisting of one or more pages, the table extraction algorithm will extract all tables defined for a document type.
  • Document is parsed to identify tables starting from the first page to the last page of the document.
  • One table may span one or more pages.
  • A table defined for a document would consist of multiple table columns, table extraction rules and table validation rules. Table columns and at least one table extraction rule are minimum requirements for table extraction to give some results.
  • A table extraction rule contains start pattern and end pattern that denotes boundaries for table data for extraction process. A table extraction API for an extraction rule is a combination (using AND or OR operators) of 3 kinds of validation:
    • Column Coordinates Validation

    • Column Header Validation

    • Regex Validation

This API combination denotes the behavior algorithm to use for extracting data for every table column in a row.

Each table extraction rule has table column extraction rules i.e. one extraction rule for each of the table columns. It contains information used in column extraction by table extraction APIs like column pattern, column header pattern, start coordinate, end coordinate, multiline Anchor, required.

The following table summarizes which column extraction rule information is used with respect to which table extraction API.

Table extraction Rule API Table column extraction rule fields used.

Column Header Validation

It uses column header pattern to search the data matching column header pattern as string with some fuzziness or search column header regex pattern's best matched value in the page, Learn matched header string's coordinates to extract data beneath it as data for extraction. The text at left or right proximity of the text beneath the header is also appended to the result column extracted value.

Column coordinate validation

It uses start coordinate and end coordinate to use as coordinates denoting the vertical boundaries for location of column data on page. These two can be set by clicking on set coordinates button, uploading an image sample and drawing overlays for giving coordinates for columns. Click on Ok button sets start and end coordinates to the column extraction rule.

Regex validation

Column pattern, Between left pattern and Between right pattern are used to find best matched text in each row for the column data.

  • Column Pattern: Data matching this pattern will be extracted as column data value.
  • Between Right Pattern: Data that is extracted by the column pattern should have a data to the right matching this between right pattern. This pattern must be single word capturing pattern only.

  • Between Left Pattern: Data that is extracted by the column pattern should have a data to the immediate left matching this between left pattern. This pattern must be single word capturing pattern only.

  • If between right or between left pattern is specified but is not matched with the immediate right or left data, data will be extracted as invalid data.

  • Only single word capturing patterns are allowed for between left and between right patterns.

Configuration

The following table includes the list of configurable properties for plugin in dcma-tablefinder.properties located at {EphesoftHome} WEB-INFclassesMETA-INFdcma-table-finder*

Configurable property Type of value Value options Description

tablefinder.gap_between_column_words

Integer

NA

Gap between words of same column data. Used while column header extraction. Value is defined in pixels. By default its 60.

tablefinder.rule_removal_invalid_characters

List of values separated by semicolon (;)

NA

Invalid characters in extracted column value which need to ignored before applying the table rule to the columns.

Table configuration

  • Add/Delete table information

    To add or delete any table information, click the corresponding buttons. After you click Add, you can enter values for any property.

  • Test table

    With table extraction plugin, you can extract data from the batch with tabular data in the form of tables. Using test table feature, you can check whether table configuration is ok to extract tabular data in the form of tables without running any batch. User can upload a valid image file or place the image file at the given path:

    {base-folder}batch-class-id test-table
  • Configurable properties

    Configurable property Type of value Value options Description

    Name

    String

    NA

    Name for the data table.

    Validation Rule Operator

    List of values

    • OR
    • AND

    In case of AND, the table row becomes valid if and only if it satisfies all the table validation rules defined. In case of OR, the table row becomes valid if it satisfies at least one of the validation rules.

    Remove Invalid Rows

    Boolean

    • True if selected.
    • False if cleared.

    Whether to remove invalid rows according to table validation rules from table result data or not.

    Currency

    List of Values

    Supported currencies.

    Name of the currency on the basis of which validation rules are to be applied for table. All table columns with currency field checked true, defined in a column extraction rule, will undergo currency extraction on the basis of this value for validation rule application.

Table column configuration

  • Add/Delete table column information

    To add or delete table column information, click the corresponding buttons. After you click Add, you can add table column fields.

  • Configurable properties

    Configurable property Type of value Value options Description

    Column Name

    String

    NA

    Name of the column.

    Description

    String

    NA

    Description of the column.

    Validation Pattern

    String

    NA

    Validation pattern of the pattern. This pattern validates extracted column data for each table row.

    Alternate Values

    String

    NA

    A semi-colon separated list of values entered by user. These values appear as suggestions for the column in the table view at validation screen.

Table Extraction Rule configuration

  • Add/Delete Table Extraction Rule

    To add or delete table extraction rule, click the corresponding buttons. After you click Add, you can add table extraction rule fields.

  • Test Table Extraction Rule

    With the test table extraction rule feature, you can check whether a table extraction rule configuration is ok to extract tabular data in the form of tables without running any batch. You can upload or drag and drop a valid image file or place the image file at the given path:

    {base-folder}batch-class-id test-table

  • Configurable properties

    Configurable property Type of value Value options Description

    Rule Name

    String

    NA

    Unique name of table extraction rule.

    Start Pattern

    String

    A keyword or a valid regex expression.

    A keyword to be matched as a string with some fuzziness configurable from property file or regex pattern to match some string marking the beginning of the table in a page. Correct start pattern must be specified for table data to be extracted. It can be validated using the check button.

    End Pattern

    String

    A keyword or a valid regex expression.

    A keyword to be matched as a string with some fuzziness configurable from property file or regex pattern to match some string marking the end of the table. It can be validated using the check button.

    Table Extraction API

    Combination of some Boolean values using AND and OR operator.

    A combination of selected table extraction APIs (column header validation, column coordinate validation and regex validation) with AND/OR operators to decide algorithm to extract table columns.

Column Extraction Rule configuration

  • Edit Column Extraction Rule

    Click Edit to edit column extraction rule fields.

  • Configurable properties

    Configurable property Type of value Value options Description

    Column Name

    String

    NA

    Name of the column. Non editable field, only for reference with table column for the table.

    Column Pattern

    Regular Expression

    Valid regular expression

    The regex pattern for column data.

    Between Left

    Regular Expression

    Valid regular expression

    The regex pattern for data in left of the actual searched column.

    Between Right

    Regular Expression

    Valid regular expression

    The regex pattern for data in right of the actual searched column.

    Column Header Pattern

    Regular Expression

    A keyword or a valid regex expression.

    A keyword to be searched as a string with some fuzziness in the page or regex pattern to search column header regex pattern's best matched value in the page.

    Start Coordinate

    Integer

    NA

    Start Coordinate for the column.

    End Coordinate

    Integer

    NA

    End Coordinate for the column.

    Multiline anchor

    Boolean

    • True if selected.
    • False if cleared.

    Marks the column as a required column and anchor to denote the start of a new row in the table of the page. This is useful in the case of one table row spanning in multiple rows in documents.

    Required

    Boolean

    • True if selected.
    • False if cleared.

    If the radio button is selected, each table row extracted must contain some valid data for that column. If invalid data is extracted for the column, the corresponding row will not be added to table data.

    Extract data from column

    Drop-down list

    List of values containing names of other columns for the table that can be selected to fill the text box containing the name of the column for extraction.

    Selection for the table column name from which the current column's data needs to be extracted when using regular expression-based extraction. If it is left empty, then it is not applicable.

    Currency

    Boolean

    For example :$ 12,000.00 will be manipulated as 12000.00 for validations. EURO 12.000,00 will be manipulated as 12000.00 for validations.

    Specifies whether the column is a currency field. If i is a currency field then validation rules will be applied according to the currency representation. Manipulation will be done on the basis of the value for the currency chosen at Table Info Level. If this field is unchecked, no currency extraction will be done for the column irrespective of the value chosen at Table Info Level.

Table Validation Rule configuration

  • Add/Delete Table Validation Rule

    A table validation rule is applicable to operands (table columns) that must be containing extracted column data as numerical values. Table validation rules are applied to rows of table extraction data. Multiple rules are applied at each row in OR or AND fashion as defined at table information level in Validation operator. If a row is invalid it is shown as orange shaded in extraction results if remove invalid rows is not selected at table info definition level or are removed from extraction result if remove invalid rows is selected at table info definition level.

    To add or delete table validation rule, click the corresponding button. After you click Add, you can add table validation rule fields.

    The first drop-down list contains the list of operands (Table column names). The second drop-down list contains the list of valid mathematical operators for a rule. The Clear button clears the rule.

  • Configurable properties

    Configurable property Type of value Value options Description

    Rule

    String

    NA

    A mathematical rule that applies to the combination of column values and governs the validity of a table row data.

    Description

    String

    NA

    The rule description. This description becomes visible on the table view on a selecting a row not satisfying the rule defined for it.

Column Header Based Extraction

  1. Enter column header regex pattern from following UI:

    [Batch Class List] > [Batch Class] > [Document Type] > [Table Info] > [Table Extraction Rule] > [Table Column Extraction Rule]

  2. You can set the Column header pattern field for each table column extraction rule.

    There is a configurable property for table extraction using column header in:

    {ephesoft-home}WEB-INFclassesMETA-INFdcma-table-finder*

    tablefinder.gap_between_column_words=60

    This value should be specified in pixels. In addition to words that are below the column header, all words (to the left or right) will also be extracted for the column in case gap between them and the extracted data is less than the value specified for gap_between_column_word.

Regex Based Extraction

A table extraction rule must be defined with have valid start and end patterns, along with Regex validation selected in any combination of table extraction API.

User needs to enter valid column patterns (optional between left pattern and between right patterns ) for regex based extraction.

Select table extraction technique

Select a table Extraction API in combination using AND or OR operators between three techniques as shown below:

[Batch Class List] > [Batch Class] > [Document Type] > [Table Info] > [Table Extraction Rule]

Dependencies

Table extraction plugin has the following dependencies:

  • RECOSTAR_HOCR

  • TESSERACT_HOCR

One of the above plugins must be ON for key value learning as these plugins extract data from the image and create hOCR file which is required for the table extraction.

Troubleshooting

Error message Possible root cause

Table info list is null or empty.

No table is configured for the document type.

Table Columns Info list is null or empty.

No table column is defined for table.

Table Extraction Rule List is null or empty.

No table extraction rule is defined for table.

Exception occurred while validating rule for a table row.

Table validation rules could not be applied properly on extraction results.

Skipping Table extraction. Switch set as off.

Table extraction switch is set to OFF.