Table Extraction plugin
The Table Extraction plugin is responsible for extracting data from the batch with tabular data in the form of tables. The user defines basic table information and a set of table columns for the table.
Transact uses one or more rules to perform table extraction.
Note the following factors for table extraction:
- There may be multiple table extraction rules defined for a table.
- The extraction rule that provides the best and most valid data from table columns is selected to show table extraction results.
- The validity of extracted columns is based on one or more validation patterns for each column combined with table validation rules that are applied to each row of extracted table data.
Characteristics
-
Each table extraction rule has table column extraction rules i.e. one extraction rule for each of the table columns. It contains information used in column extraction by table extraction APIs like column pattern, column header pattern, start coordinate, end coordinate, multiline Anchor, required, etc.
- For each document, consisting of one or more pages, the table extraction algorithm will extract all tables defined for a document type.
- Document is parsed to identify tables starting from the first page to the last page of the document.
- One table may span one or more pages.
- A table defined for a document would consist of multiple table columns, table extraction rules and table validation rules. Table columns and at least one table extraction rule are minimum requirements for table extraction to give some results.
-
A table extraction rule contains start pattern and end pattern that denotes boundaries for table data for extraction
process. A table extraction API for an extraction rule is a combination (using AND or OR operators) of 3 kinds of validation:
-
Column Coordinates Validation
-
Column Header Validation
-
Regex Validation
-
This API combination denotes the behavior algorithm to use for extracting data for every table column in a row.
Each table extraction rule has table column extraction rules i.e. one extraction rule for each of the table columns. It contains information used in column extraction by table extraction APIs like column pattern, column header pattern, start coordinate, end coordinate, multiline Anchor, required.
The following table summarizes which column extraction rule information is used with respect to which table extraction API.
Table extraction Rule API | Table column extraction rule fields used. |
---|---|
Column Header Validation |
It uses column header pattern to search the data matching column header pattern as string with some fuzziness or search column header regex pattern's best matched value in the page, Learn matched header string's coordinates to extract data beneath it as data for extraction. The text at left or right proximity of the text beneath the header is also appended to the result column extracted value. |
Column coordinate validation |
It uses start coordinate and end coordinate to use as coordinates denoting the vertical boundaries for location of column data on page. These two can be set by clicking on set coordinates button, uploading an image sample and drawing overlays for giving coordinates for columns. Click on Ok button sets start and end coordinates to the column extraction rule. |
Regex validation |
Column pattern, Between left pattern and Between right pattern are used to find best matched text in each row for the column data.
|
Configuration
The following table includes the list of configurable properties for plugin in dcma-tablefinder.properties located at {EphesoftHome} WEB-INFclassesMETA-INFdcma-table-finder*
Configurable property | Type of value | Value options | Description |
---|---|---|---|
tablefinder.gap_between_column_words |
Integer |
NA |
Gap between words of same column data. Used while column header extraction. Value is defined in pixels. By default its 60. |
tablefinder.rule_removal_invalid_characters |
List of values separated by semicolon (;) |
NA |
Invalid characters in extracted column value which need to ignored before applying the table rule to the columns. |
Table configuration
-
Add/Delete table information
To add or delete any table information, click the corresponding buttons. After you click Add, you can enter values for any property.
-
Test table
With table extraction plugin, you can extract data from the batch with tabular data in the form of tables. Using test table feature, you can check whether table configuration is ok to extract tabular data in the form of tables without running any batch. User can upload a valid image file or place the image file at the given path:
{base-folder}batch-class-id test-table -
Configurable properties
Configurable property Type of value Value options Description Name
String
NA
Name for the data table.
Validation Rule Operator
List of values
- OR
- AND
In case of AND, the table row becomes valid if and only if it satisfies all the table validation rules defined. In case of OR, the table row becomes valid if it satisfies at least one of the validation rules.
Remove Invalid Rows
Boolean
- True if selected.
- False if cleared.
Whether to remove invalid rows according to table validation rules from table result data or not.
Currency
List of Values
Supported currencies.
Name of the currency on the basis of which validation rules are to be applied for table. All table columns with currency field checked true, defined in a column extraction rule, will undergo currency extraction on the basis of this value for validation rule application.
Table column configuration
-
Add/Delete table column information
To add or delete table column information, click the corresponding buttons. After you click Add, you can add table column fields.
-
Configurable properties
Configurable property Type of value Value options Description Column Name
String
NA
Name of the column.
Description
String
NA
Description of the column.
Validation Pattern
String
NA
Validation pattern of the pattern. This pattern validates extracted column data for each table row.
Alternate Values
String
NA
A semi-colon separated list of values entered by user. These values appear as suggestions for the column in the table view at validation screen.
Table Extraction Rule configuration
-
Add/Delete Table Extraction Rule
To add or delete table extraction rule, click the corresponding buttons. After you click Add, you can add table extraction rule fields.
-
Test Table Extraction Rule
With the test table extraction rule feature, you can check whether a table extraction rule configuration is ok to extract tabular data in the form of tables without running any batch. You can upload or drag and drop a valid image file or place the image file at the given path:
{base-folder}batch-class-id test-table
-
Configurable properties
Configurable property Type of value Value options Description Rule Name
String
NA
Unique name of table extraction rule.
Start Pattern
String
A keyword or a valid regex expression.
A keyword to be matched as a string with some fuzziness configurable from property file or regex pattern to match some string marking the beginning of the table in a page. Correct start pattern must be specified for table data to be extracted. It can be validated using the check button.
End Pattern
String
A keyword or a valid regex expression.
A keyword to be matched as a string with some fuzziness configurable from property file or regex pattern to match some string marking the end of the table. It can be validated using the check button.
Table Extraction API
Combination of some Boolean values using AND and OR operator.
A combination of selected table extraction APIs (column header validation, column coordinate validation and regex validation) with AND/OR operators to decide algorithm to extract table columns.
Column Extraction Rule configuration
-
Edit Column Extraction Rule
Click Edit to edit column extraction rule fields.
-
Configurable properties
Configurable property Type of value Value options Description Column Name
String
NA
Name of the column. Non editable field, only for reference with table column for the table.
Column Pattern
Regular Expression
Valid regular expression
The regex pattern for column data.
Between Left
Regular Expression
Valid regular expression
The regex pattern for data in left of the actual searched column.
Between Right
Regular Expression
Valid regular expression
The regex pattern for data in right of the actual searched column.
Column Header Pattern
Regular Expression
A keyword or a valid regex expression.
A keyword to be searched as a string with some fuzziness in the page or regex pattern to search column header regex pattern's best matched value in the page.
Start Coordinate
Integer
NA
Start Coordinate for the column.
End Coordinate
Integer
NA
End Coordinate for the column.
Multiline anchor
Boolean
- True if selected.
-
False if cleared.
Marks the column as a required column and anchor to denote the start of a new row in the table of the page. This is useful in the case of one table row spanning in multiple rows in documents.
Required
Boolean
- True if selected.
-
False if cleared.
If the radio button is selected, each table row extracted must contain some valid data for that column. If invalid data is extracted for the column, the corresponding row will not be added to table data.
Extract data from column
Drop-down list
List of values containing names of other columns for the table that can be selected to fill the text box containing the name of the column for extraction.
Selection for the table column name from which the current column's data needs to be extracted when using regular expression-based extraction. If it is left empty, then it is not applicable.
Currency
Boolean
For example :$ 12,000.00 will be manipulated as 12000.00 for validations. EURO 12.000,00 will be manipulated as 12000.00 for validations.
Specifies whether the column is a currency field. If i is a currency field then validation rules will be applied according to the currency representation. Manipulation will be done on the basis of the value for the currency chosen at Table Info Level. If this field is unchecked, no currency extraction will be done for the column irrespective of the value chosen at Table Info Level.
Table Validation Rule configuration
-
Add/Delete Table Validation Rule
A table validation rule is applicable to operands (table columns) that must be containing extracted column data as numerical values. Table validation rules are applied to rows of table extraction data. Multiple rules are applied at each row in OR or AND fashion as defined at table information level in Validation operator. If a row is invalid it is shown as orange shaded in extraction results if remove invalid rows is not selected at table info definition level or are removed from extraction result if remove invalid rows is selected at table info definition level.
To add or delete table validation rule, click the corresponding button. After you click Add, you can add table validation rule fields.
The first drop-down list contains the list of operands (Table column names). The second drop-down list contains the list of valid mathematical operators for a rule. The Clear button clears the rule.
-
Configurable properties
Configurable property Type of value Value options Description Rule
String
NA
A mathematical rule that applies to the combination of column values and governs the validity of a table row data.
Description
String
NA
The rule description. This description becomes visible on the table view on a selecting a row not satisfying the rule defined for it.
Column Header Based Extraction
-
Enter column header regex pattern from following UI:
-
You can set the Column header pattern field for each table column extraction rule.
There is a configurable property for table extraction using column header in:
{ephesoft-home}WEB-INFclassesMETA-INFdcma-table-finder*
tablefinder.gap_between_column_words=60
This value should be specified in pixels. In addition to words that are below the column header, all words (to the left or right) will also be extracted for the column in case gap between them and the extracted data is less than the value specified for gap_between_column_word.
Regex Based Extraction
A table extraction rule must be defined with have valid start and end patterns, along with Regex validation selected in any combination of table extraction API.
User needs to enter valid column patterns (optional between left pattern and between right patterns ) for regex based extraction.
Select table extraction technique
Select a table Extraction API in combination using AND or OR operators between three techniques as shown below:
Dependencies
Table extraction plugin has the following dependencies:
-
RECOSTAR_HOCR
-
TESSERACT_HOCR
One of the above plugins must be ON for key value learning as these plugins extract data from the image and create hOCR file which is required for the table extraction.
Troubleshooting
Error message | Possible root cause |
---|---|
Table info list is null or empty. |
No table is configured for the document type. |
Table Columns Info list is null or empty. |
No table column is defined for table. |
Table Extraction Rule List is null or empty. |
No table extraction rule is defined for table. |
Exception occurred while validating rule for a table row. |
Table validation rules could not be applied properly on extraction results. |
Skipping Table extraction. Switch set as off. |
Table extraction switch is set to OFF. |