Regular Regex Extraction plugin
This plugin extracts index field values based on the pattern defined for that field. A semicolon-separated collection of one or more words followed by a regular expression can be defined for the pattern. The system searches each page for the regular expression. If a match is found, the system looks to the left of the match and sees if all of the preceding words in the pattern can be found. If all of the words are found (in order), the value is extracted. If only a subset of the words are found, or if none of the words are found, the value is not extracted.
Examples
Consider the following text defined for the pattern field of the InvoiceDate index field: Invoice;Date;d{1,2}[/]d{1,2}[/]d{2,4}
- Example 1
-
Text string in document: Invoice Date 21/03/2012
Result: "21/03/2012" is extracted for the InvoiceDate index field. This happens because "21/03/2012" matches the regular expression pattern, with "Date" found to its left, and "Invoice" found to its left.
- Example 2
-
Text string in document: Date 21/03/2012
Result: Nothing is extracted for this index field. Even though "21/03/2012" matches the regular expression, and "Date" is found to its left, the word "Invoice" is not found to the left of "Date".
Plugin configuration
Configurable property | Type of value | Value options | Description |
---|---|---|---|
Regular Regex Extraction Switch |
List of Values |
|
This property determines if the plugin will run or not. Default value is ON. |
Regular Regex Confidence Score |
Integer |
0 - 100 |
Acts as a multiplier for the confidence score calculated by matching regex. |
In the Pattern column you can enter the semicolon-separated set of words and regular expression for each index field.
Troubleshooting
Error message | Possible root cause |
---|---|
Invalid input pattern sequence. |
The pattern entered is not a valid regular expression, or does not match the proper format. |
No FieldType data found from data base for document type |
The FieldType column does not contain a valid value. |