Regular expressions

A simple example for regex: Social Security Numbers

Input:

Receipt.fp, total.yml, ssn.yml

Command:

FormProcApp.exe SSN.fp Example2 Example2\ssn.yml -l en-US

Output:

SSN_ssn_0.json

This example is well suited for matching a Social Security number (SSN):

fields:
- id: ssn-anchor
  rules:
  - {re: '.*SSN.*'}

- id: ssn
  rules:
  - [{re: '\b\d\d\d\s*-\s*\d\d\s*-\s*\d\d\d\d\b'}, {below: ssn-anchor}] 

The use of the meta character \b means word boundary to avoid false positives. Although \s is unnecessary in this single example, false or real white spaces may always leak in.

Certain characters (such as the dash "-") should not be escaped. The need to escape a character depends on the specific regex engine you are using. Our chosen regex engine does not allow escaping the dash character. Therefore, you should not automatically escape every non-alphanumeric character, as doing so could result in an error.

The result contains both the anchor field and the final SSN field matches:

guide> jq -c '{(.input): [.fields[] | {(.id): [.matches[] | [.text, .score, .error]]}]}' Example2\SSN_ssn_0.json
{"SSN.fp":[{"ssn-anchor":[["7. SSN (If Known)",0,0]]},{"ssn":[["448-86-1234",0,0]]}]}

The first example articulated through regex

Input:

Receipt.fp, total.yml

Command:

FormProcApp.exe Receipt.fp Example2 Example2\total.yml -l en-US

Output:

Receipt_total_0.json

Similar can be described with regular expressions, as with the rule voc in Anchor:

- id: amount-anchor
  rules:
  - {re: '(SUB)?TOTAL.*'}
  - {re: 'PAYMENT.*'}

With this one input results is the same:

guide> jq -c '{(.input): [.fields[] | {(.id): [.matches[] | [.text, .score, .error]]}]}' Example2\Receipt_total_0.json
{"Receipt.fp":[{"amount-anchor":[["Subtotal",0,0],["Total",0,0]]},{"amount":[["55.91",0,0],["55,11",0,0]]}]}

A difference to voc rule is, that voc is assuming a word separator automatically. However, in case you define a filter as below in order to remedy ocr inaccuracy about spaces then there are no word delimiters, so this regex works, whilst voc rule not.

- id: amount-anchor
  filters:
  - [[' '], ['']]
  rules:
  - {re: '(SUB)?TOTAL.*'}
  - {re: 'PAYMENT.*'}

For more information on this filter, see Variable, Score, Filter.