Vocabularies

The following examples demonstrate the use of a larger, file based vocabularies.

File based vocabularies

Input:

Receipt.fp, vendor1.yml, USA_CompanyDictionary.txt, English_CompanyEndings.txt

Command:

FormProcApp.exe Receipt.fp Example3 Example3\vendor1.yml -l en-US

Output:

Receipt_vendor1_0.json

For this example, use the company and company ending list files.

The following code searches one of these dictionaries for keywords. The first matching keyword is written to the output.

- id: Vendor
  rules:
  - {voc: USA_CompanyDictionary}
  - {voc: CompanyEndings}

Only the matching part of the text is written to the output, based on the vocabulary match that fuzzily aligns with it.

.score, .error]]}]}' Example3\Receipt_vendor1_0.json
.json
{"Receipt.fp":[{"Vendor":[["Vitamin Shoppe",0,0],["Vitamin Shoppe",0,0],["Vitamin Shoppe",0,0]]}]}

The reason it is written three times is that the matched form is recorded. The original fragment can be found by looking up the top coordinates of the box in the .json file. These text fragments appear in the output, separated into words:

Vitamin Shoppe #177
Vitamin Shoppe values your feedback. *
Vitamin Shoppe Gift Card.

Vocabularies in regex

Input:

Receipt.fp, vendor.yml, USA_CompanyDictionary.txt, English_CompanyEndings.txt

Command:

FormProcApp.exe Receipt.fp Example3 Example3\vendor2.yml -l en-US

Output:

Receipt_vendor2_0.json

Vocabularies are listed in the vocabularies section and referred to by these names.

The regex rule is combined with these vocabularies:

- id: Vendor
  rules:
  - [{re: '(?<company>\D.*)', cvoc: {company : USA_CompanyDictionary}}]
  - [{re: '\S+\s+(?<companyEnd>\D\S+)', cvoc: {companyEnd : CompanyEndings}}] 

Since vocabularies perform a fuzzy search if punctuation removal does not result in a match, the fuzzy search is disabled here by specifying error: 1 as the maximum error. This allows for punctuation removal before or after a sequence of words.

The expr: rule adds a score when such a rule succeeds. The intention is that the address is expected in the upper section of a receipt.

guide> jq -c '{(.input): [.fields[] | {(.id): [.matches[] | [.text, .score, .error]]}]}' Example3\Receipt_vendor2_0.json
{"Receipt.fp":[{"Vendor":[["Vitamin Shoppe #177",0,0],["Vitamin Shoppe values your feedback. *",0,0],["Vitamin Shoppe Gift Card.",0,0],["VitaminShoppe com www.BodyTech.com",0,13]]}]} 

The last two entries here are probably unwanted because they are not located in the header.

Vocabularies fuzzy match

Input:

Receipt.fp, vendor.yml, USA_CompanyDictionary.txt, English_CompanyEndings.txt

Command:

FormProcApp.exe Receipt.fp Example3 Example3\vendor3.yml -l en-US

Output:

Receipt_vendor3_0.json

Since the vocabulary search uses fuzzy matching by default, it can be useful to limit its errors to punctuation differences and no other types of characters, meaning exactly the error=1 limit:

  - [{re: '(?<company>\D.*)', cvoc: {company : USA_CompanyDictionary}, error: 1}]
  - [{re: '\S+\s+(?<companyEnd>\D\S+)', cvoc: {companyEnd : CompanyEndings}, error: 1}]

Now, the fragment "VitaminShoppe" does not match fuzzily, as it does not contain the space character:

guide> jq -c '{(.input): [.fields[] | {(.id): [.matches[] | [.text, .score, .error]]}]}' Example3\Receipt_vendor3_0.json
{"Receipt.fp":[{"Vendor":[["Vitamin Shoppe #177",0,0],["Vitamin Shoppe values your feedback. *",0,0],["Vitamin Shoppe Gift Card.",0,0]]}]}