Blocks, expression
Input: |
Receipt.fp, address.yml, Post_City.txt |
Command: |
FormProcApp.exe Receipt.fp Example4 Example4\address.yml -l en-US |
Output: |
Receipt_address_0.json |
This example demonstrates a use case where a visually cohesive block of text is needed with internal matches specified by rules.
- id: city
to-be-output: true
rules:
- [{voc: cities}, {expr: [TOP, lt, 20]}]
This field is intended to be an anchor, ensuring rows within the same visual block are produced while preserving their order.
The expression rule (expr:) is specified without a score, meaning it acts as a mandatory AND. Such a restriction is necessary to reduce false positives when dealing with large vocabulary databases.
- id: address
to-be-output: true
rules:
- [{re: '.*'}, {same-block: city}]
The city is referring to Post_City.txt, filtered from the National Address Database. The filtering was done for demonstration purposes. Note that even telephone is listed as a city here, but it is excluded due to the top 20% anchor search (the entire block can grow beyond this, but the anchor row cannot).
The output will contain multiple city matches as anchors, due to the extensive possibilities for USA cities.
guide> jq -c '{(.input): [.fields[] | {(.id): [.matches[] | [.text, .score, .error]]}]}' Example4\Receipt_address_0.json
{"Receipt.fp":[{"address":[["San Jose, CA 95117",0,0],["3150 Stevens Creek Blvd",0,0],["Telephone:",0,0],["(408)247-3498",0,0],["Vitamin Shoppe #177",0,0],["Answers For Every Bodr",0,0]]},
{"city":[["San Jose",0,1],["SEVERY",0,29],["Stephens",0,40]]}]}
Multiple cities were recognized, all within the same block, since the error: 1 option was not used in voc.
The output is ordered by the anchor's score, error, and distance, as this is the design of the output order. Users may wish to reorder it by 'matches.properties.box.top' to restore the original row order before passing it to a third-party address parser:
guide jq -c '.fields[]| select (.id == ""address"")|.matches|sort_by(.properties.box.top)[]|[.text, .referredFields[].text]' Example4\Receipt_address_0.json
["Answers For Every Bodr","SEVERY"]
["Vitamin Shoppe #177","San Jose","Stephens"]
["3150 Stevens Creek Blvd","San Jose","Stephens"]
["San Jose, CA 95117","San Jose","Stephens"]
["Telephone:","San Jose","Stephens"]
["(408)247-3498","San Jose","Stephens"]
As seen above, a false positive was also found: "Answers For Every Bodr"' matched the city "SEVERY" fuzzily (+1 char) by matching case insensitively. This can be differentiated as a separate block by the different "referredFields" text value of the field ID city as the first record.
Notice that not only the string "San Jose" was found in the large block as an anchor, but also "Stephens", which fuzzily matched Stevens (the error is marked in the .json).