Attachment detection

It is possible that an invoice is scanned or submitted with an attachment. Attachments are other documents that are related to the invoice, but do not belong to the invoice itself. This can include all kinds of documents, including time sheets, activities for a billing period, delivery notes, and even references to other invoices.

These attachments can interfere with invoice extraction because they may contain content that looks plausible for extraction. In the worst case field values are extracted from the attachment rather than the invoice. It is also possible that two equally confident extraction alternatives are found on the attachment and the invoice. Since there is no clear winner, the field is flagged as invalid.

If a page is detected as an attachment, that information is used during extraction so the page is excluded altogether, or its field confidences is lowered enough so that any alternatives located on the attachment are unlikely to be included in the overall extraction results. This can improve the quality of extraction.

Attachment detection occurs in sequential order as follows:

Detection type

Description

Page numbering

If page numbers are printed on the invoice and on the attachment, checks for a break in numbering can detect an attachment.

For example, if the first two pages are numbered 1 and 2 and the third page is numbered 1 again, it is considered an attached document. Similarly, if the first page is number with Page 1 of 2 and the second page is Page 2 of 2, then a third page can be considered an attachment even if it has no page number.

Since not every invoice contains page numbers, additional methods of detecting attachments are needed.

Layout

The layout of an invoice and an attachment typically have a different layout. Even invoices that span multiple pages usually have repeating text blocks at the top or bottom of pages. A page with a different layout is considered an attachment, except if line items are spanning over that page or no plausible amounts are found on previous pages.

If the layout of an invoice and its attachments are similar, the layout is not sufficient to detect an attachment. One final method of detecting attachments is available.

Keywords for attachment headers.

Keywords are another way to detect an attachment. If a specific keyword is found in the header section of a page it is identified as an attachment.

If data is being extracted from pages that you know are attachments, you can provide a comma-separated list of keywords or phrases that appear on the document. You can add this list to the Attachment Headers field in the Settings > Invoice Processing > Capture Profiles > Extraction Settings. Any pages that contain these keywords are identified as attachments and any data on those pages is unlikely to be included in the final extraction results.

In addition, the following constraints are put on keyword matches:

  • The match must not be on page 1.

  • The match must be in the header section of the page.

  • The match must not be part of floating text.

  • The match must stand alone or there must be some distance between it and other text boxes located nearby the match.

If the keywords are unable to identify a page as an attachment, chances are that it is not an attachment.

Related topics: