Automatic Table Extraction
There are several algorithms that run when automatic table extraction is performed.
- Position-based algorithm
-
The position-based algorithm tries to find a column at the left that has a regular increasing position number. For example, 0010, 0020, 0030, etc. The algorithm then finds combinations where unit-price, total-price, and quantity are consistent.
Once those amounts and quantities are identified, the algorithm tries to assign the rest of the table row content as description and article code columns.
The table should have a minimum width of 10% of the total page width, and should not be located on the right half of the document. Additionally, the gap between the table header and the first position can be no wider than 300 pixels.
- Amount-based algorithm
-
The amount-based algorithm uses a combination of amounts and quantity to locate table lines.
The algorithm takes input from a format locator to locate amounts and quantities on the document.
The algorithm then finds combinations where unit-price, total-price, and quantity are consistent.
If the Unit Price * Quantity = TotalPrice mathematical expression contains a unit factor, the algorithm tests unit factors of 1, 100, and 1000.
Once those amounts and quantities are identified, the algorithm tries to assign the rest of the table row content as description and article code columns.
For pages after the first page, the algorithm generates column-geometry from the table rows generated on the first page. The algorithm then tries to use those rows as templates for all subsequent pages.
- Header-based algorithm
-
The header-based algorithm uses the Table Header Packs to locate table items.
Using the Table Header packs, the header based algorithm locates table cells using the positions of the trained table headers.
The layout of the table header located on your document is applied to your table rows.
Similar to the located table header, table rows may span multiple lines.
To improve the extraction quality, train all column headers for a document, even if those columns are not part of your table model.
You can help the header-based algorithm segregate the table rows into table cells by improving your table header pack.
The table locator expert mode helps you identify header texts that are missing from your table header pack.
- Line-based algorithm
-
The line-based algorithm uses graphical lines on a document to segregate the table area into table columns, rows, and cells.
After the initial table row candidates are extracted, the remaining words in between table rows are merged into existing table rows, maintaining the segregation by the vertical and horizontal lines.
To assign the table columns, the table headers are used.
To ensure the best results, train the table headers in the table header pack.
- Layout-based algorithm
-
The main idea of the layout-based table algorithm is that the layout of line items in a table is stable across all lines. For example, the description is often left-aligned alphanumeric text that starts at a certain horizontal position and the unit-price is a right-aligned number at a different horizontal position.
The layout-based table algorithm analyzes documents page by page. For each page, the algorithm searches for left-aligned alphanumeric texts and right-aligned numbers.
If the horizontal positions of the blocks is similar across several lines, then a template is generated that is applied to all lines where possible.
This algorithm works best with a stable table layout and with many lines items. It does not work well with one line item or a few line items only.