Specific training of trainable locators

All trainable locators provide an algorithm called specific training. Specific training recognizes the layout of a trained document and correlates all possible features that can be used for extraction with that layout.

During extraction, these locators internally perform a layout classification. If the layout is known from a training document, the document is extracted based on information from the training document with the same layout. The disadvantage of this approach is that for each layout, you must train at least a single document. Using the specific training algorithm in combination with the generic Knowledge Bases significantly reduces the number of necessary training documents.

The tight integration between the specific training algorithm and the Online Learning workflow greatly reduces the effort for additional document training. The specific training algorithm works well against unintended training errors, because a training document only affects other documents of the same layout.

Understanding the behind-the-scenes behavior of the specific training algorithm is importing when troubleshooting documents that are not extracted as expected. The specific training algorithm classifies and extracts your documents using a combination of virtual classes and templates, that do not always have a 1:1 relationship with your configured classes.

Virtual classes in specific training

During extraction, a type of classification is performed on documents using virtual classes. These virtual classes group together documents with similar layouts. You do not need to configure these virtual classes, they are created automatically and used by the specific training algorithm only. These virtual classes are used for extraction only.

For example, a virtual class is related to a specific vendor for an invoice. In this case, the classification result is Invoices, but each vendor in that class has their own invoice layout. This results in multiple virtual classes that are all part of the Invoices class in the Transformation Designer.

For specific training, it is necessary to distinguish between the different layouts that documents in the same class may have. This is possible with virtual classes.

When a document is trained, it is classified and internally assigned to a virtual class based on its layout. If the layout of a document during production does not match the existing virtual class, a new virtual class for this new document is created. This can result in one class with multiple virtual classes.

The virtual class information for each document is available in the Transformation Designer. The Extraction document set includes a column called Layout ID. The value in this column has an integer and documents with the same virtual class will have the same values.

If you have several documents that are not extracting as expected, ensure that their Layout ID matches other documents with a similar layout. If not, consider adding additional training documents.

Templates in specific training

During extraction, a template defines a set of fields for a specific locator. This means that for every trainable locator at lease one template is created during extraction.

When a document is trained, it is assigned a virtual class and then the specific training algorithm attempts to find a matching template for the document by extracting fields for the current locator.

If a template is found, the document is assigned to that template. If no matching template is found, a new template is created based on the document.

Templates and virtual classes work together to generate extraction results, and new virtual classes and templates are created on demand in order to improve extraction results.

Related topics: