Set up classification

In document capture, classification is the assignment of a document to a category or a class, also known as a document type.

This category is predefined based on your project class hierarchy. Without classification, successful extraction or archiving is impossible.

If the document type that is part of an extraction group is renamed, deleted, or moved from one parent to another elsewhere in Advanced Studio, some updates are required to the project in the Transformation Designer. This ensures that any changes to document types elsewhere in Advanced Studio are propagated to the classification and extraction settings in the Transformation Designer.

A document can be automatically classified based on physical layout, content, or generative AI, and the classification order of processing determines the final classification result. You can use a combination of pre-production trained documents and classification instructions, or you can also use classification online learning, that collects training documents for use in classification while a project is in production. The latter ensures that any new documents or classes are absorbed by the project easily, without a lot of configuration. Generative AI classification does not collect training documents as the LLM classifies a document based on the description provided for a class.

To aid in layout and content classification, first perform clustering on a set of documents, and then add the pre-classified documents to your classification training document set so classification can learn by example. You assign sample documents for each class. When the project is trained, the sample documents are analyzed and important features are extracted and used to define the class. Whether your documents are used for layout or content classification depends on how each class is configured. You do not need training documents during runtime. The project contains all of the extracted information required for classification.

Before testing your classification settings for a class or project, train your project. After training your project, the documents in your classification training document set are used as a comparison for the documents you are processing. For a document to be successfully classified, the document needs a confidence greater than or equal to the configured classification thresholds.

After changing the properties of your classifiers, or after adding or deleting documents from your training set, you must retrain your project.

Once classification is configured, run some preliminary classification tests. Once you are satisfied with the preliminary classification tests, you can run more detailed classification benchmarks.

If you define fields at the project level, the extraction results are used to classify a document. For example, you can classify a document by extracting a bar code.

More info