Statistics Viewer
Document classification and extraction is a process that is not deterministically constant but deals with varying input. Therefore, the results of this process also depend on the input data, and, by definition, are not predetermined.
The quality of the process is defined by how accurately the document class is assigned and items on the document are identified. This quality is measured by two values called Recall and Precision.
Precision is the percentage of all correctly classified documents versus all classified documents. Recall is the percentage of documents that are correctly classified versus documents that should are classified. Incorrectly classified items are called substitutions. Items that are not classified are called rejects. Obviously the goal of the system is to optimize the process in a way that we have maximum recall with maximum precision.
In the document identification process, the rules and patterns that are defined and learned in the project are applied to unknown documents. The quality of these rules and patterns determines the quality of the identification results. For example, if additional samples are trained for classification and extraction, the identification quality can be expected to also increase.
The Project Builder enables you to carry out a multitude of tests to evaluate the identification quality when configuring the project. In contrast, the Statistics Viewer provides an overview of the quality of identification for the current production run. This allows the system administrator to monitor the quality and refine definitions and trained patterns, especially in areas where identification quality is below expectations.
The statistical reports rely on three values that are automatically recorded for each field during runtime (for statistical purposes, the classification result is also treated as a field value):
-
Recognition time
-
Original recognition value
-
Value changed via manual correction
By comparing the original value with the final value, it can be determined if a field was initially correct, If the value is changed, the initial value is presumed to are wrong.
During runtime, statistical data is gathered for each document and stored in the XDocument. To collect and use these statistics, the Tungsten Capture Export must be added to the batch class list of queues, and each document class, for which statistical data should be gathered, must be assigned export connector Tungsten Transformation Toolkit - Transformation Statistics. The settings for this export connector include a path where the database is stored, and the group value and cleanup interval. For further details, see Setting Up a Batch Class in Tungsten Capture – Using the Export Connector.
This export connector collects the statistics from each document. Initially the values are stored individually for each field to allow maximum detail in the reports. Later the values are condensed to daily or monthly values. This keeps database size smaller, but still allows for an overview of historical recognition performance.
This document describes the Tungsten Transformation Toolkit - Statistics Viewer, which provides common statistical reports for monitoring the quality and speed of the recognition process.
Statistics Viewer will be deprecated and no longer available in a future version of Tungsten Transformation Toolkit.