Benchmark Document Sets
A benchmark set is a group of documents that can be used for separation, classification, and extraction benchmarks. These documents usually are a proportion of the documents used for testing your project and do not include poorly scanned or hard to read documents. A benchmark set cannot be added, but you can convert a test set so that it changed to a benchmark set. A benchmark set differs from a test set because benchmark documents can have class assignments, and because of this class assignment, these documents are suitable for separation, classification, and extraction benchmarks.
For classification benchmarks, your document benchmark set needs the following:
-
Recognition results if you are using content classification
-
An assigned class
For separation benchmarks, your document benchmark set needs the following:
-
Recognition results if you are using content classification
-
An assigned class
-
When in the Hierarchy View, no subfolders can exist under the Root Folder
For extraction benchmarks, your document benchmark set needs the following:
-
Recognition results
-
An assigned class
-
Extraction results
-
Validated extraction results
A processed benchmark set is often referred to as a set of golden files.
The following information is important to know when selecting the golden files that are added to your benchmark set:
-
Separation benchmarks does not support PDF documents, so do not include these in your benchmark set if you are testing separation.
-
Documents with multiple pages need to be combined into a single image file. This simulates separation in production.
-
Any rotated documents should be correctly aligned.
-
Select typical documents rather than obscure examples for your project classes.
-
Select clean documents rather than those with blotches or dark areas that could interfere with recognition results.
-
All documents used need to belong to one of the project classes.