Search Classification plugin
This topic describes how to configure and use the Search Classification plugin. The plugin classifies documents in the Page Process module of the workflow using Lucene-based indexing. Classification is how Transact selects or associates the document to the Document Type.
Configure the Search Classification plugin
Perform the following steps to configure the SEARCH_CLASSIFICATION plugin in the Page Process module. You must have administrator rights to complete these steps.
- Launch Transact and navigate to . Enter login credentials when prompted.
-
Select an existing batch class and click
Open or create a new batch class. You can also copy or import an existing batch class, then
modify it to create a new batch class.
The SEARCH_CLASSIFICATION plugin works independently of the MULTIDIMENSIONAL_CLASSIFICATION_PLUGIN in the Page Process module. Both plugins can be present in the module.
-
Select the
SEARCH_CLASSIFICATION plugin to set up the configuration.
The Plugin Configuration screen for the SEARCH_CLASSIFICATION plugin displays.
-
Set the properties.
The following table lists and defines the configurable properties for the Search Classification plugin.
Configurable property
Type of value
Value options
Description
Lucene Valid Extensions
List of Values
xml
html
This field defines the valid extension of the input file and is applied when classifying document types for the specified file format.
Lucene Min Term Frequency
Integer
N/A
This field sets the frequency below which terms will be ignored in the source document.
Lucene Min Document Frequency
Integer
N/A
This field sets the frequency at which words are ignored. When a word does not occur in at least x amount of documents indicated in this field, it gets ignored.
Lucene Min Word Length
Integer
N/A
This field sets the minimum word length. Words smaller than this setting are ignored from the HOCR content.
Lucene Min Query Terms
Integer
N/A
This field sets the minimum number of query terms that will be included in any generated query.
Lucene Top Level Field
String
N/A
This property is used to configure the default field for query terms.
Lucene No Of Pages
Integer
N/A
This property specifies the number of documents to be returned in a query search.
Lucene Index Fields
List of Values
title
summary
This property is used as an index field for searching the document type using Lucene.
Lucene Stop Words
List of Values
title
name
This property sets the words to be ignored when classifying a document.
Search Classification Switch
List of Values
ON
OFF
This property enables or disables the SEARCH_CLASSIFICATION plugin for the batch class.
Search Classification Max Results
Integer
N/A
This field defines the maximum number of alternate value results that will be generated in the batch.xml.
The default value for this field is 5 in Ephesoft Transact to control the overall size of the batch.xml file.
First Page Confidence Score Value
Integer
N/A
This property is used to update the confidence score based on the first page type.
Middle Page Confidence Score Value
Integer
N/A
This property is used to update the confidence score based on the middle page type.
Last Page Confidence Score Value
Integer
N/A
This property is used to update the confidence score based on the last page type.
- Click Deploy to save and enable the changes.
Search classification execution process
This plugin operates in the Page Process module after all batch-level import processes are complete.
We recommend that document learning is completed for the batch class prior to using this plugin. This plugin classifies incoming document images using Lucene-based indexing. This plugin functions in two stages when classifying documents:
-
Learning: The learning process occurs when generating indexes for documents. This plugin uses the generated indexes to classify each document. This plugin uses the learned files that were created earlier in the workflow.
-
Classification: When this plugin classifies a document, the data it learns provides a reference for document classification. When this plugin classifies a document type, it uses the extracted HOCR content from the image and verifies the HOCR content, based on the data it learned in the previous learning process.
The plugin generates HOCR content similar to the RecoStar HOCR and Tesseract HOCR plugins.
After all images and documents in the batch instance have been classified, this plugin writes the data to the batch.xml file for the document type that is being classified.
Troubleshooting
The following table lists the possible error messages that may occur with this plugin along with a description of each possible root cause.
Error message |
Possible root cause |
---|---|
No index files exist inside folder |
The document learning is not complete for the batch class. |
Page Types not configured in Database. |
The index data contains invalid indexes for the batch class. |
CorruptIndexException while reading Index. |
The index data is corrupt in the index folder for the batch class. |
IOException while reading Index |
The plugin is unable to open the index data due to corruption in the get index file process, or there is a lock on the index file. |
No valid extensions are specified in resources |
The page contains an invalid HOCR file for processing. |
No pages found in batch XML. |
The pages tag was not found in the incoming batch.xml file. |