Search Classification plugin

This topic describes how to configure and use the Search Classification plugin. The plugin classifies documents in the Page Process module of the workflow using Lucene-based indexing. Classification is how Transact selects or associates the document to the Document Type.

Configure the Search Classification plugin

Perform the following steps to configure the SEARCH_CLASSIFICATION plugin in the Page Process module. You must have administrator rights to complete these steps.

  1. Launch Transact and navigate to Administrator > Batch Class Management. Enter login credentials when prompted.
  2. Select an existing batch class and click Open or create a new batch class. You can also copy or import an existing batch class, then modify it to create a new batch class.

    The SEARCH_CLASSIFICATION plugin works independently of the MULTIDIMENSIONAL_CLASSIFICATION_PLUGIN in the Page Process module. Both plugins can be present in the module.

  3. Select the SEARCH_CLASSIFICATION plugin to set up the configuration.

    The Plugin Configuration screen for the SEARCH_CLASSIFICATION plugin displays.

  4. Set the properties.

    The following table lists and defines the configurable properties for the Search Classification plugin.

    Configurable property

    Type of value

    Value options

    Description

    Lucene Valid Extensions

    List of Values

    xml

    html

    This field defines the valid extension of the input file and is applied when classifying document types for the specified file format.

    Lucene Min Term Frequency

    Integer

    N/A

    This field sets the frequency below which terms will be ignored in the source document.

    Lucene Min Document Frequency

    Integer

    N/A

    This field sets the frequency at which words are ignored. When a word does not occur in at least x amount of documents indicated in this field, it gets ignored.

    Lucene Min Word Length

    Integer

    N/A

    This field sets the minimum word length. Words smaller than this setting are ignored from the HOCR content.

    Lucene Min Query Terms

    Integer

    N/A

    This field sets the minimum number of query terms that will be included in any generated query.

    Lucene Top Level Field

    String

    N/A

    This property is used to configure the default field for query terms.

    Lucene No Of Pages

    Integer

    N/A

    This property specifies the number of documents to be returned in a query search.

    Lucene Index Fields

    List of Values

    title

    summary

    This property is used as an index field for searching the document type using Lucene.

    Lucene Stop Words

    List of Values

    title

    name

    This property sets the words to be ignored when classifying a document.

    Search Classification Switch

    List of Values

    ON

    OFF

    This property enables or disables the SEARCH_CLASSIFICATION plugin for the batch class.

    Search Classification Max Results

    Integer

    N/A

    This field defines the maximum number of alternate value results that will be generated in the batch.xml.

    The default value for this field is 5 in Ephesoft Transact to control the overall size of the batch.xml file.

    First Page Confidence Score Value

    Integer

    N/A

    This property is used to update the confidence score based on the first page type.

    Middle Page Confidence Score Value

    Integer

    N/A

    This property is used to update the confidence score based on the middle page type.

    Last Page Confidence Score Value

    Integer

    N/A

    This property is used to update the confidence score based on the last page type.

  5. Click Deploy to save and enable the changes.

Search classification execution process

This plugin operates in the Page Process module after all batch-level import processes are complete.

We recommend that document learning is completed for the batch class prior to using this plugin. This plugin classifies incoming document images using Lucene-based indexing. This plugin functions in two stages when classifying documents:

  • Learning: The learning process occurs when generating indexes for documents. This plugin uses the generated indexes to classify each document. This plugin uses the learned files that were created earlier in the workflow.

  • Classification: When this plugin classifies a document, the data it learns provides a reference for document classification. When this plugin classifies a document type, it uses the extracted HOCR content from the image and verifies the HOCR content, based on the data it learned in the previous learning process.

The plugin generates HOCR content similar to the RecoStar HOCR and Tesseract HOCR plugins.

After all images and documents in the batch instance have been classified, this plugin writes the data to the batch.xml file for the document type that is being classified.

Troubleshooting

The following table lists the possible error messages that may occur with this plugin along with a description of each possible root cause.

Error message

Possible root cause

No index files exist inside folder

The document learning is not complete for the batch class.

Page Types not configured in Database.

The index data contains invalid indexes for the batch class.

CorruptIndexException while reading Index.

The index data is corrupt in the index folder for the batch class.

IOException while reading Index

The plugin is unable to open the index data due to corruption in the get index file process, or there is a lock on the index file.

No valid extensions are specified in resources

The page contains an invalid HOCR file for processing.

No pages found in batch XML.

The pages tag was not found in the incoming batch.xml file.