Train Transact for document classification

A well-formed set of HOCR .xml files placed in a hierarchical structure such as Batch Class > Document type > Page type are used for the purpose of registering few standard HOCR .xml documents with the Lucene search engine. This process is called learning because it is like feeding the .xml files into Lucene's memory by creating Lucene indexes. HOCR files in batch instance are compared with these memorized indexes to find the best match and classify the pages. Note that document training is a one-time process. Training makes the classification process fast as no index needs to be generated at runtime to classify the documents.

Train document classification

You can train document classification using single or multipage .tif, .tiff, or .pdf documents.

If the uploaded document is a single page document, a single page is copied to Application-Checklist_First_Page. If the uploaded document is multipage, the first and last page of the document are copied in the Application-Checklist_First_Page and Application-Checklist_Last_Page, respectively, and all other pages of the document are copied in Application-Checklist_Middle_Page.

Suppose the user has created the Application-Checklist document type in batch class BC1 and saved this document type. It creates the necessary folders where .tiff files to be learned can be placed. In this case, the folder is created under <Ephesoft Transact_install_directory>/SharedFolders/<Batch-Class-Id>/lucene/<clasification-method-sample> folder. The following three subfolders will be created in this case:

  • Application-Checklist_First_Page

  • Application-Checklist_Last_Page

  • Application-Checklist_Middle_Page

Follow these steps:

  1. Create the document type.
  2. Select the document that Transact needs to recognize for classification. You can either drag the sample file to the Upload Test Classification File(s) pane, or click the Upload Test Classification File(s) link.

    Transact will process and learn the documents.

Learn files

This feature learns the documents present in the folders of document type.

When learning, the following actions occur:

  1. HOCR files are created in the folder Ephesoft Transact-install-dirSharedFoldersBatch-Class-Idlucene-search-clasification-sample for Lucene learning.
  2. Thumbnails are created in the folder Ephesoft Transact-install-dirSharedFoldersBatch-Class-Id image-classification-sample for image classification.
  3. Indexes are created in the folder Ephesoft Transact-install-dirSharedFoldersBatch-Class-Id learn-index for index learning.

View learn files

User can navigate using keyboard to see learned file results for different document types.

User can view learned files of a document type on the UI.

Select any document type and click the View Learn File(s) button.

The Result page has the following given columns:

Column name

Type of value

Value options

Description

File Name

String

NA

It represents the uploaded file name.

Page Type

String

  • FIRST

  • LAST

  • MIDDLE

It specifies this file is learned as first, last or middle page.

Image Classification

Boolean

  • True

  • False

It specifies whether thumbnail is created or not. Its value is true if the thumbnail created. The value is false if the thumbnail is not created.

Lucene Learning

Boolean

  • True

  • False

It specifies whether HOCR files are created or not. Its value is true if the HOCR file is created.