Direct TXT Output Converter Module
This module allows you to use the output of the recognition module as it is, converting recognized text without reading order and paragraph detection. Therefore the DirectTXT Outputs are simpler and faster to produce than the Layout Retention Output conversions that are available in RecAPIPlus, because DirectTXT Outputs do not include slow detection processes. The following DirectTXT output types are available:
-
DirectTXT Text: A simple text file.
-
DirectTXT CSV: A comma-separated text file, a simple format to represent tables. Microsoft Excel can read this format.
-
DirectTXT Formatted Text: This converter delivers plain text, but attempts to keep the page layout as detected in the original image. It creates a text file that simulates columns and boxes using tabulators.
-
DirectTXT PDF (deprecated): It contains the whole image of the original page and the text behind the image on a separate layer (image on text PDF). These PDF files suit the purpose of page archiving, because they contain both the original image and recognized text. This format is deprecated, use the new image on text PDF formats instead.
-
DirectTXT XML: Typically used for further processing recognized data. You can easily parse, for example, to MSXML, or transform to XSLT the output XML file. The format of the XML output is specified by the same schema as the Layout Retention XML Output, see http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd.
-
DirectTXT Binary: Used for creating files directly from the recognition data without any character conversion and formatting. It is the most usable output format for barcodes containing binary data, for example Code128 or PDF417 barcodes, containing encrypted data.
-
DirectTXT PDF: It contains the whole image of the original page and the text behind the image on a separate layer. Image on text PDF files suit the purpose of page archiving, because they contain both the original image and recognized text. The format supports adjusting the image quality.
-
DirectTXT MRC PDF: Image on Text PDF with MRC technology.
-
DirectTXT ALTO XML: ALTO is a standard XML format for preserving layout and content information with the OCR output.
-
DirectTXT hOCR XHTML: The hOCRformat preserves the style, layout, and recognition confidence information of pages resulting from OCR.
-
DTXT_XMLIMG: Use this output type to prepare pages for the TableXTract tool in the specified folder. This setting presents each page as an XML and a TIFF file. See TableXTract table recognition and data extraction tool for details. Keep with the following requirements and limitations:
-
Contrary to other output types, DTXT_XMLIMG requires an output folder rather than a file.
-
The output folder must be an existing directory.
-
The number of pages in the output folder is limited to 128.
-
When you specify an already existing file name, the TXT type outputs are appended.
The DTXT module can be especially useful for applications that do not require formatting but speed is an important factor, for example: indexing, archiving, or some form processing applications. When programming with KernelAPI, this is your only output choice. For technical details, refer to the Direct TXT Output Converter Module section in the API documentation.
It is possible to purchase distribution licenses that exclude formatted output, see Prepare distribution file set.