Process a document with TableXTract
Use a simple KernelAPI workflow and the Direct TXT output converter module to save the pages to an existing directory in the appropriate format, then run the TableXTract tool. The tool (TableXTract.exe) is available in the CSDK22\Bin subfolder under the installation path.
-
Prepare the input document.
-
Run preprocessing on the document with the usual KernelAPI functions. See the RecAPI documentation for details.
- Recognize the text. Due to memory and engine limitations, we recommend processing pages one by one, or at most 10 pages at once.
-
Run the Direct TXT output converter module with the DTXT_XMLIMG setting, which presents the pages in the format compliant for TableXTract. For requirements and limitations on the Direct TXT output converter module and the DTXT_XMLIMG setting, see Direct TXT Output Converter Module.
-
-
Run TableXTract.exe to process all pages in an input folder. Due to
memory limitations, we recommend to keep the number of pages low. In practice, do not
process more than 25 pages at once. Specify the input folder, the output XML file, and
logging options in the parameters, following the template below.
TableXtract [-l loglevel] [-f logfilename] input_foldername output_xml_filename
-
loglevel: Use the following values to specify what to store in the log:
-
0: No logging
-
1: Full logging with all events
-
2: Logging errors only (default)
-
-
logfilename: The logfile name must specify a writable file. If no logfile name is specified, the log appears on the console.
-
input_foldername: The path for the input folder containing the pages prepared with the DTXT_XMLIMG setting, as detailed above. This parameter is mandatory.
-
output_xml_filename: The filename for the XML output presenting the extracted data and layout information. This parameter is mandatory.
The command has a return value:
-
0: No error
-
1: Command line error
-
2: Program error
-
The tool processes the pages found in the input folder and presents the content, layout and structure of the pages in XML. For details on this XML format, see TableXTract output XML format.