Simple output converters. More...

Enumerations
enum	DTXTOUTPUTFORMATS { DTXT_TXTS , DTXT_TXTCSV , DTXT_TXTF , DTXT_PDFIOT , DTXT_XMLCOORD , DTXT_BINARY , DTXT_IOTPDF , DTXT_IOTPDF_MRC , DTXT_ALTO , DTXT_HOCR , DTXT_XMLIMG , DTXT_FORMPROC }
	DTXT output formats. More...

Functions
RECERR RECAPIKRN	kRecSetDTXTFormat (int sid, DTXTOUTPUTFORMATS dFormat)
	Changing DTXT format.

RECERR RECAPIKRN	kRecGetDTXTFormat (int sid, DTXTOUTPUTFORMATS *pdFormat)
	Getting DTXT format.

RECERR RECAPIKRN	kRecConvert2DTXT (int sid, const HPAGE *ahPage, int nPage, LPCTSTR pFilename)
	Converting pages with DTXT.

RECERR RECAPIKRN	kRecConvert2DTXTEx (int sid, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg, LPCTSTR pFilename)
	Converting pages with DTXT.

RECERR RECAPIKRN	kRecMakePagesSearchable (int sid, LPCTSTR pFilename, int fromPage, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg)
	Making a PDF page searchable.

Detailed Description

Simple output converters.

This module gives you the possibility to convert recognized text simply and quickly. That is, you use the output of the recognition module as is (without reading order and paragraph detection). Therefore the DirectTXT Outputs are simpler than the Layout Retention Output conversions (available in RecAPIPlus) and also faster to produce, because they do not include slow detection processes.

There are different functions for starting the DTXT conversion. The older function kRecConvert2DTXT can create all the possible DTXT formats. Its newer successor is kRecConvert2DTXTEx, which has an additional IMAGEINDEX parameter for controlling the orientation of the pages creating a PDF file. This latter functions also can create all DTXT formats.

For existing PDF files a special conversion method can be applied. The function kRecMakePagesSearchable inserts invisible text (coming from a recognition step) into the PDF file (in-place), i.e. it makes the file searchable.

You can have control over DirectTXT output behavior through various settings. The root of DirectTXT settings is Kernel.DTxt. These settings can be queried and modified through Settings Manager Module. The following DirectTXT output types can be selected by calling kRecSetDTXTFormat:

The code page used at generating DTXT_TXT* output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage.

The DirectTXT Text (DTXT_TXTS) output is a simple text file. The settings used by this converter are as follows:

Shared settings
- Kernel.DTxt.UnicodeFileHeader
- Kernel.DTxt.IntelByteOrder
- Kernel.DTxt.PageBreak
- Kernel.DTxt.txt.LineBreak
Exclusive settings
- Kernel.DTxt.txt.IgnoreSpaceAtEOL
- Kernel.DTxt.txt.CellLineBreak
- Kernel.DTxt.txt.BeginCell
- Kernel.DTxt.txt.EndCell
- Kernel.DTxt.txt.CellSeparator
- Kernel.DTxt.txt.ZoneSeparator

The DirectTXT CSV (DTXT_TXTCSV) output is a simple format to represent tables. Microsoft Excel can read this format. The settings used by this converter are as follows:

Shared settings
- Kernel.DTxt.UnicodeFileHeader
- Kernel.DTxt.IntelByteOrder
- Kernel.DTxt.PageBreak
Exclusive settings
- Kernel.DTxt.csv.EndOfRecord
- Kernel.DTxt.csv.BeginField
- Kernel.DTxt.csv.EndField
- Kernel.DTxt.csv.FieldSeparator
- Kernel.DTxt.csv.EndOfLineAsFieldSeparator
- Kernel.DTxt.csv.EndOfCellLineAsFieldSeparator
- Kernel.DTxt.csv.RecordSeparator

When you want to process forms, you can collect data into one row for each page. E.g. Kernel.DTxt.PageBreak = ""; Kernel.DTxt.csv.RecordSeparator = 2

The DirectTXT Formatted Text (DTXT_TXTF) delivers plain text, but attempts to keep layout as detected in the original image: this creates a text file that simulates columns and boxes using tabulators. The settings used by this converter are as follows:

Shared settings
- Kernel.DTxt.UnicodeFileHeader
- Kernel.DTxt.IntelByteOrder
- Kernel.DTxt.PageBreak
- Kernel.DTxt.txt.LineBreak

The newer DirectTXT PDF formats (DTXT_IOTPDF and DTXT_IOTPDF_MRC) (supported on: Windows, Linux, Embedded Linux, MacOS) contain the whole image of the original page and the text behind the image on a separate layer. These pdf files especially suit the purpose of page archiving, because they contain both the image and the searchable recognized text. There are possibilities to affect on the quality of the generated PDF file. For details see the section about newer image formats.

Exclusive settings
- Kernel.DTxt.PDF.CompressContentStream
- Kernel.DTxt.PDF.Linearized
- Kernel.DTxt.PDF.SplitMaxPages
- Kernel.DTxt.PDF.SplitMaxSize
- Kernel.IMF.PDF.Compatibility

Enumeration Type Documentation

◆ DTXTOUTPUTFORMATS

enum DTXTOUTPUTFORMATS

DTXT output formats.

The following output formats can be created by Direct TXT output converter. All of them except DTXT_ALTO, DTXT_HOCR and DTXT_BINARY are appendable. Some of the selectable output formats can be balanced by settings. See Settings of the Direct TXT Module.

Enumerator
DTXT_TXTS	Text Standard. Details...
DTXT_TXTCSV	Text CSV. Details...
DTXT_TXTF	Text Formatted. Details...
DTXT_PDFIOT	Deprecated (see usage of new formats). PDF Image on Text. Supported on: Windows, Linux, Embedded Linux, MacOS. Details...
DTXT_XMLCOORD	XML Simple. Details...
DTXT_BINARY	Binary output. Details...
DTXT_IOTPDF	Image on Text PDF with changeable compression level. Supported on: Windows, Linux, Embedded Linux, MacOS. Details...
DTXT_IOTPDF_MRC	Image on Text PDF with MRC technology. Supported on: Windows, Linux, Embedded Linux, MacOS. Details...
DTXT_ALTO	ALTO xml. Details...
DTXT_HOCR	hOCR xhtml. Details...
DTXT_XMLIMG	Preparing pages for the `TableXTract` tool. Supported on: Windows.
DTXT_FORMPROC	FormProc. Feed this output to FormProc as input.

Function Documentation

◆ kRecConvert2DTXT()

RECERR RECAPIKRN kRecConvert2DTXT	(	int	sid,
		const HPAGE *	ahPage,
		int	nPage,
		LPCTSTR	pFilename )

Converting pages with DTXT.

This function converts the given pages using Direct TXT output converter.

Parameters

[in]	sid	Settings Collection ID.
[in]	ahPage	Array of HPAGEs to be converted.
[in]	nPage	Number of HPAGEs in `ahPage`.
[in]	pFilename	File name of the resulted file.

Return values

RECERR

Note: You might find kRecConvert2DTXTEx better to use; that is the suggested fuction to call. This function is equivalent to kRecConvert2DTXTEx(sid, ahPage, nPage, II_CURRENT, pFilename) in the default case when there is no original image.; HPAGE's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors. All the DTXT types except DTXT_ALTO, DTXT_HOCR and DTXT_BINARY are appendable, thus it is recommended to append them page-by-page (or per some pages) to the same file instead of using a large array containing all of the HPAGE's.; If hPage contains DataStream this function may put the image into the output file without recompression. See the details in the section about DataStream.; Some of the settings used by this function can be modified with calling the kRecSetDTXTFormat. The rest of the used settings can be changed only through the Settings Manager Module. For more information see the Settings of the Direct TXT Module or the description of the module.; The code page used at generating DTXT_TXT* output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage.; The specification of this function in C# is:
RECERR kRecConvert2DTXT(int sid, IntPtr[] ahPage, string pFilename);

// or

RECERR kRecConvert2DTXT(int sid, IntPtr ahPage, string pFilename);

kRecConvert2DTXT
RECERR RECAPIKRN kRecConvert2DTXT(int sid, const HPAGE *ahPage, int nPage, LPCTSTR pFilename)
Converting pages with DTXT.

RECERR
RECERR
Error codes.
Definition RECERR_doc.h:19; The specification of this function in Java is:
int kRecConvert2DTXT(int sid, HPAGE[] ahPage, String pFilename)

HPAGE
struct RECPAGESTRUCT * HPAGE
Handle of a page in memory.
Definition KernelApi.h:289; The specification of this function in Python is:
def kRecConvert2DTXT(sid: int, ahPage: "HPAGE", pFilename: str) -> int

◆ kRecConvert2DTXTEx()

RECERR RECAPIKRN kRecConvert2DTXTEx	(	int	sid,
		const HPAGE *	ahPage,
		int	nPage,
		IMAGEINDEX	iiImg,
		LPCTSTR	pFilename )

Converting pages with DTXT.

This function converts the given pages using Direct TXT output converter.

Parameters

[in]	sid	Settings Collection ID.
[in]	ahPage	Array of HPAGEs to be converted.
[in]	nPage	Number of HPAGEs in `ahPage`.
[in]	iiImg	Index of the image to be saved. (II_CURRENT or II_ORIGINAL)
[in]	pFilename	File name of the resulted file.

Return values

RECERR

Note: This function is the successor of the kRecConvert2DTXT function.; HPAGE's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors. All the DTXT types except DTXT_ALTO, DTXT_HOCR and DTXT_BINARY are appendable, thus it is recommended to append them page-by-page (or per some pages) to the same file instead of using a large array containing all of the HPAGE's.; In case of DTXT_XMLCOORD output iiImg specifies the orientation of the coordinates written into the XML file. II_ORIGINAL can be used even if the original image does not exist.; If hPage contains DataStream this function may put the image into the output file without recompression. See the details in the section about DataStream.; In case of different PDF outputs iiImg specifies the image used to create the PDF file. II_ORIGINAL can be used only if the original image or DataStream is available. See also kRecSetPreserveOriginalImg and the documentation of DataStream.; Some of the settings used by this function can be modified with calling the kRecSetDTXTFormat. The rest of the used settings can be changed only through the Settings Manager Module. For more information see the Settings of the Direct TXT Module or the description of the module.; The code page used at generating DTXT_TXT* output files can be specified by the setting Kernel.Chr.CodePage, or the function kRecSetCodePage.; The specification of this function in C# is:
RECERR kRecConvert2DTXTEx(int sid, IntPtr[] ahPage, IMAGEINDEX iiImg, string pFilename);

// or

RECERR kRecConvert2DTXTEx(int sid, IntPtr ahPage, IMAGEINDEX iiImg, string pFilename);

kRecConvert2DTXTEx
RECERR RECAPIKRN kRecConvert2DTXTEx(int sid, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg, LPCTSTR pFilename)
Converting pages with DTXT.

IMAGEINDEX
IMAGEINDEX
Index of each image type in HPAGE.
Definition KernelApi.h:991; The specification of this function in Java is:
int kRecConvert2DTXTEx(int sid, HPAGE[] ahPage, IMAGEINDEX iiImg, String pFilename); The specification of this function in Python is:
def kRecConvert2DTXTEx(sid: int, ahPage: "HPAGE", iiImg: int, pFilename: str) -> int

◆ kRecGetDTXTFormat()

RECERR RECAPIKRN kRecGetDTXTFormat	(	int	sid,
		DTXTOUTPUTFORMATS *	pdFormat )

Getting DTXT format.

This function retrieves the Direct TXT output format.

Parameters

[in]	sid	Settings Collection ID.
[out]	pdFormat	Pointer of a variable to store the output format.

Return values

RECERR

Note: This function gets the value of the setting Kernel.DTxt.DirectTxtFormat. This setting can be changed by kRecSetDTXTFormat.; The specification of this function in C# is:
RECERR kRecGetDTXTFormat(int sid, out DTXTOUTPUTFORMATS pdFormat);

DTXTOUTPUTFORMATS
DTXTOUTPUTFORMATS
DTXT output formats.
Definition KernelApi.h:9582

kRecGetDTXTFormat
RECERR RECAPIKRN kRecGetDTXTFormat(int sid, DTXTOUTPUTFORMATS *pdFormat)
Getting DTXT format.; The specification of this function in Java is:
int kRecGetDTXTFormat(int sid, DTXTOUTPUTFORMATS[] pdFormat); The specification of this function in Python is:
def kRecGetDTXTFormat(sid: int) -> Tuple[int, int]

◆ kRecMakePagesSearchable()

RECERR RECAPIKRN kRecMakePagesSearchable	(	int	sid,
		LPCTSTR	pFilename,
		int	fromPage,
		const HPAGE *	ahPage,
		int	nPage,
		IMAGEINDEX	iiImg )

Making a PDF page searchable.

This function writes invisible textual information into a PDF to make it searchable/readable

Parameters

[in]	sid	Settings Collection ID.
[in]	pFilename	Name of the file to be made searchable
[in]	fromPage	Index of the first page to be made searchable (zero start index). The function processes the `nPage` pages starting from `fromPage`.
[in]	ahPage	Array of `HPAGE`'s containing the searchable/textual data (comes from kRecRecognize).
[in]	nPage	Number of `HPAGE`'s in `ahPage`
[in]	iiImg	Index of the image to use for orienting the pages. (II_ORIGINAL or II_CURRENT)

Return values

RECERR

Note: The orientation of the processed page is not changed if iiImg is II_ORIGINAL. (The text might be rotated to the correct position if needed.) On the other hand, when iiImg is II_CURRENT the function may rotate the pages to make them upright standing. The image itself is not touched even in that case just rotated matrix added.; The file pFilename should not be opened during kRecMakePagesSearchable. It is recommended to make searchable page-by-page instead of using an array with lots of HPAGE's. Since each open/close requires greater resources, grouping HPAGE's are supported. However HPAGE's may be rather big memory areas, thus keeping them in memory simultaneously may cause memory errors.; Page-by-page sample:
"Kernel.OcrMgr.PDF.ProcessingMode"=PDF_PM_GRAPHICS_ONLY

HPAGE hPage;

rc = kRecLoadImgF(sid, pFilename, &hPage, i);

rc = kRecPreprocessImg(sid, hPage);

rc = kRecRecognize(sid, hPage, NULL);

rc = kRecMakePagesSearchable(sid, pFilename, i, &hPage, 1, II_CURRENT);

rc = kRecFreeImg(hPage);

kRecMakePagesSearchable
RECERR RECAPIKRN kRecMakePagesSearchable(int sid, LPCTSTR pFilename, int fromPage, const HPAGE *ahPage, int nPage, IMAGEINDEX iiImg)
Making a PDF page searchable.

kRecPreprocessImg
RECERR RECAPIKRN kRecPreprocessImg(int sid, HPAGE hPage)
Image preprocessing.

kRecFreeImg
RECERR RECAPIKRN kRecFreeImg(HPAGE hPage)
Removing a page.

II_CURRENT
@ II_CURRENT
Definition KernelApi.h:1001

kRecLoadImgF
RECERR RECAPIKRN kRecLoadImgF(int sid, LPCTSTR pFilename, HPAGE *phPage, int nPage)
Loading a page directly from an image file.

kRecRecognize
RECERR RECAPIKRN kRecRecognize(int sid, HPAGE hPage, LPCTSTR pFilename)
Recognizing a page.

PDF_PM_GRAPHICS_ONLY
@ PDF_PM_GRAPHICS_ONLY
Only the graphic areas are recognized.
Definition KernelApi.h:8701; Page grouping sample:
HIMGFILE hIFile = NULL;

HPAGE hPage[NPAGES];

..

rc = kRecOpenImgFile(pFilename, &hIFile, IMGF_READ, (IMF_FORMAT)0);

..

for(ipage)

rc = kRecLoadImg(sid, hIFile, hPage+ipages, first_page+ipage);

..

rc = kRecCloseImgFile(hIFile);

..

//Preprocessing must be somewhere between load and recognize.

//So you can also attach it to the loop of either load or recognize.

for(ipage)

rc = kRecPreprocessImg(sid, hPage[ipage]);

..

for(ipage)

rc = kRecRecognize(sid, hPage[ipage], NULL);

..

rc = kRecMakePagesSearchable(sid, pFilename, first_page, hPage, NPAGES, II_CURRENT);

..

for(ipage)

rc = kRecFreeImg(hPage[ipage]);

HIMGFILE
struct tagIMGFILEHANDLE * HIMGFILE
Handle of image files.
Definition KernelApi.h:11244

IMF_FORMAT
IMF_FORMAT
Image formats.
Definition KernelApi.h:11179

kRecCloseImgFile
RECERR RECAPIKRN kRecCloseImgFile(HIMGFILE hIFile)
Closing an image file.

kRecLoadImg
RECERR RECAPIKRN kRecLoadImg(int sid, HIMGFILE hIFile, HPAGE *phPage, int iPage)
Loading a page from an opened image file.

kRecOpenImgFile
RECERR RECAPIKRN kRecOpenImgFile(LPCTSTR pFilename, HIMGFILE *pHIMGFILE, int mode, IMF_FORMAT filetype)
Opening an image file.

IMGF_READ
#define IMGF_READ
Opening file for reading only. The file can be shared. (Use it with kRecOpenImgFile....
Definition KernelApi.h:11253; An interesting trick, which can be useful sometimes: this function can change the orientation of the pages in the PDF file even if it gets HPAGE's having no letters (if preprocess have been already called and II_CURRENT is passed in iiImg).; The specification of this function in C# is:
RECERR kRecMakePagesSearchable(int sid, string pFilename, int fromPage, IntPtr[] ahPage, IMAGEINDEX iiImg);

// or

RECERR kRecMakePagesSearchable(int sid, string pFilename, int fromPage, IntPtr ahPage, IMAGEINDEX iiImg);; The specification of this function in Java is:
int kRecMakePagesSearchable(int sid, String pFilename, int fromPage, HPAGE[] ahPage, IMAGEINDEX iiImg); The specification of this function in Python is:
def kRecMakePagesSearchable(sid: int, pFilename: str, fromPage: int, ahPage: "HPAGE", iiImg: int) -> int

◆ kRecSetDTXTFormat()

RECERR RECAPIKRN kRecSetDTXTFormat	(	int	sid,
		DTXTOUTPUTFORMATS	dFormat )

Changing DTXT format.

This function changes the Direct TXT output format setting.

Parameters

[in]	sid	Settings Collection ID.
[in]	dFormat	The output format to be set.

Return values

RECERR

Note: This function sets the value of the setting Kernel.DTxt.DirectTxtFormat. This setting can be retrieved by kRecGetDTXTFormat.; The specification of this function in C# is:
RECERR kRecSetDTXTFormat(int sid, DTXTOUTPUTFORMATS dFormat);

kRecSetDTXTFormat
RECERR RECAPIKRN kRecSetDTXTFormat(int sid, DTXTOUTPUTFORMATS dFormat)
Changing DTXT format.; The specification of this function in Java is:
int kRecSetDTXTFormat(int sid, DTXTOUTPUTFORMATS dFormat); The specification of this function in Python is:
def kRecSetDTXTFormat(sid: int, dFormat: int) -> int

CSDK	ALTO
HPAGE	Page
Zone (WT_GRAPHIC)	Illustration
Zone (WT_TABLE)	ComposedBlock (TYPE="Table")
Zone (WT_FLOW, ...)	TextBlock
LETTER	Glyph
RLINE	GraphicalElement
kRecGetFrameInfo (CELL_INFO.rect)	GraphicalElement

CSDK	hOCR
HPAGE	ocr_page
Zone (WT_GRAPHIC)	ocr_photo
Zone (WT_TABLE)	ocr_table
Zone (WT_FLOW, ...)	ocr_carea and ocr_par
RLINE	ocr_separator
kRecGetFrameInfo (CELL_INFO.rect)	ocr_separator

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

◆ DTXTOUTPUTFORMATS

Function Documentation

◆ kRecConvert2DTXT()

◆ kRecConvert2DTXTEx()

◆ kRecGetDTXTFormat()

◆ kRecMakePagesSearchable()

◆ kRecSetDTXTFormat()