RecAPI
All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Properties Modules Pages
Output

Code page

One of the Engine's settings is the code page, which can be accessed through the functions kRecSetCodePage and kRecGetCodePage, respectively. The Language, Character Set and Code Page Handling Module is the module responsible for its handling.

Recognized characters are stored internally in the Engine in their UNICODE representation. The current code page is taken into account either when converting a character to/from this UNICODE representation, or when converting the recognition data to the final output document. The first needs to be done with the kRecConvertCodePage2Unicode or kRecConvertUnicode2CodePage calls.

The output conversion process performs character code conversions from UNICODE into the current code page while producing the final output document.

The kRecGetFirstCodePage and the kRecGetNextCodePage function-pair can be used to enumerate the list of available code pages.

There can be conflicts between the set of characters validated for recognition (see the topic Defining the character set) and the code page selection; a selected code page may not support some characters. For example, if you select the Hungarian language and the current code page is Windows ANSI (code page 1252), the final output document will not contain some accented characters for that language. Use the kRecCheckCodePage function to check whether the current code page setting contains all the characters of the current Language environment (language selection, the LanguagesPlus characters), and any characters listed as FilterPlus characters. The output of kRecCheckCodePage is a string of characters not supported by the current code page (non-supported characters). If there are non-supported characters when output conversion is performed, the Engine tries to replace non-supported characters with somewhat similar shaped ones in the final output document. This substitution does not work in all cases; mainly it is good for replacing non-supported accented characters with un-accented ones. The final output document will contain a missing symbol in the place of characters that were recognized correctly but could not be either exported or substituted.

The application can call kRecSetMissingSymbol to define which character from the current code page should be used to indicate a missing symbol.

Output conversion

The page-level processing contains only simple output converters (because of page level requirements). The Direct TXT Output Converter Module is responsible for realizing this step of the page processing. The functions kRecSetDTXTFormat and kRecGetDTXTFormat provide access to the setting specifying the output converter. The selected output converter can be any of the following (DTXTOUTPUTFORMATS):

  • binary - non-formatted binary data,
  • standard text,
  • comma separated text,
  • formatted text,
  • image-on-text PDF,
  • simple XML.

The working of each converter can be fine-tuned through settings.

The integrating application can call the output conversion by kRecConvert2DTXT.

Note
More complex and more accurate conversion can be performed by the converters of the document level.
The following code sample processes a multi-page image file (loading, preprocessing, recognizing and output conversion) containing Hungarian text. The recognition results of all the pages are inserted into only one output document:
RECERR rc;
...
HPAGE *hPages;
HIMGFILE hIFile;
int pageCnt, i;
// Selecting Hungarian language
memset(langs, 0, sizeof(LANGUAGES)*LANG_SIZE);
langs[LANG_HUN] = LANG_ENA;
rc = kRecSetLanguages(0, langs);
// Selecting the codepage for Hungarian language
rc = kRecSetCodePage(0, "Windows Eastern");
// Load image.
rc = kRecOpenImgFile("multipage.tif", &hIFile, IMGF_READ, (IMF_FORMAT)0);
// Get number of pages.
rc = kRecGetImgFilePageCount(hIFile, &pageCnt);
// Create an array for the pages.
hPages = new HPAGE[pageCnt];
// Cycle through the pages.
for(i=0;i<pageCnt;i++)
{
// Load current page.
rc = kRecLoadImg(0, hIFile, &(hPages[i]), i);
// Preprocess image.
rc = kRecPreprocessImg(0, hPages[i]);
// Recognize image.
rc = kRecRecognize(0, hPages[i], NULL);
}
// Close file.
rc = kRecCloseImgFile(hIFile);
// Set conversion format to PDF image on text.
// Convert all the pages into a PDF file.
rc = kRecConvert2DTXT(0, hPages, i, "multipage.pdf");
// Free up memory.
for(i=0;i<pageCnt;i++)
rc = kRecFreeImg(hPages[i]);
delete[] hPages;
...
RECERR RECAPIKRN kRecSetLanguages(int sid, const LANG_ENA *pLanguages)
Setting languages.
LANG_ENA
Language enable/disable.
Definition KernelApi.h:1051
RECERR RECAPIKRN kRecSetCodePage(int sid, LPCTSTR pCodePageName)
Setting the code page.
LANGUAGES
Possible languages.
Definition KernelApi.h:1106
@ LANG_SIZE
Definition KernelApi.h:1275
@ LANG_HUN
Definition KernelApi.h:1140
RECERR RECAPIKRN kRecSetDTXTFormat(int sid, DTXTOUTPUTFORMATS dFormat)
Changing DTXT format.
RECERR RECAPIKRN kRecConvert2DTXT(int sid, const HPAGE *ahPage, int nPage, LPCTSTR pFilename)
Converting pages with DTXT.
@ DTXT_IOTPDF
Definition KernelApi.h:9589
RECERR
Error codes.
Definition RECERR_doc.h:19
RECERR RECAPIKRN kRecPreprocessImg(int sid, HPAGE hPage)
Image preprocessing.
RECERR RECAPIKRN kRecFreeImg(HPAGE hPage)
Removing a page.
struct RECPAGESTRUCT * HPAGE
Handle of a page in memory.
Definition KernelApi.h:289
struct tagIMGFILEHANDLE * HIMGFILE
Handle of image files.
Definition KernelApi.h:11244
IMF_FORMAT
Image formats.
Definition KernelApi.h:11179
RECERR RECAPIKRN kRecCloseImgFile(HIMGFILE hIFile)
Closing an image file.
RECERR RECAPIKRN kRecLoadImg(int sid, HIMGFILE hIFile, HPAGE *phPage, int iPage)
Loading a page from an opened image file.
RECERR RECAPIKRN kRecGetImgFilePageCount(HIMGFILE hIFile, int *lpPageCount)
Getting the number of pages in an image file.
RECERR RECAPIKRN kRecOpenImgFile(LPCTSTR pFilename, HIMGFILE *pHIMGFILE, int mode, IMF_FORMAT filetype)
Opening an image file.
#define IMGF_READ
Opening file for reading only. The file can be shared. (Use it with kRecOpenImgFile....
Definition KernelApi.h:11253
RECERR RECAPIKRN kRecRecognize(int sid, HPAGE hPage, LPCTSTR pFilename)
Recognizing a page.