Letter handling tools. More...

Classes
struct	LSPC
	Additional information about the space character. More...
struct	LETTER
	The LETTER structure. More...
Typedefs
typedef LETTER *	LPLETTER
	Pointer to a structure LETTER.
typedef const LETTER *	LPCLETTER
	Const pointer to a structure LETTER.
Enumerations
enum	LETTERSTRENGTH { LTS_FINAL, LTS_STRONG, LTS_MEDIUM, LTS_WEAK, LTS_SIZE }
	Possible places where letter array is to be copied to. More...
Functions
RECERR RECAPIKRN	kRecGetLetters (HPAGE hPage, IMAGEINDEX iiImage, LPLETTER *ppLetter, LPLONG pLettersLength)
	Getting recognition result.
RECERR RECAPIKRN	kRecGetLetterPalette (HPAGE hPage, REC_COLOR **ppColours, LPLONG pNum)
	Getting palette of recognition data.
RECERR RECAPIKRN	kRecGetChoiceStr (HPAGE hPage, WCHAR **ppChoices, LPLONG pLength)
	Getting choices.
RECERR RECAPIKRN	kRecGetSuggestionStr (HPAGE hPage, WCHAR **ppSuggestions, LPLONG pLength)
	Getting suggestions.
RECERR RECAPIKRN	kRecGetFontFaceStr (HPAGE hPage, char **ppFontFaces, LPLONG pLength)
	Getting font faces.
RECERR RECAPIKRN	kRecSetLetters (LETTERSTRENGTH towhere, HPAGE hPage, IMAGEINDEX iiImage, LPCLETTER pLetter, LONG LettersLength)
	Putting a letter buffer onto the input of the `PLUS2W` and `PLUS3W` engines or the selected output converter.
RECERR RECAPIKRN	kRecFreeRecognitionData (HPAGE hPage)
	Freeing recognition data.
LETTER::fontAttrib field elements
Possible values of LETTER::fontAttrib field.
#define	R_NO_ITALIC 0x0001
	Not-Italic character. It is not possible for both `R_ITALIC` and `R_NO_ITALIC` to be set. If both are unset we do not know whether it is Italic or not.
#define	R_ITALIC 0x0002
	Italic character. See also `R_NO_ITALIC`.
#define	R_NO_BOLD 0x0004
	Not-Bold character. It is not possible for both `R_BOLD` and `R_NO_BOLD` to be set. If both are unset we do not know whether it is Bold or not.
#define	R_BOLD 0x0008
	Bold character. See also `R_NO_BOLD`.
#define	R_SANSSERIF 0x0010
	Sans Serif character. It is not possible for both `R_SANSSERIF` and `R_SERIF` to be set. If both are unset we do not know whether it is Serif or not.
#define	R_SERIF 0x0020
	Serif character. See also `R_SANSSERIF`.
#define	R_PROPORTIONAL 0x0040
	Proportional character. It is not possible for both `R_PROPORTIONAL` and `R_MONOSPACED` to be set. If both are unset we do not know whether it is Monospaced or not.
#define	R_MONOSPACED 0x0080
	Monospaced character. See also `R_PROPORTIONAL`.
#define	R_SMALLCAPS 0x0100
	Character in a Small Caps word. The code is always upper case! See also `RR_SMALLCAPS_TALL` in the field info.
#define	R_UNDERLINE 0x0200
	Underlined character.
#define	R_STRIKETHROUGH 0x0400
	Struck through character. It is not used. It is only for future versions.
#define	R_SUBSCRIPT 0x0800
	Subscript character.
#define	R_SUPERSCRIPT 0x1000
	Superscript character.
#define	R_DROPCAP 0x2000
	Dropcap character.
#define	R_POPCAP 0x4000
	Popcap character.
#define	R_INVERTED 0x8000
	Inverted character.
LETTER::info field macros
Macros can be used with LETTER::info field.
#define	RH_OCRENGINE(info) ((RECOGNITIONMODULE)(((info) & RH_OCRENGINE_MASK) >> 5))
	Getting the RECOGNITIONMODULE from the field `info`. This is the module ID of the engine that actually recognized the given character. With the PLUS engines this is usually RM_RESERVED_M.
#define	RH_OCRENGINE_SET(oeng) (((UINT)(oeng)) << 5)
	Setting the RECOGNITIONMODULE into the field `info`.
#define	RH_OCRTYPE(info) ((FILLINGMETHOD)(((info) & RH_OCRTYPE_MASK) >> 10))
	Getting the FILLINGMETHOD from the field `info`.
#define	RH_OCRTYPE_SET(otype) (((UINT)(otype)) << 10)
	Setting the FILLINGMETHOD into the field `info`.
#define	RH_BARTYPE(info) ((BAR_TYPE)(((info) & RH_BARTYPE_MASK) >> 24))
	Getting the BAR_TYPE from the field `info`.
#define	RH_BARTYPE_SET(btype) (((UINT)(btype)) << 24)
	Setting the BAR_TYPE into the field `info`.
Info field bits
Possible flags of LETTER::info field.
#define	RR_BULLET 0x00000001
	Bullet character at bullet position.
#define	RR_SOFTHYPHEN 0x00000004
	Soft hyphen.
#define	RH_OCRENGINE_MASK 0x000003E0
	Mask of RECOGNITIONMODULE.
#define	RH_OCRTYPE_MASK 0x00007C00
	Mask of FILLINGMETHOD.
#define	RH_GTMTCH 0x00008000
	Internal use only.
#define	RR_CONFIDENT_CHAR 0x00010000
	Internal use only.
#define	RR_DISABLED_CHAR 0x00020000
	Internal use only.
#define	RR_VOTED_CHAR 0x00040000
	Internal use only.
#define	RR_NOISY_CHAR 0x00080000
	Internal use only.
#define	RR_EXPANDED 0x00100000
	Internal use only.
#define	RH_MANGO_ISOLATED_CH 0x00200000
	Internal use only.
#define	RH_LA_INTERNAL 0x00400000
	Internal use only.
#define	RH_LA_EXTERNAL 0x00800000
	NO LONGER USED.
#define	RR_PDMERGE_CHAR 0x00800000
	Internal use only.
#define	RH_BARTYPE_MASK 0x3F000000
	Mask of BAR_TYPE.
#define	RR_DICTIONARY_WORD 0x40000000
	Dictionary word. It is set when the word is in at least one dictionary of the currently used ones. See Language of a word.
#define	RR_SMALLCAPS_TALL 0x80000000
	A tall character among the small capitals. See also `R_SMALLCAPS` in the field fontAttrib.
LETTER::makeup field elements
Flags of end-position `LETTERs` (see usage of them in the table here) and direction/orientation flags (see also the section about vertical text support).
#define	R_ENDOFLINE 0x0001
	End of line. In a table zone, the end of all the lines of a cell is marked by this flag.
#define	R_ENDOFPARA 0x0002
	End of paragraph. This flag is used by BAR module only.
#define	R_ENDOFWORD 0x0004
	End of word.
#define	R_ENDOFZONE 0x0008
	End of zone.
#define	R_ENDOFPAGE 0x0010
	End of page.
#define	R_ENDOFCELL 0x0020
	End of table cell.
#define	R_ENDOFROW 0x0040
	End of the last line of the last filled cell in a table row.
#define	R_INTABLE 0x0080
	Letter is in a table cell.
#define	R_TEXT_DIR_MASK 0x0700
	Mask of text direction in makeup field.
#define	R_TEXT_ORIENT_MASK 0x0300
	Mask of text orientation in makeup field.
#define	R_NORMTEXT 0
	Horizontal text.
#define	R_VERTTEXT 0x0100
	Vertical text (CCJK) or neon text (Latin) or upside-down barcode.
#define	R_LEFTTEXT 0x0200
	Left rotated / orientation is upward.
#define	R_RIGHTTEXT 0x0300
	Right rotated / orientation is downward.
#define	R_RTLTEXT 0x0400
	Character from a right-to-left direction word.
Space type values
Possible space types (LSPC).
#define	SPC_SPACE 0
	Real space.
#define	SPC_TAB 1
	Tabular.
#define	SPC_LEADERDOT 2
	Dot leader.
#define	SPC_LEADERLINE 3
	Line leader.
#define	SPC_LEADERHYPHEN 4
	Hyphen leader.
Macros of alternatives of the LETTER
Macros can be used for processing the alternatives of each LETTER. See also usage of usage of alternatives.
#define	GETFIRSTALTERN(stringstart, ndx) (((const WCHAR*)((stringstart)+(ndx)))+1)
	Getting the first alternative.
#define	GETALTERNLENGTH(str) ((str)[-1])
	Getting the length of the alternative.
#define	GETNEXTALTERN(str) ((str)+GETALTERNLENGTH(str)+2)
	Getting the next alternative.
Defines of confidence handling of the LETTER
See confidence handling and LETTER::err.
#define	RE_SUSPECT_WORD 0x80
	The word is declared suspicious by the recognition engine if the dictionary (if any) does not contain it. This flag does not necessarily reflect whether the word is a dictionary word or not.
#define	RE_SUSPECT_THR 64
	Suspect threshold: if the lower 7 bits of LETTER::err represent a value at or above this (up to 100) it means low confidence.
#define	RE_ERROR_LEVEL_MASK ~RE_SUSPECT_WORD
	Mask for getting the error level of the current letter.

Detailed Description

Letter handling tools.

Recognized data is stored in the current HPAGE and it is available as an array of LETTER structures providing significantly more information than the character code itself. This type of output offers the most detailed information on recognition. The information stored in a LETTER structure may belong to the character itself (character code, position, size, confidence level, font attributes, font face, choices, color) or to the word containing the character (suggestions, languages). Word-level information is set in the first LETTER of the word.

NOTE: In both the SDK and its documentation, coordinates refer to grid-coordinates - i.e. the top or left borders of pixels. Thus a rectangle does not contain the pixels according to its right and bottom coordinates.

Handling of spaces

Spaces have a special role in the text, thus their handling is also special. There are two kinds of spaces in the recognition result.

One of them is the space-like character. It really appears in the original text and it is represented with a LETTER having a space character in its code field and an LSPC structure containing information about this character. The SPACE and TAB characters and the leaders belong to this type.

The other kind of space is the dummy space. It does not appear in the original text, but it has an individual LETTER object. It indicates the end of the line only when this is also the end of the word (i.e. the last character of the line is not a hyphen). It has a role only when the User writes the recognition result directly from the LETTER array into a pure TXT file without analysing any formatting flags (e.g. font attributes, end of lines, etc.). To handle this case, a space (the dummy space) is inserted between the last word of the line and the first word of the next line.

The LETTER has size information about the represented character. However the width of the dummy space is zero, because it is in fact not in the original text.

Barcode module (BAR) has a special, binary recognition mode, when the recognition result contains binary data (not a text). (See the setting Kernel.OcrMgr.BarBinary for more information.) In this case, the content of the barcode is logically one word in one line, and the result gets a dummy space at the end only for uniformity.

The notion of word in CSDK

The last letter (maybe punctuation character or digit) of a word is the LETTER having an R_ENDOFWORD flag. The beginning of a word is the first non-space character after the previous word (or the very first item of the LETTER array). The flag R_ENDOFLINE does not play a role in determining word boundaries (e.g. hyphenation).

Special cases:

Inside a word there may be spaces. In an expanded text all the letters are followed by a space.
In some cases there are no spaces between two words. Examples are words connected with a dash (like Jean-Pierre or drag-and-drop). In such cases the middle R_ENDOFWORD flag is on the character before the dash. This kind of dash is not marked with RR_SOFTHYPHEN even if the dash is at end of line.
Hyphenated words are treated as one, there is no R_ENDOFWORD before the hyphen. The hyphen's LETTER::info field has the RR_SOFTHYPHEN flag.

Word-related information (like the language of a word or RE_SUSPECT_WORD, etc.) is specified on all the characters of the word. The only exception is suggestion handling where suggestions are attached to the first character of the word only. (Note that suggestion handling uses a different word notion: space-separated words.)

End-position letters

The letters in ending positions are marked with particular flags. See above section for details about end of word. The end of line flag in a flowing text is generally on the above mentioned dummy space. However, if the last character of a line is a hyphen in a hyphenated word, the flag R_ENDOFLINE is put on the hyphen and the dummy space is missing from this line.

In a table the situation of end-position flags is more difficult. The next figure shows all the possible situations of the R_ENDOFLINE (L), R_ENDOFCELL (C), R_ENDOFROW (R) and R_ENDOFZONE (Z) flags in a table.

text L,C	text L,C	text L,C,R
text L,C	more L lines in L a cell L,C,R
two-line L text L,C		text L,C,R
	last filled L cell L,C,R,Z

Usage of alternatives

The common name for LETTER choices and word suggestions is 'alternatives'. You can use different alternatives similarly. They can be accessed through special WCHAR typed arrays. Every single alternative is a special string with its size in its 0th WCHAR element and an ending zero WCHAR. You can get WCHAR arrays listing of all alternatives in the recognition data - one for choices and one for suggestions. Use the functions kRecGetChoiceStr, and kRecGetSuggestionStr, respectively.

One LETTER contains an index to the list of the alternatives that points to its first alternative and has a counter with the number of its alternatives. All LETTERs can have choices (LETTER::ndxChoices), but only the first LETTER of a word refers to the suggestions (LETTER::ndxSuggestions). The scope of such a suggestion is the space-terminated word. (Note that it can differ from the end of the word notion used by spelling.)

The alternatives of a LETTER can be enumerated using the macros GETFIRSTALTERN, GETNEXTALTERN and GETALTERNLENGTH. See the following sample code on how to use them:

    RECERR err;
    HPAGE hPage;
    LETTER *pLetters;
    WCHAR *pChoices;
    LONG nLetters, choiceStrLen;

    ...
    err = kRecGetLetters(hPage, II_CURRENT, &pLetters, &nLetters);
    if (err != REC_OK)
        ... // Doing some error handling
    ...
    err = kRecGetChoiceStr(hPage, &pChoices, &choiceStrLen);
    if (err != REC_OK)
        ... // Doing some error handling
    for (LONG lettn=0; lettn<nLetters; lettn++)
    {
        ...
        const WCHAR *choice = GETFIRSTALTERN(pChoices, pLetters[lettn].ndxChoices);
        for (BYTE chon=1; chon<pLetters[lettn].cntChoices; chon++)
        {
            ... // Doing some choice handling
            choice = GETNEXTALTERN(choice);
        }
        ...
    }
    ...

Consecutive words can have the same suggestion indices - that is, the given suggestions are common to the group of the given words. This is the case when the suggestion combines two space-separated words into a single one without the space.

Since the first LETTER of a word cannot be a space, spaces do not have suggestions, but they have space information (LSPC) in the same union type (see below for more information about space handling).

Font faces can be accessed in a string of C-type strings. The LETTER indexes into this string at the first character of its font face name.

Enumeration Type Documentation

enum LETTERSTRENGTH

Possible places where letter array is to be copied to.

Enumerator:

LTS_FINAL	Letters are put directly onto the input of the output conversion step.
LTS_STRONG	Letters are put onto the strong input of the `PLUS2W` and `PLUS3W` engines.
LTS_MEDIUM	Letters are put onto the medium input of the `PLUS2W` and `PLUS3W` engines.
LTS_WEAK	Letters are put onto the weak input of the `PLUS3W` engine.
LTS_SIZE	Number of LETTER indices (for verifying index validity).

Function Documentation

RECERR RECAPIKRN kRecFreeRecognitionData ( HPAGE hPage )

Freeing recognition data.

The kRecFreeRecognitionData function destroys the recognized data (memory object) belonging to the hPage page.

Parameters:

[in] hPage Handle of the page having the data to be removed.

Return values:

RECERR

Note:

The effect of this call is the same as if the application had not called the kRecRecognize function.

The specification of this function in C# is:

 RECERR kRecFreeRecognitionData(IntPtr hPage);

The specification of this function in Java is:

 int kRecFreeRecognitionData(HPAGE hPage)

RECERR RECAPIKRN kRecGetChoiceStr	(	HPAGE	hPage,
		WCHAR **	ppChoices,
		LPLONG	pLength
	)

Getting choices.

The kRecGetChoiceStr function makes the alternative letter choices data belonging to the hPage page available to the application by creating a new memory object. This function can be called after a successful kRecRecognize call. The retrieved data is available as an array of WCHAR structures. For more about its internal structure see the usage of alternatives. A LETTER contains the number of its choices and an index into this array on the first choice (LETTER::cntChoices, LETTER::ndxChoices).

Parameters:

[in]	hPage	Handle of the page whose recognized data should be accessed.
[out]	ppChoices	Address of a pointer variable to get the array of the recognized alternative characters and ligatures.
[out]	pLength	Pointer to a variable to hold the length of recognized alternative characters.

Return values:

RECERR

Note:

Since this function creates a new memory object, the application should call the kRecFree function to free this memory area after evaluating the result.

The specification of this function in C# is:

 RECERR kRecGetChoiceStr(IntPtr hPage, out char[] ppChoices);

The specification of this function in Java is:

 int kRecGetChoiceStr(HPAGE hPage, Choices ppChoices)

RECERR RECAPIKRN kRecGetFontFaceStr	(	HPAGE	hPage,
		char **	ppFontFaces,
		LPLONG	pLength
	)

Getting font faces.

The kRecGetFontFaceStr function makes the font face data belonging to the hPage page available to the application by creating a new memory object. This function can be called after a successful kRecRecognize call. The retrieved data is available as an array of char strings. A LETTER contains an index into this array on its font face (LETTER::ndxFontFace).

Parameters:

[in]	hPage	Handle of the page whose recognized data should be accessed.
[out]	ppFontFaces	Address of a pointer variable to get the UTF-8 string of the recognized font faces.
[out]	pLength	Pointer to a variable to hold the length of recognized font face string.

Return values:

RECERR

Note:

Font face information is available only at processing PDF files with accessible text layer.

Since this function creates a new memory object, after evaluating the result, the application should call the kRecFree function to free this memory area.

The specification of this function in C# is:

 RECERR kRecGetFontFaceStr(IntPtr hPage, out char[] ppFontFaces);

The specification of this function in Java is:

 int kRecGetFontFaceStr(HPAGE hPage, FontFaces ppFontFaces)

RECERR RECAPIKRN kRecGetLetterPalette	(	HPAGE	hPage,
		REC_COLOR **	ppColours,
		LPLONG	pNum
	)

Getting palette of recognition data.

This function makes the palette of the recognition data belonging to the hPage page available to the application by creating a new memory object. This function can be called after a successful kRecRecognize call. It contains both the foreground and background colors of the letters. The LETTER structure has indices into this array for foreground and background colors (LETTER::ndxFGColor, LETTER::ndxBGColor).

Parameters:

[in]	hPage	Handle of the page whose recognized data should be accessed.
[out]	ppColours	Address of a pointer variable to get the address of the palette array.
[out]	pNum	Pointer to a variable to hold the number of colors in palette.

Return values:

RECERR

Note:

Palette can contain the special REC_COLOR values REC_DEFAULT_COLOR and REC_UNDEF_COLOR. Background color can be both, they mean white. Foreground color can be REC_DEFAULT_COLOR, which means black.

Since this function creates a new memory object, the application should call the kRecFree function to free this memory area after evaluating the result.

The specification of this function in C# is:

 RECERR kRecGetLetterPalette(IntPtr hPage, out uint[] ppColours);

The specification of this function in Java is:

 int kRecGetLetterPalette(HPAGE hPage, RecColorArray ppColours)

RECERR RECAPIKRN kRecGetLetters	(	HPAGE	hPage,
		IMAGEINDEX	iiImage,
		LPLETTER *	ppLetter,
		LPLONG	pLettersLength
	)

Getting recognition result.

The kRecGetLetters function makes the recognition data belonging to the hPage page available to the application by creating a new memory object containing the recognized data. This function can be called after a successful kRecRecognize call. The recognized data is available as an array of LETTER structures.

Parameters:

[in]	hPage	Handle of the page whose recognized data should be accessed.
[in]	iiImage	Index of the image in the page, in which the coordinates are needed to be given.
[out]	ppLetter	Address of a pointer variable to get the address of the recognized characters.
[out]	pLettersLength	Pointer to a variable to hold the number of recognized characters.

Return values:

RECERR

Note:

Since this function creates a new memory object containing the recognized data, the application should call the kRecFree function to free this memory area after evaluating the result.

The specification of this function in C# is:

 RECERR kRecGetLetters(IntPtr hPage, IMAGEINDEX iiImage, out LETTER[] ppLetter);

The specification of this function in Java is:

 int kRecGetLetters(HPAGE hPage, IMAGEINDEX iiImage, LetterArray ppLetter)

RECERR RECAPIKRN kRecGetSuggestionStr	(	HPAGE	hPage,
		WCHAR **	ppSuggestions,
		LPLONG	pLength
	)

Getting suggestions.

The kRecGetSuggestionStr function makes the word suggestions data belonging to the hPage page available to the application by creating a new memory object. This function can be called after a successful kRecRecognize call. The retrieved data is available as an array of WCHAR structures. For more about its internal structure see the usage of alternatives. The first LETTER of a word contains the number of word choices and an index into this array on the first suggestion (LETTER::cntSuggestions, LETTER::ndxSuggestions).

Parameters:

[in]	hPage	Handle of the page whose recognized data should be accessed.
[out]	ppSuggestions	Address of a pointer variable to get the array of the recognized suggestions.
[out]	pLength	Pointer to a variable to hold the length of recognized suggestions.

Return values:

RECERR

Note:

Since this function creates a new memory object, the application should call the kRecFree function to free this memory area after evaluating the result.

If the letter is a space, it does not have suggestions, but only space info (see LETTER::spcInfo and LSPC).

The specification of this function in C# is:

 RECERR kRecGetSuggestionStr(IntPtr hPage, out char[] ppSuggestions);

The specification of this function in Java is:

 int kRecGetSuggestionStr(HPAGE hPage, Suggestions ppSuggestions)

RECERR RECAPIKRN kRecSetLetters	(	LETTERSTRENGTH	towhere,
		HPAGE	hPage,
		IMAGEINDEX	iiImage,
		LPCLETTER	pLetter,
		LONG	LettersLength
	)

Putting a letter buffer onto the input of the PLUS2W and PLUS3W engines or the selected output converter.

This function can affect the recognition results and/or the content of the output file. The PLUS modules are voting engines combining results of two or three other OCR engines. The voting method of RM_OMNIFONT_PLUS2W has strong and medium inputs, RM_OMNIFONT_PLUS3W uses an additional weak one as well. You can replace one input (parameter towhere) with your alternative engine result by calling the function kRecSetLetters. The voting method uses your letter buffer as it generates the final OCR result. Stronger input may have greater effect on the recognition result, so you should consider which level you select for your letter buffer.

Passing the letter buffer on the level LTS_FINAL the OCR method does not run, because in this level kRecSetLetters works similarly as in previous versions of CSDK, i.e. the letters are given directly to the input of the selected output converter.

Parameters:

[in]	towhere	This parameter specifies one of the three possible inputs of the Voting Engine, on which the engine receives the letter buffer.
[in]	hPage	Handle of the HPAGE the Voting Engine works on.
[in]	iiImage	Index of the image in the page whose coordinate system you have used in defining the boundary box for `LETTER`.
[in]	pLetter	The letter buffer to be given to the engine.
[in]	LettersLength	Size of the letter buffer.

Return values:

RECERR

Note:

After putting letters on the selected levels (even on LTS_FINAL), you should call kRecRecognize.

More than one input way of the PLUS engines can be replaced with subsequent calls for kRecSetLetters. Even in such a case, kRecRecognize should be called only once.

The following fields of input LETTERs are unused during kRecRecognize:cntChoices, ndxChoices, cntSuggestions, ndxSuggestions, reserved_b, ndxFGColor, ndxBGColor, ndxFontFace, ndxExt and OCRENGINE bits (RH_OCRENGINE_MASK) of info. These fields are cleared and the original order of LETTERs may be altered after using this function.

The specification of this function in C# is:

 RECERR kRecSetLetters(LETTERSTRENGTH towhere, IntPtr hPage, IMAGEINDEX iiImg, LETTER[] lpLetter);

The specification of this function in Java is:

 int kRecSetLetters(LETTERSTRENGTH towhere, HPAGE hPage, IMAGEINDEX iiImage, LETTER[] pLetter)

Classes

Typedefs

Enumerations

Functions

LETTER::fontAttrib field elements

LETTER::info field macros

Info field bits

LETTER::makeup field elements

Space type values

Macros of alternatives of the LETTER

Defines of confidence handling of the LETTER

Detailed Description

Handling of spaces

The notion of word in CSDK

End-position letters

Usage of alternatives

Enumeration Type Documentation

Function Documentation