RecAPI
All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Properties Modules Pages
Character Set and Filters in the Engine

The Character Set

The Character Set is the group of validated characters for a given zone. It can vary for each zone. The recognition module associated with the zone inquires the Character Set assigned to the zone immediately before the recognition step.

The Character Set concept incorporated in the Engine applies to the following text recognition modules: MOR, MTX, FRX, PLUS2W and PLUS3W, DOT, HNR and RER.

The purpose of limiting the Character Set

You can improve text recognition accuracy by narrowing the range of characters validated for recognition, so the recognition module does not always have the difficult job of choosing one solution from all 500 characters (and even from multiple shapes of every character) in Engine's Total Character Set.

Most recognition modules use this information to improve recognition, some do not. Even responsive engines may return filtered out characters because the limited character set is just a hint for them.

Limiting the character set assumes that the application or user has some prior knowledge of what types of texts or characters will be encountered on a page or zone, e.g.

  • the language(s) may be known.
  • certain character classes may be excluded (e.g. no lowercase letters)
  • there may be a limited set of permissible characters (e.g. in form-type documents).

The following steps describe how the Character Set can be defined, per image (with global settings) or per zone (with local settings).

Language Environment

Typically texts to be recognized are written by and for people in natural languages. To describe the character components of the languages the Engine provides two tools.

The supported languages (and their combinations) can be selected directly using the LANGUAGES enum.

The language selection may not always fully meet a user’s needs:

  • Linguistic sources do not agree on the full alphabets of some languages.
  • There may be differences between archaic and modern forms or between dialects of a given language.
  • Languages transcribed from other alphabets may use different norms.
  • Some foreign words and names in a text may use accented letters not supported by the language setting.

This is why we provide a second flexible tool, to open a backdoor to complement or construct the Language environment character by character, using the LanguagesPlus setting.

This Language environment is global, i.e., it remains valid for the whole image, and for all future ones until any of its components are changed.

Note
The Language environment does not always equal the Character Set, since filters can be applied, either globally (per image) or locally (per zone), see later.

Language selection (global)

This is the most frequently used tool for the limitation of the Character Set. You enable one or more languages. This validates all letters and language-related characters needed for those languages, plus all digits, punctuation and miscellaneous characters.

E.g. selection of German without Spanish enables the typical German letter "O diaeresis" but disables the Spanish "Inverted Question Mark".

Related functions:

As a parameter for both of these functions an array of languages is used. The enum LANGUAGES defines the position-language relationship

Note
The list of supported characters and languages depends on the recognition module applied. Only the RM_OMNIFONT_PLUS2W and RM_OMNIFONT_PLUS3W multi-lingual modules support all languages. For more detail see the topics Recognition module specifications and Languages and modules.
Only the omnifont recognition modules support all punctuation and miscellaneous characters. To see which are supported by the other modules go to the Characters (punctuation / miscellaneous) and modules topic.
In rare cases you may define no language and build the Character Set only from individually defined characters.
For more on languages, characters, modules and Code Pages, see Introduction to language-related topics.

LanguagesPlus characters (global)

Here you define any additionally needed characters, e.g. to handle some foreign words in a text.

Related functions:

Note
The recognition modules RM_OMNIFONT_MOR, RM_OMNIFONT_FRX, RM_OMNIFONT_PLUS2W, RM_OMNIFONT_PLUS3W, RM_DOT and RM_RER accept the LanguagesPlus characters. RM_OMNIFONT_MTX does not support LanguagesPlus characters.
To discover which accented letters are supported for each language, and which modules support them, see Languages and characters.
To revalidate individual characters removed by filtering, use FILTER_PLUS.

Filtering

Filters can be used to limit the character set defined by the Language environment to specific character categories. This filtering can be a Global filter (applied at image/page level) or a local filter (applied per-zone). FILTER_ALL switches all filtering off, enabling all the characters in the Language environment. A filter can be built up from any combination (binary OR-ed) of the following five disjunct elements plus a few special ones:

These elements are rather rigid; to make filtering more flexible the Engine provides a few special filters: the FILTER_PLUS, FILTER_PLUS_1, FILTER_PLUS_2 and FILTER_PLUS_3 bits. These filter bits additionally enable groups of individually validated characters, called the FilterPlus characters, set through the kRecSetFilterPlus and kRecSetFilterPlusEx functions. There are 4 different sets of characters that can be defined at most, indexed from 0 to 3. They can be enabled with the corresponding FILTER_PLUS* filter bits.

As an example of filtering, when your document is a questionnaire containing only capitals, you can use the filter FILTER_UPPERCASE.

Some pre-defined combined filters are available: FILTER_ALPHA for all letters and FILTER_NUMBERS for the digits plus all FILTER_PLUS characters with index 0.

Activation of filters

Each zone in the image has a ZONE structure defining its properties (coordinates, size, filling method and recognition module to be applied etc.). One of the fields in this structure is the filter field.

If automatic decomposition (auto-zoning) detects the zones, this filter field will always have the value FILTER_DEFAULT, which means that for these zones a common page-level filtering, i.e. the Global filter, will be applied.

The application can change this field, or can create zones with different filter values for the individual zones defining Local filters.

Related functions and enums:

Note
Remember that some recognition modules impose their own limitations, e.g. RM_HNR is limited to digits plus four symbols. Any filter for a character category can only validate the characters in that category supported by the assigned recognition module.
For the value FILTER_DEFAULT to take effect, it must be the only one in the field.

Global filter

When the Engine is initialized, the Global filter setting takes the value FILTER_ALL (i.e. no filtering). You can set it to any other default value using the function kRecSetDefaultFilter. The Global filter setting will be applied to every zone having the ZONE’s filter field value FILTER_DEFAULT.

Related functions and enums:

Local filter

As already stated, each ZONE structure has a field filter. If it is filled with any value other than FILTER_DEFAULT, the zone-level, Local filter will be used and any Global filter is ignored.

Note
It is important to note that there are only 4 FilterPlus sets of characters. At zone-level the application can enable or disable the usage of one of these sets, or a combination of them.

Related functions:
These zone properties are typically defined by the kRecInsertZone or kRecUpdateZone functions.