RecAPI
|
Module name: | RER |
Module identifier: | RM_RER |
Filling methods supported: | FM_HANDPRINT, FM_CMC7, FM_OCRA, FM_OCRB, FM_MICR FM_OMNIFONT (Thai, Hebrew) |
Filters supported: | all filter elements |
Trade-off supported: | TO_ACCURATE, TO_FAST (includes TO_BALANCED) |
Knowledge base files: | kadmos.uk , hand_s.rec , numplus.rec , and the below language-specific kb-files. |
Knowledge base file for Thai OCR: | kadmos.uk , ttf_s_th.rec . |
Knowledge base file for Hebrew OCR: | kadmos.uk , ttf_s_il.rec . |
Training file supported: | no |
This module is supported on: Windows, Linux and MacOS x64.
This module is included only in the Professional Recognition Kit (not the OCR kit). To make this technology available in your application, it must be covered by your distribution licensing.
Thai and Hebrew OCR can be purchased as an add-on ("Asian Plus") to either the Professional Recognition Kit or the Professional OCR Kit.
See the topic on Licensing in the General Information help system.
This is a third-party recognition module from re Recognition GmbH (www.rerecognition.com). The Engine contains its KADMOS recognition engine.
This recognition module can be used for recognition of handprinted alphanumerical characters, i.e. upper and lower case letters, the digits and some others. Although it can be used to read flowing text, its main application area is in form-like situations, where the form designer has great control over the content and maybe length of handprinted information given in each zone.
In addition this module recognizes Thai and Hebrew text. It can handle short embedded English texts within such language text. Thai language is accessible from version 19.0, Hebrew from 20.1. See details below.
Selecting the filling method FM_HANDPRINT this module can differentiate 159 characters. These are the digits, 28 punctuation and miscellaneous characters (listed below), letters of the English alphabet plus all accented characters necessary for 98 languages. Fifteen languages have dictionary support: Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Spanish and Swedish. Other supported languages include Croatian (with one limitation), Estonian, Gaelic, Indonesian, Latvian, Lithuanian, Slovak, Slovenian, Swahili, Tagalog, Turkish and Welsh (the last two with minor limitations). Cyrillic languages and Greek are not supported. These languages can be freely combined, but then dictionary support is not available.
The following punctuation characters can be recognized:
! | Exclamation Mark |
? | Question Mark |
‘ | Apostrophe-Quote |
" | Quotation Mark |
: | Semicolon |
, | Comma |
: | Colon |
. | Period (Full-stop) |
- | Hyphen-Minus |
( | Opening Parenthesis |
) | Closing Parenthesis |
[ | Opening Square Bracket |
] | Closing Square Bracket |
{ | Opening Curly Bracket |
} | Closing Curly Bracket |
The following miscellaneous characters can be recognized:
# | Number Sign |
% | Percent Sign |
@ | Commercial At |
& | Ampersand |
| | Vertical Bar |
$ | Dollar Sign |
* | Asterisk |
+ | Plus Sign |
= | Equals Sign |
_ | Spacing Underscore |
/ | Slash |
\ | Backslash |
< | Less-Than Sign |
> | Greater-Than Sign |
Other supported filling methods gives additional character ranges to the capability of RER engine. The description of these ranges can be found in OCR special filling methods and in the summary table of OCR Special Characters.
The files with .rec
extension are optional, removable, selectable and combinable with each other manually. The general knowledge base file hand_s.rec
is installed with the module during installation of OmniPage Capture SDK v22. In addition, two language/country-specific knowledge base files are installed: hand_s_us.rec
, hand_s_de.rec
and the knowledge about numbers and some miscellaneous characters: numplus.rec
. Other language/country-specific knowledge base files can be found in the folder RER_KBFiles
of the install ZIP. (The installed files are also here.) These files are distributed as listed in the table below. Their names are in the form hand_s_??.rec
, where the double question mark within the filename should be replaced by a country code as follows:
Code | Language(s) / Country |
al | Albanian |
at | Austrian, German |
be | Belgian, Dutch, French, German |
ch | Swiss, French, German, Italian |
cs | Czech, Slovakian |
cz | Czech |
de | German |
dk | Danish |
ee | Estonian |
es | Spanish |
eu | West-European |
fi | Finnish |
fr | French |
hu | Hungarian |
ie | Irish, English, Gaelic Irish |
it | Italian |
lt | Lithuanian |
lv | Latvian |
nl | Dutch |
no | Norwegian |
pl | Polish |
pt | Portuguese |
ro | Romanian |
se | Swedish |
sf | Scandinavia |
sl | Slovenian |
sk | Slovakian |
tr | Turkish |
uk | UK |
us | USA |
If more than one language/country-specific knowledgebase files are in the Engine Binary directory, the system automatically uses the one according to the current language of the recognition.
If the User's product will be used only in a specific region, the installed languge/country-specific knowledge base files can be removed and replaced manually by the one according to that region. In this case, some accuracy improvement may be accessible. For example, in Austria it is possibly better to use hand_s_at.rec
instead of hand_s_de.rec
.
For a language spoken in more than one regions, there is no point using simultaneously all the knowledge base files containing that language, because the Engine cannot decide well enough between the regions. Since the API does not provide any way to specify the country, the User has to make the decision in advance.
The module requires at least one .rec file in the Engine Binary directory. It is not necessary to be
hand_s.rec
. On the other hand, the Distribution Wizard of the CSDK tries to copy only hand_s.rec
from the binary folder into the selected file set (and sends a message, if this file is not there). Thus if you want to see a different subset of optional knowledge base files in your redistributed file set you should select and copy it manually after running the Distribution Wizard.
Handprint is much harder to recognize accurately than machine generated text, and success depends very heavily on character quality. The use of structured forms to limit the possible range of characters, together with zone-level filters and individual character validation can significantly improve accuracy (kRecSetFilterPlusEx). This recognition module can apply all the Engine’s possible filter elements to the 159-member character set it supports. Handprinted forms are usually filled by different respondents and this is liable to lower accuracy. If respondents can be given clear filling instructions (e.g. a print model to follow) and be motivated to print clearly, success will be higher.
If the handprint contains numbers only, using the RM_HNR module is likely to give better results than the RM_RER module filtered for numbers only. The functioning of the RER module can be influenced by the page-level trade-off setting.
For successful recognition, the characters should not touch each other. Each character can be zoned individually or a zone may contain one or more lines of characters. Each character must have a height of 30-180 pixels. Well formed characters written in pen are best recognized. Pencil and felt-tip pens give poorer results. When reading from pre-printed forms, dropout colored boxes can be useful to encourage respondents to write characters of even size and spacing. But then, they mustn’t use a pen with the dropout color.
Maximum number of characters in a line: 200.
Number of lines in a zone: No restriction.
The Engine cannot provide access to all the parameters of reRecognition’s KADMOS toolkit. Note however, that the recognition module can be fine-tuned through parameters of an INI file located under the section [Parm]
. A sample INI file RM_RER.INI
can be found in the above mentioned folder RER_KBFiles
of the install ZIP. The full-path of the given INI file can be specified by the setting Kernel.Ocr.RER.UseParamFile, which replaces the function RecSetRMSpecParams
of the previous CSDK versions.
RER recognition module can recognize only machine printed (FM_OMNIFONT) characters of these languages. Handprinted characters are not supported.
For recognition of such text the given language should be set (LANG_THA, LANG_HEB) and Western languages should not be set (except English in one case - see next paragraph).
The module can recognize short English texts embedded in such language text. It works in default without English language to be set. If embedded texts are in other Latin-alphabet languages, their recognition is also possible, however accented characters may not always be handled correctly.
IMPORTANT NOTE: For the correct working of the recognition of Thai and Hebrew languages, the language should be set before the preprocess operation.