RecAPI
All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Properties Modules Pages
Supported characters with the CCJK languages

For each CCJK language, all the generally used characters that are defined in the country's character encoding standard are supported. The recognized Han characters are the ones defined by the following encoding standards:

Encoding standard Level 1 Level 2 CJK Unified Ideographs
Extension A
Simplified Chinese GB 2312 3755 3008
GB 1830-2022 6530
Traditional Chinese Big-5 5401 7652
Japanese JIS X 0208 2965 3390
Korean KS X 1001 4888
Note
GB18030-2022 Level 1 is equivalent to GB 2312 Level 1 + Level 2 + Extension A.

With default settings the set of recognized characters is as follows:

  • Simplified Chinese:
    • All GB 2312 Level 1 and some frequently used Level 2 characters
  • Traditional Chinese:
    • All Level 1 characters
  • Japanese:
    • All Level 1 characters
  • Korean:
    • All common characters

In each case, English letters, numerals, and punctuation are also supported.

Using the Kernel.OcrMgr.Asian.FullCharacterSet setting this set can be extended:

  • Setting it to 1:
    • All Level 2 characters are added to the recognized set for the Chinese and Japanese languages. (Korean doesn't have level distinctions.)
  • Setting it to 2:
    • In case of Simplified Chinese language, the CJK Unified Ideographs Extension A characters are also added, covering Level 1 of the GB18030-2022 standard.

Details:

  • Simplified Chinese:
    • There are 3755 Level 1, 3008 Level 2 characters (as defined by the GB 2312 standard) and 6530 Extension A characters(as defined by the GB18030-2022 standard) supported. 499 frequently used GB 2312 Level 2 characters (see later) are supported even with the default non-FullCharacterSet setting, while the remaining 2509 Level 2 characters are supported when FullCharacterSet is 1. The whole GB18030-2022 Level 1 set is supported when FullCharacterSet is 2.
  • Traditional Chinese:
    • There are 5401 Level 1 and 7652 Level 2 characters (as defined by the Big-5 standard) supported. Level 2 characters are supported when FullCharacterSet is 1.
    • If the setting Kernel.OcrMgr.Asian.CHTIncludesHKSCS is set to TRUE and the selected language is Traditional Chinese, the Asian recognition module can recognize characters also from Hong Kong character set. Note that this setting may cause some decreasing of the accuracy of Chinese Traditional OCR.
  • Japanese:
    • There are 2965 Level 1 and 3390 Level 2 kanji characters (as defined by the JIS X 0208 standard) supported. The 83 Hiragana and 86 Katakana characters are supported as well. Level 2 characters are supported when FullCharacterSet is 1.
    • Our Asian engine recognizes half-width katakana characters, however in its output the unicodes of the corresponding full-width katakana characters appear. So it treats half-width katakanas as if full-width katakanas would be printed in a particular font.
  • Korean:
    • There are 4888 Hanja characters (as defined by the KS X 1001 standard) supported. The 2350 Hangul characters are supported as well.
Default setting Full Character Set = 1 adds: Full Character Set = 2 adds: Non-Han characters English Digits Punct.
Simplified Chinese 3755 (Level 1) + 499 (Level 2) 3008 (Level 2) − 499 (Level 2 used in default setting) 6530 (Extansion A) - 52 10 74
Traditional Chinese 5401 (Level 1) 7652 (Level 2) - - 52 10 86
Japanese 2965 (Level 1) 3390 (Level 2) - Hiragana: 83, Katakana: 86 52 10 92
Korean 4888 - - Hangul: 2350 52 10 87

The 499 Level 2 Simplified Chinese characters are as follows: