For each CCJK language, all the generally used characters that are defined in the country's character encoding standard are supported. The recognized Han characters are the ones defined by the following encoding standards:
| Encoding standard | Level 1 | Level 2 | CJK Unified Ideographs
Extension A |
Simplified Chinese | GB 2312 | 3755 | 3008 | |
GB 1830-2022 | | | 6530 |
Traditional Chinese | Big-5 | 5401 | 7652 | |
Japanese | JIS X 0208 | 2965 | 3390 | |
Korean | KS X 1001 | 4888 | |
- Note
- GB18030-2022 Level 1 is equivalent to GB 2312 Level 1 + Level 2 + Extension A.
With default settings the set of recognized characters is as follows:
- Simplified Chinese:
- All GB 2312 Level 1 and some frequently used Level 2 characters
- Traditional Chinese:
- Japanese:
- Korean:
In each case, English letters, numerals, and punctuation are also supported.
Using the Kernel.OcrMgr.Asian.FullCharacterSet setting this set can be extended:
- Setting it to 1:
- All Level 2 characters are added to the recognized set for the Chinese and Japanese languages. (Korean doesn't have level distinctions.)
- Setting it to 2:
- In case of Simplified Chinese language, the CJK Unified Ideographs Extension A characters are also added, covering Level 1 of the GB18030-2022 standard.
Details:
- Simplified Chinese:
- There are 3755 Level 1, 3008 Level 2 characters (as defined by the GB 2312 standard) and 6530 Extension A characters(as defined by the GB18030-2022 standard) supported. 499 frequently used GB 2312 Level 2 characters (see later) are supported even with the default non-FullCharacterSet setting, while the remaining 2509 Level 2 characters are supported when
FullCharacterSet
is 1. The whole GB18030-2022 Level 1 set is supported when FullCharacterSet
is 2.
- Traditional Chinese:
- There are 5401 Level 1 and 7652 Level 2 characters (as defined by the Big-5 standard) supported. Level 2 characters are supported when
FullCharacterSet
is 1.
- If the setting Kernel.OcrMgr.Asian.CHTIncludesHKSCS is set to
TRUE
and the selected language is Traditional Chinese, the Asian recognition module can recognize characters also from Hong Kong character set. Note that this setting may cause some decreasing of the accuracy of Chinese Traditional OCR.
- Japanese:
- There are 2965 Level 1 and 3390 Level 2 kanji characters (as defined by the JIS X 0208 standard) supported. The 83 Hiragana and 86 Katakana characters are supported as well. Level 2 characters are supported when
FullCharacterSet
is 1.
- Our Asian engine recognizes half-width katakana characters, however in its output the unicodes of the corresponding full-width katakana characters appear. So it treats half-width katakanas as if full-width katakanas would be printed in a particular font.
- Korean:
- There are 4888 Hanja characters (as defined by the KS X 1001 standard) supported. The 2350 Hangul characters are supported as well.
| Default setting | Full Character Set = 1 adds: | Full Character Set = 2 adds: | Non-Han characters | English | Digits | Punct. |
Simplified Chinese | 3755 (Level 1) + 499 (Level 2) | 3008 (Level 2) − 499 (Level 2 used in default setting) | 6530 (Extansion A) | - | 52 | 10 | 74 |
Traditional Chinese | 5401 (Level 1) | 7652 (Level 2) | - | - | 52 | 10 | 86 |
Japanese | 2965 (Level 1) | 3390 (Level 2) | - | Hiragana: 83, Katakana: 86 | 52 | 10 | 92 |
Korean | 4888 | - | - | Hangul: 2350 | 52 | 10 | 87 |
The 499 Level 2 Simplified Chinese characters are as follows: