The XML (Extensible Markup Language) is a simple and flexible text format based on SGML. XML is a W3C recommendation (http://www.w3.org/XML). The XML was designed to describe structured data, to store it and to send it through the Web. The XML tags are not predefined; we need to define tags. Our XML uses the XSD schema to define tags and attributes. (http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd)

We recommend this format to SDK users wanting to send the output through the Web or wanting to process the output. It can be sent easily because it is a plain text file. It can be processed easily because there are many library and tools to process it. The XML file can be parsed with the help of MSXML, .Net, Xerces … The user can parse the XML file to create a tree (DOM) in memory containing all data in the XML file (e.g. MSXML contains DOM). The DOM can also be used to modify and save data. The XML file can also be parsed with help of SAX (and the SAX part of MSXML also) This is a read-only and forward-only interface, but it cannot read all data into memory at the same time. The other possibility to process the XML file is the XSLT. It can transform any XML file to any other text file (plain text, html, xml). The user has to create an xsl file containing rules for the transformation.

An XML contains structured data, represented as a tree. Each node in the tree may include attributes, texts and other nodes. There are tags (e.g. <name>) attributes <name attribute=”value”> text (<name> xy </name>) comment (<— comment —>)... An xml file has to be “Well Formed”. A “Well Formed” xml document has correct XML syntax. An xml file may be “Valid”. A “Valid” xml document conforms to the schema. The parsers cannot read an xml file when it is not “Well Formed”, but some (e.g. MSXML) can read it even if it is not “Valid”. The schema (or DTD) defines the legal elements of an XML document. We support the XSD schema that is maintained by W3C (http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd) and we can support the obsolete XDR schema. The schema defines the document structure with a list of legal elements, attributes, the possible relationships between elements, the order and the number of child elements, the types for elements and attributes and others.

This is a simple XML file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<document author=”Name of author” date=”Created at”>
<title>Document Title</title>
<body>The text of document.</body>
</document>

Handling illegal codes

There are some character codes that are illegal in XML files. They are control characters and illegal characters of Unicode. In the LETTER result both of them are possible:

Control characters may be coming from e.g. binary barcode, when the setting Kernel.OcrMgr.Codes.CtrlOffset is 0.
Illegal Unicode characters may be coming from a textual PDF.

The XML output of the OmniPage CSDK manages these cases by using the settings Kernel.OcrMgr.Codes.CtrlOffsetXml and Kernel.OcrMgr.Codes.CtrlMaskXml. If the setting CtrlOffsetXml is positive, the illegal code is masked with CtrlMaskXml and then shifted with CtrlOffsetXml. If CtrlOffsetXml is negative, the illegal code is replaced to the negative of CtrlOffsetXml. If the setting is 0, these conversions are practically disabled and the XML will be invalid.

The illegal codes that are shifted:

0x0000 - 0x001F 0xD800 - 0xDFFF 0xFFFE - 0xFFFF 0x110000 -