Fraud detection using OCR font switch
Use the Font Recognition switch to detect potential fraud and tampering with processed documents. The HOCR file reflects the font style (Bold, Italics, and Underline) and font size if the Font switch is turned ON in the OMNIPAGE_HOCR plugin.
This allows the user to detect any data that has been manually altered or added to the documents. By default, the Font switch is set to OFF.
For example, the original amount of a field in a document is 1000 and the font size is 11. Assume this value is manually changed to 41000 and the 4 is written in a size 12 font. The system recognizes the font size and style in the HOCR file. This helps the user to identify that the document has been tampered with.
Tesseract does not provide any information on font detection. This feature is available only in the OmniPage OCR engine.
Fraud detection includes the following features:
-
The HOCR schema includes font information from the data fetched by the OmniPage OCR engine.
-
The OMNIPAGE_HOCR plugin includes an ON/OFF switch which the user can configure to retrieve font information.
-
The following Web Services include font information in the HOCR file:
-
ocrClassifyExtract
-
initiateOCRClassifyExtractService
-
OcrClassifyExtractSearchablePDF
-
executeMobileUpload
-
extractFieldFromHocr
-
extractKV
-
classifyImage
-
classifyBarcodeImage
-
classifyHocr
-
classifyMultiPageHocr
-
decryptBatchInstanceHocrXml
-
decryptLuceneClassificationHocrXml
-
decryptTestHocrXml
-
keywordClassification
-
ocrClassify
-
ExtractKVForDocumentType
-
createHOCRforBatchClass
-
tableExtractionHOCR
-
-
The following Web Service can be configured to obtain font information in the HOCR file:
-
createOCR (a new parameter fontSwitch with an ON/OFF setting has been added to the input .xml file)
-
Fetch font information
Fetch font information through OMNIPAGE_HOCR
- Navigate to and add the OMNIPAGE_HOCR plugin.
- Open the OMNIPAGE_HOCR plugin and turn ON the OmniPage Font Switch.
- Click Apply to update the Batch Class configuration.
- Click Deploy to apply the changes in the workflow.
-
Process a batch with the Font Switch
ON.
The newly generated HOCR schema includes the font size of each character in the span. The HOCR file includes the <UnicodeCharacters> tag which contains the information about the value and size of each character. When the OmniPage Font Switch is turned ON or OFF, the style information remains the same - the <Style> tag contains a value of None by default.
-
To see the difference in the HOCR schema when the font switch is turned off, do the following:
- Turn OFF the OmniPage Font Switch and save your changes.
-
Process a batch.
The information about font family and size is not fetched when the switch is turned OFF.
The OmniPage OCR engine does recognize the combination of font styles, giving comma-separated values when multiple styles are detected. However, it does not recognize the character size of individual characters. All characters in a word are always recognized as having the same size, even though some letters might be capitalized.