OmniPage Capture SDK 22 provides extensive support for Portable Document Format files both on the input and output sides. This datasheet gives you an overview on this area.

Both PDF input and output are supported on: Windows, Linux, MacOS. In addition PDF output is also supported on Embedded Linux and Android. This information is true also for PDF_MRC.

PDF Input is supplied in both the Professional Recognition Kit and the OCR Kit. However the PDF Output Kit is an optional add-on. For more details see the topic on Licensing in the General Information help system.

PDF File Format Summary

Name	Adobe Portable Document Format
Format ID	FF_PDF_*
Image load (read)	Yes
Image save (write)	Yes
Image types supported	As the result of the image loading process either a B/W, or a 8-bit grayscale, paletted or a 24-bit true-color image will be created in the Engine.
Multi-page supported	Yes
Special note	Supports standard PDF files compliant up to the PDF v2.0 specification.

PDF Input in CSDK 22

PDF input is supported on: Windows, Linux, MacOS.

If you develop a Linux application using CSDK, please see also the Linux specific notes about PDF input.

By default (this can be changed using different settings), the program handles input PDF files as follows:

Bitmap creation
A bitmap is created from the loaded PDF.
Information extraction
After this, additional information is extracted from the PDF, including the following:
1. Information on fonts and the decision whether font substitution is necessary
2. Information on text with the exact position of its letters
3. TAG information, if any.
Pre-processing
The next step is the pre-processing. This step may vary depending on whether the PDF contains textual information or not.
1. If the PDF does not contain text information (it is an image-only PDF), all pre-processing (deskew, auto-rotation, binarization) and other operations will run similarly to the ones applied to other image files (TIFF, JPEG, etc.)
  NOTE: Image on text-type PDFs do not undergo text extraction: during processing these are treated as image only ones.
2. If the PDF does contain text information, a specific binarization (developed for PDF files) will run, but without deskew and auto-rotation.
Layout decomposition
In PDF files that contain textual information, this text is extracted. The OCR engine runs on the image, but mainly to search for text areas and other elements on the page resulting in a zone set. Page layout and spacing are also determined. The generated and zoned bitmap (see Step 1) is collated with the extracted text to ensure its correct positioning on the page, including column placement and transfer of graphic elements.
Recognition
In PDF files with no accessible text layer, OCR runs to generate editable text and perform zoning. In PDF files with an accessible text layer with text information, pages are zoned and word boundaries are determined, as described. Occasionally, text extraction may be imperfect, so as a backup, recognition with two-way voting runs on the images and its result is compared to the text information extracted from the PDF. In case of minor differences, recognized characters are corrected according to the ones in the PDF, since that text is more likely to be correct. In the case of major differences recognition results will serve as the final ones, since it is likely that PDF character encoding identification has failed (Non-standard encoding was not detected). If a non-standard encoded text pdf has control characters and the recognition process cannot resolve them, they will be modified according to the setting Kernel.OcrMgr.Codes.CtrlOffset
Definition of character attributes
Character attributes, such as size and style (bold, italic) can usually be defined using information extracted from the PDF. When PDF text is written in a type that is difficult to identify, a font attribute defining process will run.
Other operations
Character- and line spacing, paragraph, and table definitions are done just as in the case of image files.

Resolution of the rendered bitmap

By default:
- If the PDF does not have an image, resolution is determined by the value of the setting Kernel.Imf.DefaultDPI.
- If the PDF has an image and there is only "little" text, the resolution of the rendered bitmap will be the maximal resolution of the images. However, this resolution is limited to 300 DPI.
If the setting Kernel.Imf.PDF.Resolution is not 0:
- The given value will be the resolution.
If the setting Kernel.Imf.PDF.LoadOriginalDPI is TRUE:
- If the PDF does not have an image, resolution is determined as in the default case.
- If the PDF has an image, the resolution of the rendered bitmap will be the maximal resolution of the images. However, this resolution is limited to 600 DPI.

Handling Encrypted PDF Files

PDF files may be password-protected. Passwords have two types: open (or user) and permissions (or owner, or master).

Open passwords can block file opening. When a PDF requires an open password, CSDK 22 cannot open it without this. Your application must include an interface to accept a password.

As for permissions passwords CSDK 22 only checks the permissions that block printing or content-copy from the file.

When a PDF requires a permissions password for content-copy, its text content cannot be copied without this. CSDK 22 however gives you the possibility to process a content-copy protected file without giving a permissions password. In this case the encrypted PDF is treated as an image-only one and no textual information can be extracted.

A PDF may also require a permissions password for printing. CSDK 22 will only load a PDF if its printing is not blocked – that is, the user either has this permissions password, or the file is not protected against printing.

When a PDF file is protected by both an open and a permissions password, only the permissions password needs to be given for full access.

PDF Output in CSDK 22

CSDK 22 is able to produce PDF files on the KernelAPI, RecAPIPlus as well as on the IPRO layers.

PDF Output in KernelAPI

PDF output in KernelAPI is supported on: Windows, Linux, Embedded Linux, MacOS.

For creating image-only PDF files, KernelAPI offers the following format:

FF_PDF (instead of FF_PDF_MIN, FF_PDF_GOOD, FF_PDF_SUPERB).

For creating DTXT image-on-text PDF output files, KernelAPI offers the following format:

DTXT_IOTPDF (instead of DTXT_PDFIOT).

See Saving to a file and Direct TXT Output Converter Module DirectTXT Outputs.

It contains the whole image of the original page and the text behind the image on a separate layer. These PDF files especially suit the purpose of page archiving, because they contain both the original image and recognized text.

DTXT image-on-text PDF output also provides the different PDF qualities. Selecting it the given quality levels by calling kRecSetCompressionLevel. See the settings of DTXT image-on-text PDF output.

Saving MRC PDF files in KernelAPI

There is no MRC compression for black-and-white images. Thus if an MRC format is selected for a B/W image the proper no-MRC format is used.

For creating image-only PDF files, KernelAPI offers the following format choices:

FF_PDF_MRC (instead of FF_PDF_MRC_MIN, FF_PDF_MRC_GOOD, FF_PDF_MRC_SUPERB).

For creating DTXT image-on-text PDF output files, KernelAPI offers the following format:

DTXT_IOTPDF_MRC (instead of DTXT_PDFIOT using MRC).

See Saving to a file and Direct TXT Output Converter Module DirectTXT Outputs.

There may be saved both image-only MRC files and image-on-text MRC PDF files. One is saved by Image File Handling Module, the other is saved by Direct TXT Output Converter Module. Below description uses the notions of image-only case, however the proper notions can be found for image-on-text case in a natural way. For more information see the above section about PDF Output in KernelAPI.

DTXT image-on-text PDF MRC output also provides the different PDF qualities. Selecting it the given quality levels by calling kRecSetCompressionLevel. See the settings of DTXT image-on-text PDF output.

Briefly about MRC compression

In case of MRC formats the image is saved in multiple layers:

one background layer containing the graphics and the background behind the text,
one (in the case of Level 4 and 5) or more (in the case of Level 1, 2, 3) foreground layers containing the text,
one selector layer (in the case of Level 4 and 5 MRC format).

The background, foreground and selector layers are compressed using different compression algorithms.

When creating an MRC file CSDK decomposes the image into multiple layers and sub-images. This process includes an algorithm for detecting text. It works without OCR, though this process can benefit from the OCR result and no text detection is needed if the HPAGE contains an II_OCR image. Note that II_OCR image is created during the recognition process and in default it is freed after recognition. In order to keep it the setting Kernel.OcrMgr.Images.KeepOcrImage must be set to TRUE before the recognition.

Compression methods:

Selector layer: Group4 or JBIG2 depending the trade-off parameter or on the setting Kernel.Imf.PDF.BWFormat (default: JBIG2).
Foreground (FG) and Background (BG) layer: Jpeg or Jpeg2000 depending on the trade-off parameter or the setting Kernel.Imf.PDF.ColorFormat (default: jpeg2000). Refer to the RecAPI documentation (Kernel.Imf.CompressionTradeoff) for details

Note: The new MRC compression can be used with 5 different predefined levels (1-5) by calling kRecSetCompressionLevel.

Level 1 provides the smallest file size, achieved by the intense filtering and compression applied to the background while reducing its resolution for a third. CSDK uses low-quality JPG compression on the background and blurs the area around characters.
Level 2 applies less intense filtering to the background and saves it with a higher quality JPG compression.
Level 3 is a compromise between document quality and file size. CSDK filters the background even less and only halves the background resolution. This is the default value.
Level 4 provides superior compression by dividing a document into three layers: a binary mask layer (selector layer), a foreground layer, and a background layer.
Level 5 New feature. The Kofax Omnipage Capture SDK 22.0 automatically looks for graphics, diagrams, and photos on the page. Parcels these elements into high quality sections (called Imagettes), placing them on the top of Layer 3 with high resolution. At the same time, it smooths out the pixel errors and triples the resolution of the text (selector) layer, greatly improving character contour. As a result, PDF readers display the text contour smoother than on the source image, even at high zoom levels. The quality of the photos keeps close to the original. On the other hand, the same image compressed with Level 5 MRC concludes a larger file size than with Level 4.

For details see the section about newer image formats. See the Kofax Omnipage Capture SDK User's Guide for more details, in the Imaging Module, MRC image compression level comparison subsection.

Depricated:

FF_PDF_MIN	Minimum image file size
FF_PDF_GOOD	Medium image file size
FF_PDF_SUPERB	Large image file size with high quality
FF_PDF_MRC_MIN	MRC-compressed PDF optimized for minimum file size
FF_PDF_MRC_GOOD	MRC-compressed PDF of medium file size
FF_PDF_MRC_SUPERB	MRC-compressed PDF providing large file size, but high quality

PDF Output in RecAPIPlus and IPRO

The PDF output in RecAPIPlus level is supported on: Windows, Linux, MacOS.

In RecAPIPlus and IPRO, the following PDF output formats and converters are available:

PDF with image on text (Searchable PDF in OmniPage terminology) – A PDF converter where the original (input) image is retained in the foreground with the recognized text hidden in the background (and in the correct position). This format allows the content of an image PDF to become searchable without disrupting the original due to the hidden text layer. Text in a Searchable PDF is positioned directly behind the corresponding image text and is selectable and searchable in popular PDF viewers. This format especially suits archiving and indexing purposes.

PDF - A highly configurable, general PDF output converter. It supports many PDF features, but relies heavily on the position of the recognized characters.

PDF with image substitutes - A special PDF converter, where the suspect words are covered by their image cut out from the original image.

PDF edited – This PDF converter does not rely on the position of the recognized characters, so it could be used even after inserting large new portions of text in the editor.

All PDF Output converters have the following features in common:

Compression options, including (for details see the Compression settings):
- Content stream compression (flate)
- JBIG2 compression for black/white images
  (available from PDF v1.4)
- JPEG2000 compression for color images
  (available from PDF v1.5)
- Compression of embedded font files
Appending the output to an existing PDF file
MRC compression of the image for even smaller size.
MRC is short for Multi-Raster Content Technology. It segments images into layers and applies different compression algorithms to each layer, thus optimizing both file size and quality.
Creation of fillable PDF forms (with LFR – Logical Form Recognition)
Compatibility settings for PDF versions 1.0 - 1.7, 2.0
PDF/A compliant output (by modifying the setting Compatibility of the selected PDF converter)
PDF/A is a normative ISO-compliant PDF file specification based on PDF 1.4 designed for two main purposes:
- To generate PDF files that display and handle uniformly over the broadest possible range of operating systems, environments and PDF viewers or editors.
- To generate PDF files that will remain viewable over a long period of time, so that archived material is protected against obsolescence due to techological innovation.
Predefined settings for highest quality or for smallest file size
Tagged PDF file creation based on our layout recognition
Font embedding
Selectable quality for image compression, image resolution and color depth
Outline tree creation for document and page thumbnails
Ability to exclude the text of headers and footers from the output
Digital signing of the created PDF files (see PDF converter settings Converters.Text.PDF*.Signature.*)
Security settings (extract content, modify content, print, etc) definition plus open and permissions password setting
Content encryption (40, 128 or 256 bit)
Highlighting of the recognized URLs in the text and/or turning them into clickable links

Saving MRC PDF files using the output converters (in RecAPIPlus)

When a document is saved in PDF formats, the image can be saved using MRC technology. In this case the same compression methods and compression quality are used and the resolution of the layers also will be the same as described previously.

The following converter settings affect the compression:

The following combinations can be used:

`Compatibility`	`UseMRC`	`Compression.UseJBIG2`	`Compression.UseJPEG2000`
`R2ID_PDF_FORCESIZE`	`R2ID_PDFMRC_MIN`	TRUE	TRUE
`R2ID_PDF_FORCEQUALITY`	`R2ID_PDFMRC_NO`
`R2ID_PDF20`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDF17`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDF16`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDF15`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDF14`	any possible value	TRUE/FALSE	FALSE
`R2ID_PDF13`	any possible value	FALSE	FALSE
`R2ID_PDF12` and below	`R2ID_PDFMRC_NO`
`R2ID_PDFA` (deprecated)	any possible value	TRUE/FALSE	FALSE
`R2ID_PDFA1B` (instead of R2ID_PDFA)	any possible value	TRUE/FALSE	FALSE
`R2ID_PDFA2B`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA3B`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA2U`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA3U`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA1A`	any possible value	TRUE/FALSE	FALSE
`R2ID_PDFA2A`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA3A`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA4`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA4E`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFA4F`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFUA1`	any possible value	TRUE/FALSE	TRUE/FALSE
`R2ID_PDFUA2`	any possible value	TRUE/FALSE	TRUE/FALSE

The Pictures setting may modify the resolution of the layers. If the resolution of a layer is higher than the resolution specified by the Pictures setting, the layer will be transformed to the specified resolution, so this setting is suggested to leave in default state (R2_DPI_ORIGINAL) when saving MRC PDF.

The PictureColor setting may change the bit depth of the layers, so it is suggested to leave in default state (R2_BPP_ORIGINAL).

Saving new encrypted PDF files

CSDK gives the possibility to save encrypted PDF files using either an open (or user) or a permissions (or owner, or master) password (see also the above description about encrypted PDF files). This option is fully setting-controlled. The setting Kernel.Imf.PDF.PDFSecurity.Type determines the used encryption method. The passwords can be stored also in settings (Kernel.Imf.PDF.PDFSecurity.OwnerPassword and UserPassword).

In addition there can be set permission flags for the created PDF file enabling:

modifying the document contents,
extracting text and graphics from the extracted document for supporting accessibility to users with disabilities
copying text and graphics,
adding or modifying text annotations,
printing the document,
filling in forms and signing the document,
assembling the document: inserting, rotating and deleting pages.

The created PDF file can be opened for the enabled operations by using the open password. If one has the permissions password, all the operations are enabled.

Modifying an existing encrypted PDF file

Existing encrypted PDF files can be modified and saved as well. In this case the password of the existing file is needed for processing it. The file and the given password determines the operations can be performed, so the setting Kernel.Imf.PDF.PDFSecurity.Type has no effect. The password (either open or permission one) can be specified in the setting Kernel.Imf.PDF.PDFSecurity.ProcessPassword.

Improvements after CSDK 15

Generating editable output from PDF files has been speeded up – more advanced technology is applied to make zoning faster, achieve higher OCR accuracy, improve output quality and make the resulting files more usable when being further edited in target applications. This is achieved by creating two images whenever the input is a PDF or XPS file with a text layer. One is a composite image with all PDF information, the second contains only a background image without any text. This is especially useful for pages where text wraps around pictures irregularly, as shown. Further speed-up is achieved by assessing image quality and layout complexity. A faster OCR algorithm is now applied to high-quality pages with simple layouts.

This technology cannot be applied to active PDF forms, and works only in accurate mode. In cases where this technology cannot be applied, there is an automatic and seamless fall-back to the old algorithm.

Another innovation is support for creating linearized PDF files. These are optimized for efficient web display. The resulting PDF adheres to Appendix F (Linearized PDF) of the PDF Reference. This means that after creating the PDF in the usual way (any PDF flavor), the CSDK reorders the file contents and adds hint tables. This means that the first page of the PDF will load quickly into a web page, with remaining pages loaded while it is being viewed. It means browsers can determine which page elements to present first (typically headings and texts) and which can follow (heavier pictures etc.). It also optimizes the efficiency of jumping to new pages in the PDF document.

Settings relating to the creation of linearized PDF are: Kernel.Imf.PDF.Linearized, Kernel.DTxt.PDF.Linearized and Converters.Text.PDF*.Linearized.

Linearized file creation works also with Asian-language PDF files.

Support is introduced for PDF version 1.6 and 1.7 – this includes support for the AES encryption system (128 and 256 bit). File opening is handled through existing mechanisms, while new saving options are provided for applying AES encryption to files.

Original orientation can be forced for PDF Searchable output

In response to client requests, it is now possible to have the original orientations conserved when outputting to PDF Searchable (Image-on-text) files. To implement this, the setting Kernel.Img.KeepOriginalImage or the corresponding function kRecSetPreserveOriginalImg must be used for each page involved. If these images are kept, the orientation on PDF Searchable output pages remains the same as that of the input, overriding any auto-rotation decisions that may have been performed during preprocessing.

RecPDF API for managing page-level manipulations of PDF files

This part of the SDK is an extension to KernelAPI and RecAPIPlus. It manages PDF files on the page level. It can copy, move, or delete pages of the PDF files. It is also able to extract information from them, or change their pages. RecPDF is a mostly operation based API. The page-level modifications are passed to the operation, and at the end the operation executes all of the changes at the same time. Operations can be cancelled as well if it turns out that no modification is needed. For more information see the documentation of the RecPDF Module.