RecAPI
|
OmniPage Capture SDK 22 provides extensive support for Portable Document Format files both on the input and output sides. This datasheet gives you an overview on this area.
Both PDF input and output are supported on: Windows, Linux, MacOS. In addition PDF output is also supported on Embedded Linux and Android. This information is true also for PDF_MRC.
PDF Input is supplied in both the Professional Recognition Kit and the OCR Kit. However the PDF Output Kit is an optional add-on. For more details see the topic on Licensing in the General Information help system.
Name | Adobe Portable Document Format |
Format ID | FF_PDF_* |
Image load (read) | Yes |
Image save (write) | Yes |
Image types supported | As the result of the image loading process either a B/W, or a 8-bit grayscale, paletted or a 24-bit true-color image will be created in the Engine. |
Multi-page supported | Yes |
Special note | Supports standard PDF files compliant up to the PDF v2.0 specification. |
PDF input is supported on: Windows, Linux, MacOS.
If you develop a Linux application using CSDK, please see also the Linux specific notes about PDF input.
By default (this can be changed using different settings), the program handles input PDF files as follows:
Kernel.Imf.DefaultDPI
.Kernel.Imf.PDF.Resolution
is not 0
:Kernel.Imf.PDF.LoadOriginalDPI
is TRUE
:PDF files may be password-protected. Passwords have two types: open (or user) and permissions (or owner, or master).
Open passwords can block file opening. When a PDF requires an open password, CSDK 22 cannot open it without this. Your application must include an interface to accept a password.
As for permissions passwords CSDK 22 only checks the permissions that block printing or content-copy from the file.
When a PDF requires a permissions password for content-copy, its text content cannot be copied without this. CSDK 22 however gives you the possibility to process a content-copy protected file without giving a permissions password. In this case the encrypted PDF is treated as an image-only one and no textual information can be extracted.
A PDF may also require a permissions password for printing. CSDK 22 will only load a PDF if its printing is not blocked – that is, the user either has this permissions password, or the file is not protected against printing.
When a PDF file is protected by both an open and a permissions password, only the permissions password needs to be given for full access.
CSDK 22 is able to produce PDF files on the KernelAPI, RecAPIPlus as well as on the IPRO layers.
PDF output in KernelAPI is supported on: Windows, Linux, Embedded Linux, MacOS.
For creating image-only PDF files, KernelAPI offers the following format:
For creating DTXT image-on-text PDF output files, KernelAPI offers the following format:
See Saving to a file and Direct TXT Output Converter Module DirectTXT Outputs.
It contains the whole image of the original page and the text behind the image on a separate layer. These PDF files especially suit the purpose of page archiving, because they contain both the original image and recognized text.
DTXT image-on-text PDF output also provides the different PDF qualities. Selecting it the given quality levels by calling kRecSetCompressionLevel. See the settings of DTXT image-on-text PDF output.
There is no MRC compression for black-and-white images. Thus if an MRC format is selected for a B/W image the proper no-MRC format is used.
For creating image-only PDF files, KernelAPI offers the following format choices:
For creating DTXT image-on-text PDF output files, KernelAPI offers the following format:
See Saving to a file and Direct TXT Output Converter Module DirectTXT Outputs.
There may be saved both image-only MRC files and image-on-text MRC PDF files. One is saved by Image File Handling Module, the other is saved by Direct TXT Output Converter Module. Below description uses the notions of image-only case, however the proper notions can be found for image-on-text case in a natural way. For more information see the above section about PDF Output in KernelAPI.
DTXT image-on-text PDF MRC output also provides the different PDF qualities. Selecting it the given quality levels by calling kRecSetCompressionLevel. See the settings of DTXT image-on-text PDF output.
Briefly about MRC compression
In case of MRC formats the image is saved in multiple layers:
The background, foreground and selector layers are compressed using different compression algorithms.
When creating an MRC file CSDK decomposes the image into multiple layers and sub-images. This process includes an algorithm for detecting text. It works without OCR, though this process can benefit from the OCR result and no text detection is needed if the HPAGE contains an II_OCR image. Note that II_OCR
image is created during the recognition process and in default it is freed after recognition. In order to keep it the setting Kernel.OcrMgr.Images.KeepOcrImage must be set to TRUE
before the recognition.
Compression methods:
Note: The new MRC compression can be used with 5 different predefined levels (1-5) by calling kRecSetCompressionLevel.
For details see the section about newer image formats. See the Kofax Omnipage Capture SDK User's Guide for more details, in the Imaging Module, MRC image compression level comparison subsection.
Depricated:
FF_PDF_MIN | Minimum image file size |
FF_PDF_GOOD | Medium image file size |
FF_PDF_SUPERB | Large image file size with high quality |
FF_PDF_MRC_MIN | MRC-compressed PDF optimized for minimum file size |
FF_PDF_MRC_GOOD | MRC-compressed PDF of medium file size |
FF_PDF_MRC_SUPERB | MRC-compressed PDF providing large file size, but high quality |
The PDF output in RecAPIPlus level is supported on: Windows, Linux, MacOS.
In RecAPIPlus and IPRO, the following PDF output formats and converters are available:
PDF with image on text (Searchable PDF in OmniPage terminology) – A PDF converter where the original (input) image is retained in the foreground with the recognized text hidden in the background (and in the correct position). This format allows the content of an image PDF to become searchable without disrupting the original due to the hidden text layer. Text in a Searchable PDF is positioned directly behind the corresponding image text and is selectable and searchable in popular PDF viewers. This format especially suits archiving and indexing purposes.
PDF - A highly configurable, general PDF output converter. It supports many PDF features, but relies heavily on the position of the recognized characters.
PDF with image substitutes - A special PDF converter, where the suspect words are covered by their image cut out from the original image.
PDF edited – This PDF converter does not rely on the position of the recognized characters, so it could be used even after inserting large new portions of text in the editor.
All PDF Output converters have the following features in common:
When a document is saved in PDF formats, the image can be saved using MRC technology. In this case the same compression methods and compression quality are used and the resolution of the layers also will be the same as described previously.
The following converter settings affect the compression:
The following combinations can be used:
Compatibility | UseMRC | Compression.UseJBIG2 | Compression.UseJPEG2000 |
R2ID_PDF_FORCESIZE | R2ID_PDFMRC_MIN | TRUE | TRUE |
R2ID_PDF_FORCEQUALITY | R2ID_PDFMRC_NO | ||
R2ID_PDF20 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF17 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF16 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF15 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDF14 | any possible value | TRUE/FALSE | FALSE |
R2ID_PDF13 | any possible value | FALSE | FALSE |
R2ID_PDF12 and below | R2ID_PDFMRC_NO | ||
R2ID_PDFA (deprecated) | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA1B (instead of R2ID_PDFA) | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA2B | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3B | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA2U | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3U | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA1A | any possible value | TRUE/FALSE | FALSE |
R2ID_PDFA2A | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA3A | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA4 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA4E | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFA4F | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFUA1 | any possible value | TRUE/FALSE | TRUE/FALSE |
R2ID_PDFUA2 | any possible value | TRUE/FALSE | TRUE/FALSE |
The Pictures
setting may modify the resolution of the layers. If the resolution of a layer is higher than the resolution specified by the Pictures
setting, the layer will be transformed to the specified resolution, so this setting is suggested to leave in default state (R2_DPI_ORIGINAL
) when saving MRC PDF.
The PictureColor
setting may change the bit depth of the layers, so it is suggested to leave in default state (R2_BPP_ORIGINAL
).
CSDK gives the possibility to save encrypted PDF files using either an open (or user) or a permissions (or owner, or master) password (see also the above description about encrypted PDF files). This option is fully setting-controlled. The setting Kernel.Imf.PDF.PDFSecurity.Type determines the used encryption method. The passwords can be stored also in settings (Kernel.Imf.PDF.PDFSecurity.OwnerPassword
and UserPassword
).
In addition there can be set permission flags for the created PDF file enabling:
The created PDF file can be opened for the enabled operations by using the open password. If one has the permissions password, all the operations are enabled.
Existing encrypted PDF files can be modified and saved as well. In this case the password of the existing file is needed for processing it. The file and the given password determines the operations can be performed, so the setting Kernel.Imf.PDF.PDFSecurity.Type
has no effect. The password (either open or permission one) can be specified in the setting Kernel.Imf.PDF.PDFSecurity.ProcessPassword.
Generating editable output from PDF files has been speeded up – more advanced technology is applied to make zoning faster, achieve higher OCR accuracy, improve output quality and make the resulting files more usable when being further edited in target applications. This is achieved by creating two images whenever the input is a PDF or XPS file with a text layer. One is a composite image with all PDF information, the second contains only a background image without any text. This is especially useful for pages where text wraps around pictures irregularly, as shown. Further speed-up is achieved by assessing image quality and layout complexity. A faster OCR algorithm is now applied to high-quality pages with simple layouts.
This technology cannot be applied to active PDF forms, and works only in accurate mode. In cases where this technology cannot be applied, there is an automatic and seamless fall-back to the old algorithm.
Another innovation is support for creating linearized PDF files. These are optimized for efficient web display. The resulting PDF adheres to Appendix F (Linearized PDF) of the PDF Reference. This means that after creating the PDF in the usual way (any PDF flavor), the CSDK reorders the file contents and adds hint tables. This means that the first page of the PDF will load quickly into a web page, with remaining pages loaded while it is being viewed. It means browsers can determine which page elements to present first (typically headings and texts) and which can follow (heavier pictures etc.). It also optimizes the efficiency of jumping to new pages in the PDF document.
Settings relating to the creation of linearized PDF are: Kernel.Imf.PDF.Linearized, Kernel.DTxt.PDF.Linearized and Converters.Text.PDF*.Linearized.
Linearized file creation works also with Asian-language PDF files.
Support is introduced for PDF version 1.6 and 1.7 – this includes support for the AES encryption system (128 and 256 bit). File opening is handled through existing mechanisms, while new saving options are provided for applying AES encryption to files.
In response to client requests, it is now possible to have the original orientations conserved when outputting to PDF Searchable (Image-on-text) files. To implement this, the setting Kernel.Img.KeepOriginalImage or the corresponding function kRecSetPreserveOriginalImg must be used for each page involved. If these images are kept, the orientation on PDF Searchable output pages remains the same as that of the input, overriding any auto-rotation decisions that may have been performed during preprocessing.
This part of the SDK is an extension to KernelAPI and RecAPIPlus. It manages PDF files on the page level. It can copy, move, or delete pages of the PDF files. It is also able to extract information from them, or change their pages. RecPDF is a mostly operation based API. The page-level modifications are passed to the operation, and at the end the operation executes all of the changes at the same time. Operations can be cancelled as well if it turns out that no modification is needed. For more information see the documentation of the RecPDF Module.