Form Data Extraction Module

Data collection from filled forms with CSV output. Form processing is supported on: Windows, Linux, Mac OS X. More...


RECERR RECAPIPLS  RecProcessFormPagesPDF (int sid, LPCTSTR sampleFormFile, LPCTSTR pageRange, LPCTSTR *inputFormFiles, LPCTSTR outoutDTXTFile)
  It collects data from input form files.
RECERR RECAPIPLS  RecProcessFormPagesTemplate (int sid, LPCTSTR *inputFormTemplFiles, LPCTSTR *inputFormFiles, LPCTSTR outoutDTXTFile)
  It collects data from input form files.

Detailed Description

Data collection from filled forms with CSV output. Form processing is supported on: Windows, Linux, Mac OS X.

The Toolkit already provides form handling capabilities, using Logical Form Recognition® Technologies. This allows form templates to be created and/or designed, so that sets of forms can be processed.

A new type of form handling is introduced since CSDK 16, called Form Data Extraction (FDE). It can be considered as a simplified more direct workflow for extracting and collating form data. The following table compares the two offerings:

Template source Any supported image file type or by scanning a paper form. PDF files as input are treated as image-only. Only one page can be processed at a time. The form must be unfilled. Must be an active PDF form - single or multi-page, filled or unfilled.
Template page range If used, must specify a single page. Can specify any number of the existing pages; but must harmonize with the forms to be processed (see below).
Template design Form objects can be auto-detected and/or manually added, deleted or modified. No changes permitted, the specified template file must be suitable for the task.
Form field types Check boxes, circle texts, comb fields, tables/cells, graphics, lines and text boxes. Check boxes, text boxes, option (radio) buttons.
Form field names Can be set with kRecSetFormFieldName Must be pre-defined as meta-data in the PDF template form.
Anchors Four pre-defined form controls set in template as anchors, must appear on all filled forms. Automatic. Four text strings identified from fixed text on template, searched on all forms being processed.
Forms being processed
  • PDF/XPS forms – any type
  • Image files
  • Scanned paper forms
  • PDF/XPS forms – any type
  • Image files
  • Scanned paper forms
Multi-page forms Can be handled, but a separate template is needed for each page; the application and end user must ensure template/form matching. Can be handled – the page range for the template must be in harmony with the number of pages in forms being processed (see below).
OCR usage during processing Used only for data extraction, and only as necessary. Used twice, to find anchor points and for data extraction - only when a usable text layer is not detected.
Resizing tolerance 10% (was 1.5% previously) 10% or more.
Recognition restrictions Regular expressions or conditions like ‘Numbers only’ can be set for each form control. No restriction of field data.
  • Letter structure (full info)
  • CSV (with or without append)
  • Any other supported output.
CSV Text only: by default data from all forms enters one file - each form becomes a row, each field a column.
Upside-down pages Auto-orientation should be able to resolve such cases. Auto-orientation is on by default and should correct such errors.
Error handling There are three levels:
  • Warn and deliver result
  • Skip form and continue
  • Close processing.
The developer can decide how to handle each case.
The same three levels exist, but in general the program decides which to apply, within the general workflow error handling system. Since FDE processing is more limited than LFR, there is a lower likelihood of serious errors arising.

Neither FDE nor LFR are designed to handle Asian (including CCJK, Arabic, Thai and Hebrew) language forms.

To summarize, Form Data Extraction allows data to be extracted from sets of forms, and collated into a comma separated text file (CSV) that can be opened in database programs where each form is represented by a worksheet row and each detected form control becomes a worksheet column.

A form template must be specified for each form type to be handled – in addition to a file name, a page range can be specified. This template file must be an active PDF form – it can be single- or multi-page, filled or unfilled. It must contain active tagged form controls – these can be text boxes, check boxes and option (radio) buttons. The form field names (labels) must be defined in the PDF template form; they will appear as the column headers in the target application. The PDF output converters can save such active PDF forms. This feature can be controlled through the settings PDFForms and PDFFormVisuality. The PowerPDF can also generate such active PDF forms.

The forms to be processed must have a layout and content corresponding to the defined form template (page size, number of pages per form, location of controls, etc.) The forms can be:

A page range can be useful to exclude pages with form filling instructions or other unneeded content. It can be specified for the FDE template file and it must harmonize with the forms that are later processed, as shown in the following example for the page range 3-5:

p1 and 2 p3 to 5 p6 and on
Active PDF form template file excluded in range excluded
PDF/XPS files with text layers * must exist must exist need not exist
Scanned forms and image files ** no yes no

(*) That means Normal or Searchable PDF or XPS files and includes Active PDF forms

(**) In other words if the template defines a three-page form, each scanned filled form must contain three pages, each in the correct order. The same applies to image files, but the pages can be in three single-page files, one three-page file or any other combination (1+2 or 2+1). A set of forms can be presented in a single multi-page file, so long as each form contains three pages in the correct order. If a mismatch is detected, the program attempts to match the template to neighboring pages and may be able to continue processing.

Multiple page ranges are also acceptable, e.g. 3-5, 8, 11-14. In that case all PDF/XPS forms with a text layer must have all the pages corresponding to the template, up to and including the last validated template page.

For FDE the first task is to prepare a suitable template to be selected in step two of the following procedure. FDE processing is performed through workflows; only three steps are allowed:

  1. Image Input Step
    Here the file set to be processed is defined using the usual Workflow input step to scan pages or load image files. File names can be given in advance or run-time prompting can be specified; folder input is available – then it is user’s responsibility to see that all files reaching the chosen folder are suitable for the current FDE process and its assigned template.
  2. Extract Form Data Step
    Here the PDF template file is selected, with full name and path, and, if desired, a page range. Recognition language(s) are chosen and, optionally, vertical dictionary support.
  3. Save Results Step
    The file type is fixed as CSV Text; the saving location must be given unless runtime prompting is specified. The default file saving option is to have all pages entered into a single CSV file, but other options are available. Timestamp folders can be specified to separate results from multiple processing sessions.

Function Documentation

RECERR RECAPIPLS RecProcessFormPagesPDF ( int  sid,
LPCTSTR  sampleFormFile,
LPCTSTR  pageRange,
LPCTSTR *  inputFormFiles,
LPCTSTR  outoutDTXTFile 

It collects data from input form files.

This function collects data from a set of filled forms for further processing in databases or spreadsheets. The layout and location of form elements is defined by a sample form file, which is used for generating form templates. The forms to be processed must be filled by computer or similar machine and not handwritten. The output is a CSV file.

The sample file must be an active, non-image-only PDF form containing suitable Acro-Form controls for form objects. It can be either filled or unfilled. It can be a multi-page form and a page range can be specified to eliminate non-form pages such as filling instructions, etc.

[in] sid Setting Collection.
[in] sampleFormFile Name of the sample form file.
[in] pageRange Page range specifying which pages should be processed. This is a string that contains a decimal number (e.g. "4"), or two numbers separated by a '-' sign (e.g. "3-5"), or their arbitrary comma-separated combinations (e.g. "3-5,8,11-14"). It may be NULL, which means all the pages are selected.
[in] inputFormFiles Pointers of the names of the form files to be processed. The latest pointer has to be NULL.
[in] outoutDTXTFile Name of the output CSV file.
Return values:
ZONE_SIZE_WARN At least one zone was truncated, because it extends beyond the image.
ZONE_SIZE_ERR At least one zone was not loaded, because it extends beyond the image.
IMG_ANCHOR_WARN Some of the anchors were not found.
IMG_ANCHOR_NOT_FOUND No anchor was found or the sample form does not contain anchor zones.
REC_OK Successful.
Form processing is supported on: Windows, Linux, Mac OS X.
Processing filled PDF files including all PDF flavors except image-only ones. The PDF files can be either static or active. In this case, each form must be located in a separate PDF file. If a page range is chosen for the sample, both the sample and the form files have to contain all the pages up to the last page selected by the page range. Otherwise, the form files have to contain at least as many pages as the sample file contains. If the sample file contains non-form pages, the input files have to contain their pairs as well.
Processing filled forms saved as image files including all image file formats supported by CSDK, and image-only PDF. In this case, the function considers not the input files, but the input pages. It creates a queue containing all the pages coming from the input files (either single or multi-page) in the order of appearing. After that, it picks up the sample pages one-by-one and picks up the next input page from the queue pairing it with the current sample page. After accessing the last sample page it repeats from the first one pairing with the next input page. It follows that the sum of the number of all the input pages has to be a multiple of the number of pages of the sample file. Using a page range, the input pages are paired with only the sample pages selected by the page range (i.e. the input pages must not contain the non-form pages), thus the sum of the number of input pages has to be a multiple of the number of sample pages selected by the page range.
Each form element becomes a table column in the output file and the data from each form is presented in a single row. The form elements are typically fillable fields, check boxes and option buttons. If the output file already exists, the collected data rows will be appended to the end of the file.
The function automatically rotates the input images if the current IMG_ROTATE setting is ROT_AUTO (except if the image is non-image-only PDF), otherwise the images are not rotated. However, auto-rotation is not called on the sample form file, thus its orientation must not require any rotation.
See details about size limits of input images.
The specification of this function in C# is:
 RECERR RecProcessFormPagesPDF(int sid, string sampleFormFile, string pageRange, string[] inputFormFiles, string outputDTXTFile); 
The specification of this function in Java is:
 int RecProcessFormPagesPDF(int sid, String sampleFormFile, String pageRange, String[] inputFormFiles, String outoutDTXTFile) 
RECERR RECAPIPLS RecProcessFormPagesTemplate ( int  sid,
LPCTSTR *  inputFormTemplFiles,
LPCTSTR *  inputFormFiles,
LPCTSTR  outoutDTXTFile 

It collects data from input form files.

This function collects data from a set of filled forms for further processing in databases or spreadsheets. The layout and location of form elements is defined by form template files. The forms to be processed must be filled by computer or similar machine and not handwritten. The output is a CSV.

[in] sid Setting Collection.
[in] inputFormTemplFiles Pointers of the names of the form template files. The latest pointer has to be NULL.
[in] inputFormFiles Pointers of the names of the form files to be processed. The latest pointer has to be NULL.
[in] outoutDTXTFile Name of the output CSV file.
Return values:
ZONE_SIZE_WARN At least one zone was truncated, because it extends beyond the image.
ZONE_SIZE_ERR At least one zone was not loaded, because it extends beyond the image.
IMG_ANCHOR_WARN Some of the anchors were not found.
IMG_ANCHOR_NOT_FOUND No anchor was found or the template does not contain anchor zones.
REC_OK Successful.
Form processing is supported on: Windows, Linux, Mac OS X.
The form template files must be created by CSDK using the kRecSaveFormTemplate function. See how to create a form template file.
The function considers not the input files, but the input pages. It creates a queue containing all the pages coming from the input files (either single or multi-page, either PDF or image files) in the order of appearing. After that, it picks up the form templates one-by-one and picks up the next input page from the queue pairing it with the current template. After accessing the last template it repeats from the first one pairing with the next input page. It follows that the sum of the number of all the input pages has to be a multiple of the number of template files. If the input pages contain non-form pages they must have empty form templates pairs in the form template list.
Each form element becomes a table column in the output file and the data from each form is presented in a single row. The form elements are typically fillable fields, check boxes and option buttons. If the output file already exists, the collected data rows will be appended to the end of the file.
The function automatically rotates the input images if the current IMG_ROTATE setting is ROT_AUTO (except if the image is non-image-only PDF), otherwise the images are not rotated.
See details about size limits of input images.
The specification of this function in C# is:
 RECERR RecProcessFormPagesTemplate(int sid, string[] inputFormTemplFiles, string[] inputFormFiles, string outputDTXTFile); 
The specification of this function in Java is:
 int RecProcessFormPagesTemplate(int sid, String[] inputFormTemplFiles, String[] inputFormFiles, String outoutDTXTFile)