Paragraph extraction

Paragraph extraction performs the following actions:

  • Gets values trapped inside a paragraph.

  • Defines the paragraph boundary using a keywords.

  • Uses a pattern (regular expression) to search for desired data.

The pattern for the regular expression cannot exceed 31 characters in length.

First, a paragraph is identified, and then from within that paragraph, a particular value matching with a given regex pattern is extracted.

Paragraphs are identified on the basis of the following conditions:

  • Regex match for start pattern is treated as the start of the paragraph only if there is no span (word) present to the left of found Regex Match.

  • Transact takes the average white space between lines and segregates the text body on the basis of white space being larger than the average space.

  • If any line ends with the End pattern if defined, then it takes priority over the line spacing mechanism and the paragraphs end on that line even if the next lines satisfy the spacing condition.

The start pattern for a paragraph can be a title of the paragraph or starting words of the paragraph. You can configure the extraction rule accordingly. During extraction, paragraph wrapping is handled by default while using the Paragraph Extraction Rule.

This functionality enables you, as an administrator of batch classes, to configure extraction rules for index fields.