Configure chunking settings for documents

You can configure the chunking settings for TotalAgility (Capture data) and non-TotalAgility documents. For TotalAgility documents, you can set chunking for specific document types.

  1. Navigate to System > System settings > System > Chunking settings.

    The Chunking settings dialog box is displayed.

  2. Configure the settings for TotalAgility documents.

    • TotalAgility documents

    • Non TotalAgility documents

    • Specific document type

  3. Click Save.

    • When you execute the Add to knowledge base activity, the chunking settings configured for TotalAgility documents and non-TotalAgility documents are applied.

    • When you add a TotalAgility document to the AI knowledge base, chunking settings specific to that document type are applied. If there are no specific settings configured for the document type, the default settings configured for TotalAgility documents are applied.

    • When you add a non-TotalAgility document to the AI knowledge base, the default chunking settings configured for non-TotalAgility documents are applied.

Chunking settings for TotalAgility documents

The following are the settings for the TotalAgility documents.

Setting

Description

Chunk type

A method used to divide a document into smaller sections, known as chunks when adding it to the AI Knowledge base. Different chunking methods allow the system to handle content effectively based on its structure and requirements. Available chunking types are:

Section

Divides the document into chunks based on sections, headings, or subheadings. Each section becomes a chunk. This chunk type is useful for maintaining the context of the information.

Page

Splits the document based on individual pages where each page becomes its chunk. This chunk type is useful when page format is significant. A page-based chunking keeps the structure intact making it easier to handle the search and reference specific sections. It is beneficial for large documents in distributed systems.

Fixed size

Divides the document into chunks of a specified size, regardless of the document's content structure. Fixed size is useful where a uniform size is necessary for processing, regardless of the logic of the content.

Chunk size

Indicates the number of characters or words into which a document can be divided. This setting helps determine how much of the document should be processed at one time when it is added to a knowledge base. When documents are large, breaking them into smaller, manageable segments (or "chunks") enhances processing and information retrieval. (Default: 2000 characters and Minimum: 200)

The chunk size setting is only available for a fixed size chunk type for TotalAgility documents.

Overlap chunk

Specifies the amount of content that is repeated between chunks to maintain context. Overlapping can help preserve context, particularly when the end of one chunk does not capture the beginning of important information in the next chunk. This is useful for tasks like text segmentation, where splitting text into chunks without overlap might lead to losing important context at the boundaries. (Default for section and page: 20% of number of characters in the chunk, and for fixed size: 10 % of number of characters in the chunk. An overlap value of 0 indicates that there will be no repeated content between the chunks. )

Microsoft Word

Chunk type

The chunk type for the Microsoft Word document. Available options are: Section (default) and Fixed size.

Chunk size

The Chunk size setting is only available for a "Fixed size" chunk type for non TotalAgility Microsoft Word documents. (Default: 2000 characters and Minimum: 200)

Overlap chunk

The default overlap chunk percentage . (D for Sectionis 20% of number of characters in the chunk and for Fixed size is 10% of characters in the chunk.)

Chunking settings for non-TotalAgility documents

Non-TotalAgility documents only support a "Fixed size" chunk type.

  • The default chunk size is 2000 characters and a minimum is 200 characters.

  • The default overlap chunk is 10% of number of characters in the chunk.

Chunking settings for a document type

You can configure the chunk settings for each type of Capture document. For example, you may want an invoice to be chunked differently than a property valuation.

  1. Click .

    The Add document type chunk settings dialog box is displayed.

  2. On the Document type list, select a Capture document type to override the options for default Capture data and Microsoft Word document .

    The document types defined in the extraction group appear on the Document type list.

  3. Configure the settings as needed. See Chunking settings for TotalAgility documents.
  4. Click Save.

    The document type is listed in the table. You can modify or delete the configured document type chunking settings.