Optimization Settings Window
The Optimize Content Classification feature can be further configured using the options on the Optimization Settings window.
Many of the features have a Min. value, a Max. value, and a Step width. This means that each value is tested, starting from the Min. values and ending with the Max. values. The Step width specifies how to increment each step. For example, if you have a minimum value set to 3, a maximum value set to 9, and a step width of 3, the optimization process will test each of the 3, 6, and 9 values.
- Optimization Targets
-
This group has the following options:
- Max. number of features
-
With respect to optimization, a feature is a string that is unique to a class. Trigrams (3-letters) or words are useful for helping with classification, but they are not unique enough, especially if you are using a fuzzy string match. Features are strong stings. For example, "Invoice" might be found in all classes, but Invoice statement is unique to one class and "This is the final statement" is unique to another class.
Select this option to configure a minimum value and a maximum value as well as a step width for the maximum number of features.
This option is selected by default. The default values for Min. values, and Step width are both set to 1000. Max. values is set to 10,000.
The feature length includes spaces. This means that if the maximum length is set to 7, this includes any spaces between words. - Min. feature length
-
The minimum feature length determines the shortest length of a feature or string. Select this option to configure a minimum value and a maximum value as well as a step width for the minimum feature length.
This option is selected by default. The default value for Min. values is set to 3, the default value for Max. values is set to 5, and the default value for Step width is set to 1. This means that when optimization is performed, the minimum feature length is tested for features that are 3, 4, and 5 characters long as a minimum.
- Max feature length
-
The maximum feature length determines longest length of a feature or string. Select this option to configure a minimum value and a maximum value as well as a step width for the maximum feature length.
This option is selected by default. The default value for Min. values is set to 16, the default value for Max. values is set to 64, and the default value for Step width is set to 8. This means that when optimization is performed, the maximum feature length is tested for features that are 16, 24, 32, 40, 48, 56, and 64 characters long as a maximum.
- Min. feature frequency
-
Select this option to specify how many times a feature is present in order for a document to be classified.
This option is selected by default. The Min. values is set to 2, the Max. values is set to 5, and the Step width is set to 1. This means that when optimization is performed, there must be at least 2 instances of a feature in order for a document to be classified. There can be 2, 3, 4, and 5 instances using the default values.
- Min. class entropy
-
Entropy is the level of uniqueness in a project. A higher the value indicates that the features are unique. The more classes that a project has means more entropy.
This option is selected by default. The Min. values is set to 0.2, the Max. values is set to 0.8, and the Step width is set to 0.1. This means that the 0.2, 0.3, 0.4, 0.5m 0.6, 0.7, and 0.8 are tested.
- Use words only
-
Select this option to use full words only. This option is selected by default.
Only full words are recognized. There are no partial words or stems for verbs or partial compound words.
- Use fuzzy string match
-
Select this option to enable fuzzy string matches. This option is selected by default.
Definitions for the buttons at the bottom of this window can be found in Common Transformation Designer Buttons.