DCTool command-line supplement

DCTool is a command-line supplement for Document Classifier Assistant provided by the installation.

DCTool offers the same functionality as the Document Classifier Assistant application. Developers can use DCTool to invoke CSDK Document Classifier API functions in their custom applications, using the command line, a custom Windows UI or a web interface.

The default location of DCTool depends on your CSDK edition:

  • Windows, 32-bit edition:

    c:\Program Files (x86)\OmniPage\CSDK22\Bin\DCTool.exe
  • Windows, 64-bit edition:

    c:\Program Files\OmniPage\CSDK22\Bin\DCTool.exe

DCTool can create or modify a Document Classfier project file, run a training or test a trained project.

At the command prompt, switch to the folder appropriate for your operating system, and type DCTool (or dctool under Linux) without parameters to display the help for commands and switches.

The details, commands, and parameters in the following sections complement the DCTool command-line help.

Specify the project definition file

A project definition file is a text file of JSON format, describing the actual properties (such as classes, documents, stopwords, metawords) of the source Document Classifier project. To specify the project definition file, use the definition-file parameter with the modify or get_def_file commands.

Syntax:

DCTool get_def_file <project-file> -def_file <definition-file>

Under Linux, use the executable name decapitalized, such as: dctool.

Parameter Description
get_def_file The command for specifying the project definition file or folder.
<project-file> The path to the existing Document Classifier project file.
-def_file This switch sets the output format as definition file.
<definition-file> The path to the output project definition file.

Example:

DCTool get_def_file myproject.dcp -def_file myprojectdef.json

Project definition file samples

The following examples demonstrate the structure of the project definition files (JSON format) to use with the modify and get_def_file commands.

In the following project, data is serialized (using a struct Project instance as a root object):

struct POINT {
	long x;
	long y;
};

struct SIZE {
	int cx;
	int cy;
};

// see KernelApi.h!
// NOTE: enum values are serialized now using their 'int' representation!
enum LANGUAGES {
	// ...
};

// NOTE: enum values are serialized now using their 'int' representation!
enum Ngram_weights {
	NGW_NON = -1,
	NGW_PROHIBITED,
	NGW_NORMAL,
	NGW_LEVEL1,
	NGW_LEVEL2,
	NGW_MANDATORY
};

// NOTE: enum values are serialized now using their 'int' representation!
enum DCProjectType {
	DCPT_OPEN = 0,
	DCPT_CLOSED
};

struct MetaWord {
	string pattern;
	string value;
};

typedef string StopWord;

struct Phrase {
	string text;
	string value;
};

struct ClassPhrase : public Phrase {
	Ngram_weights weight;
	POINT location; // serialized as location.x, location.y
	SIZE size; // serialized as size.cx, size.cy
	SIZE drift; // serialized as drift.cx, drift.cy
	int groupId = -1;
};

struct Document {
	string imagePath;
	unsigned page = 0;
	bool isHidden = false;
};

typedef vector<ClassPhrase> ClassPhraseVect;
typedef map<LANGUAGES, ClassPhraseVect> ClassPhrasesMap;

struct Class {
	string name;
	bool isHidden = false;
	vector<Document> documents;
	ClassPhrasesMap classPhrasesMap;
};

typedef set<LANGUAGES> LangSet;
typedef vector<MetaWord> MetaWordVect;
typedef map<LANGUAGES, MetaWordVect> MetaWordsMap;
typedef vector<StopWord> StopWordVect;
typedef map<LANGUAGES, StopWordVect> StopWordsMap;

struct Project {
	string projectPath;
	DCProjectType projectType = DCPT_CLOSED;
	LangSet langSet;
	vector<Class> classes;
	MetaWordsMap metaWordsMap;
	StopWordsMap stopWordsMap;
};

A simplified project definition file sample:

{
	"project": {
		"projectPath": "e:\\dc\\test_project\\test.dcproj",
		"projectType" : 1,
		"langSet" : [ 0 ],
		"classes" : [ {
			"name": "BusinessCard",
			"isHidden" : false,
			"documents" : [ {
				"imagePath": "e:\\dc\\test_input\\BusinessCard\\02.jpg",
				"page" : 0,
				"isHidden" : false },{
				"imagePath": "e:\\dc\\test_input\\BusinessCard\\10.jpg",
				"page" : 0,
				"isHidden" : false } ],
			"classPhrasesMap": [ {
				"key": 0,
				"value" : [] } ] },
		{
			"name": "Receipt",
			"isHidden" : false,
			"documents" : [ {
				"imagePath": "e:\\dc\\test_input\\Receipt\\03.jpg",
				"page" : 0,
				"isHidden" : false },{
				"imagePath": "e:\\dc\\test_input\\Receipt\\08.jpg",
				"page" : 0,
				"isHidden" : false } ],
			"classPhrasesMap": [ {
				"key": 0,
				"value" : [] } ] }
		],
		"metaWordsMap": [
			{
				"key": 0,
				"value" : [
					{
						"pattern": "fax|facsimile",
						"value" : "fax"
					},
					{
						"pattern": "tel|telephone|phone|mobil[e]",
						"value" : "tel"
					}]
			}
		],
		"stopWordsMap": [
			{
				"key": 0,
				"value" : [
					"a",
					"about",
					"above"
				]
			}
		]
	}
}

JSON output with the test command

JSON output format is an alternative to CSV. To use JSON as the output format, use the –json parameter with the test command and specify the file name. Either use the project training set, or specify a test directory containing training files sorted in subfolders.

  • The following example uses the training set:

    dctool test myproject.dcp -test_training_set -json mytestoutput.json

  • The following example uses a test directory:

    dctool test myproject.dcp -test_dir mytestdir -json mytestoutput.json

The following test output data is serialized (using a struct TesOutputData instance as a root object):

struct StatData {
	unsigned count = 0;
	float percent = 0.0;
};

struct FileMatchInfo {
	string docName;
	string targetClass;
	string predictedClass;
	int confidence;
	bool isConfident;
	string result;
};

struct ConfusionMatrix {
	struct Row {
		string className;
		vector<unsigned> numOfDocsVect;
	};
	vector<string> header;
	vector<Row> matrix;
};

struct TotalStatistics {
	unsigned alienNum = 0;
	float falsePosPercentInAliens;
	StatData falsePos;
	StatData falseNegOrRejected; // open project: false negative; closed project: rejected
	StatData misclassified;
	StatData totalError;
	StatData correct;
};

struct BestConfidenceThresholdInfo {
	float falseNegativeOrRejectedWeight;
	float falsePositiveWeight;
	float misclassifiedWeight;
	unsigned bestConfidenceThreshold;
};

struct TestOutputData {
	vector<FileMatchInfo> fileMatchInfoVect;
	ConfusionMatrix confusionMatrix;
	TotalStatistics totalStatistics;
	BestConfidenceThresholdInfo bestConfidenceThresholdInfo;
};

A simplified JSON test output sample:

{
	"testOutputData": {
		"fileMatchInfoVect": [
			{
				"docName": "02.jpg#1",
				"targetClass" : "BusinessCard",
				"predictedClass" : "BusinessCard",
				"confidence" : 100,
				"isConfident" : true,
				"result" : "Correct"
			},
			{
				"docName": "10.jpg#1",
				"targetClass" : "BusinessCard",
				"predictedClass" : "BusinessCard",
				"confidence" : 100,
				"isConfident" : true,
				"result" : "Correct"
			},
			{
				"docName": "03.jpg#1",
				"targetClass" : "Receipt",
				"predictedClass" : "Receipt",
				"confidence" : 100,
				"isConfident" : true,
				"result" : "Correct"
			},
			{
				"docName": "08.jpg#1",
				"targetClass" : "Receipt",
				"predictedClass" : "Receipt",
				"confidence" : 100,
				"isConfident" : true,
				"result" : "Correct"
			}
		],
		"confusionMatrix": {
			"header": [
				"BusinessCard",
				"Receipt",
				"<Rejected>"
			],
			"matrix" : [
				{
					"className": "BusinessCard",
					"numOfDocsVect" : [
						2,
						0,
						0
					]
				},
				{
					"className": "Receipt",
					"numOfDocsVect" : [
						0,
						2,
						0
					]
				},
				{
					"className": "Alien documents",
					"numOfDocsVect" : [
						0,
						0,
						0
					]
				}
			]
		},
		"totalStatistics": {
			"alienNum": 0,
			"falsePosPercentInAliens" : 0.0,
			"falsePos" : {
				"count": 0,
				"percent" : 0.0
			},
			"falseNegOrRejected" : {
				"count": 0,
				"percent" : 0.0
			},
			"misclassified" : {
				"count": 0,
				"percent" : 0.0
			},
			"totalError" : {
				"count": 0,
				"percent" : 0.0
			},
			"correct" : {
				"count": 4,
				"percent" : 100.0
			}
		},
			"bestConfidenceThresholdInfo": {
			"falseNegativeOrRejectedWeight": 1.0,
			"falsePositiveWeight" : 1.0,
			"misclassifiedWeight" : 1.0,
			"bestConfidenceThreshold" : 49
		}
	}
}

CSV output with the test command

Use the –csv parameter with the test command to create a CSV (Comma Separated Values) output file, containing the test result information similar to the format displayed by the DCAssistant tool. You can open this CSV file in Excel or other spreadsheet applications. Adopt the examples in the JSON output with the test command section.

Excel uses the default list separator character to separate columns in the CSV file. To configure the default list separator under Windows, do the following:

  1. On the Start menu, click Control Panel.

  2. Click Region.

  3. In the Region dialog box, click the Formats Tab.

  4. Click Additional settings.

  5. Check the actual value or type a new separator in the List separator box.

  6. Click OK twice.

DCTool uses the "," (comma) character as the default list separator. To specify an alternative list separator character for DCTool (for example, to align with the default list separator character configured in Control Panel), use the –csv_separator parameter with the test command.