What is image classification?

Image classification is an extremely fast and reliable mathematical algorithm used to identify documents by their layout. Image classification uses all geometrical features and the layout of a document to match with other similar documents. Geometrical features are not restricted to logos or lines but take into account the complete structure of a page including text, logos, boxes and other graphical elements and lines.

The relevant features of a document are determined automatically. This means that the system disregards variable information like hand printed text or stains on documents while keeping the relevant information. If several documents have similar characteristics then the classification algorithm can make use of these characteristics to separate or classify these documents automatically. Typically, this is the case for structured documents like forms. Image classification can also be used for letters such as invoices, sales orders, and similar. This is because the layout of one sender can easily be distinguished and separated from the others.

The algorithm uses very fast matching to compare document images with a learned pattern. This match is fault tolerant, almost independent of contrast and redundant so that smaller local differences between learned images and classified images are eliminated as long as the majority of features are similar. Therefore, stamps or stains do not raise problems. Shifts, skews and size differences are accepted within a limit of about four percent. Since the algorithms used are very fast and need no OCR this classification object can be used in time critical applications.