TableXTract output XML format
To learn about the TableXTract output XML format, see the following lines from an example with commentary.
<?xml version="1.0" encoding="utf-8"?>
The <Document>
tag specifies the number of pages in the npages
attribute.
<Document npages="2" version="2" encrypted="false">
The description of recognized pages begins here.
<Pages>
Pages are defined by <Pag>
tags with the following attributes:
-
id: The ID of the page
-
wd: Page width in pixels
-
ht: Page height in pixels
<Pag id="0" wd="4950" ht="3825"><Img vs="0" im="D:\Temp\tablextract\test\out0.tif"/></Pag>
<Pag id="1" wd="4950" ht="3825"><Img vs="0" im="D:\Temp\tablextract\test\out1.tif"/></Pag>
</Pages>
<Metadata>
</Metadata>
The <Elements>
tag below includes each recognized word as an
<Elm>
tag.
-
id: The ID of the element
-
pg: The ID of the page
-
rt: The rectangle enclosing the word [left,top-right,bottom]
<Elements>
<Elm id="0" pg="0" rt="[20,59-124,79]">Plant:</Elm>
<Elm id="1" pg="0" rt="[152,60-225,79]">SANF</Elm>
<Elm id="2" pg="0" rt="[1370,59-1595,85]">Productivity</Elm>
...
The word elements for the next page:
<Elm id="787" pg="1" rt="[920,217-1012,237]">Clock</Elm>
<Elm id="788" pg="1" rt="[1032,218-1086,237]">Hrs</Elm>
<Elm id="789" pg="1" rt="[1164,166-1237,191]">Temp</Elm>
...
</Elements>
The description of recognized table components begins here. The
<Fields>
tag describes the table parts with
<Fld>
tags. <Fld>
tags represent a
field, an area in a table: either a row, column, or cell.
<Fields>
The main attributes of the <Fld>
tag:
-
id: The ID of the field
-
pg: The page number. The numbering is zero-based.
-
rt: The rectangle enclosing the field [left,top-right,bottom]
The lb (label) attribute defines what the Fld tag represents:
-
lb="_Row": a row in the table. rt: the rectangle enclosing the row [left,top-right,bottom]
-
lb="_Column": a column in the table. rt: the rectangle enclosing the column [left,top-right,bottom]
Row definitions follow.
<Fld id="0" pg="0" rt="[7,59-3461,243]" lb="_Row" cf="@@@@H@CO"></Fld>
<Fld id="1" pg="0" rt="[7,319-3461,346]" lb="_Row" cf="@@@@H@CO"></Fld>
Rows with lb="_RowDescriptor" do not have other cells but typically contain explanatory text and do not keep to the overall column setup of the table.
<Fld id="21" pg="1" rt="[159,1371-2208,1392]" lb="_RowDescriptor" cf="@@@@H@CO">At 31 Dec</Fld>
<Fld id="22" pg="1" rt="[159,1402-2208,1423]" lb="_RowDescriptor" cf="@@@@H@CO">2018</Fld>
..
Column definitions follow.
The <Fld>
tags here represent headerless columns. In this case, the
<Fld>
tag surrounds the following string: "(empty)".
<Fld id="35" pg="1" rt="[566,1831-727,2142]" lb="_Column" cf="@@@@H@CO">(empty)</Fld>
<Fld id="36" pg="1" rt="[834,1831-904,2142]" lb="_Column" cf="@@@@H@CO">(empty)</Fld>
..
The following <Fld>
tags represent columns with headers. In this
case, the <Fld>
tag surrounds the column header text, in the
example below: "Cost Cntr".
<Fld id="41" pg="2" rt="[20,166-93,2973]" lb="_Column" cf="@@@@H@CO">Cost Cntr</Fld>
<Fld id="42" pg="2" rt="[112,217-488,2973]" lb="_Column" cf="@@@@H@CO">Description</Fld>
..
Cell definitions follow after the rows and column definitions. Cells appear in consecutive rows, ordered from left to right within a row.
The header cells appear first if there are any headers defined above. Cells with lb="_ColumnDescriptor" describe a merged header cell (a header cell spanned over multiple columns).
<Fld id="57" pg="0" rt="[7,26-495,131]" lb="_ColumnDescriptor" cf="@@@@H@CO">Plant: SANF</Fld>
<Fld id="58" pg="0" rt="[495,26-2446,131]" lb="_ColumnDescriptor" cf="@@@@H@CO">Productivity</Fld>
..
Cells with lb="_ColumnHeader" describe a regular header cell. The column header text
surrounded by the <Fld>
tag matches with the related column
header text defined above. While the above column definitions refer to the whole column,
the definition below describes the header cell only. Check the rectangle (rt) attribute
to identify the area.
<Fld id="60" pg="0" rt="[20,166-93,243]" lb="_ColumnHeader" cf="@@@@H@CO">Cost Cntr</Fld>
<Fld id="61" pg="0" rt="[112,217-488,243]" lb="_ColumnHeader" cf="@@@@H@CO">Description</Fld>
..
The columns are named if there are any headers defined above. Cells can refer to the
columns by specifying these names in the label (lb) attribute. In the example below
lb="Cost Cntr" assigns the cell containing the string "4104" to the column with the
header text "Cost Cntr". (This header text is set in the <Fld>
element with id="41".)
<Fld id="77" pg="0" rt="[20,319-93,346]" lb="Cost Cntr" cf="@@@@H@CO">4104</Fld>
<Fld id="78" pg="0" rt="[112,319-488,346]" lb="Description" cf="@@@@H@CO">Training/Meeting Car</Fld>
..
The columns have no name if there are no headers defined above. In this case, lb="(empty)".
<Fld id="184" pg="1" rt="[1481,482-1581,577]" lb="(empty)" cf="@@@@H@CO">60 and 90 , $be</Fld>
<Fld id="185" pg="1" rt="[1671,482-1785,577]" lb="(empty)" cf="@@@@H@CO">90 and 180 days $bn</Fld>
..
</Fields>
Table definitions begin here with the <Tables>
tag.
<Tables>
Within the <Tables>
tag, <Tbl>
tags specify
the table connections and basic layout. The main attribute of the
<Tbl>
tag:
-
nm: The page number and the number of the table within the page in English according to the following template:
Page x (table y).
In the example below, the table is the first table on the first page.
-
rs: The number of rows in the table.
-
cs: The number of columns in the table.
<Tbl nm="Page 0 (table 0)" rs="40" cs="17" vrt="0" nxt="1" prv="-1" hdr="0.950000">
Tables with the same structure can be continued in sequence using the following attributes:
-
nxt: Refers to the next connected table with its number. Equals with "-1" if there is no more table in the sequence.
-
prv: Refers to the previous connected table with its number. Equals with "-1" if there is no previous table in the sequence.
For connected consecutive tables, the nm attribute contains the "CONTINUED" string.
The following example describes a table that continues table 0 (prv="0") and has no further connected table (nxt="-1"):
<Tbl nm="CONTINUED" rs="36" cs="17" vrt="0" nxt="-1" prv="0" hdr="0.950000">
There are no connected tables in the following example:
<Tbl nm="Page 0 (table 1)" rs="8" cs="13" vrt="0" nxt="-1" prv="-1" hdr="0.880000">
An <Fld>
tag follows, specifying the location and size of the
table with the following attributes:
-
pg: The number of the page. The numbering is zero-based.
-
rt: The rectangle enclosing the page [left,top-right,bottom]
<Fld id="0" pg="0" rt="[7,26-3461,2973]" lb="Generic" cf="@@@@H@CO"></Fld>
Table structure definitions begin here.
<Structure>
<Title fi="-1"></Title>
Row structure definitions come first, with fi attributes linking to the id attribute of
the related <Fld>
definition above.
<SRow fi="0"></SRow>
<SRow fi="1"></SRow>
...
Column structure definitions follow, with fi attributes linking to the id attribute of
the related <Fld>
definition above.
<SCol fi="40"></SCol>
<SCol fi="41"></SCol>
...
Column structure descriptor definitions link to <Fld>
tags with
the _ColumnDescriptor label (lb="_ColumnDescriptor"). Those are typically merged cells
in table headers. The following <ScolDesc>
tag refers to the
<Fld>
tag above with id="57", enclosing the text "Plant:
SANF".
<SColDesc fi="57"></SColDesc>
<SColDesc fi="58"></SColDesc>
</Structure>
Cell structure definitions follow after the rows and column definitions. Cells appear in
consecutive rows, ordered from left to right within a row. The following fi attributes
link to the id attribute of the related <Fld>
definition
above.
<Row>
<Col fi="60"></Col>
<Col fi="61"></Col>
Empty cells are represented by the following tag:
<Empt/>
Then cells may continue until the row is filled:
<Col fi="63"></Col>
<Col fi="64"></Col>
...
</Row>
The structure definition continues row by row.
<Row>
<Col fi="77"></Col>
<Col fi="78"></Col>
...
</Row>
...
</Tbl>
The structure continues with the next table. That can be placed on the same page or the next. The table example below continues the table detailed above:
<Tbl nm="CONTINUED" rs="36" cs="17" vrt="0" nxt="-1" prv="0" hdr="0.950000">
<Fld id="0" pg="1" rt="[7,26-3461,3023]" lb="Generic" cf="@@@@H@CO"></Fld>
<Structure>
<Title fi="-1"></Title>
<SRow fi="740"></SRow>
<SRow fi="741"></SRow>
....
<SCol fi="776"></SCol>
<SCol fi="777"></SCol>
...
<SColDesc fi="793"></SColDesc>
</Structure>
<Row>
<Col fi="796"></Col>
<Col fi="797"></Col>
...
</Row>
<Row>
...
</Row>
</Tbl>
</Tables>
</Document>