The evaluation tool can be downloaded here.
The ICDAR 2019 cTDaR evaluates two aspects of table analysis: table detection and recognition. We choose the metric (i) to evaluate the performance of table region detection, and apply the metric (ii) to evaluate that of table recognition. Based on these measures, an overall performance of various algorithms can be compared with each other.
Metric for table region detection task
IoU is calculated to tell if a table region is correctly detected. It's used to measure the overlapping of the detected polygons:
$$IoU=\frac{area(GTP\bigcap DTP)}{area(GTP\bigcup DTP)}$$
where GTP defines the Ground Truth Polygon of the table region and DTP defines the Detected Table Polygon. IoU has a range from 0 to 1, where 1 suggests the best possible segmentation. When evaluating, different threshold values of IoU will be used to determine if a region is considered as being detected correctly. Then, the precision and recall values are computed from a method’s ranked output. Recall is defined as the proportion of all true positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class. Furthermore, F1 score will be computed as the harmonic average of recall and precision value. Precision, recall and F1 scores are calculated with IoU threshold of 0.6, 0.7, 0.8 and 0.9 respectively.
Metric for table recognition task
This track is evaluated by comparing the structure of a table that is defined as a matrix of cells. For each cell, participants are required to return the coordinates of a polygon defining the cell (historical documents) or a polygon defining the convex hull of the cell's contents (modern documents). Additionally, participants must provide the start/end column/row information for each cell.We propose the following metric: Cell’s adjacency relation-based table structure evaluation (inspired by Gobel’s method [2]).
For comparing two cell structures, we use the method: for each table region, we align each groundtruth cell to the predicted cell with IoU > σ, identify the valid predicted cells, and then generate a list of adjacency relations between each valid cell and its nearest neighbor in horizontal and vertical directions. Blank cells are not represented in the grid. No adjacency relations are generated between blank cells or a blank cell and a content cell. This 1-D list of adjacency relations can be compared to the groundtruth by using precision and recall measures. If both cells are identical and the direction matches, then it is marked as correctly retrieved; otherwise it is marked as incorrect.
The precision, recall and F1 score will be calculated under circumstances that IoU is equal to 0.6, 0.7, 0.8 and 0.9 as the evaluation for track A.
the final ranks of teams are decided by the weighted average F1 (WAvg. F1) value of the whole dataset for each track. The WAvg. F1 value is defined as:
$$ WAvg. F1 = \frac{\sum\limits_{i=1}^4 IoU_i \cdot F1@IoU_i}{\sum\limits_{i=1}^4 IoU_i}$$
which shows that the weight of each F1 value is the corresponding IoU threshold. We think results with higher IoUs are more important than those with lower IoUs, so we use IoU threshold as the weight of each F1 value to get a definitive performance score for convenient comparison.
We will also release a number of tools to enable the participants to automatically compare their result to the groundtruth.
[1] L. Gao, X. Yi, Z. Jiang, L. Hao and Z. Tang, “ICDAR 2017 POD Competition,” in ICDAR, 2017, pp. 1417-1422.
[2] M. C. Gobel, T. Hassan, E. Oro, G. Orsi, ”ICDAR2013 Table Competition,” in Proc. of the 12th ICDAR (IEEE, 2013), pp. 1449-1453.