Table Extraction from Scanned Documents
Optical character recognition converts document images into text, but text is only part of what documents contain. Tables encode structured relationships — financial figures aligned in columns, experimental results organized by condition, schedules mapping time to activity. Standard OCR treats a table as lines of text, losing the spatial relationships that give the data meaning. The number "42.5" means nothing without knowing which row and column it belongs to.
Table extraction recovers this structure. It detects where tables appear in a document, identifies the arrangement of rows, columns, and cells, and maps recognized text into a structured format that preserves the relationships the original author intended. This is one of the most challenging problems in document understanding, requiring both visual analysis of layout and semantic understanding of content organization.
Why Tables Are Hard
Tables seem visually simple — grids of text separated by lines. But the variation in real documents makes automatic extraction surprisingly difficult.
Ruling lines are optional. Many tables use horizontal and vertical rules to delineate cells, but many do not. Financial statements, academic papers, and government forms routinely use whitespace alignment instead of drawn borders. A table with no visible lines requires the extraction system to infer cell boundaries from text alignment alone.
Spanning cells break the grid. Headers that span multiple columns, row labels that span multiple rows, and merged cells create irregular structures that do not fit a simple row-column model. A table with a header "Revenue (USD millions)" spanning four quarterly columns must be recognized as a single cell covering multiple grid positions.
Nested and hierarchical headers. Complex tables use multi-level headers where columns are grouped under parent categories. A table might have "2024" and "2025" as top-level headers, each containing "Q1", "Q2", "Q3", "Q4" sub-headers. Recognizing this hierarchy is necessary for correct data extraction.
Visual ambiguity. Lists, forms, and multi-column text layouts can visually resemble tables without being tables. Conversely, some tables use such minimal visual structure that they are difficult to distinguish from regular text. The preprocessing pipeline must handle both false positives and false negatives.
Table Detection
The first step is finding tables in the document image. Table detection answers the question: where are the tables on this page?
Early approaches used hand-crafted rules based on line detection (finding horizontal and vertical ruling lines) and whitespace analysis (detecting aligned gaps between text blocks). These methods worked well for simple tables with clear borders but failed on borderless tables and documents with complex layouts.
Deep learning transformed table detection into an object detection problem. Models like Faster R-CNN and, more recently, transformer-based detectors locate tables as bounding boxes in the document image, trained on annotated examples rather than hand-crafted rules.
Schreiber et al. demonstrated that deep learning achieves strong table detection without requiring heuristics or PDF metadata, working directly from document images.
[1]Schreiber, S., Agne, S., Wolf, I., Dengel, A. & Ahmed, S. (2017).DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images.Proceedings of ICDAR 2017, pp. 1162–1167
The challenge scales with document diversity. A detector trained on scientific papers may miss tables in financial reports, because the visual conventions differ. Fine-tuning on domain-specific documents is typically necessary for production accuracy.
Table Structure Recognition
Detection finds the table; structure recognition understands its internal organization. This means identifying rows, columns, cell boundaries, and the relationships between header cells and data cells.
Segmentation-Based Approaches
One approach treats structure recognition as a semantic segmentation problem — classifying each pixel in the table region as belonging to a row separator, column separator, cell content, or background. Paliwal et al. proposed TableNet, which jointly performs detection and structure recognition by segmenting table and column regions from the document image.
[2]Paliwal, S.S., D, V., Rahul, R., Sharma, M. & Vig, L. (2019).TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images.Proceedings of ICDAR 2019, pp. 128–133
The segmentation approach works well for tables with consistent structure but struggles with irregular layouts, spanning cells, and tables where columns are defined by text alignment rather than visual separators.
Transformer-Based Approaches
The Table Transformer, built on the DETR (DEtection TRansformer) architecture, treats table structure recognition as an object detection problem at the cell level. Rather than segmenting regions, it detects individual cells, rows, columns, and headers as separate objects, then assembles them into a coherent table structure.
Smock, Pesala, and Abraham demonstrated that a single transformer architecture handles detection, structure recognition, and functional analysis (distinguishing headers from data cells) without task-specific customization.
[3]Smock, B., Pesala, R. & Abraham, R. (2022).PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents.IEEE/CVF CVPR 2022
This approach benefits from the transformer's ability to model global relationships — understanding that a header cell relates to all data cells below it, or that a spanning cell covers multiple grid positions. The self-attention mechanism naturally captures these long-range spatial dependencies.
from transformers import TableTransformerForObjectDetection
from PIL import Image
def extract_table_structure(table_image_path, model, processor):
"""Detect rows, columns, and cells in a table image."""
image = Image.open(table_image_path).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Post-process detections
target_sizes = [image.size[::-1]] # height, width
results = processor.post_process_object_detection(
outputs,
threshold=0.7,
target_sizes=target_sizes,
)[0]
cells = []
for score, label, box in zip(
results["scores"], results["labels"], results["boxes"]
):
cells.append({
"type": model.config.id2label[label.item()],
"confidence": score.item(),
"bbox": box.tolist(),
})
return cells
Training Data and Benchmarks
Progress in table extraction has been driven by increasingly large and well-annotated datasets.
TableBank
Li et al. created TableBank by exploiting the native markup structure of Word and LaTeX documents — a form of weak supervision that produced over 417,000 labelled table images without manual annotation.
[4]Li, M., Cui, L., Huang, S., Wei, F., Zhou, M. & Li, Z. (2020).TableBank: Table Benchmark for Image-based Table Detection and Recognition.Proceedings of LREC 2020, pp. 1918–1925
This scale — orders of magnitude larger than prior manually-labelled datasets — enabled training of deeper models and more robust generalization. The weak supervision approach trades annotation precision for volume, which proves effective for table detection where the visual signal is strong.
PubTabNet
Zhong, ShafieiBavani, and Yepes took annotation further by providing structured HTML representations for each table, not just bounding boxes. PubTabNet contains over 568,000 table images with their corresponding HTML structure, enabling training of models that output complete table markup.
[5]Zhong, X., ShafieiBavani, E. & Jimeno Yepes, A. (2020).Image-based Table Recognition: Data, Model, and Evaluation.Proceedings of ECCV 2020, pp. 564–580
PubTabNet also introduced the Tree-Edit-Distance-based Similarity (TEDS) metric, which evaluates table recognition by comparing the predicted HTML tree structure against the ground truth. TEDS has become the standard evaluation metric because it captures structural correctness — whether cells are in the right positions with the right spanning attributes — rather than just text content.
Community Benchmarks
The ICDAR competition series has driven progress through standardized evaluation. The 2019 competition on Table Detection and Recognition established benchmarks on both modern and historical documents.
[6]Gao, L., Huang, Y., Dejean, H., Meunier, J.-L., Yan, Q., Fang, Y., Kleber, F. & Lang, E. (2019).ICDAR 2019 Competition on Table Detection and Recognition (cTDaR).Proceedings of ICDAR 2019, pp. 1510–1515
Table extraction from historical documents faces additional challenges beyond modern documents. Hand-drawn ruling lines are irregular and often broken. Handwritten cell content requires HTR rather than OCR. And the table conventions of earlier centuries may not follow modern layout patterns. The ICDAR 2019 cTDaR competition included an archival document track specifically to benchmark progress on these challenges.
The Extraction Pipeline
A complete table extraction system combines detection, structure recognition, and text recognition into an end-to-end pipeline.
Step 1: Table detection. Identify table regions in the document image using an object detector. Output bounding boxes for each detected table.
Step 2: Structure recognition. For each detected table, identify rows, columns, cell boundaries, and spanning relationships. Output a grid structure mapping each cell to its row and column positions.
Step 3: Cell text extraction. Crop each cell region and apply OCR to extract the text content. This step uses the same character recognition technology as standard OCR but applied to small, isolated text regions.
Step 4: Assembly. Combine the grid structure with the cell text to produce a structured output — CSV, JSON, HTML, or database records. Handle spanning cells, empty cells, and multi-line cell content.
def assemble_table(cells, structure):
"""Combine cell text with grid structure into a data frame."""
rows = {}
for cell in cells:
row_idx = cell["row"]
col_idx = cell["col"]
text = cell["text"].strip()
if row_idx not in rows:
rows[row_idx] = {}
rows[row_idx][col_idx] = text
# Convert to list of lists, filling empty cells
max_col = max(c for row in rows.values() for c in row.keys())
table_data = []
for row_idx in sorted(rows.keys()):
row_data = []
for col_idx in range(max_col + 1):
row_data.append(rows[row_idx].get(col_idx, ""))
table_data.append(row_data)
return table_data
Practical Considerations
Document type matters. Financial documents, scientific papers, government forms, and medical records each use different table conventions. A model trained on scientific papers may perform well on regular grids with clear headers but struggle with the complex nested structures common in financial statements. Domain-specific fine-tuning is usually necessary.
Quality depends on OCR accuracy. Table structure recognition can be perfect, but if the OCR step misreads cell content — particularly numbers — the extracted data is unreliable. For numerical tables, post-OCR correction and validation against expected formats (dates, currencies, percentages) catch errors that aggregate metrics miss.
Scale and performance. Processing large document collections requires batch processing strategies. Table detection adds computational cost because every page must be analyzed for table presence, even pages that contain no tables. Two-stage systems — a fast classifier that flags pages likely to contain tables, followed by detailed extraction on flagged pages — improve throughput.
Output format. The right output format depends on the downstream use case. CSV works for flat tables. JSON preserves hierarchical header relationships. HTML with rowspan and colspan attributes captures spanning cells. Database insertion requires schema mapping. The extraction pipeline should produce the format the consumer needs, not a generic intermediate representation.
Conclusion
Table extraction from scanned documents has evolved from rule-based line detection to transformer-based systems that understand table structure at a semantic level. The key advances:
- Deep learning replaced hand-crafted rules for both detection and structure recognition, enabling generalization across document types
- Transformer architectures (DETR, Table Transformer) model global spatial relationships, handling spanning cells and irregular layouts that confound segmentation approaches
- Large-scale datasets (TableBank, PubTabNet, PubTables-1M) made robust model training feasible through weak supervision and automated annotation
- Evaluation metrics like TEDS capture structural correctness, not just text accuracy, measuring what matters for downstream data use
- The full pipeline — detection, structure recognition, OCR, assembly — must be evaluated end-to-end because errors compound across stages
For practitioners building document processing pipelines, table extraction is often the highest-value capability after basic text OCR. Structured data from tables feeds directly into databases, spreadsheets, and analytical workflows — turning scanned documents from passive archives into active data sources.
References
[1]Schreiber, S., Agne, S., Wolf, I., Dengel, A. & Ahmed, S. (2017).DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images.Proceedings of ICDAR 2017, pp. 1162–1167
[2]Paliwal, S.S., D, V., Rahul, R., Sharma, M. & Vig, L. (2019).TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images.Proceedings of ICDAR 2019, pp. 128–133
[3]Smock, B., Pesala, R. & Abraham, R. (2022).PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents.IEEE/CVF CVPR 2022
[4]Li, M., Cui, L., Huang, S., Wei, F., Zhou, M. & Li, Z. (2020).TableBank: Table Benchmark for Image-based Table Detection and Recognition.Proceedings of LREC 2020, pp. 1918–1925
[5]Zhong, X., ShafieiBavani, E. & Jimeno Yepes, A. (2020).Image-based Table Recognition: Data, Model, and Evaluation.Proceedings of ECCV 2020, pp. 564–580
[6]Gao, L., Huang, Y., Dejean, H., Meunier, J.-L., Yan, Q., Fang, Y., Kleber, F. & Lang, E. (2019).ICDAR 2019 Competition on Table Detection and Recognition (cTDaR).Proceedings of ICDAR 2019, pp. 1510–1515