PDF → Excel — Finding the table on the page
A PDF page is a flat collection of disconnected drawing operations. There is no “table” object, only text at coordinates and, sometimes, ruled lines. The converter has to decide which rectangular regions deserve to be treated as tables. A missed region produces no output; a false positive floods Excel with garbage rows.
The easy case: ruled tables
A fully bordered table is a rectangle of intersecting horizontal and vertical strokes with text in the cells. Every converter handles it the same way:
- Collect every line on the page.
- Filter to lines that are thin (roughly 0.25–1 pt, the typical border thickness) and orthogonal.
- Compute every intersection.
- Group connected intersections into candidate grids.
- Validate: at least 3 lines, regular spacing, minimum size.
A grid that survives validation becomes a table.
The hard case: tabular text without lines
Many modern documents look like this:
| Name | Age | City |
|---|---|---|
| Alice | 25 | London |
| Bob | 30 | New York |
| Charlotte | 28 | Berlin |
There are no separators. The structure exists only in your visual
cortex, which sees the alignment and infers columns. To a naive
line-detection algorithm there is nothing to detect, and the region
falls through to the text-extraction layer as ordinary paragraphs. Excel
ends up with a single cell per row holding "Alice 25 London",
useless for analysis.
Recognizing this layout requires column detection from whitespace gaps.
The major open-source tools split along these lines:
- Camelot — Python, four modes since v1.0:
lattice(lines),stream(alignment),network(character bounding boxes), andhybrid(network + lattice). - Tabula — Java, strongest in stream mode.
- pdfplumber — pdfminer-based, with fine-grained
table_settings(snap, join, edge tolerance).
In commercial products since 2023–2024, ML models (TATR, LayoutLMv3, Donut, and proprietary variants) sit on top of or in place of these heuristics.
The middle ground: partial rules
Most real tables fall between the extremes. Common patterns: top and bottom horizontals with no verticals, a single line under the header, every third row outlined as a zebra stripe.
Pure line detection produces a weak result here — too few intersections, an incomplete grid. The converter has to use lines where they exist, fill the gaps with alignment-based detection, and stitch the two into one grid. Many tools don’t try: if there are lines they treat the region as a table, otherwise as text.
Which lines count as table lines
Not every stroke is structural. Decorative rules under section headings, figure frames, chart gridlines, and underline marks in captions are noise. Useful filters:
- Color — table borders are almost always black; colored lines tend to be decoration.
- Thickness — 0.25–1 pt suggests a border; anything over 2 pt is decorative.
- Length — very short strokes are usually rounded-corner artifacts.
- Context — lines with no nearby text are almost never table lines.
Which grids count as tables
A candidate grid has to pass three more checks.
Minimum size. A 1×1 grid is just a rectangle. Floor is 2×2.
Fill ratio. A 5×5 grid with text in only 2 cells is likely decorative art. Commercial engines typically threshold at 25–40%.
Regularity. Roughly uniform cells behave like a table. Wildly varying sizes signal a layout block masquerading as one.
Full-page tables vs embedded tables
Full-page tables (financial reports, schedules, lists) fill the page out to the margins. The recognized grid spans nearly the full page; in Excel it fills a sheet.
Embedded tables sit alongside other content — one or two tables in a sea of prose. The detector has to draw a tight box and resist absorbing adjacent paragraphs. In Excel they take up part of a sheet or get one of their own.
Multiple tables on one page
Reports often place “Revenue by category” next to “Expenses by category” on the same page. The algorithm splits them by checking that their bounding boxes don’t overlap and that each has its own intersection cluster, then decides how to lay them out. Stacking them on one sheet with a blank row between is the safe default; no automated way exists to tell whether they’re semantically related.
Distributing tables across sheets
Three policies are standard.
One table, one sheet. Predictable, but 30 tables produce 30 sheets.
One PDF page, one sheet. Cluttered when pages are dense.
Merge homogeneous tables. If pages 1–35 all carry “date | amount | category”, fold them into one. Usually what the user wants; almost no converter does it automatically because recognizing homogeneity is a hard problem.
Most tools default to one of the first two.
Quality you can expect
- Full-page tables with explicit lines — recovered cleanly in most cases.
- Embedded tables with lines — noticeably worse.
- Borderless tables — low without a tool targeting them specifically.
- Hybrid (partial lines) — varies wildly by tool.
- Scanned tables — bottlenecked by OCR.
If the tables are mostly borderless, budget for cleanup or pick a tool that promises lineless detection.