← Blog

PDF → Excel — Finding the table on the page

A PDF page is a flat collection of disconnected drawing operations. There is no “table” object, only text at coordinates and, sometimes, ruled lines. The converter has to decide which rectangular regions deserve to be treated as tables. A missed region produces no output; a false positive floods Excel with garbage rows.

The easy case: ruled tables

A fully bordered table is a rectangle of intersecting horizontal and vertical strokes with text in the cells. Every converter handles it the same way:

Collect every line on the page.
Filter to lines that are thin (roughly 0.25–1 pt, the typical border thickness) and orthogonal.
Compute every intersection.
Group connected intersections into candidate grids.
Validate: at least 3 lines, regular spacing, minimum size.

A grid that survives validation becomes a table.

The hard case: tabular text without lines

Many modern documents look like this:

Name	Age	City
Alice	25	London
Bob	30	New York
Charlotte	28	Berlin

There are no separators. The structure exists only in your visual cortex, which sees the alignment and infers columns. To a naive line-detection algorithm there is nothing to detect, and the region falls through to the text-extraction layer as ordinary paragraphs. Excel ends up with a single cell per row holding "Alice 25 London", useless for analysis.

Recognizing this layout requires column detection from whitespace gaps.

The major open-source tools split along these lines:

Camelot — Python, four modes since v1.0: lattice (lines), stream (alignment), network (character bounding boxes), and hybrid (network + lattice).
Tabula — Java, strongest in stream mode.
pdfplumber — pdfminer-based, with fine-grained table_settings (snap, join, edge tolerance).

In commercial products since 2023–2024, ML models (TATR, LayoutLMv3, Donut, and proprietary variants) sit on top of or in place of these heuristics.

The middle ground: partial rules

Most real tables fall between the extremes. Common patterns: top and bottom horizontals with no verticals, a single line under the header, every third row outlined as a zebra stripe.

Pure line detection produces a weak result here — too few intersections, an incomplete grid. The converter has to use lines where they exist, fill the gaps with alignment-based detection, and stitch the two into one grid. Many tools don’t try: if there are lines they treat the region as a table, otherwise as text.

Which lines count as table lines

Not every stroke is structural. Decorative rules under section headings, figure frames, chart gridlines, and underline marks in captions are noise. Useful filters:

Color — table borders are almost always black; colored lines tend to be decoration.
Thickness — 0.25–1 pt suggests a border; anything over 2 pt is decorative.
Length — very short strokes are usually rounded-corner artifacts.
Context — lines with no nearby text are almost never table lines.

Which grids count as tables

A candidate grid has to pass three more checks.

Minimum size. A 1×1 grid is just a rectangle. Floor is 2×2.

Fill ratio. A 5×5 grid with text in only 2 cells is likely decorative art. Commercial engines typically threshold at 25–40%.

Regularity. Roughly uniform cells behave like a table. Wildly varying sizes signal a layout block masquerading as one.

Full-page tables vs embedded tables

Full-page tables (financial reports, schedules, lists) fill the page out to the margins. The recognized grid spans nearly the full page; in Excel it fills a sheet.

Embedded tables sit alongside other content — one or two tables in a sea of prose. The detector has to draw a tight box and resist absorbing adjacent paragraphs. In Excel they take up part of a sheet or get one of their own.

Multiple tables on one page

Reports often place “Revenue by category” next to “Expenses by category” on the same page. The algorithm splits them by checking that their bounding boxes don’t overlap and that each has its own intersection cluster, then decides how to lay them out. Stacking them on one sheet with a blank row between is the safe default; no automated way exists to tell whether they’re semantically related.

Distributing tables across sheets

Three policies are standard.

One table, one sheet. Predictable, but 30 tables produce 30 sheets.

One PDF page, one sheet. Cluttered when pages are dense.

Merge homogeneous tables. If pages 1–35 all carry “date | amount | category”, fold them into one. Usually what the user wants; almost no converter does it automatically because recognizing homogeneity is a hard problem.

Most tools default to one of the first two.

Quality you can expect

Full-page tables with explicit lines — recovered cleanly in most cases.
Embedded tables with lines — noticeably worse.
Borderless tables — low without a tool targeting them specifically.
Hybrid (partial lines) — varies wildly by tool.
Scanned tables — bottlenecked by OCR.

If the tables are mostly borderless, budget for cleanup or pick a tool that promises lineless detection.