← Blog

PDF → Excel — Why this conversion is harder than it looks

PDF→Excel tries to recover typed data (numbers, dates, and the relationships between them) from a document that knows nothing about types or relationships. That gap is why Excel results almost always need a second pass by hand.

What an .xlsx actually is

An Excel file is a ZIP container of XML parts:

/xl/workbook.xml — sheet list, global settings.
/xl/worksheets/sheet1.xml, … — the cells of each sheet.
/xl/sharedStrings.xml — a deduplicated string table. The alternative is inlineStr, where the value lives inside the cell. PDF→Excel converters usually pick inlineStr because they generate the file in a single streaming pass and can’t easily build a dedupe table.
/xl/styles.xml — number formats, fonts, colors, alignment.
A theme file for color palettes.

A sheet is a 2D grid addressed by A1, B2, and so on. Each cell carries three things:

A type, set by the OOXML t attribute: n (number), s (index into shared strings), inlineStr (string written directly), b (boolean), e (error), str (string result of a formula). Dates are not a separate type — a date is a number with a date format applied. A formula is a separate <f> element inside the cell.
A value.
A style — display format, color, font, alignment.

The grid is rigid. Numbers are summed, strings are sorted, dates are filtered. Get a type wrong and the cell silently stops behaving like data.

What PDF doesn’t have

A PDF has no concept of a cell, a row, a column, a sheet, a header, a formula, or the difference between 100 and "100". To the file format, every glyph is a character at some coordinates, and every line is a stroked path. Tabular meaning is something the human eye assembles from spacing and rules.

Reconstructing a typed grid from that means doing four things in order:

Find which regions of which pages are tables.
Recover the row/column structure of each region.
Decide what type each cell holds.
Pack everything into one or more sheets.

Steps 2 and 3 (grid assembly and typing) are where things go quietly wrong: the converter produces an .xlsx that opens fine, but a SUM down a column returns zero or the wrong total because cells that look like numbers were written as text.

Why anyone bothers

Four scenarios drive most demand. Financial reports and bank statements need to be summed, audited, or reconciled. Archives of historical reports have to be retypeable in bulk. Scanned invoices and waybills enter accounting workflows through OCR. Government statistics are still routinely published as PDFs with no underlying CSV. In each case the user wants arithmetic on numbers PDFs don’t type.

Why the output is rarely final

For most document conversions the goal is visual fidelity. For Excel it is the opposite: change the representation entirely so visual coordinates become grid addresses, glyphs become typed values, and layout becomes structure.

Many of the resulting questions have no algorithmic answer. Which column is a date, which is an amount, which is a category code? Which row is the header, which is a subtotal? When a cell holds two lines, is it one address or two records? When a page has three tables, do they belong on three sheets or one? Without a human hint, every converter guesses, and guesses compound — so the .xlsx almost always needs a manual pass before it's usable.

The source dictates the ceiling

The biggest predictor of conversion quality is what kind of PDF you started with.

Fully ruled tables (accounting reports, bank statements, most government forms) give line detection something to bite on. Every cell has a border, intersections form a clean grid, and the converter rebuilds the structure almost completely.

Borderless tabular text (modern designed reports, marketing decks) lays out data in columns with no visible separators. The converter guesses boundaries from alignment alone; quality drops sharply. Manual cleanup is mandatory.

Scans add an OCR layer. Final quality is gated by scan resolution, skew correction, and the OCR engine.

Tagged PDFs marked up per PDF/UA with <Table>, <TR>, <TH>, <TD> make conversion trivial because the structure is explicit. In the wild this is rare, mostly confined to government forms and regulated corporate templates.

The 2025–2026 ML overlay

Commercial services have largely moved to neural models. The open-source reference is Microsoft’s Table Transformer (TATR), trained on PubTables-1M (~948k tables). Vision-language models (GPT-4o/5, Claude, Gemini, Qwen3-VL) extract tables directly from page images into Markdown, CSV, or JSON.

Heuristics remain the baseline ML sits on top of. VLMs are weak exactly where heuristics are strong: long multi-page tables blow past context windows, digits hallucinate, runs aren’t reproducible. A 2026 production pipeline typically uses TATR or a commercial engine for the grid and a VLM for disputed cells, with column totals checked against the source.