PDF → Excel — Hard cases: rotation, multi-line, nested
A small fraction of tables breaks the standard algorithms entirely. Three categories dominate: rotated text, multi-line cells, nested tables. Each fails for a different reason; a fix doesn’t always exist.
Rotated text
Tables with a wide categorical axis frequently rotate the column headers 90° to fit — month names, country codes, year labels written sideways above narrow numeric columns.
PDF stores text under a transformation matrix that includes rotation. Recovering the headers requires three things: detect the rotation (π/2 here), translate the text coordinates from the local rotated system into the page’s global system, and store the text with a rotation flag.
A converter that ignores rotation reads the rotated runs as ordinary horizontal text at meaningless coordinates. It can’t bind them to columns; the headers go missing or land in the wrong cells.
A handler that recognizes rotation can preserve orientation in Excel through a cell-style alignment:
<xf ...><alignment textRotation="90"/></xf>The cell <c r="A1" s="1"/> references the style by
s. Excel supports rotation between −90° and +90°; in OOXML,
the textRotation attribute takes 0–180, where 0–90 is
counter-clockwise rotation and 91–180 encodes rotation below the
horizontal as “90 + angle”.
Most mass-market converters don’t handle rotation correctly. The headers disappear or land in the wrong columns. Specialized financial and table-oriented tools do better.
Multi-line cells
A cell with several lines of text — a long address, a description, a bullet list:
| Name | Address |
|---|---|
| John Smith | 10 Baker St., Apt. 5, |
| London, NW1 6XE | |
| Mary Jones | 25 Park Ave., Apt. 12, |
| New York, NY 10024 |
Each address occupies two physical rows. The detector sees three rows per logical record: one with both columns populated, one with only the address column, then the next record. A naive algorithm produces:
| John Smith | 10 Baker St., Apt. 5, |
| (empty) | London, NW1 6XE |
| Mary Jones | 25 Park Ave., Apt. 12, |
Sorting by name now scrambles the addresses. The fix: if a row has data in only one column and that column matches the previous row’s populated column, it’s a continuation. Concatenate with a line break (Alt+Enter in Excel):
<c r="B1" t="inlineStr">
<is><t>10 Baker St., Apt. 5,
London, NW1 6XE</t></is>
</c>The pattern fails when continuation rows have data in multiple columns (read as separate records), when many columns have long values broken across different line counts (no consistent continuation pattern), or when every column has partial data (no clear boundaries). A human has to step in.
Nested tables
Some PDFs put a small table inside a cell of a larger one:
| Category | Subcategories | ||||
|---|---|---|---|---|---|
| Income |
|
||||
| Expenses |
|
PDF doesn’t model nesting. The detector sees two tables (a big one and several small ones) with overlapping coordinates. Outcomes:
- Both recognized as separate tables, written independently to Excel.
- Only the big one recognized, the inner ones treated as garbled content.
- Only the small ones recognized, the big one lost.
- Neither recognized.
Most converters take option 2 or 4 and drop the inner tables.
Excel itself has no model for nested tables. A cell is atomic. You can embed a formula or an OLE object, but neither is the same thing. Even with perfect detection, there’s no clean way to write a nested table back. Realistic options: a summary value (a sum, a count) in the cell, the inner table in a neighboring region, or cell references linking the two — none of which a general-purpose tool attempts automatically. Nested tables are a category where automated conversion is structurally incomplete.
The other hard cases
Cross-tabs. Pivot-style tables with a header row, a header column, and a data matrix, often with row and column totals. The algorithm finds the grid but loses the role of each cell — header vs data vs subtotal.
Tables across page breaks with structural changes. Page 1 has 5 columns, page 2 has 4. No way to align the columns automatically.
Tables with embedded charts or images. Excel has floating-image support (anchored to a cell rather than embedded in it). A sloppy converter drops the image; a careful one anchors it over the cell.
Footnotes inside tables. A cell with a
² marker referencing a footnote at the bottom. Excel
doesn’t have footnotes the way Word does. Choices: store the footnote as
an Excel Note (a hover popup, historically called “Comment” in classic
Excel; when threaded comments arrived in Microsoft 365, “Comment” was
reassigned and the old popups became “Notes”); inline the footnote text
into the cell, cluttering it; or ignore the marker. Most converters
ignore.
Currency placement by locale. $1,000
(US, prefix), 1.000 € (DE, suffix), 1 000 ₽
(RU, suffix with space). Without locale awareness the value is
mistyped.
Quality
The hard cases are the weakest area of PDF→Excel:
- Rotated headers — usually lost (10–30% handled correctly).
- Multi-line cells — handled poorly (40–60% correct).
- Nested tables — almost always lost (no adequate Excel model).
- Cross-tabs — structure typically falls apart.
If your tables are simple, conversion works well. If they’re complex, manual cleanup is non-negotiable.
The escape hatch: PDF/A-3 and ZUGFeRD/Factur-X
For documents you control, PDF/A-3 and PDF/A-4f let you embed the
source .xlsx or .csv inside the PDF as an
Associated File. The European e-invoicing standard ZUGFeRD /
Factur-X uses this mechanism: one file holds the
human-readable printable form and the machine-readable XML data. With a
PDF prepared this way, no table detection is needed; the data extracts
from the embedded file.