← Blog

PDF → Excel — Hard cases: rotation, multi-line, nested

A small fraction of tables breaks the standard algorithms entirely. Three categories dominate: rotated text, multi-line cells, nested tables. Each fails for a different reason; a fix doesn’t always exist.

Rotated text

Tables with a wide categorical axis frequently rotate the column headers 90° to fit — month names, country codes, year labels written sideways above narrow numeric columns.

PDF stores text under a transformation matrix that includes rotation. Recovering the headers requires three things: detect the rotation (π/2 here), translate the text coordinates from the local rotated system into the page’s global system, and store the text with a rotation flag.

A converter that ignores rotation reads the rotated runs as ordinary horizontal text at meaningless coordinates. It can’t bind them to columns; the headers go missing or land in the wrong cells.

A handler that recognizes rotation can preserve orientation in Excel through a cell-style alignment:

<xf ...><alignment textRotation="90"/></xf>

The cell <c r="A1" s="1"/> references the style by s. Excel supports rotation between −90° and +90°; in OOXML, the textRotation attribute takes 0–180, where 0–90 is counter-clockwise rotation and 91–180 encodes rotation below the horizontal as “90 + angle”.

Most mass-market converters don’t handle rotation correctly. The headers disappear or land in the wrong columns. Specialized financial and table-oriented tools do better.

Multi-line cells

A cell with several lines of text — a long address, a description, a bullet list:

Name	Address
John Smith	10 Baker St., Apt. 5,
	London, NW1 6XE
Mary Jones	25 Park Ave., Apt. 12,
	New York, NY 10024

Each address occupies two physical rows. The detector sees three rows per logical record: one with both columns populated, one with only the address column, then the next record. A naive algorithm produces:

John Smith	10 Baker St., Apt. 5,
(empty)	London, NW1 6XE
Mary Jones	25 Park Ave., Apt. 12,

Sorting by name now scrambles the addresses. The fix: if a row has data in only one column and that column matches the previous row’s populated column, it’s a continuation. Concatenate with a line break (Alt+Enter in Excel):

<c r="B1" t="inlineStr">
  <is><t>10 Baker St., Apt. 5,
London, NW1 6XE</t></is>
</c>

The pattern fails when continuation rows have data in multiple columns (read as separate records), when many columns have long values broken across different line counts (no consistent continuation pattern), or when every column has partial data (no clear boundaries). A human has to step in.

Nested tables

Some PDFs put a small table inside a cell of a larger one:

The other hard cases

Cross-tabs. Pivot-style tables with a header row, a header column, and a data matrix, often with row and column totals. The algorithm finds the grid but loses the role of each cell — header vs data vs subtotal.

Tables across page breaks with structural changes. Page 1 has 5 columns, page 2 has 4. No way to align the columns automatically.

Tables with embedded charts or images. Excel has floating-image support (anchored to a cell rather than embedded in it). A sloppy converter drops the image; a careful one anchors it over the cell.

Footnotes inside tables. A cell with a ² marker referencing a footnote at the bottom. Excel doesn’t have footnotes the way Word does. Choices: store the footnote as an Excel Note (a hover popup, historically called “Comment” in classic Excel; when threaded comments arrived in Microsoft 365, “Comment” was reassigned and the old popups became “Notes”); inline the footnote text into the cell, cluttering it; or ignore the marker. Most converters ignore.

Currency placement by locale. $1,000 (US, prefix), 1.000 € (DE, suffix), 1 000 ₽ (RU, suffix with space). Without locale awareness the value is mistyped.

Quality

The hard cases are the weakest area of PDF→Excel:

Rotated headers — usually lost (10–30% handled correctly).
Multi-line cells — handled poorly (40–60% correct).
Nested tables — almost always lost (no adequate Excel model).
Cross-tabs — structure typically falls apart.

If your tables are simple, conversion works well. If they’re complex, manual cleanup is non-negotiable.

The escape hatch: PDF/A-3 and ZUGFeRD/Factur-X

For documents you control, PDF/A-3 and PDF/A-4f let you embed the source .xlsx or .csv inside the PDF as an Associated File. The European e-invoicing standard ZUGFeRD / Factur-X uses this mechanism: one file holds the human-readable printable form and the machine-readable XML data. With a PDF prepared this way, no table detection is needed; the data extracts from the embedded file.