← Blog

PDF → Excel — Hard cases: rotation, multi-line, nested

Rotated column headers: textRotation in Excel in PDF (headers rotated 90°) January February March April May June 100 120 90 110 130 115 in the Excel XML <xf ...> <alignment textRotation="90"/> </xf> range 0–180: 0–90: counter-clockwise 91–180: below horizontal

A small fraction of tables breaks the standard algorithms entirely. Three categories dominate: rotated text, multi-line cells, nested tables. Each fails for a different reason; a fix doesn’t always exist.

Rotated text

Tables with a wide categorical axis frequently rotate the column headers 90° to fit — month names, country codes, year labels written sideways above narrow numeric columns.

Column headers rotated 90° in the source PDF Jan Feb Mar Apr Alice Bob Charlotte 1 2 3 4 5 6 7 8 9 10 11 12

PDF stores text under a transformation matrix that includes rotation. Recovering the headers requires three things: detect the rotation (π/2 here), translate the text coordinates from the local rotated system into the page’s global system, and store the text with a rotation flag.

A converter that ignores rotation reads the rotated runs as ordinary horizontal text at meaningless coordinates. It can’t bind them to columns; the headers go missing or land in the wrong cells.

A handler that recognizes rotation can preserve orientation in Excel through a cell-style alignment:

<xf ...><alignment textRotation="90"/></xf>

The cell <c r="A1" s="1"/> references the style by s. Excel supports rotation between −90° and +90°; in OOXML, the textRotation attribute takes 0–180, where 0–90 is counter-clockwise rotation and 91–180 encodes rotation below the horizontal as “90 + angle”.

Most mass-market converters don’t handle rotation correctly. The headers disappear or land in the wrong columns. Specialized financial and table-oriented tools do better.

Multi-line cells

A cell with several lines of text — a long address, a description, a bullet list:

NameAddress
John Smith10 Baker St., Apt. 5,
London, NW1 6XE
Mary Jones25 Park Ave., Apt. 12,
New York, NY 10024

Each address occupies two physical rows. The detector sees three rows per logical record: one with both columns populated, one with only the address column, then the next record. A naive algorithm produces:

John Smith10 Baker St., Apt. 5,
(empty)London, NW1 6XE
Mary Jones25 Park Ave., Apt. 12,

Sorting by name now scrambles the addresses. The fix: if a row has data in only one column and that column matches the previous row’s populated column, it’s a continuation. Concatenate with a line break (Alt+Enter in Excel):

<c r="B1" t="inlineStr">
  <is><t>10 Baker St., Apt. 5,
London, NW1 6XE</t></is>
</c>

The pattern fails when continuation rows have data in multiple columns (read as separate records), when many columns have long values broken across different line counts (no consistent continuation pattern), or when every column has partial data (no clear boundaries). A human has to step in.

Nested tables

Some PDFs put a small table inside a cell of a larger one:

CategorySubcategories
Income
Salary1000
Bonus500
Expenses
Rent800
Food300

PDF doesn’t model nesting. The detector sees two tables (a big one and several small ones) with overlapping coordinates. Outcomes:

  1. Both recognized as separate tables, written independently to Excel.
  2. Only the big one recognized, the inner ones treated as garbled content.
  3. Only the small ones recognized, the big one lost.
  4. Neither recognized.

Most converters take option 2 or 4 and drop the inner tables.

Excel itself has no model for nested tables. A cell is atomic. You can embed a formula or an OLE object, but neither is the same thing. Even with perfect detection, there’s no clean way to write a nested table back. Realistic options: a summary value (a sum, a count) in the cell, the inner table in a neighboring region, or cell references linking the two — none of which a general-purpose tool attempts automatically. Nested tables are a category where automated conversion is structurally incomplete.

The other hard cases

Cross-tabs. Pivot-style tables with a header row, a header column, and a data matrix, often with row and column totals. The algorithm finds the grid but loses the role of each cell — header vs data vs subtotal.

Tables across page breaks with structural changes. Page 1 has 5 columns, page 2 has 4. No way to align the columns automatically.

Tables with embedded charts or images. Excel has floating-image support (anchored to a cell rather than embedded in it). A sloppy converter drops the image; a careful one anchors it over the cell.

Footnotes inside tables. A cell with a ² marker referencing a footnote at the bottom. Excel doesn’t have footnotes the way Word does. Choices: store the footnote as an Excel Note (a hover popup, historically called “Comment” in classic Excel; when threaded comments arrived in Microsoft 365, “Comment” was reassigned and the old popups became “Notes”); inline the footnote text into the cell, cluttering it; or ignore the marker. Most converters ignore.

Currency placement by locale. $1,000 (US, prefix), 1.000 € (DE, suffix), 1 000 ₽ (RU, suffix with space). Without locale awareness the value is mistyped.

Quality

The hard cases are the weakest area of PDF→Excel:

If your tables are simple, conversion works well. If they’re complex, manual cleanup is non-negotiable.

The escape hatch: PDF/A-3 and ZUGFeRD/Factur-X

For documents you control, PDF/A-3 and PDF/A-4f let you embed the source .xlsx or .csv inside the PDF as an Associated File. The European e-invoicing standard ZUGFeRD / Factur-X uses this mechanism: one file holds the human-readable printable form and the machine-readable XML data. With a PDF prepared this way, no table detection is needed; the data extracts from the embedded file.

PDF/A-3 / Factur-X: an embedded Excel inside the PDF PDF/A-3 (or PDF/A-4f) human-readable part invoice page for printing and viewing Associated File (attachment) XML / xlsx / csv with the data machine-readable representation <Invoice> <Date>2026-04-25</Date> <Total>1234.56</Total> <Lines>...</Lines> </Invoice> example: ZUGFeRD / Factur-X — one file for printing + automated import into accounting