Whether a cell behaves like data depends on its type. In OOXML the t attribute can be n (number), s/inlineStr (string), b (boolean), e (error), or str (string result of a formula). Dates aren’t a s…
A small fraction of tables breaks the standard algorithms entirely. Three categories dominate: rotated text, multi-line cells, nested tables. Each fails for a different reason; a fix doesn’t always…
PDF→Excel tries to recover typed data (numbers, dates, and the relationships between them) from a document that knows nothing about types or relationships. That gap is why Excel results almost alwa…
If you control how the PDF is generated and someone downstream is going to extract its tables into Excel, a handful of choices at generation time will improve the result by an order of magnitude. M…
A PDF page is a flat collection of disconnected drawing operations. There is no “table” object, only text at coordinates and, sometimes, ruled lines. The converter has to decide which rectangular r…