← Blog

PDF → Excel — Typing the cell: number, text, or date

Cell typing: the traps $1,234.56 — number with currency usually stays text (has $ and commas); user formats manually 01/02/2024 — date February 1 (Europe) or January 2 (US)? — can't tell without context SKU: 12345 — NOT a number if the converter "normalizes" to a number — leading zeros and context are lost 1 000 000,50 — Russian separator per GOST 8.417 — a non-breaking space U+00A0; an ordinary \s parser may misfire

Whether a cell behaves like data depends on its type. In OOXML the t attribute can be n (number), s/inlineStr (string), b (boolean), e (error), or str (string result of a formula). Dates aren’t a separate type: a date is a number with a date format applied. A formula is a separate <f> element inside the cell.

Numbers can be summed, averaged, sorted numerically. A date (really a number under a date format) can be filtered by range and used in date arithmetic. Strings sort alphabetically. Mistype a column of dollar amounts as text and the column goes inert; the user notices when SUM returns zero.

A PDF stores none of this. Every cell is a sequence of glyphs; the converter guesses.

The first pass: regex

Numbers

^-?\d+([.,]\d+)?$ — optional minus, digits, optional decimal separator and more digits. The fine print is where everything goes wrong:

Dates

A few patterns cover most cases:

The pitfalls are familiar: is 01/02/2024 February 1 or January 2? Are month abbreviations English, German, or French? Is there a time component glued onto the date?

Booleans

true/false, yes/no, Y/N, +/- could all be booleans. Excel’s boolean type only stores 1 or 0; anything else has to be normalized by the converter or it stays as text.

Text

Anything that didn’t match.

Where pattern matching breaks

Real documents contain plenty of things that look numeric but aren’t.

Digit-shaped strings that aren’t numbers. SKU: 12345. Order # 1234. Excel happily strips leading zeros from 007005 and destroys the meaning. A prefix or suffix lets the converter keep it as text; a naked number is harder.

Versions. 1.2.3 is three numbers, not one decimal.

Phone numbers. +1 555 123-45-67 is a “number” only in the loosest sense.

Postal codes. 90210 is a number; 007005 becomes 7005 if treated as one. The right answer depends on what country you’re in.

Ratios and fractions. 1/2 could be a date. 3:4 could be a time.

The second pass: column context

A serious converter doesn’t type cells in isolation; it looks at the column.

  1. Apply the regex pass to every cell in a column.
  2. Tally the resulting types.
  3. If more than ~80% of cells share a type, assign that type to the column.
  4. Cells that don’t match the dominant type are either coerced (often dangerous) or kept as text.

This catches the SKU column with one accidentally-numeric value and the date column with one stray free-text entry.

The third pass: header dictionaries

If the first row is recognized as a header, the column name itself predicts the type:

A multilingual dictionary won’t catch every business glossary, but it raises the floor noticeably for typical reports.

Storing numbers correctly

A recognized number has to be stored as a raw value with a separate display format. Store 1234.56 as the value and #,##0.00 as the format and Excel displays 1,234.56 while doing arithmetic on 1234.56. Store $1,234.56 as a string and Excel treats it as text; formulas won’t touch it.

Most simple converters take the second path; the user fixes it by hand.

The display formats correspond to:

Storing dates correctly

Same architecture. The raw value is an Excel serial number; the format dictates the display. By convention serial 1 is January 1, 1900, but for Lotus 1-2-3 compatibility Excel treats 1900 as a leap year (which it wasn’t). After February 28, 1900 the serial is effectively the count of days since December 30, 1899. Older Mac versions defaulted to the 1904 epoch, which differs from the 1900 epoch by 1462 days. A file with the wrong epoch shifts every date by four years and one day.

A correct converter parses the string with locale-aware logic, stores the serial (e.g. 45323 for February 1, 2024), and applies a format like dd.mm.yyyy. A simple converter stores the literal string "01.02.2024" and Excel treats it as text.

When in doubt, leave it as text

If the type is uncertain, text is the safer fallback. $1,234.56 stored as text loses no information. Coerced to 1234.56, it loses the currency.

Formulas

A PDF has no formulas. Every total was computed before export and lives as a hard-coded number. After conversion the total is a value; change the inputs and it won’t recalculate.

A clever converter can attempt to reconstruct formulas:

In practice this is rare. The cost of a wrong guess (a SUM that silently includes the header row) is too high. Most converters keep values literal and let the user replace them.

The XML form for a formula plus its cached result:

<c r="C5"><f>SUM(C2:C4)</f><v>150</v></c>

<v> is the last computed value, so Excel doesn’t have to recalculate on open.

Quality

Plain numbers without decoration and plain text type correctly. Numbers with currency or percent often stay as text because the regex doesn’t match. Dates type with mixed success; format ambiguity (01/02/2024) is the main culprit. SKUs, identifiers, and version strings get mistyped as numbers more often than not.

After conversion, audit the type of any column you intend to do arithmetic on.