PDF → Excel — Typing the cell: number, text, or date
Whether a cell behaves like data depends on its type. In OOXML the
t attribute can be n (number),
s/inlineStr (string), b
(boolean), e (error), or str (string result of
a formula). Dates aren’t a separate type: a date is a number with a date
format applied. A formula is a separate <f> element
inside the cell.
Numbers can be summed, averaged, sorted numerically. A date (really a number under a date format) can be filtered by range and used in date arithmetic. Strings sort alphabetically. Mistype a column of dollar amounts as text and the column goes inert; the user notices when SUM returns zero.
A PDF stores none of this. Every cell is a sequence of glyphs; the converter guesses.
The first pass: regex
Numbers
^-?\d+([.,]\d+)?$ — optional minus, digits, optional
decimal separator and more digits. The fine print is where everything
goes wrong:
- Decimal separator: period (US/UK) or comma (continental Europe)? You either infer from document context or force the user to set a locale.
- Thousands separator: space, comma, period, or
apostrophe.
1,000,000.50(US),1.000.000,50(DE),1 000 000,50(RU/FR). Russia’s GOST 8.417 specifies a non-breaking space (U+00A0) for digit groups; a parser using\swill happily glue the non-breaking space to a unit of measurement that follows the number. - Currency symbol:
$100,€100,100₽. The symbol has to be peeled off before the value parses. - Percent:
25%should become a value of 0.25 with a percent format applied — not the literal number 25.
Dates
A few patterns cover most cases:
\d{1,2}[./-]\d{1,2}[./-]\d{2,4}for European and American formats.\d{4}-\d{2}-\d{2}for ISO.\d{1,2} (Jan|Feb|...) \d{4}for textual months.
The pitfalls are familiar: is 01/02/2024 February 1 or
January 2? Are month abbreviations English, German, or French? Is there
a time component glued onto the date?
Booleans
true/false, yes/no, Y/N,
+/- could all be booleans. Excel’s boolean type only stores
1 or 0; anything else has to be normalized by
the converter or it stays as text.
Text
Anything that didn’t match.
Where pattern matching breaks
Real documents contain plenty of things that look numeric but aren’t.
Digit-shaped strings that aren’t numbers.
SKU: 12345. Order # 1234. Excel happily strips
leading zeros from 007005 and destroys the meaning. A
prefix or suffix lets the converter keep it as text; a naked number is
harder.
Versions. 1.2.3 is three numbers, not
one decimal.
Phone numbers. +1 555 123-45-67 is a
“number” only in the loosest sense.
Postal codes. 90210 is a number;
007005 becomes 7005 if treated as one. The
right answer depends on what country you’re in.
Ratios and fractions. 1/2 could be a
date. 3:4 could be a time.
The second pass: column context
A serious converter doesn’t type cells in isolation; it looks at the column.
- Apply the regex pass to every cell in a column.
- Tally the resulting types.
- If more than ~80% of cells share a type, assign that type to the column.
- Cells that don’t match the dominant type are either coerced (often dangerous) or kept as text.
This catches the SKU column with one accidentally-numeric value and the date column with one stray free-text entry.
The third pass: header dictionaries
If the first row is recognized as a header, the column name itself predicts the type:
- “Date”, “Datum”, “Period”, “Période” → date.
- “Amount”, “Price”, “Cost”, “Importe”, “Preis” → number with currency.
- “Quantity”, “Qty”, “Cantidad”, “Anzahl” → integer.
- “Name”, “Nombre”, “Description” → text.
A multilingual dictionary won’t catch every business glossary, but it raises the floor noticeably for typical reports.
Storing numbers correctly
A recognized number has to be stored as a raw value with a separate
display format. Store 1234.56 as the value and
#,##0.00 as the format and Excel displays
1,234.56 while doing arithmetic on 1234.56.
Store $1,234.56 as a string and Excel treats it as text;
formulas won’t touch it.
Most simple converters take the second path; the user fixes it by hand.
The display formats correspond to:
- Integer:
1234. - Decimal:
1234.56. - Currency:
$1,234.56. - Percent:
25%(value 0.25). - Scientific:
1.23E+09.
Storing dates correctly
Same architecture. The raw value is an Excel serial number; the format dictates the display. By convention serial 1 is January 1, 1900, but for Lotus 1-2-3 compatibility Excel treats 1900 as a leap year (which it wasn’t). After February 28, 1900 the serial is effectively the count of days since December 30, 1899. Older Mac versions defaulted to the 1904 epoch, which differs from the 1900 epoch by 1462 days. A file with the wrong epoch shifts every date by four years and one day.
A correct converter parses the string with locale-aware logic, stores
the serial (e.g. 45323 for February 1, 2024), and applies a format like
dd.mm.yyyy. A simple converter stores the literal string
"01.02.2024" and Excel treats it as text.
When in doubt, leave it as text
If the type is uncertain, text is the safer fallback.
$1,234.56 stored as text loses no information. Coerced to
1234.56, it loses the currency.
Formulas
A PDF has no formulas. Every total was computed before export and lives as a hard-coded number. After conversion the total is a value; change the inputs and it won’t recalculate.
A clever converter can attempt to reconstruct formulas:
- A bottom-row value equal to the sum of the column above it might be
=SUM(B2:B10). - A “total” column equal to “price” × “quantity” might be
=B2*C2.
In practice this is rare. The cost of a wrong guess (a SUM that silently includes the header row) is too high. Most converters keep values literal and let the user replace them.
The XML form for a formula plus its cached result:
<c r="C5"><f>SUM(C2:C4)</f><v>150</v></c><v> is the last computed value, so Excel doesn’t
have to recalculate on open.
Quality
Plain numbers without decoration and plain text type correctly.
Numbers with currency or percent often stay as text because the regex
doesn’t match. Dates type with mixed success; format ambiguity
(01/02/2024) is the main culprit. SKUs, identifiers, and
version strings get mistyped as numbers more often than not.
After conversion, audit the type of any column you intend to do arithmetic on.