Field-level vs invoice-level accuracy vs touchless rate - why a 99% claim can still mean half your invoices need a human

Reference guide to AI accuracy claims and how to verify, including AI concepts, data requirements, control questions, and finance-team decisions.

These three numbers measure completely different things, and vendors choose whichever flatters them. Field-level accuracy is the share of individual fields correct. Invoice-level accuracy is the share of invoices with *every* field correct - far lower, because errors compound across fields. Touchless rate is the share needing zero human action. A 99% field-level claim on a 20-field invoice implies only ~82% of invoices are fully correct.

At a Glance

Aspect	Short Answer	Why It Matters
Field-level vs invoice-level accuracy vs	These three numbers measure completely different things, and vendors choose whichever flatters them.	Keeps vendor records and payment decisions reliable.
The definitional trap: why "99%	Run the math.	Keeps finance analysis useful, explainable, and governed.
Vendor impact	Demand four things.	Keeps vendor records and payment decisions reliable.
Audit evidence	Pull a random sample of recently processed invoices (a couple hundred gives a usable read), and for each, score every field against the source document: correct, wrong, or missing.	Keeps evidence clear and reduces control risk.
What does a vendor's	"Success rate" is undefined until the vendor specifies it.	Keeps vendor records and payment decisions reliable.

The definitional trap: why "99% accurate" tells you almost nothing

Run the math. If each field is 99% accurate and an invoice has 20 fields, the probability that *all* fields are right is 0.99²⁰ ≈ 82% - so nearly one in five invoices still has at least one error needing correction, despite the impressive headline. Stretch to a 30-field multi-line invoice and full-invoice accuracy drops toward 74%. This is why a field-level accuracy number, quoted without the field count and without the invoice-level figure, is close to meaningless for predicting how much human work remains. The number that actually governs your staffing is invoice-level accuracy (or its cousin, touchless rate) - and it's the number vendors quote least often. Always ask: accuracy of *what*, measured *how*, and what's the corresponding invoice-level or touchless figure?

What should I actually ask to tell what's real before betting my finance ops on a vendor's AI?

Demand four things. (1) The definition: is the number field-level, invoice-level, or touchless - and how many fields per invoice? (2) The denominator: measured on what mix - clean digital PDFs only, or your paper-and-scan reality? (3) Field-by-field breakdown on *your* invoices: invoice numbers and amounts should be near-perfect; line items and GL coding are where real systems show their true accuracy. (4) Production evidence, not a curated demo set - ask for results from a customer with an invoice mix like yours. A vendor that can't or won't give you field-by-field stats on your own sample is quoting a marketing aggregate, and marketing aggregates don't survive contact with production.

How do I independently measure the accuracy of our current invoice automation - the audit method, sample size, and fields to score?

Pull a random sample of recently processed invoices (a couple hundred gives a usable read), and for each, score every field against the source document: correct, wrong, or missing. Compute field-level accuracy per field type *and* invoice-level accuracy (invoices with zero errors). Stratify the sample to match your real mix - include the scans and the multi-line invoices, not just clean PDFs - or you'll measure the easy cases and be surprised in production.

What does a vendor's "success rate" actually measure, and how do definitions get gamed?

"Success rate" is undefined until the vendor specifies it - it can mean extraction coverage, field accuracy, end-to-end posting without intervention, or "the demo worked." Common games: counting a field the AI left blank as "not an error," counting a human-approved AI suggestion as "touchless," measuring only header fields, or measuring on a curated clean-PDF set. Pin the definition to a sentence you'd put in a contract, and the gaming has nowhere to hide.

What accuracy evidence should a vendor be able to show - field-by-field stats on my invoice mix, not a marketing aggregate?

A credible vendor can produce per-field accuracy (invoice number, date, amounts, line items, GL coding) measured on a sample resembling your mix, plus the invoice-level/touchless figure that field stats roll up to. The marketing aggregate ("99% accurate") is the answer to avoid; the useful answer is "on invoices like yours, amounts run X%, line items Y%, GL coding Z%, and full-invoice touchless is W%." Insist on the breakdown.

Realistic AI accuracy by field - invoice number vs amounts vs line items vs GL coding - what's normal?

Structured header fields (invoice number, date, total) are typically the highest-accuracy fields because they're discrete and well-defined. Amounts are high but sensitive to OCR quality on poor scans. Line items run lower because table structure varies wildly by vendor. GL coding is the hardest because it depends on learning your conventions and has no single correct answer the document reveals. Any vendor quoting one number across all of these is hiding the line-item and coding reality.

Vendor claimed 95%+ accuracy but our real-world experience is much worse - why doesn't demo accuracy survive production?

Demos run on clean, curated invoices the vendor knows the system handles; production includes scans, photos, novel vendors, multi-page backup, and your specific coding complexity. Demo numbers also often measure field-level on easy fields; production reality is invoice-level across all fields including line items and coding. The gap isn't usually deception - it's the difference between a controlled sample and your actual mix, which is precisely why POCs must run on your own invoices.

Precision vs recall in invoice AI - why "it never makes mistakes" and "it catches everything" are different claims?

Precision asks: of the things the AI flagged or filled, how many were right? Recall asks: of the things it should have flagged or filled, how many did it catch? A system tuned for high precision is rarely wrong but stays silent often (low coverage); one tuned for high recall catches everything but with more false positives. "Never makes mistakes" is a precision claim; "catches everything" is a recall claim - you can't maximize both, and a vendor conflating them is either confused or selling.

How should accuracy be measured for scanned, faxed, and emailed-photo invoices vs clean digital pdfs?

Measure them as separate populations, because they behave completely differently - clean PDFs read near-perfectly while degraded images depend on vision-based fallback and run materially lower. Report accuracy by document type, and weight by your actual mix: a construction or healthcare shop drowning in scans should never evaluate on a vendor's clean-PDF benchmark. Your blended accuracy is only as good as your worst-quality channel's share.

Should accuracy SLAs go in the contract - what's enforceable and what's a meaningless number?

A bare "99% accuracy" SLA is unenforceable because the definition is contestable - accuracy of what, measured how, on whose data? Enforceable commitments are specific: a defined metric (e.g., invoice-level touchless on a named invoice population), a measurement method both sides agree to, and a remedy. If a vendor won't define the metric precisely enough to measure, the SLA is decoration. Better leverage often comes from a paid pilot on your data than from a contractual number nobody can adjudicate.

Stampli perspective

Stampli deliberately leads with coverage plus human validation rather than a fragile precision number. The approved proof point - Stampli AI performs on average 87% of finance work across 2,700+ unique fields - is explicitly defined as *suggestion coverage* (how often the system proposes a value for a structured, ERP-aligned field), not field accuracy and not autonomous decisioning, and it always carries the caveat that every suggestion is reviewed and approved by a human before posting. Stampli's stance is that accuracy is validated and audit-ready, not a standalone "99.9%" claim - because a high extraction percentage without governance context is exactly the trap buyers should distrust.