Klarefi
← Notes from the build

How to evaluate AI document extraction for regulated workflows

A buyer's checklist for AI document extraction when the output has to defend itself in front of a regulator, an auditor, and an angry claimant.

Mike Cooper, Founder

11.5% of the data in a US mortgage file is wrong or missing. That number has not moved in five years. Your team still types numbers from a tax return into the origination system, and on a €11,800 loan you cannot afford the rework.

Document extraction vendors will demo on their own corpus. They will show you 93% field accuracy on a clean form. Then you point them at your actual loan packet, your actual FNOL inbox, your actual UBO declaration, and the number drops twenty points. Here is how to evaluate before you sign anything.

Run the test on your documents, not theirs

Most pilots fail because the buyer accepts a demo on a vendor corpus. Demand the opposite. Give them fifty of your worst files. Tax returns missing the schedule. Police reports scanned from a fax. Insurance certificates with the endorsement on a separate page. Photos of a passport on a coffee table. That is your real distribution. If the system breaks there, it will break in production every day.

The seven questions that decide the pilot

Score every vendor on these. Refuse to advance without a yes.

  • Evidence. Does every extracted value carry a quote, a page index, and the bounding box? A reviewer needs to click the value and see the source.
  • Sufficiency. Can the system decide the packet is incomplete, or does it just extract what is there and call it done? A document set that is missing the schedule is not "92% confident", it is insufficient.
  • Gaps. Can it flag an expired ID, a missing endorsement, a contradictory date, an unsigned signature page? Gap detection is the work, not extraction.
  • Context. Does it use the form answers, the chat history, and prior facts in the case? An EBITDA on page 12 only matches if the legal entity on page 1 matches the borrower on the application.
  • Review. Can a human correct a value without breaking the audit trail? Every edit needs to be recorded, attributed, and reversible.
  • Outage handling. When the model fails, does the case move to review or fall back to a generic answer? Silent fallback is the failure mode that ends careers in regulated work.
  • Throughput honesty. What does the system do under load? If 30% of cases need human review, your unit economics changed. Demand the rate.

If a vendor cannot answer any of these in a 45 minute call, the answer is no.

What not to optimize for

Speed alone is a vanity metric. A vendor brags about extracting 200 fields in 4 seconds. Your underwriter still spends 20 minutes verifying each one because none of them carry a citation. The pipeline got faster. The job did not.

Confidence scores alone are worse. "92%" is meaningless to a regulator. A quote on page 4, line 18 of the 2024 tax return is not. Score systems that emit citations. Reject systems that emit only numbers.

Accuracy on clean forms is a third trap. Your packets are not clean. If a vendor's accuracy collapses on scans, on rotated pages, on tables that span three pages, you will pay the difference in human review hours forever.

How to price the rework

Build a simple model before the pilot. Three inputs:

  • Number of files per month.
  • Average minutes of human review per file today.
  • Average minutes of human review per file under the new system, including exceptions.

A pilot that drops review time from 40 minutes to 12 on 2,000 files a month is worth roughly 933 hours saved monthly. At €50 fully loaded that is €47k a month, €560k a year. Now you can negotiate from a number, not a vibe.

If the vendor will not let you measure that, the vendor knows the answer is bad.

What we ship at Klarefi, and what we will not

Every fact resolves to one of three states. Resolved with cited evidence. Needs input from the applicant. Failed and escalated to a human. There is no fourth state. There is no "85% confident, here is your answer". A reviewer clicks any value in the case and lands on the exact quote in the source PDF.

If a model call fails, the case moves to a human queue with the reason attached. We do not retry silently and emit a fabricated answer. That choice costs us some "automation rate" in the demo. It earns us deployments in regulated production.

The move

Do not buy on a demo. Buy on your worst fifty files, scored on the seven questions above, with a euro number attached. If the vendor cannot commit to those terms, you have your answer.