Turning messy documents into data with vision AI

For most of my career, the dullest and most expensive problem in data work has been the same one: a pile of documents that clearly contain the numbers you need, in a form a computer flatly refuses to read. PDFs, scanned images, supplier emails, weekly promo leaflets. The data is right there. Getting it into a table reliably was, until recently, a quiet nightmare.

I want to talk about why that problem stayed unsolved for so long, what genuinely changed with vision models and structured outputs, and — because the model is the easy part — what you still have to build around them to get something you'd trust in production. I'll use a project of mine, Leaflet Analyzer, as the running example throughout.

Why OCR and regex always broke

The classic recipe was OCR to pull raw text off the page, then a thicket of regular expressions and heuristics to find the bits you cared about. It works in a demo. It falls apart in the real world, and it falls apart in a specific, maddening way: it's fragile against layout.

A retail promo leaflet is a perfect adversary. Prices sit in starbursts and coloured boxes, not in neat rows. A "was" price is struck through; the "now" price is three times the size next to it. Two unrelated products share a column because the designer liked the look. OCR flattens all of that spatial meaning into a stream of characters, and your regex is left guessing which number is a price, which is a pack size, and which is a phone number on the footer. Every new retailer, every seasonal redesign, broke the rules you'd painstakingly tuned for the last one. You were never finished — you were just between failures.

OCR told you what characters were on the page. It never told you what they meant. That gap is where a decade of brittle pipelines lived and died.

What changes with a vision model and a schema

Two things shifted at once, and they matter more together than apart.

The first is that modern vision models read the page as a page — layout, proximity, struck-through text, the visual link between a product photo and its price box. They're not transcribing characters and hoping; they're interpreting a document the way a person glancing at the leaflet would. That alone removes most of the brittleness that killed OCR pipelines.

The second, and the part people underrate, is structured outputs. Instead of asking the model for prose and parsing its answer, you hand it a schema and require the response to conform to it. Anthropic, OpenAI and others now support this directly — you supply a JSON Schema and the model is constrained to return data that fits it (OpenAI's structured outputs and Anthropic's tool/JSON support both work this way). The difference in practice is enormous: you stop writing fragile parsers for free-form text and start receiving rows that already have the right shape.

The pipeline in four moves. The schema in the middle is the small idea that makes the whole thing dependable.

The shape is the contract

Here's roughly the schema Leaflet Analyzer asks the model to fill for every product it sees on a leaflet. It's deliberately boring — and that's the point. A clear shape is a contract the model has to honour and your downstream code can rely on.

// one promotional product, extracted from a leaflet page
{
  "product":      string,   // "Whole-bean coffee 1kg"
  "brand":        string,   // "Lavazza"
  "price":        number,   // current promo price, e.g. 39.90
  "prev_price":   number | null, // struck-through "was" price, if shown
  "discount_pct": number | null, // derived or printed, e.g. 25
  "valid_from":   string,   // ISO date "2025-08-18"
  "valid_to":     string    // ISO date "2025-08-24"
}

Note what the nullable fields do. Not every leaflet prints a previous price or a discount badge, and forcing the model to invent one would be worse than useless. The schema lets it say "not shown" cleanly, which is itself a piece of honest data. The model returns an array of these objects per page, and suddenly the leaflet is just rows.

The model is the easy part

This is the bit I'd most want a younger version of myself to hear. Getting a good first extraction out of a vision model is genuinely the easy 20% of the work. Everything that makes it production lives in the unglamorous 80% around it.

Validation. A schema guarantees the right shape; it does not guarantee the right meaning. So after extraction I check the business logic: is price below prev_price? Does a printed discount_pct agree with the one you'd compute from the two prices? Are the validity dates sane and in order? Rows that fail get flagged, not silently trusted.

Deduplication. The same product turns up across overlapping leaflets, multiple pages, and re-captures of the same source. Without dedup you don't have a dataset, you have a tally of how many times you ran the pipeline. Leaflet Analyzer has a dedicated normalize stage for exactly this — brand and category tagging, then collapsing duplicates — before anything is delivered.

Resumability. Vision calls cost money and take time, so re-processing work you've already done is pure waste. Leaflet Analyzer runs as a five-stage, resumable pipeline — capture, archive, extract, normalize, deliver — where each stage keeps a SQL view of what's still outstanding. That makes the whole thing idempotent: it can stop and restart without redoing a single leaflet it has already handled. When you're paying per page, "never extract the same thing twice" is a feature, not a nicety.

Hard-won tip Make the unit of work small and idempotent — one page, keyed by a content hash. Then a crash, a rate limit, or a model timeout costs you one page, not a re-run. A resumable pipeline isn't an optimisation you add later; it's the thing that lets you operate at all without dreading every restart.

Cost, latency, and keeping a human in the loop

Vision inference isn't free, and it isn't instant. That shapes the architecture more than people expect. Batch the work, cache aggressively, and only ever send a page to the model once — the resumable design above is as much about the bill as it is about robustness. For a weekly cadence of leaflets, latency barely matters; for something interactive it would, and you'd trade model size against speed accordingly.

And I keep a human spot-check, on purpose. Not reviewing every row — that would defeat the point — but sampling. A quick eyeball on a slice of each batch catches the failure modes validation can't, like a model confidently misreading a stylised "9" as a "0". The goal was never to remove people from the loop. It was to move them from typing the data in to checking that the data is right, which is a far better use of an analyst's attention.

Where this leaves us

The long-unsolved problem — clean structured data out of unstructured documents — isn't fully "solved", but the balance has genuinely tipped. For years the honest answer to "can we automate this?" was "sort of, until the next layout breaks it." Now the model reads the page like a person and hands back a shape you defined, and the engineering effort moves to where it should have been all along: validation, deduplication, idempotent pipelines, and a sensible human check.

That's a much better place to be spending the work. It's also a useful reminder that the impressive demo and the dependable system are different artefacts. The vision model gets you the first; the boring scaffolding around it gets you the second. Leaflet Analyzer turns a pile of weekly leaflets into a competitive-pricing feed teams can actually act on — and almost all of what makes that true is the unglamorous 80%.