Validation corpus · Updated May 18, 2026
How accurate is
Koji, really?
Every change to the extraction pipeline is validated against a public corpus of real-world documents. Schemas, expected outputs, and benchmark runs all live in getkoji/corpus. No cherry-picked numbers, no hidden regressions — what you see below is what nightly CI reports.
01 Methodology
How we
measure this.
The number at the top of this page isn't a marketing claim — it's a weighted aggregate over a specific corpus, measured with a specific definition of "accuracy," produced by a specific pipeline run on specific models. Here's the fine print.
- 01
What "accuracy" means
Field-level exact match after normalization. For each field in the schema, we check whether the extracted value matches the expected value. Dates, numbers, currencies, and enum mappings are fuzzy-matched (
"April 12, 2026"≡"2026-04-12"). "Accuracy" reported on this page is fields correct ÷ fields checked — not document-level completion. - 02
What's in the corpus
Real documents, not synthetic. Public-domain sources only: SEC EDGAR, IRS forms, state insurance filings, SROIE, court records. Every document has a committed schema and a ground-truth JSON of expected values. No PII, no proprietary content, no customer data. Rotating mix of clean digital PDFs, scanned PDFs, and multi-column layouts.
- 03
How runs are reproducible
Each run pins the model version (e.g.
gpt-4o-mini-2024-07-18), the prompt version, and the Koji commit SHA. Temperature 0. Results write tocorpus/.benchmarks/<date>/<category>/<model>.jsonwith full traceability — you can diff two runs to see exactly what changed. - 04
How we aggregate
Per-category accuracy weights by fields, not documents. A category with 200 small docs doesn't dominate one with 20 large docs. The overall number is a weighted average across all categories. The trend line is one data point per full corpus run.
02 By category
Every document type,
measured in public.
The corpus is organized by domain — invoices, insurance policies, SEC filings, contracts, government forms, receipts. Each category has its own schema and expected-output fixtures. Accuracy below is Koji's intelligent pipeline running openai/gpt-4o-mini end-to-end.
Insurance policies
insurance_policiesIRS forms
irs_formsInsurance claims
insurance_claimsInvoices
invoicesInsurance certificates
insurance_certificatesAdversarial
adversarialReceipts (SROIE)
receiptsMulti-format
multi_format03 By model
Your model,
your trade-off.
Koji runs on any OpenAI-compatible endpoint or local model via Ollama. Different models hit different points on the accuracy/latency/cost curve — swap in a config change, not a code change.
| Model | Accuracy | Latency | Cost per doc |
|---|---|---|---|
| GPT-4o mini OpenAI | 96.1% | 2.1s | $0.0009 |
All runs use the same pipeline configuration: parse via Docling, intelligent extraction (map → route → extract → validate → reconcile), full corpus. Latency and cost are averages across categories. See getkoji/corpus for the raw benchmark JSON.
04 Accuracy over time
Getting better
with every release.
Each data point is a nightly CI run against the full corpus. Regressions get flagged in pull requests and fixed before they merge.
05 Honest numbers
We publish regressions too.
Accuracy numbers from vendor-published benchmarks are usually theoretical: they pick the best documents, the best schema, and the best model, run it once, and screenshot the result. We don't do that.
The corpus is public and versioned. Every document, every schema, every expected output, every nightly CI run — all committed to getkoji/corpus. When we break something, the numbers drop and the PR gets flagged before it merges. When we fix something, the numbers move up and you can diff the run to see why.
If you find a document type where Koji underperforms, submit it to the corpus. We'll add it, run it against the pipeline, and publish the result — regression or not. That's the whole point.
Run it yourself
Don't trust the numbers?
Reproduce them.
The corpus is public. The CLI is open source. Clone the repo, run koji bench against any category, and compare your numbers to ours. No sign-up, no vendor lock-in, no "contact sales to see the benchmark."
$ git clone https://github.com/getkoji/corpus $ uv tool install git+https://github.com/getkoji/koji.git $ koji bench --corpus ./corpus --category insurance