Validation corpus · Updated May 18, 2026

How accurate is
Koji, really?

Every change to the extraction pipeline is validated against a public corpus of real-world documents. Schemas, expected outputs, and benchmark runs all live in getkoji/corpus. No cherry-picked numbers, no hidden regressions — what you see below is what nightly CI reports.

653 documents
96.1% field accuracy
3,961 fields checked

How we
measure this.

The number at the top of this page isn't a marketing claim — it's a weighted aggregate over a specific corpus, measured with a specific definition of "accuracy," produced by a specific pipeline run on specific models. Here's the fine print.

  1. 01

    What "accuracy" means

    Field-level exact match after normalization. For each field in the schema, we check whether the extracted value matches the expected value. Dates, numbers, currencies, and enum mappings are fuzzy-matched ("April 12, 2026""2026-04-12"). "Accuracy" reported on this page is fields correct ÷ fields checked — not document-level completion.

  2. 02

    What's in the corpus

    Real documents, not synthetic. Public-domain sources only: SEC EDGAR, IRS forms, state insurance filings, SROIE, court records. Every document has a committed schema and a ground-truth JSON of expected values. No PII, no proprietary content, no customer data. Rotating mix of clean digital PDFs, scanned PDFs, and multi-column layouts.

  3. 03

    How runs are reproducible

    Each run pins the model version (e.g. gpt-4o-mini-2024-07-18), the prompt version, and the Koji commit SHA. Temperature 0. Results write to corpus/.benchmarks/<date>/<category>/<model>.json with full traceability — you can diff two runs to see exactly what changed.

  4. 04

    How we aggregate

    Per-category accuracy weights by fields, not documents. A category with 200 small docs doesn't dominate one with 20 large docs. The overall number is a weighted average across all categories. The trend line is one data point per full corpus run.


Every document type,
measured in public.

The corpus is organized by domain — invoices, insurance policies, SEC filings, contracts, government forms, receipts. Each category has its own schema and expected-output fixtures. Accuracy below is Koji's intelligent pipeline running openai/gpt-4o-mini end-to-end.

SEC filings

sec_filings
99.2% accuracy
102 documents
383 fields checked

Insurance policies

insurance_policies
99.1% accuracy
97 documents
763 fields checked

IRS forms

irs_forms
100.0% accuracy
20 documents
180 fields checked

Insurance claims

insurance_claims
95.5% accuracy
152 documents
1,020 fields checked

Invoices

invoices
95.3% accuracy
155 documents
1,085 fields checked

Insurance certificates

insurance_certificates
94.4% accuracy
61 documents
305 fields checked

Adversarial

adversarial
96.7% accuracy
11 documents
60 fields checked

Receipts (SROIE)

receipts
81.6% accuracy
52 documents
147 fields checked

Multi-format

multi_format
100.0% accuracy
3 documents
18 fields checked

Your model,
your trade-off.

Koji runs on any OpenAI-compatible endpoint or local model via Ollama. Different models hit different points on the accuracy/latency/cost curve — swap in a config change, not a code change.

Model Accuracy Latency Cost per doc
GPT-4o mini OpenAI 96.1% 2.1s $0.0009

All runs use the same pipeline configuration: parse via Docling, intelligent extraction (map → route → extract → validate → reconcile), full corpus. Latency and cost are averages across categories. See getkoji/corpus for the raw benchmark JSON.


Getting better
with every release.

Each data point is a nightly CI run against the full corpus. Regressions get flagged in pull requests and fixed before they merge.

+4.9pt accuracy since Dec 25
+141 documents added
96.1% Dec 25 Jan 26 Feb 26 Mar 26 Apr 26 May 26

We publish regressions too.

Accuracy numbers from vendor-published benchmarks are usually theoretical: they pick the best documents, the best schema, and the best model, run it once, and screenshot the result. We don't do that.

The corpus is public and versioned. Every document, every schema, every expected output, every nightly CI run — all committed to getkoji/corpus. When we break something, the numbers drop and the PR gets flagged before it merges. When we fix something, the numbers move up and you can diff the run to see why.

If you find a document type where Koji underperforms, submit it to the corpus. We'll add it, run it against the pipeline, and publish the result — regression or not. That's the whole point.

Run it yourself

Don't trust the numbers?
Reproduce them.

The corpus is public. The CLI is open source. Clone the repo, run koji bench against any category, and compare your numbers to ours. No sign-up, no vendor lock-in, no "contact sales to see the benchmark."

$ git clone https://github.com/getkoji/corpus
$ uv tool install git+https://github.com/getkoji/koji.git
$ koji bench --corpus ./corpus --category insurance