Validation corpus · Updated May 18, 2026

How accurate is
Koji, really?

Every change to the extraction pipeline is validated against a public corpus of real-world documents. Schemas, expected outputs, and benchmark runs all live in getkoji/corpus. No cherry-picked numbers, no hidden regressions — what you see below is what nightly CI reports.

653 documents

96.1% field accuracy

3,961 fields checked

01 Methodology

How we
measure this.

The number at the top of this page isn't a marketing claim — it's a weighted aggregate over a specific corpus, measured with a specific definition of "accuracy," produced by a specific pipeline run on specific models. Here's the fine print.

01

What "accuracy" means

Field-level exact match after normalization. For each field in the schema, we check whether the extracted value matches the expected value. Dates, numbers, currencies, and enum mappings are fuzzy-matched ("April 12, 2026" ≡ "2026-04-12"). "Accuracy" reported on this page is fields correct ÷ fields checked — not document-level completion.
02

What's in the corpus

Real documents, not synthetic. Public-domain sources only: SEC EDGAR, IRS forms, state insurance filings, SROIE, court records. Every document has a committed schema and a ground-truth JSON of expected values. No PII, no proprietary content, no customer data. Rotating mix of clean digital PDFs, scanned PDFs, and multi-column layouts.
03

How runs are reproducible

Each run pins the model version (e.g. gpt-4o-mini-2024-07-18), the prompt version, and the Koji commit SHA. Temperature 0. Results write to corpus/.benchmarks/<date>/<category>/<model>.json with full traceability — you can diff two runs to see exactly what changed.
04

How we aggregate

Per-category accuracy weights by fields, not documents. A category with 200 small docs doesn't dominate one with 20 large docs. The overall number is a weighted average across all categories. The trend line is one data point per full corpus run.

02 By category

Every document type,
measured in public.

The corpus is organized by domain — invoices, insurance policies, SEC filings, contracts, government forms, receipts. Each category has its own schema and expected-output fixtures. Accuracy below is Koji's intelligent pipeline running openai/gpt-4o-mini end-to-end.

SEC filings

sec_filings

99.2% accuracy

102 documents

383 fields checked

Insurance policies

insurance_policies

99.1% accuracy

97 documents

763 fields checked

IRS forms

irs_forms

100.0% accuracy

20 documents

180 fields checked

Insurance claims

insurance_claims

95.5% accuracy

152 documents

1,020 fields checked

Invoices

invoices

95.3% accuracy

155 documents

1,085 fields checked

Insurance certificates

insurance_certificates

94.4% accuracy

61 documents

305 fields checked

Adversarial

adversarial

96.7% accuracy

11 documents

60 fields checked

Receipts (SROIE)

receipts

81.6% accuracy

52 documents

147 fields checked

Multi-format

multi_format

100.0% accuracy

3 documents

18 fields checked

03 By model

Your model,
your trade-off.

Koji runs on any OpenAI-compatible endpoint or local model via Ollama. Different models hit different points on the accuracy/latency/cost curve — swap in a config change, not a code change.

Model	Accuracy	Latency	Cost per doc
GPT-4o mini OpenAI	96.1%	2.1s	$0.0009

All runs use the same pipeline configuration: parse via Docling, intelligent extraction (map → route → extract → validate → reconcile), full corpus. Latency and cost are averages across categories. See getkoji/corpus for the raw benchmark JSON.

04 Accuracy over time

Getting better
with every release.

Each data point is a nightly CI run against the full corpus. Regressions get flagged in pull requests and fixed before they merge.

+4.9pt accuracy since Dec 25

+141 documents added

05 Honest numbers

We publish regressions too.

Accuracy numbers from vendor-published benchmarks are usually theoretical: they pick the best documents, the best schema, and the best model, run it once, and screenshot the result. We don't do that.

The corpus is public and versioned. Every document, every schema, every expected output, every nightly CI run — all committed to getkoji/corpus. When we break something, the numbers drop and the PR gets flagged before it merges. When we fix something, the numbers move up and you can diff the run to see why.

If you find a document type where Koji underperforms, submit it to the corpus. We'll add it, run it against the pipeline, and publish the result — regression or not. That's the whole point.

Run it yourself

Don't trust the numbers?
Reproduce them.

The corpus is public. The CLI is open source. Clone the repo, run koji bench against any category, and compare your numbers to ours. No sign-up, no vendor lock-in, no "contact sales to see the benchmark."

$ git clone https://github.com/getkoji/corpus
$ uv tool install git+https://github.com/getkoji/koji.git
$ koji bench --corpus ./corpus --category insurance

Browse the corpus → Install Koji

How accurate is Koji, really?

How we measure this.

What "accuracy" means

What's in the corpus

How runs are reproducible

How we aggregate

Every document type, measured in public.

Your model, your trade-off.

Getting better with every release.