Open source · Cloud or self-hosted · Config-driven

Documents in.
Structured data out.

Koji turns any document — PDFs, Word, scans, contracts, forms — into clean, structured JSON. Run it on our cloud in two minutes, or self-host the open source core on your own infrastructure. Same product, your call.

$ uv tool install git+https://github.com/getkoji/koji.git
$ koji init myproject --template invoice
$ koji process invoice.pdf
→ output/invoice.json  (97% confidence, 2.4s)

Your documents.
Your infrastructure.
Your rules.

Every contract, invoice, and form in your business contains data worth extracting. But the tools that extract it expect you to hand your data over — to their cloud, their servers, their terms of service.

Koji doesn't work that way. It runs on your infrastructure. It uses the models you choose. It shows you exactly what it's doing. And it's open source, forever.

— The thing we wished existed when we needed it.


From document to data
in three unhurried steps.

No black boxes. Every stage is inspectable, configurable, and replaceable. The pipeline is just a YAML file.

i

Define a schema

Declare what you want to extract in a few lines of YAML. Field names, types, descriptions. The description guides the model.

name: invoice
fields:
  invoice_number:
    type: string
    required: true
  total_amount:
    type: number
  line_items:
    type: array
ii

Run the pipeline

Point Koji at a file or a directory. It parses, classifies, routes, and extracts — locally, in containers you control.

$ koji process ./invoice.pdf \
    --schema schemas/invoice.yaml

→ parsing   [████████] 1.2s
→ extracting[████████] 2.4s
→ validating[████████] 0.3s
  ✓ 11 fields, 97% confidence
iii

Get structured data

Clean JSON, per-field confidence scores, and the chunks each field came from. Pipe it anywhere.

{
  "invoice_number": "INV-2026-0042",
  "date": "2026-03-15",
  "vendor": "Acme Corp",
  "total_amount": 4250.00,
  "currency": "USD"
}

Describe what you want.
Get exactly that.

Schemas are YAML. Pipelines are YAML. The model sees your field descriptions, not your prose. No prompt engineering, no fragile string templates — just a config file you can version, diff, and share.

schemas/invoice.yaml INPUT
name: invoice
description: Commercial invoice extraction

fields:
  invoice_number:
    type: string
    required: true
    description: Unique invoice identifier

  invoice_date:
    type: date
    required: true

  total_amount:
    type: number
    required: true
    description: Grand total including tax

  line_items:
    type: array
    description: Items purchased
    items:
      type: object
      properties:
        description: { type: string }
        quantity: { type: number }
        amount: { type: number }
output/invoice.json OUTPUT
{
  "invoice_number": "INV-2026-0042",
  "invoice_date": "2026-03-15",
  "total_amount": 4250.00,
  "line_items": [
    {
      "description": "Strategic consulting",
      "quantity": 8,
      "amount": 1600
    },
    {
      "description": "Documentation package",
      "quantity": 1,
      "amount": 350
    },
    {
      "description": "Training workshop",
      "quantity": 2,
      "amount": 1150
    }
  ]
}

Cloud or self-hosted.
Same product, either way.

Most platforms force you to pick a side — hosted forever, or roll your own forever. Koji is built to be both. One product, two operators. Pick the path that fits, swap later if it stops fitting.

Hosted Coming soon

Koji Cloud

We run it. You upload and go.

  • Two-minute signup. No Docker, no infrastructure, no YAML on day one.
  • Free tier. 100 documents every month, forever. No credit card.
  • Pro at $49/month. 500 docs, per-field confidence, email support.
  • Isolated by design. Per-tenant encryption. Models are BYO or ours. Your data is yours.
Request early access

For: Teams without a DevOps function, small operators, anyone who wants to start processing documents this afternoon.

Self-hosted Available today

Koji OSS

You run it. Full control, full source.

  • Apache 2.0. Free forever. Real open source, not open-core theatre.
  • Five-minute install. uv tool install git+https://github.com/getkoji/koji.git, then koji start. Docker Compose under the hood.
  • Your infrastructure. Laptop, VPC, Kubernetes. Documents never leave the network you control.
  • Your models. Local via Ollama, API providers, or your own inference endpoint. Mix per pipeline step.
Install with pip

For: Teams with compliance requirements, data sovereignty needs, custom model stacks, or anyone who just prefers to run their own tools.

One codebase. Same dashboard, same schemas, same pipelines, same CLI. The only difference is who operates the servers.


Your data is yours.
Always.

Most document processing platforms were designed around a single assumption: that you'll upload your data to their cloud and trust them with it. Koji starts from the opposite premise — and the cloud inherits that assumption, too.

01

Sovereign by design

Run it where you need it. Our cloud, your VPC, your laptop, your air-gapped Kubernetes cluster. The architecture is built for the self-hosted case — which means the cloud case inherits real isolation, not a retrofit.

02

Model-agnostic

Local models via Ollama, API providers like OpenAI and Anthropic, or your own inference endpoint. Mix and match per pipeline step. Change a line of YAML, not a line of code.

03

Open source core

Apache 2.0 licensed. The cloud runs the same code you can read, fork, and self-host. No proprietary extraction logic, no closed model pipeline, no "here's the community edition and here's the real one."


Numbers on the
public record.

Every change to the extraction pipeline runs against a public corpus of real-world documents — invoices, insurance policies, SEC filings, contracts, forms, receipts. Accuracy is measured, versioned, and published. Regressions get flagged in PRs before they merge.

See full benchmarks
653 documents in the corpus

96.1% overall field accuracy

3,961 fields checked nightly

Simple pricing.
No surprises.

Pay a flat monthly fee for an allowance of documents. When you need more, overage is a flat per-doc rate. No per-token math, no credit systems, no "platform fees" at the bottom of the invoice.

BYO endpoint All tiers run on your own AI endpoint — OpenAI, Azure OpenAI, Bedrock, Anthropic, Ollama, or any custom model. You pay your model vendor directly; we never touch the inference bill. No markup on tokens we didn't generate.

Free

$0 forever

Evaluate with real documents. 500 docs/month, no credit card.

Documents / month
500
Pages / document
100
Concurrent jobs
1
Overage
  • Full dashboard access
  • All schema features
  • All extraction strategies
  • Community support

Enterprise

Custom

Compliance, long documents, and a dedicated team behind you.

Documents / month
Custom
Pages / document
Unlimited
Concurrent jobs
Custom
Overage
Negotiated
  • SSO / SAML / SCIM
  • RBAC and audit logs
  • Dedicated tenant or BYOC
  • SOC 2, DPA, custom terms
  • Named support engineer
  • SLA with credits

Prefer to self-host? Koji OSS is free forever under the Apache 2.0 license. No limits, no metering, no phone-home. See it on GitHub →

Two ways to start today

Try Koji
on something real.

Point it at an invoice, a contract, a form — anything you've been meaning to turn into data. Start in the cloud if you want speed, or install the open source core if you want control.

01 Koji Cloud

Sign up, upload, done.

Two minutes from zero to your first extraction. 100 docs/month free forever, $49/month Pro with 500 docs included. Launching soon.

Request early access
02 Koji OSS

One command, go.

Apache 2.0 licensed. Full product, no feature gates. Runs on Docker, docker-compose, or Kubernetes. Your infrastructure, your rules.

Install
$ uv tool install git+https://github.com/getkoji/koji.git
View on GitHub →