Engineering blog
How we build Koji — extraction pipelines, benchmarking methodology, schema design, and lessons from running document AI in production.
-
Why Open Source for Document AI
We made Koji open source because the security claims that matter most are the ones you can verify yourself.
-
Where Your Documents Go During Extraction
The first question every security team asks when evaluating document AI: 'If I upload a policy PDF, who sees it?' Here's exactly what happens at every stage.
-
Null Semantics: When "Nothing" Is the Right Answer
Every extraction system can pull values out of documents. The harder problem is knowing when a value isn't there — and handling that correctly.
-
Rate Limits, Retries, and the Hidden Accuracy Killer in LLM Pipelines
We spent weeks investigating a 6% accuracy variance. The root cause wasn't the model or the prompts — it was silent HTTP 429 errors treated as 'field not found.'
-
Why Heuristic Routing Fails on Long Documents
When a 120-page insurance policy goes through extraction, the AI sees fragments. If the router picks the wrong chunks, the AI can't extract what isn't in front of it.
-
Benchmarking Document Extraction: How We Measure Accuracy Across 1,100 Documents
Every document extraction vendor claims 95%+ accuracy. None of them publish how they measure it. We built an open, reproducible benchmark — here's the methodology.
-
Schema-Driven Extraction: Configuration Over Code for Document AI
Most extraction approaches rely on prompt engineering. Schema-driven extraction replaces the hope with a contract — typed fields, validation rules, and routing hints in a YAML file.