AI extraction benchmark: real supplier documents.
Every vendor talks about AI extraction. This page shows how Telden's extraction actually performs on a representative set of real Declarations of Conformity, test reports, and supplier certificates — with methodology, numbers, and honest limitations.
Dataset
What we tested on.
The benchmark uses a representative corpus of real supplier documents, not synthetic or hand-picked clean samples.
Methodology
How we measured performance.
Field-level extraction
For each document, the extraction pipeline processes the full text (including OCR for scanned pages) and attempts to extract every targeted compliance field. Each extracted field includes a confidence score (0–1), source page reference, and the exact text span used as evidence.
Completeness scoring
Completeness measures what fraction of the fields actually present in a document were successfully extracted with a value. A field that is genuinely absent from the document (e.g. no batch number on a certificate that does not list one) is not penalised — only fields the ground truth confirms as present are scored.
Accuracy scoring
Accuracy measures how often an extracted value matches the ground truth. Partial matches (e.g. a company name with a minor typographical difference) are scored as incorrect to keep the benchmark conservative. The reported accuracy is therefore a lower bound on real-world usefulness.
Confidence calibration
Extraction confidence scores are reported alongside results. The calibration analysis shows how often high-confidence extractions (≥0.8) are correct vs low-confidence extractions (<0.5), so readers can understand the reliability of the confidence signal itself.
Results
Quantitative outcomes.
| Document type | Completeness | Accuracy | Avg. confidence |
|---|---|---|---|
| Text-native PDFs (65%) | 78% | 85% | 0.82 |
| Scanned PDFs — OCR (25%) | 67% | 73% | 0.68 |
| Hybrid text+scanned (10%) | 72% | 79% | 0.75 |
| Overall (all documents) | 74% | 81% | 0.77 |
Benchmark run: April 2026. Extraction model: OpenRouter with default workspace configuration. Results are stable across runs with the same model version but may vary with model updates.
Caveats and limitations
Corpus size is limited. 50 documents from 12 suppliers is representative but not exhaustive. Performance on documents with highly unusual layouts, dense tables, or non-standard formatting may differ from the averages reported here.
Model version matters. Results reflect the extraction model version available at the time of the benchmark (April 2026). Model updates may improve or, in rare cases, regress on specific document types. We re-run the benchmark against the current model periodically.
Accuracy is a lower bound. The benchmark treats partial matches as incorrect. Real-world review often accepts minor normalisation differences (e.g. "Example GmbH" vs "Example GmbH.") that are scored as incorrect here.
Documents were not pre-selected. The corpus was assembled from real supplier submissions without excluding difficult documents. However, all documents are from active compliance workflows — documents that were never intended for machine reading (e.g. heavily redacted, hand-annotated, or severely degraded scans) are not represented.
This is an extraction benchmark, not a compliance guarantee. AI extraction reduces manual data entry and speeds up review. It does not replace human verification of compliance data. Every extracted value should be reviewed against its source document, especially for high-stakes regulatory submissions.
Confidence scores are model-reported, not independently calibrated. The confidence value is produced by the extraction model itself. While our calibration analysis shows it is a useful signal, it is not a statistical guarantee of correctness.
Practical impact
What this means for your compliance workflow.
Maintenance
How we keep this page honest.
Extraction models improve over time. The benchmark is re-run against the current production model whenever a model version change is deployed. If performance changes materially (≥5 percentage points on any metric), this page is updated within one release cycle.
The benchmark corpus is versioned and documented. Document hashes are recorded so the same inputs can be re-evaluated against future model versions. The ground truth annotations are peer-reviewed by compliance specialists, not generated by the extraction model itself.
We publish results even when they are not flattering. The scanned-document completeness number (67%) is lower than we would like. We document it here because hiding it would be dishonest, and because it drives our engineering priorities for OCR improvement.
See how extraction performs on your documents.
Upload your own supplier documents in an early-access workspace. The extraction pipeline runs on your real PDFs, and you can review the results with full provenance before deciding whether Telden fits your workflow.
Free during early access · No credit card · Pricing starts after General Availability · Existing workspaces get 30 days notice before pricing changes