Skip to main content

AI extraction benchmark: real supplier documents.

Every vendor talks about AI extraction. This page shows how Telden's extraction actually performs on a representative set of real Declarations of Conformity, test reports, and supplier certificates — with methodology, numbers, and honest limitations.

Dataset

What we tested on.

The benchmark uses a representative corpus of real supplier documents, not synthetic or hand-picked clean samples.

Document types

EU Declarations of Conformity, test reports (EN 71, REACH, RoHS), supplier certificates, and safety data sheets. These are the documents that real compliance teams work with daily.

Corpus size

50 documents across 12 distinct suppliers. Documents range from 1-page certificates to 20-page multi-standard test reports. Average document length: 4.3 pages.

Document quality

Mix of text-native PDFs (~65%), scanned/image-based PDFs (~25%), and hybrid documents with both text and scanned pages (~10%). No documents were pre-cleaned or excluded based on quality.

Languages

Primarily English and German documents, with a smaller set in French, Italian, and Spanish. Multi-language extraction is supported but accuracy varies by language and model capability.

Fields targeted

Standard compliance fields: manufacturer name and address, authorised representative, product identifiers (model, batch, EAN), applicable standards and directives, CE marking status, and warning/safety statements.

Ground truth

Each document was independently reviewed by two compliance specialists. Fields were marked as present/absent and the correct value recorded. Disagreements were resolved by a third reviewer.

Methodology

How we measured performance.

01

Field-level extraction

For each document, the extraction pipeline processes the full text (including OCR for scanned pages) and attempts to extract every targeted compliance field. Each extracted field includes a confidence score (0–1), source page reference, and the exact text span used as evidence.

02

Completeness scoring

Completeness measures what fraction of the fields actually present in a document were successfully extracted with a value. A field that is genuinely absent from the document (e.g. no batch number on a certificate that does not list one) is not penalised — only fields the ground truth confirms as present are scored.

03

Accuracy scoring

Accuracy measures how often an extracted value matches the ground truth. Partial matches (e.g. a company name with a minor typographical difference) are scored as incorrect to keep the benchmark conservative. The reported accuracy is therefore a lower bound on real-world usefulness.

04

Confidence calibration

Extraction confidence scores are reported alongside results. The calibration analysis shows how often high-confidence extractions (≥0.8) are correct vs low-confidence extractions (<0.5), so readers can understand the reliability of the confidence signal itself.

Results

Quantitative outcomes.

78%

Field completeness

On text-native PDFs, extraction finds and fills ~78% of present fields. This means roughly 4 out of 5 compliance fields are extracted without manual typing.

91%

High-confidence accuracy

When the model reports confidence ≥ 0.8, the extracted value matches ground truth 91% of the time. High-confidence fields can be reviewed quickly rather than re-typed.

67%

Scanned-doc completeness

On scanned/image-based PDFs requiring OCR, completeness drops to ~67%. OCR quality is the primary bottleneck — blurry or low-resolution scans reduce extraction yield.

~3 min

Per-document review time

Manual extraction of the same fields from a 4-page DoC takes ~20 minutes. With AI extraction surfacing pre-filled fields with confidence scores and source links, review time drops to ~3 minutes.

Document typeCompletenessAccuracyAvg. confidence
Text-native PDFs (65%)78%85%0.82
Scanned PDFs — OCR (25%)67%73%0.68
Hybrid text+scanned (10%)72%79%0.75
Overall (all documents)74%81%0.77

Benchmark run: April 2026. Extraction model: OpenRouter with default workspace configuration. Results are stable across runs with the same model version but may vary with model updates.

Caveats and limitations

Corpus size is limited. 50 documents from 12 suppliers is representative but not exhaustive. Performance on documents with highly unusual layouts, dense tables, or non-standard formatting may differ from the averages reported here.

Model version matters. Results reflect the extraction model version available at the time of the benchmark (April 2026). Model updates may improve or, in rare cases, regress on specific document types. We re-run the benchmark against the current model periodically.

Accuracy is a lower bound. The benchmark treats partial matches as incorrect. Real-world review often accepts minor normalisation differences (e.g. "Example GmbH" vs "Example GmbH.") that are scored as incorrect here.

Documents were not pre-selected. The corpus was assembled from real supplier submissions without excluding difficult documents. However, all documents are from active compliance workflows — documents that were never intended for machine reading (e.g. heavily redacted, hand-annotated, or severely degraded scans) are not represented.

This is an extraction benchmark, not a compliance guarantee. AI extraction reduces manual data entry and speeds up review. It does not replace human verification of compliance data. Every extracted value should be reviewed against its source document, especially for high-stakes regulatory submissions.

Confidence scores are model-reported, not independently calibrated. The confidence value is produced by the extraction model itself. While our calibration analysis shows it is a useful signal, it is not a statistical guarantee of correctness.

Practical impact

What this means for your compliance workflow.

The 80/20 of extraction

AI extraction handles the high-volume, repetitive field population that currently consumes most manual review time. For a typical 100-SKU catalog with 3–5 documents per SKU, extraction pre-fills roughly three-quarters of compliance fields — saving an estimated 50–80 hours of manual data entry per catalog intake.

Review, don't type

The workflow shifts from 'read PDF, type into form' to 'verify pre-filled values against source.' Each extracted field links back to its source page and text span. High-confidence fields (≥0.8) are correct 91% of the time, so a reviewer can scan them quickly and focus attention on low-confidence or missing fields.

Provenance is built in

Unlike black-box extraction, every field carries its evidence trail: which document it came from, which page, and the exact text span. When a market surveillance authority asks where a value came from, the answer is one click away — not buried in an email thread from six months ago.

You still own the final record

Extraction proposes. You decide. The system never silently accepts AI output as verified compliance data. Every extracted value sits in a review queue until a human confirms or overrides it. The approval trail records who reviewed what and when.

Maintenance

How we keep this page honest.

Extraction models improve over time. The benchmark is re-run against the current production model whenever a model version change is deployed. If performance changes materially (≥5 percentage points on any metric), this page is updated within one release cycle.

The benchmark corpus is versioned and documented. Document hashes are recorded so the same inputs can be re-evaluated against future model versions. The ground truth annotations are peer-reviewed by compliance specialists, not generated by the extraction model itself.

We publish results even when they are not flattering. The scanned-document completeness number (67%) is lower than we would like. We document it here because hiding it would be dishonest, and because it drives our engineering priorities for OCR improvement.

See how extraction performs on your documents.

Upload your own supplier documents in an early-access workspace. The extraction pipeline runs on your real PDFs, and you can review the results with full provenance before deciding whether Telden fits your workflow.

Free during early access · No credit card · Pricing starts after General Availability · Existing workspaces get 30 days notice before pricing changes