Provenance
Purpose
Section titled “Purpose”Every SDF document carries a provenance object that records how the document was produced. This enables consumers to assess document quality, detect stale conversions, deduplicate content, and audit the conversion pipeline.
Provenance fields
Section titled “Provenance fields”| Field | Type | Required | Description |
|---|---|---|---|
converter | string | Yes | Identifier of the conversion system |
converter_version | string | Yes | Version of the converter |
model | string | Yes | Model used for extraction |
conversion_timestamp | string (ISO 8601) | Yes | When the conversion was performed |
conversion_confidence | number (0-1) | Yes | Overall confidence score for the conversion |
content_hash | string | Yes | Content hash for deduplication (format: algorithm:hex) |
processing_chain | array<object> | No | Ordered list of processing steps |
converter
Section titled “converter”Identifies the system that produced the SDF document. This is a free-form string but should be consistent across conversions from the same system.
Examples: "sdf-pipeline", "acme-converter", "publisher-native"
The ML model or extraction engine used. For multi-model pipelines, this is the primary extraction model.
Examples: "sdf-extractor-3b", "gpt-4o", "custom-ner-v2"
processing_chain
Section titled “processing_chain”For multi-stage pipelines, the processing chain records each step:
{ "processing_chain": [ { "step": "classification", "model": "sdf-classifier-1.5b", "duration_ms": 45, "output": "article.news" }, { "step": "extraction", "model": "sdf-extractor-3b", "duration_ms": 320, "output": "document" }, { "step": "validation", "model": "json-schema", "duration_ms": 12, "output": "valid" } ]}Each step records the operation name, the model used, processing duration, and a summary of the output.
content_hash
Section titled “content_hash”A hash of the source content, used for deduplication and cache validation. Format is algorithm:hex_digest.
"content_hash": "sha256:8a3b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b"Agents use the content hash to determine if a document has changed since a previous fetch. If the hash matches, the agent can skip re-processing.
Recommended algorithm: SHA-256. The hash is computed over the raw source content (HTML) before conversion.
conversion_confidence
Section titled “conversion_confidence”A float between 0 and 1 representing the converter’s overall confidence in the extraction quality. This is typically an aggregate of per-field confidences.
| Range | Interpretation |
|---|---|
| 0.90 - 1.00 | High confidence — suitable for automated consumption |
| 0.70 - 0.89 | Moderate confidence — review recommended for critical applications |
| 0.50 - 0.69 | Low confidence — extraction may have significant gaps |
| Below 0.50 | Very low confidence — manual verification recommended |
Example
Section titled “Example”{"provenance": { "converter": "sdf-pipeline", "converter_version": "0.2.0", "model": "sdf-extractor-3b", "conversion_timestamp": "2025-03-15T15:02:30Z", "conversion_confidence": 0.91, "content_hash": "sha256:8a3b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b", "processing_chain": [ { "step": "fetch", "model": null, "duration_ms": 230, "output": "200 OK" }, { "step": "classify", "model": "sdf-classifier-1.5b", "duration_ms": 42, "output": "article.news" }, { "step": "extract", "model": "sdf-extractor-3b", "duration_ms": 315, "output": "18 fields extracted" }, { "step": "validate", "model": "json-schema-draft-2020-12", "duration_ms": 8, "output": "valid" } ]}}Consumer guidance
Section titled “Consumer guidance”Agents should use provenance data to:
- Filter by confidence — Skip documents below a confidence threshold for the use case
- Deduplicate — Use
content_hashto avoid processing the same content twice - Audit — Track which converter and model versions produced consumed documents
- Freshness — Compare
conversion_timestampagainst acceptable staleness windows - Pipeline transparency — Inspect
processing_chainto understand extraction methodology