Skip to content

Provenance

Every SDF document carries a provenance object that records how the document was produced. This enables consumers to assess document quality, detect stale conversions, deduplicate content, and audit the conversion pipeline.

FieldTypeRequiredDescription
converterstringYesIdentifier of the conversion system
converter_versionstringYesVersion of the converter
modelstringYesModel used for extraction
conversion_timestampstring (ISO 8601)YesWhen the conversion was performed
conversion_confidencenumber (0-1)YesOverall confidence score for the conversion
content_hashstringYesContent hash for deduplication (format: algorithm:hex)
processing_chainarray<object>NoOrdered list of processing steps

Identifies the system that produced the SDF document. This is a free-form string but should be consistent across conversions from the same system.

Examples: "sdf-pipeline", "acme-converter", "publisher-native"

The ML model or extraction engine used. For multi-model pipelines, this is the primary extraction model.

Examples: "sdf-extractor-3b", "gpt-4o", "custom-ner-v2"

For multi-stage pipelines, the processing chain records each step:

{
"processing_chain": [
{
"step": "classification",
"model": "sdf-classifier-1.5b",
"duration_ms": 45,
"output": "article.news"
},
{
"step": "extraction",
"model": "sdf-extractor-3b",
"duration_ms": 320,
"output": "document"
},
{
"step": "validation",
"model": "json-schema",
"duration_ms": 12,
"output": "valid"
}
]
}

Each step records the operation name, the model used, processing duration, and a summary of the output.

A hash of the source content, used for deduplication and cache validation. Format is algorithm:hex_digest.

"content_hash": "sha256:8a3b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b"

Agents use the content hash to determine if a document has changed since a previous fetch. If the hash matches, the agent can skip re-processing.

Recommended algorithm: SHA-256. The hash is computed over the raw source content (HTML) before conversion.

A float between 0 and 1 representing the converter’s overall confidence in the extraction quality. This is typically an aggregate of per-field confidences.

RangeInterpretation
0.90 - 1.00High confidence — suitable for automated consumption
0.70 - 0.89Moderate confidence — review recommended for critical applications
0.50 - 0.69Low confidence — extraction may have significant gaps
Below 0.50Very low confidence — manual verification recommended
Provenance Block
{
"provenance": {
"converter": "sdf-pipeline",
"converter_version": "0.2.0",
"model": "sdf-extractor-3b",
"conversion_timestamp": "2025-03-15T15:02:30Z",
"conversion_confidence": 0.91,
"content_hash": "sha256:8a3b1c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b",
"processing_chain": [
{
"step": "fetch",
"model": null,
"duration_ms": 230,
"output": "200 OK"
},
{
"step": "classify",
"model": "sdf-classifier-1.5b",
"duration_ms": 42,
"output": "article.news"
},
{
"step": "extract",
"model": "sdf-extractor-3b",
"duration_ms": 315,
"output": "18 fields extracted"
},
{
"step": "validate",
"model": "json-schema-draft-2020-12",
"duration_ms": 8,
"output": "valid"
}
]
}
}

Agents should use provenance data to:

  1. Filter by confidence — Skip documents below a confidence threshold for the use case
  2. Deduplicate — Use content_hash to avoid processing the same content twice
  3. Audit — Track which converter and model versions produced consumed documents
  4. Freshness — Compare conversion_timestamp against acceptable staleness windows
  5. Pipeline transparency — Inspect processing_chain to understand extraction methodology