Skip to content

Document Model

An SDF document is a JSON object containing the following top-level fields:

FieldTypeRequiredDescription
sdf_versionstringYesProtocol version (semver). Current: "0.2.0"
idstringYesUnique identifier, prefixed sdf_
parent_typestringYesOne of 10 parent types
typestringYesQualified type as parent_type.subtype
aspectsarray<string>NoAdditional type aspects for multi-type content
sourceobjectYesOrigin metadata
summaryobjectYesMulti-level summarization
entitiesarray<object>YesExtracted named entities
claimsarray<object>NoFactual assertions
topicsarray<string>NoTopic labels
relationshipsarray<object>NoEntity relationship triples
type_dataobjectYesType-specific structured fields
sectionsarray<object>NoDocument structural sections
metadataobjectNoPage-level metadata
provenanceobjectYesConversion audit trail
temporalobjectNoTime-related metadata
linksarray<object>NoOutbound link analysis
embeddingsobjectNoVector representations
extensionsobjectNoVendor-namespaced custom fields

Origin metadata for the converted content.

{
"source": {
"url": "https://example.com/page",
"domain": "example.com",
"fetch_timestamp": "2025-03-15T14:30:00Z",
"http_status": 200,
"content_type": "text/html",
"language": "en",
"canonical_url": "https://example.com/page"
}
}
FieldTypeRequiredDescription
urlstringYesURL that was fetched
domainstringYesDomain of the source URL
fetch_timestampstring (ISO 8601)YesWhen the content was fetched
http_statusintegerYesHTTP status code of the fetch
content_typestringYesMIME type of the fetched content
languagestringNoBCP 47 language tag
canonical_urlstringNoCanonical URL if different from fetch URL

Multi-level summarization of the content.

{
"summary": {
"one_line": "Brief one-sentence summary.",
"key_points": [
"First key point",
"Second key point"
],
"abstract": "Paragraph-length summary providing more context and detail."
}
}
FieldTypeRequiredDescription
one_linestringYesSingle-sentence summary
key_pointsarray<string>YesBullet-point key takeaways
abstractstringNoParagraph-length summary

Named entities extracted from the content.

{
"entities": [
{
"name": "OpenAI",
"type": "organization",
"role": "subject",
"salience": 0.92,
"description": "AI research company"
}
]
}
FieldTypeRequiredDescription
namestringYesEntity name as it appears in content
typestringYesEntity type (person, organization, location, etc.)
rolestringNoRole in the content (subject, quoted_source, etc.)
saliencenumber (0-1)NoRelevance score
descriptionstringNoBrief entity description

Factual assertions extracted from the content.

{
"claims": [
{
"claim": "Revenue increased 23% year-over-year",
"source_type": "direct",
"confidence": 0.95,
"supporting_entities": ["Acme Corp"]
}
]
}
FieldTypeRequiredDescription
claimstringYesThe factual assertion
source_typestringYesdirect, inferred, or attributed
confidencenumber (0-1)YesExtraction confidence
supporting_entitiesarray<string>NoEntity names supporting this claim

Subject-predicate-object triples.

{
"relationships": [
{
"subject": "Google",
"predicate": "acquired",
"object": "DeepMind",
"confidence": 0.97
}
]
}
FieldTypeRequiredDescription
subjectstringYesSubject entity
predicatestringYesRelationship verb
objectstringYesObject entity
confidencenumber (0-1)NoConfidence score

Type-specific structured fields. The schema for this object is determined by the parent_type field. See the Type Reference for field definitions per type.

Document structural breakdown.

{
"sections": [
{
"heading": "Introduction",
"level": 1,
"content": "Section content text...",
"word_count": 245
}
]
}

Page-level metadata extracted from HTML <head> and structured data.

{
"metadata": {
"title": "Page Title",
"description": "Meta description",
"keywords": ["keyword1", "keyword2"],
"og_image": "https://example.com/image.jpg",
"author": "Author Name"
}
}

Conversion audit trail. See Provenance for full details.

Time-related metadata.

{
"temporal": {
"publish_date": "2025-03-15",
"modified_date": "2025-03-16",
"expiry_date": null,
"temporal_coverage": {
"start": "2025-01-01",
"end": "2025-03-15"
}
}
}

Outbound link analysis.

{
"links": [
{
"url": "https://other.com/resource",
"text": "link anchor text",
"relationship": "reference",
"domain": "other.com"
}
]
}

Optional vector representations for semantic search.

{
"embeddings": {
"model": "text-embedding-3-small",
"dimensions": 1536,
"vectors": {
"summary": [0.012, -0.034, ...],
"full_content": [0.008, -0.021, ...]
}
}
}

Vendor-namespaced custom fields. See Content Negotiation for the extension mechanism.

{
"extensions": {
"x-acme": {
"internal_id": "ACM-12345",
"department": "research"
}
}
}

All extension keys must be prefixed with x- followed by a vendor identifier.

SDF Document Skeleton
{
"sdf_version": "0.2.0",
"id": "sdf_abc123",
"parent_type": "article",
"type": "article.news",
"source": { "url": "...", "domain": "...", "fetch_timestamp": "...", "http_status": 200, "content_type": "text/html" },
"summary": { "one_line": "...", "key_points": ["..."], "abstract": "..." },
"entities": [{ "name": "...", "type": "...", "salience": 0.9 }],
"claims": [{ "claim": "...", "source_type": "direct", "confidence": 0.9 }],
"topics": ["..."],
"relationships": [{ "subject": "...", "predicate": "...", "object": "..." }],
"type_data": { },
"sections": [{ "heading": "...", "level": 1, "content": "..." }],
"metadata": { "title": "..." },
"provenance": { "converter": "...", "model": "...", "content_hash": "sha256:..." },
"temporal": { "publish_date": "..." },
"links": [{ "url": "...", "text": "...", "relationship": "reference" }],
"extensions": { }
}