Skip to content

Why SDF?

Every AI agent that consumes a web page performs the same work independently:

  1. Fetch the raw HTML
  2. Strip boilerplate (nav, ads, scripts, footers)
  3. Extract the main content
  4. Identify entities, claims, and relationships
  5. Classify the content type
  6. Structure the result for downstream use

When 100 agents consume the same page, this work happens 100 times. Each agent uses different heuristics, different models, and produces different results. There is no shared representation.

MetricRaw HTMLSDF
Avg. page size89 KB~2 KB
Token count (LLM)~73,000~750
Extraction requiredYes, every timeNo — pre-extracted
Schema validationNoneJSON Schema enforced
Entity consistencyVaries by modelCanonical
CacheableURL-level onlyContent-hash verified

Raw HTML is a delivery format for browsers, not for agents. It contains navigation, advertising, tracking scripts, and layout markup that is irrelevant to semantic understanding. Feeding raw HTML to an LLM wastes context window, increases latency, and introduces extraction variance.

SDF occupies the same position in the agent stack that compiled bytecode occupies in a software stack. Source code (HTML) is authored by humans; compiled output (SDF) is consumed by machines.

HTML (source) → SDF Converter → SDF Document (compiled) → Agent (consumer)

The conversion happens once at the publisher or intermediary level. Every downstream agent consumes the same canonical representation.

ApproachStructuredSemanticType-awareSchema-validatedAgent-optimized
Raw HTMLNoNoNoNoNo
Schema.org / JSON-LDPartialPartialLimitedOptionalNo
RSS / AtomMinimalNoNoDTD onlyNo
llms.txtNoNoNoNoPartial
RAG pipelinesVariesVariesNoNoPartial
SDFYesYesYes (50+ types)Yes (JSON Schema)Yes

Schema.org provides vocabulary but not document structure. A page may have JSON-LD markup for an Article, but the actual content — entities, claims, relationships — is still locked in HTML.

RSS/Atom provides syndication metadata (title, date, summary) but not semantic extraction.

llms.txt provides a plain-text rendering hint but no structured fields, no schema, and no type system.

RAG pipelines chunk and embed content for retrieval but discard document structure and type semantics in the process.

SDF provides the complete package: a schema-validated, type-aware, semantically extracted JSON document that agents consume without further processing.