Read the Spec
See the full Protocol v0.2 specification.
Every AI agent that consumes a web page performs the same work independently:
When 100 agents consume the same page, this work happens 100 times. Each agent uses different heuristics, different models, and produces different results. There is no shared representation.
| Metric | Raw HTML | SDF |
|---|---|---|
| Avg. page size | 89 KB | ~2 KB |
| Token count (LLM) | ~73,000 | ~750 |
| Extraction required | Yes, every time | No — pre-extracted |
| Schema validation | None | JSON Schema enforced |
| Entity consistency | Varies by model | Canonical |
| Cacheable | URL-level only | Content-hash verified |
Raw HTML is a delivery format for browsers, not for agents. It contains navigation, advertising, tracking scripts, and layout markup that is irrelevant to semantic understanding. Feeding raw HTML to an LLM wastes context window, increases latency, and introduces extraction variance.
SDF occupies the same position in the agent stack that compiled bytecode occupies in a software stack. Source code (HTML) is authored by humans; compiled output (SDF) is consumed by machines.
HTML (source) → SDF Converter → SDF Document (compiled) → Agent (consumer)The conversion happens once at the publisher or intermediary level. Every downstream agent consumes the same canonical representation.
| Approach | Structured | Semantic | Type-aware | Schema-validated | Agent-optimized |
|---|---|---|---|---|---|
| Raw HTML | No | No | No | No | No |
| Schema.org / JSON-LD | Partial | Partial | Limited | Optional | No |
| RSS / Atom | Minimal | No | No | DTD only | No |
| llms.txt | No | No | No | No | Partial |
| RAG pipelines | Varies | Varies | No | No | Partial |
| SDF | Yes | Yes | Yes (50+ types) | Yes (JSON Schema) | Yes |
Schema.org provides vocabulary but not document structure. A page may have JSON-LD markup for an Article, but the actual content — entities, claims, relationships — is still locked in HTML.
RSS/Atom provides syndication metadata (title, date, summary) but not semantic extraction.
llms.txt provides a plain-text rendering hint but no structured fields, no schema, and no type system.
RAG pipelines chunk and embed content for retrieval but discard document structure and type semantics in the process.
SDF provides the complete package: a schema-validated, type-aware, semantically extracted JSON document that agents consume without further processing.
Read the Spec
See the full Protocol v0.2 specification.
View Research
Review the production findings from 2,335 documents.