SDF Protocol

Convert once, consume everywhere. SDF defines a canonical JSON format for pre-compiled semantic representations of web pages — entities, claims, relationships, type-specific data — that AI agents consume directly.

Read the Spec Quick Example

The open protocol for structured, agent-readable web content

SDF (Structured Data Format) is a JSON-based protocol that eliminates redundant extraction work across AI agents. Instead of every agent independently parsing raw HTML, publishers convert once and agents consume a canonical, schema-validated representation.

Schema-Validated

Every SDF document validates against JSON Schema (draft 2020-12). Structural guarantees mean agents skip parsing heuristics entirely.

Type-Aware

10 parent types, 50+ subtypes. Articles, commerce, code, events, discussions — each with type-specific structured fields that agents can query directly.

Cacheable

Content hashing via content_hash enables deduplication and cache validation. Agents skip re-processing when content has not changed.

Provenance

Full audit trail: converter identity, model used, processing chain, confidence scores. Consumers know exactly how the document was produced.

Why not just use HTML?

A typical web page is 89 KB of HTML — ads, nav, scripts, boilerplate. That becomes ~73,000 tokens for an LLM to process. The same content in SDF is ~750 tokens: pre-extracted entities, claims, relationships, and type-specific structured data.

SDF is the compilation layer between web content and AI agents.

99% fewer tokens

From raw HTML to structured JSON. Agents process only the semantic content.

90% extraction accuracy

Schema-constrained extraction with type normalization achieves 90% exact match in production.

4.1x faster pipeline

Specialized 1.5B + 3B model cascade outperforms monolithic 14B extraction by 4.1x.

2,335 documents

Validated across 74 type combinations in production deployment.

Research

The SDF whitepaper is available as a preprint on Zenodo:

Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” DOI: 10.5281/zenodo.18559223 | PDF