Schema-Validated
Every SDF document validates against JSON Schema (draft 2020-12). Structural guarantees mean agents skip parsing heuristics entirely.
SDF (Structured Data Format) is a JSON-based protocol that eliminates redundant extraction work across AI agents. Instead of every agent independently parsing raw HTML, publishers convert once and agents consume a canonical, schema-validated representation.
Schema-Validated
Every SDF document validates against JSON Schema (draft 2020-12). Structural guarantees mean agents skip parsing heuristics entirely.
Type-Aware
10 parent types, 50+ subtypes. Articles, commerce, code, events, discussions — each with type-specific structured fields that agents can query directly.
Cacheable
Content hashing via content_hash enables deduplication and cache validation. Agents skip re-processing when content has not changed.
Provenance
Full audit trail: converter identity, model used, processing chain, confidence scores. Consumers know exactly how the document was produced.
A typical web page is 89 KB of HTML — ads, nav, scripts, boilerplate. That becomes ~73,000 tokens for an LLM to process. The same content in SDF is ~750 tokens: pre-extracted entities, claims, relationships, and type-specific structured data.
SDF is the compilation layer between web content and AI agents.
99% fewer tokens
From raw HTML to structured JSON. Agents process only the semantic content.
90% extraction accuracy
Schema-constrained extraction with type normalization achieves 90% exact match in production.
4.1x faster pipeline
Specialized 1.5B + 3B model cascade outperforms monolithic 14B extraction by 4.1x.
2,335 documents
Validated across 74 type combinations in production deployment.
The SDF whitepaper is available as a preprint on Zenodo:
Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” DOI: 10.5281/zenodo.18559223 | PDF