Skip to content

Key Research Findings

SDF has been validated in a production deployment across 2,335 documents spanning all 10 parent types and 74 unique type combinations. The results below summarize pipeline performance, extraction accuracy, and token efficiency.

MetricValue
Total documents processed2,335
Parent types covered10 (all)
Unique type combinations74
JSON validity rate100%
Schema validation pass rate100%
Average extraction confidence0.91
Average processing time~400 ms/document

The SDF conversion pipeline uses a two-stage model cascade:

StageModelParametersPurpose
Classificationsdf-classifier1.5BDetermine parent_type and type
Extractionsdf-extractor3BExtract all document fields

This specialized cascade is 4.1x faster than a monolithic 14B parameter baseline that performs both classification and extraction in a single pass.

ApproachParametersSpeedAccuracy
Monolithic baseline14B1.0x87%
SDF cascade1.5B + 3B4.1x90%

The smaller, specialized models outperform the larger generalist model on both speed and accuracy because:

  1. The classifier is optimized for a constrained output space (10 parent types, 50+ subtypes)
  2. The extractor receives the type classification as input, narrowing its extraction schema
  3. Each model is fine-tuned on its specific task rather than trained on the joint task

Overall extraction accuracy: 90% exact match across all field types.

Field CategoryExact MatchNotes
Type classification95.2%After normalization cascade
Entity extraction91.3%Name + type + salience
Claim extraction87.6%Claim text + confidence
Relationship extraction85.4%Subject + predicate + object
Summary generation92.1%Evaluated by human raters
type_data fields89.8%Type-specific structured fields

The 5-stage type normalization cascade corrected 63 non-standard type inventions across the 2,335 document corpus:

Normalization StageCorrectionsExample
Exact match0 (pass-through)article.newsarticle.news
Alias resolution24blog_postarticle.blog
Parent type inference15articlearticle.news (from domain)
Fuzzy matching18artcle.analyisarticle.analysis
Fallback classification6custom.miscreference.wiki
Total corrections63

After normalization: 100% taxonomy conformance.

SDF achieves a 99% token reduction from raw HTML:

RepresentationAvg. SizeAvg. TokensReduction from HTML
Raw HTML89 KB~73,000
Markdown (cleaned)12 KB~4,20094.2%
SDF JSON2 KB~75098.9%

Token efficiency from markdown to SDF is 46% reduction — significant because many agent pipelines already convert HTML to markdown as a first step. SDF provides further compression by extracting only semantic content.

SectionAvg. Tokens% of Document
summary18024%
entities12016%
type_data15020%
claims9513%
relationships659%
source + provenance8011%
Other fields608%

Across all 2,335 documents, 100% produced valid JSON. This is achieved through:

  1. Schema-constrained generation (the extraction model outputs against a JSON schema)
  2. Post-generation validation with automatic repair for minor formatting issues
  3. Retry logic for generation failures (< 0.5% of documents require retry)

If you reference this research, please cite:

Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” Zenodo. DOI: 10.5281/zenodo.18559223