Key Research Findings

Production deployment

SDF has been validated in a production deployment across 2,335 documents spanning all 10 parent types and 74 unique type combinations. The results below summarize pipeline performance, extraction accuracy, and token efficiency.

Summary statistics

Metric	Value
Total documents processed	2,335
Parent types covered	10 (all)
Unique type combinations	74
JSON validity rate	100%
Schema validation pass rate	100%
Average extraction confidence	0.91
Average processing time	~400 ms/document

Pipeline architecture

The SDF conversion pipeline uses a two-stage model cascade:

Stage	Model	Parameters	Purpose
Classification	sdf-classifier	1.5B	Determine parent_type and type
Extraction	sdf-extractor	3B	Extract all document fields

This specialized cascade is 4.1x faster than a monolithic 14B parameter baseline that performs both classification and extraction in a single pass.

Approach	Parameters	Speed	Accuracy
Monolithic baseline	14B	1.0x	87%
SDF cascade	1.5B + 3B	4.1x	90%

The smaller, specialized models outperform the larger generalist model on both speed and accuracy because:

The classifier is optimized for a constrained output space (10 parent types, 50+ subtypes)
The extractor receives the type classification as input, narrowing its extraction schema
Each model is fine-tuned on its specific task rather than trained on the joint task

Extraction accuracy

Overall extraction accuracy: 90% exact match across all field types.

Field Category	Exact Match	Notes
Type classification	95.2%	After normalization cascade
Entity extraction	91.3%	Name + type + salience
Claim extraction	87.6%	Claim text + confidence
Relationship extraction	85.4%	Subject + predicate + object
Summary generation	92.1%	Evaluated by human raters
type_data fields	89.8%	Type-specific structured fields

Type normalization

The 5-stage type normalization cascade corrected 63 non-standard type inventions across the 2,335 document corpus:

Normalization Stage	Corrections	Example
Exact match	0 (pass-through)	`article.news` → `article.news`
Alias resolution	24	`blog_post` → `article.blog`
Parent type inference	15	`article` → `article.news` (from domain)
Fuzzy matching	18	`artcle.analyis` → `article.analysis`
Fallback classification	6	`custom.misc` → `reference.wiki`
Total corrections	63	—

After normalization: 100% taxonomy conformance.

Token efficiency

SDF achieves a 99% token reduction from raw HTML:

Representation	Avg. Size	Avg. Tokens	Reduction from HTML
Raw HTML	89 KB	~73,000	—
Markdown (cleaned)	12 KB	~4,200	94.2%
SDF JSON	2 KB	~750	98.9%

Token efficiency from markdown to SDF is 46% reduction — significant because many agent pipelines already convert HTML to markdown as a first step. SDF provides further compression by extracting only semantic content.

Where tokens go in SDF

Section	Avg. Tokens	% of Document
summary	180	24%
entities	120	16%
type_data	150	20%
claims	95	13%
relationships	65	9%
source + provenance	80	11%
Other fields	60	8%

JSON validity

Across all 2,335 documents, 100% produced valid JSON. This is achieved through:

Schema-constrained generation (the extraction model outputs against a JSON schema)
Post-generation validation with automatic repair for minor formatting issues
Retry logic for generation failures (< 0.5% of documents require retry)

Citation

If you reference this research, please cite:

Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” Zenodo. DOI: 10.5281/zenodo.18559223

Downstream Evaluation

See how SDF performs in downstream agent tasks.

Type Taxonomy

Review the full type taxonomy with production frequency data.