Skip to content

Downstream Evaluation

The SDF protocol’s central claim is “convert once, consume many” — that pre-extracting structured data is more efficient for AI agents than re-parsing raw content each time. This experiment provides the consumer-side evidence for that claim.

ParameterValue
Consumer modelqwen2.5:7b-instruct-q4_K_M (general purpose, not fine-tuned for SDF)
Documents tested30 (3 per parent type, 10 types)
Question categories5
Questions per document5 (one per category)
Total LLM calls300 (30 docs x 5 questions x 2 paths)
  1. Raw path: Full markdown content + question → 7B model → answer
  2. SDF path: Compact SDF JSON (summary, entities, claims, type_data, topics, relationships) → 7B model → answer

Ground truth is the SDF document itself (generated by the 14B extraction model, quality-scored).

CategoryQuestion TemplateGround Truth Source
Type identification”What type of content is this?”parent_type, type
Entity extraction”List the main entities”entities[]
Key facts”What are the 3 most important facts?”summary.key_points
Type-specificVaries by parent type (e.g., author/date for articles, price for commerce)type_data fields
Relationships”How are [entity A] and [entity B] related?”relationships[]
CategoryMethod
Type identificationExact match (1.0 if both parent+sub match, 0.5 if parent only)
Entity extractionF1 score on entity names (case-insensitive, substring matching)
Key factsFraction of ground-truth key_points covered (30% token overlap threshold)
Type-specificFraction of ground-truth fields matched in response
RelationshipsFraction of ground-truth triples matched (subject+object fuzzy match)
MetricRaw PathSDF PathDelta
Mean Accuracy0.3520.739+0.387
Median Accuracy0.3331.000+0.667
JSON Valid Rate99.3%100.0%+0.7%
Mean Input Tokens1,731834-51.8%
Mean Latency (ms)3,8721,609-58.5%

SDF achieves 0.739 mean accuracy compared to 0.352 for the raw path — a 110% improvement — while using 51.8% fewer tokens and completing 58.5% faster.

CategoryRawSDFDeltaSDF WinsTiesRaw Wins
Type identification0.2000.733+0.53318120
Entity extraction0.2980.842+0.5442901
Key facts0.4510.808+0.3572451
Type-specific0.4830.772+0.28912162
Relationships0.3270.538+0.21119110

Entity extraction shows the largest improvement (+0.544). SDF pre-extracts entities with types, roles, and salience scores, eliminating the need for the consumer model to perform NER on raw text. SDF wins on 29 out of 30 documents.

Type identification shows an equally large delta (+0.533). The raw-path model struggles to infer content types from unstructured markdown, while SDF provides the classification directly.

Parent TypeNRawSDFDelta
article50.3970.805+0.409
documentation30.4600.844+0.384
reference30.2880.800+0.512
discussion30.3080.722+0.414
commerce30.2960.891+0.596
data30.2970.596+0.299
code30.2680.523+0.256
media10.3900.767+0.377
profile30.3540.673+0.319
event30.4600.741+0.281

Commerce shows the largest improvement (+0.596), reflecting the difficulty of extracting structured product data from heavily templated HTML. SDF pre-extracts price, availability, and product attributes into typed fields.

HTML tokens estimated from raw fetched HTML size. Markdown and SDF tokens from experiment input.

Parent TypeHTML Tokens (est.)Markdown TokensSDF TokensHTML→SDFMarkdown→SDF
article101,6941,570922-99.1%-41.3%
documentation107,5241,166658-99.4%-43.6%
reference84,4362,812822-99.0%-70.8%
discussion54,5151,0611,384-97.5%+30.4%
commerce205,4642,710776-99.6%-71.4%
data141,940878769-99.5%-12.4%
code45,6711,337645-98.6%-51.8%
media69,049278662-99.0%+138.1%
profile93,0121,708801-99.1%-53.1%
event105,0572,922726-99.3%-75.2%
Overall103,0131,731834-99.2%-51.8%

The three-tier reduction — HTML to markdown to SDF — shows that SDF provides value even over markdown-cleaned content. The 99.2% reduction from HTML is the headline number, but the 51.8% reduction from markdown matters for agent pipelines that already strip HTML.

Note: Discussion and media types show SDF token inflation over markdown. Discussion threads include pre-extracted answer metadata, and media documents add structured metadata that exceeds the short markdown source. The HTML→SDF reduction remains >97% for all types.

TestStatisticResult
Paired t-testt(29) = 11.890p < 0.05 (significant)

The paired t-test was conducted on per-document average accuracy (df=29, alpha=0.05, t_crit=2.045). The observed t-statistic of 11.890 far exceeds the critical value, confirming the SDF advantage is statistically significant.

  • Sample size: 30 documents / 150 questions.
  • Circularity: Ground truth is the SDF document itself, so entity/relationship improvements partially reflect extraction quality rather than independent consumer improvement.
  • Single consumer model: Only a 7B parameter model was tested. Results may differ at other scales.
  • Template-based questions: Questions were generated deterministically, not sampled from real agent workloads.
  • Token inflation: Two content types (discussion, media) show SDF token inflation over markdown, though accuracy still improves.

If you reference this research, please cite:

Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” Zenodo. DOI: 10.5281/zenodo.18559223

Key Research Findings

Pipeline performance and extraction accuracy across 2,335 documents. See Key Findings.

Protocol Specification

Full protocol specification and document model. See Protocol v0.2.