Key Research Findings
Pipeline performance and extraction accuracy across 2,335 documents. See Key Findings.
The SDF protocol’s central claim is “convert once, consume many” — that pre-extracting structured data is more efficient for AI agents than re-parsing raw content each time. This experiment provides the consumer-side evidence for that claim.
| Parameter | Value |
|---|---|
| Consumer model | qwen2.5:7b-instruct-q4_K_M (general purpose, not fine-tuned for SDF) |
| Documents tested | 30 (3 per parent type, 10 types) |
| Question categories | 5 |
| Questions per document | 5 (one per category) |
| Total LLM calls | 300 (30 docs x 5 questions x 2 paths) |
Ground truth is the SDF document itself (generated by the 14B extraction model, quality-scored).
| Category | Question Template | Ground Truth Source |
|---|---|---|
| Type identification | ”What type of content is this?” | parent_type, type |
| Entity extraction | ”List the main entities” | entities[] |
| Key facts | ”What are the 3 most important facts?” | summary.key_points |
| Type-specific | Varies by parent type (e.g., author/date for articles, price for commerce) | type_data fields |
| Relationships | ”How are [entity A] and [entity B] related?” | relationships[] |
| Category | Method |
|---|---|
| Type identification | Exact match (1.0 if both parent+sub match, 0.5 if parent only) |
| Entity extraction | F1 score on entity names (case-insensitive, substring matching) |
| Key facts | Fraction of ground-truth key_points covered (30% token overlap threshold) |
| Type-specific | Fraction of ground-truth fields matched in response |
| Relationships | Fraction of ground-truth triples matched (subject+object fuzzy match) |
| Metric | Raw Path | SDF Path | Delta |
|---|---|---|---|
| Mean Accuracy | 0.352 | 0.739 | +0.387 |
| Median Accuracy | 0.333 | 1.000 | +0.667 |
| JSON Valid Rate | 99.3% | 100.0% | +0.7% |
| Mean Input Tokens | 1,731 | 834 | -51.8% |
| Mean Latency (ms) | 3,872 | 1,609 | -58.5% |
SDF achieves 0.739 mean accuracy compared to 0.352 for the raw path — a 110% improvement — while using 51.8% fewer tokens and completing 58.5% faster.
| Category | Raw | SDF | Delta | SDF Wins | Ties | Raw Wins |
|---|---|---|---|---|---|---|
| Type identification | 0.200 | 0.733 | +0.533 | 18 | 12 | 0 |
| Entity extraction | 0.298 | 0.842 | +0.544 | 29 | 0 | 1 |
| Key facts | 0.451 | 0.808 | +0.357 | 24 | 5 | 1 |
| Type-specific | 0.483 | 0.772 | +0.289 | 12 | 16 | 2 |
| Relationships | 0.327 | 0.538 | +0.211 | 19 | 11 | 0 |
Entity extraction shows the largest improvement (+0.544). SDF pre-extracts entities with types, roles, and salience scores, eliminating the need for the consumer model to perform NER on raw text. SDF wins on 29 out of 30 documents.
Type identification shows an equally large delta (+0.533). The raw-path model struggles to infer content types from unstructured markdown, while SDF provides the classification directly.
| Parent Type | N | Raw | SDF | Delta |
|---|---|---|---|---|
| article | 5 | 0.397 | 0.805 | +0.409 |
| documentation | 3 | 0.460 | 0.844 | +0.384 |
| reference | 3 | 0.288 | 0.800 | +0.512 |
| discussion | 3 | 0.308 | 0.722 | +0.414 |
| commerce | 3 | 0.296 | 0.891 | +0.596 |
| data | 3 | 0.297 | 0.596 | +0.299 |
| code | 3 | 0.268 | 0.523 | +0.256 |
| media | 1 | 0.390 | 0.767 | +0.377 |
| profile | 3 | 0.354 | 0.673 | +0.319 |
| event | 3 | 0.460 | 0.741 | +0.281 |
Commerce shows the largest improvement (+0.596), reflecting the difficulty of extracting structured product data from heavily templated HTML. SDF pre-extracts price, availability, and product attributes into typed fields.
HTML tokens estimated from raw fetched HTML size. Markdown and SDF tokens from experiment input.
| Parent Type | HTML Tokens (est.) | Markdown Tokens | SDF Tokens | HTML→SDF | Markdown→SDF |
|---|---|---|---|---|---|
| article | 101,694 | 1,570 | 922 | -99.1% | -41.3% |
| documentation | 107,524 | 1,166 | 658 | -99.4% | -43.6% |
| reference | 84,436 | 2,812 | 822 | -99.0% | -70.8% |
| discussion | 54,515 | 1,061 | 1,384 | -97.5% | +30.4% |
| commerce | 205,464 | 2,710 | 776 | -99.6% | -71.4% |
| data | 141,940 | 878 | 769 | -99.5% | -12.4% |
| code | 45,671 | 1,337 | 645 | -98.6% | -51.8% |
| media | 69,049 | 278 | 662 | -99.0% | +138.1% |
| profile | 93,012 | 1,708 | 801 | -99.1% | -53.1% |
| event | 105,057 | 2,922 | 726 | -99.3% | -75.2% |
| Overall | 103,013 | 1,731 | 834 | -99.2% | -51.8% |
The three-tier reduction — HTML to markdown to SDF — shows that SDF provides value even over markdown-cleaned content. The 99.2% reduction from HTML is the headline number, but the 51.8% reduction from markdown matters for agent pipelines that already strip HTML.
Note: Discussion and media types show SDF token inflation over markdown. Discussion threads include pre-extracted answer metadata, and media documents add structured metadata that exceeds the short markdown source. The HTML→SDF reduction remains >97% for all types.
| Test | Statistic | Result |
|---|---|---|
| Paired t-test | t(29) = 11.890 | p < 0.05 (significant) |
The paired t-test was conducted on per-document average accuracy (df=29, alpha=0.05, t_crit=2.045). The observed t-statistic of 11.890 far exceeds the critical value, confirming the SDF advantage is statistically significant.
If you reference this research, please cite:
Sarkar, P. (2026). “Convert Once, Consume Many: SDF for Cacheable, Typed Semantic Extraction from Web Pages.” Zenodo. DOI: 10.5281/zenodo.18559223
Key Research Findings
Pipeline performance and extraction accuracy across 2,335 documents. See Key Findings.
Protocol Specification
Full protocol specification and document model. See Protocol v0.2.