Skip to content

Type System

SDF uses a two-level type system: parent_type.type. The parent_type is one of 10 top-level categories. The type field provides finer-grained classification as parent_type.subtype.

parent_type: "article"
type: "article.news"

The parent type determines which type_data schema applies to the document. The subtype provides additional classification within that schema.

Content produced by an author for publication.

Subtypes: news, blog, opinion, review, analysis, press_release

Technical or instructional content for a product, API, or platform.

Subtypes: tutorial, api_reference, guide, faq, changelog, troubleshooting

Product or service listings and transactional content.

Subtypes: product, category, comparison, marketplace, service, pricing

Community-generated Q&A and forum content.

Subtypes: forum, qa, comment_thread, review_thread, poll

Encyclopedic and reference material.

Subtypes: encyclopedia, dictionary, legal, academic, specification, wiki

Data-centric content: datasets, statistics, reports.

Subtypes: dataset, statistics, report, dashboard, financial, scientific

Software repositories, packages, and code-related content.

Subtypes: repository, package, snippet, gist, notebook, documentation

Entity profiles: people, organizations, places.

Subtypes: person, organization, place, product_profile, portfolio

Event listings and schedules.

Subtypes: conference, meetup, webinar, concert, sports, workshop

Audio, video, and multimedia content.

Subtypes: video, podcast, music, image_gallery, livestream, animation

Parent TypeSubtypesDescription
articlenews, blog, opinion, review, analysis, press_releaseAuthored content for publication
documentationtutorial, api_reference, guide, faq, changelog, troubleshootingTechnical and instructional content
commerceproduct, category, comparison, marketplace, service, pricingProduct/service listings
discussionforum, qa, comment_thread, review_thread, pollCommunity Q&A and forums
referenceencyclopedia, dictionary, legal, academic, specification, wikiEncyclopedic / reference material
datadataset, statistics, report, dashboard, financial, scientificData-centric content
coderepository, package, snippet, gist, notebook, documentationSoftware and code content
profileperson, organization, place, product_profile, portfolioEntity profiles
eventconference, meetup, webinar, concert, sports, workshopEvent listings
mediavideo, podcast, music, image_gallery, livestream, animationAudio/video/multimedia

Some content spans multiple types. A product review is both article.review and commerce.product. The aspects field captures secondary type classifications:

{
"parent_type": "article",
"type": "article.review",
"aspects": ["commerce.product"]
}

Aspects do not change the type_data schema — the primary parent_type determines that. They provide additional classification signals for agents that need multi-dimensional type filtering.

LLM-based extraction often produces non-standard type values. The SDF pipeline applies a 5-stage normalization cascade to ensure all documents conform to the canonical taxonomy:

If the extracted type exactly matches a canonical type, accept it.

Map known aliases to canonical types:

  • "blog_post""article.blog"
  • "product_page""commerce.product"
  • "q_and_a""discussion.qa"

If only a parent type is provided, infer the most likely subtype based on content signals:

  • "article" + news domain → "article.news"
  • "code" + GitHub URL → "code.repository"

Apply string similarity matching against the canonical taxonomy:

  • "artcle.news""article.news" (edit distance 1)
  • "documentation.api""documentation.api_reference" (prefix match)

If all previous stages fail, re-classify using content-based heuristics and assign the closest canonical type.

In production deployment across 2,335 documents, 63 invented type combinations were corrected by the normalization cascade, achieving 100% conformance to the canonical taxonomy.

See Type Taxonomy for the full reference with production statistics.