Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2023 RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets extracted from the October 2023 Common Crawl

DOI

The Web Data Commons RDFa, Microdata and Microformats data sets has been extracted from the September/October 2023 release of the Common Crawl. In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50.60%). These pages originate from 15 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (42.89%). Altogether, the extracted data sets consist of 86 billion RDF quads.

Identifier
DOI https://doi.org/10.7801/429
Related Identifier IsDocumentedBy https://madoc.bib.uni-mannheim.de/id/eprint/64408
Metadata Access https://api.datacite.org/dois/10.7801/429
Provenance
Creator Brinkmann, Alexander; Bizer, Christian
Publisher Mannheim University Library
Publication Year 2024
OpenAccess true
Representation
Resource Type Dataset
Version 1
Discipline Social Sciences