Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 453 93

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 212 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 126 15

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 38 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 28 4

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 64 11

Repositories

Showing 10 of 81 repositories
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 28 4 0 1 Updated Mar 13, 2026
  • cc-warc-examples Public Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    commoncrawl/cc-warc-examples’s past year of commit activity
    Java 38 MIT 45 0 0 Updated Mar 12, 2026
  • nutch Public Forked from Aloisius/nutch

    Common Crawl fork of Apache Nutch

    commoncrawl/nutch’s past year of commit activity
    Java 40 Apache-2.0 1,272 11 (1 issue needs help) 3 Updated Mar 12, 2026
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 212 Apache-2.0 16 2 2 Updated Mar 12, 2026
  • ia-web-commons Public Forked from Aloisius/ia-web-commons

    Web archiving utility library

    commoncrawl/ia-web-commons’s past year of commit activity
    Java 11 Apache-2.0 78 4 1 Updated Mar 11, 2026
  • cc-webgraph-statistics Public

    Statistics of Common Crawl monthly Web Graphs

    commoncrawl/cc-webgraph-statistics’s past year of commit activity
    JavaScript 5 Apache-2.0 1 0 0 Updated Mar 6, 2026
  • cc-index-table Public

    Index Common Crawl archives in tabular format

    commoncrawl/cc-index-table’s past year of commit activity
    Java 126 Apache-2.0 15 6 1 Updated Mar 4, 2026
  • cc-downloader Public

    A polite and user-friendly downloader for Common Crawl data

    commoncrawl/cc-downloader’s past year of commit activity
    Rust 70 Apache-2.0 4 4 (1 issue needs help) 1 Updated Mar 3, 2026
  • cc-index-annotations Public

    Example code to join an annotation to a host or url index

    commoncrawl/cc-index-annotations’s past year of commit activity
    Python 1 0 0 0 Updated Mar 2, 2026
  • eot2020-host-index Public

    Tools to work with the preliminary End of Term Archive host index

    commoncrawl/eot2020-host-index’s past year of commit activity
    Python 0 0 0 0 Updated Mar 2, 2026