This website does readability filtering of other pages. All styles, scripts, forms and ads are stripped. If you want your website excluded or have other feedback, use this form.

Defining Web pages, Web sites and Web captures | Internet Archive Blogs

Internet Archive Blogs A blog from the team at archive.org Skip to content

← Authors Alliance and Internet Archive Team Up to Make Books Available SHOWCASE: the GIF Collider at Berkeley Art Museum →

Defining Web pages, Web sites and Web captures

Posted on October 23, 2016 by Vinay Goel


The Internet Archive has been archiving the web for 20 years and has preserved billions of webpages from millions of websites. These webpages are often made up of, and link to, many images, videos, style sheets, scripts and other web objects. Over the years, the Archive has saved over 510 billion such time-stamped web objects, which we term web captures.

We define a webpage as a valid web capture that is an HTML document, a plain text document, or a PDF.

A domain on the web is an owned section of the internet namespace, such as google.com or archive.org or bbc.co.uk. A host on the web is identified by a fully qualified domain name or FQDN that specifies its exact location in the tree hierarchy of the Domain Name System. The FQDN consists of the following parts: hostname and domain name.  As an example, in case of the host blog.archive.org, its hostname is blog and the host is located within the domain archive.org.

We define a website to be a host that has served webpages and has at least one incoming link from a webpage belonging to a different domain.

As of today, the Internet Archive officially holds 273 billion webpages from over 361 million websites, taking up 15 petabytes of storage.

About Vinay Goel

Web Search & Data Mining Lead, Senior Data Engineer View all posts by Vinay Goel → This entry was posted in Announcements, News, Wayback Machine - Web Archive. Bookmark the permalink. ← Authors Alliance and Internet Archive Team Up to Make Books Available SHOWCASE: the GIF Collider at Berkeley Art Museum →

4 Responses to Defining Web pages, Web sites and Web captures

  1. Mihai Pintilie says: October 24, 2016 at 11:09 am

    Good job guys! Interesting facts about archiving!

  2. Pingback: Beta Wayback Machine – Now with Site Search! | Internet Archive Blogs

  3. Pingback: WOW! New Beta Allows Users to Keyword Search a Limited Amount of Material in The Wayback Machine | LJ INFOdocket

  4. Pingback: Internet Archive – Treasure | Web Search Guide and Internet News

Comments are closed.

Internet Archive Blogs Proudly powered by WordPress.