Fixing Broken Links on the InternetPosted on October 25, 2013 by Alexis Rossi
Today the Internet Archive announces a new initiative to fix broken links across the Internet. We have 360 billion archived URLs, and now we want you to help us bring those pages back out onto the web to heal broken links everywhere.
When I discover the perfect recipe for Nutella cookies, I want to make sure I can find those instructions again later. But if the average lifespan of a web page is 100 days, bookmarking a page in your browser is not a great plan for saving information. The Internet echoes with the empty spaces where data used to be. Geocities – gone. Friendster – gone. Posterous – gone. MobileMe – gone.
Imagine how critical this problem is for those who want to cite web pages in dissertations, legal opinions, or scientific research. A recent Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are dead now. Those decisions affect everyone in the U.S., but the evidence the opinions are based on is disappearing.
In 1996 the Internet Archive started saving web pages with the help of Alexa Internet. We wanted to preserve cultural artifacts created on the web and make sure they would remain available for the researchers, historians, and scholars of the future. We launched the Wayback Machine in 2001 with 10 billion pages. For many years we relied on donations of web content from others to build the archive. In 2004 we started crawling the web on behalf of a few, big partner organizations and of course that content also went into the Wayback Machine. In 2006 we launched Archive-It, a web archiving service that allows librarians and others interested in saving web pages to create curated collections of valuable web content. In 2010 we started archiving wide portions of the Internet on our own behalf. Today, between our donating partners, thousands of librarians and archivists, and our own wide crawling efforts, we archive around one billion pages every week. The Wayback Machine now contains more than 360 billion URL captures.
FTC.gov directed people to the Wayback Machine during the recent shut down of the U.S. federal government.
We have been serving archived web pages to the public via the Wayback Machine for twelve years now, and it is gratifying to see how this service has become a medium of record for so many. Wayback pages are cited in papers, referenced in news articles and submitted as evidence in trials. Now even the U.S. government relies on this web archive.
We’ve also had some problems to overcome. This time last year the contents of the Wayback Machine were at least a year out of date. There was no way for individuals to ask us to archive a particular page, so you could only cite an archived page if we already had the content. And you had to know about the Wayback Machine and come to our site to find anything. We have set out to fix those problems, and hopefully we can fix broken links all over the Internet as a result.
Up to date. Newly crawled content appears in the Wayback Machine about an hour or so after we get it. We are constantly crawling the Internet and adding new pages, and many popular sites get crawled every day.
Save a page. We have added the ability to archive a page instantly and get back a permanent URL for that page in the Wayback Machine. This service allows anyone — wikipedia editors, scholars, legal professionals, students, or home cooks like me — to create a stable URL to cite, share or bookmark any information they want to still have access to in the future. Check out the new front page of the Wayback Machine and you’ll see the “Save Page” feature in the lower right corner.
Do we have it? We have developed an Availability API that will let developers everywhere build tools to make the web more reliable. We have built a few tools of our own as a proof of concept, but what we really want is to allow people to take the Wayback Machine out onto the web.
Fixing broken links. We started archiving the web before Google, before Youtube, before Wikipedia, before people started to treat the Internet as the world’s encyclopedia. With all of the recent improvements to the Wayback Machine, we now have the ability to start healing the gaping holes left by dead pages on the Internet. We have started by working with a couple of large sites, and we hope to expand from there.
WordPress.com is one of the top 20 sites in the world, with hundreds of millions of users each month. We worked with Automattic to get a feed of new posts made to WordPress.com blogs and self-hosted WordPress sites. We crawl the posts themselves, as well as all of their outlinks and embedded content – about 3,000,000 URLs per day. This is great for archival purposes, but we also want to use the archive to make sure WordPress blogs are reliable sources of information. To start with, we worked with Janis Elsts, a developer from Latvia who focuses on WordPress plugin development, to put suggestions from the Wayback into his Broken Link Checker plugin. This plugin has been downloaded 2 million times, and now when his users find a broken link on their blog they can instantly replace it with an archived version. We continue to work with Automattic to find more ways to fix or prevent dead links on WordPress blogs.
Wikipedia.org is one of the most popular information resources in the world with almost 500 million users each month. Among their millions of amazing articles that all of us rely on, there are 125,000 of them right now with dead links. We have started crawling the outlinks for every new article and update as they are made – about 5 million new URLs are archived every day. Now we have to figure out how to get archived pages back in to Wikipedia to fix some of those dead links. Kunal Mehta, a Wikipedian from San Jose, recently wrote a protoype bot that can add archived versions to any link in Wikipedia so that when those links are determined to be dead the links can be switched over automatically and continue to work. It will take a while to work this through the process the Wikipedia community of editors uses to approve bots, but that conversation is under way.
Every webmaster. Webmasters can add a short snippet of code to their 404 page that will let users know if the Wayback Machine has a copy of the page in our archive – your web pages don’t have to die!
We started with a big goal — to archive the Internet and preserve it for history. This year we started looking at the smaller goals — archiving a single page on request, making pages available more quickly, and letting you get information back out of the Wayback in an automated way. We have spent 17 years building this amazing collection, let’s use it to make the web a better place.
Thank you so much to everyone who has helped to build such an outstanding resource, in particular:
13 Responses to Fixing Broken Links on the Internet
- Keith Swenson says:
October 25, 2013 at 5:49 pm
Excellent! I wrote earlier this year about this sort of thing, and had been *planning* among all the other things I have to do, to put together something like this. THANK YOU SO MUCH for doing it for me!
- A concerned archivist says:
October 25, 2013 at 8:18 pm
Big question – what happens when the Internet Archive dies? Who watches the watchers and all that. Private donations can dry up and public funds can disappear so I’m wondering how permanent even this endeavor is (although definitely more permanent than most!) Keep up the good work though, IA is much loved. 🙂
- Nemo says:
October 28, 2013 at 2:21 pm
On Wikipedia: can you confirm that it’s all language editions of Wikipedia, and also the other Wikimedia projects?
- Eric Kansa says:
October 28, 2013 at 3:43 pm
This is great news!
I’m wondering if the “Save Page” feature can be implemented with the API? It would be great to integrate with a tool like Zotero, so people can use the Internet Archive to archive a Web resource that they are using Zotero to capture for citation.
There are lots of Digital Humanities programs that lack institutional repositories that would really benefit from integrating a Save Page feature via an API. Or is this idea too close to the Archive-It service?
Pingback: broken wordpress links | inkdroid
- cara membuat says:
October 29, 2013 at 5:00 pm
What this all means for calculations of the average longevity of a webpage is that, while Internet Archive’s estimates may be the best available, there are key limitations and caveats behind any of the numbers proffered to date. Unfortunately, it’s unlikely that we’ll have objective measurements better than the gross methodologies permitted by automated link checking any time soon.
- Architrivus says:
October 30, 2013 at 11:38 am
It seems to only capture the top page and not the other links. Can we have more layers please?
- James Jacobs says:
October 30, 2013 at 6:43 pm
Great work as always! what ever happened to zotero commons? Is it still running and able to accept snapshots for pages saved to zotero? Inquiring minds want to know.
- SJ Klein says:
October 30, 2013 at 7:52 pm
Thanks for moving this beautiful and necessary work forward.
Nemo beat me to it, but: I hope you can run the same spiders across all of the wikimedia wikis. And I hope that at least the Wikipedias that track their own
categories for deadlinks can start running bots to insert archive-links.
Some of the other wikis may not yet have clean templates to showcase archived links, but we should preserve the sources they link to, before their 100±3σ days expire.
PS. I would be interested to know if the snapshot-series for Wikimedia Commons is significantly heavier than those for articles, given the way you are handling media. There the need for an archive is primarily to preserve the context of the original media, not necessarily an additional full-resolution copy – which IA nicely preserves elsewhere.
- Sushubh says:
November 4, 2013 at 8:32 pm
Is there a bookmarklet that I can use to instantly save the current webpage I am on in your database?
- Steve says:
November 4, 2013 at 9:16 pm
The next stage is the ability to search the Wayback archives.
Once a link goes dead, it disappears from normal search results (Google etc). Even if it is archived at Wayback, no one knows it’s there. The forgotten pages are most of them.
- Ricky says:
December 19, 2013 at 4:40 am
This is probably one of the best feature about archives.org, love it. thanks.
Comments are closed.