Way Back on The Wayback Machine

The Internet Archive is Growing by the Petabyte

Internet Archive and the Wayback MachineThose of us unfamiliar with the term ‘petabyte’ will be surprised to learn that a petabyte is a measurement of data equal to one quadrillion bytes, or 1000 terabytes. A million gigabytes, a billion megabytes — or simply Tons o’ Data. . . The Internet Archive has a few of them stored within its digital library of websites and other cultural artifacts in digital form.

The Wayback Machine is a ‘digital time capsule’ that allows people to visit older versions of websites, many of which were archived more than ten years ago!

The Internet Archive itself is definitely worth a return visit, with searchable collections of digital content, organized into distinct ‘Media Types’ including: Web; Moving Images; Texts; Audio; and Software. Each of these Media Types is sub-categorized into semi-logical groupings to help the visitor narrow down the peta-choices.

“Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive’s mission is to help preserve those artifacts. . .”

Their banner states “Universal Access to All Knowledge” and they’re correct — to a point. It took me less than a minute to extract the Complete Letters of Mark Twain — 1835–1910, from Project Gutenburg files in text format. Opening a new window to the Audio section, I was able to listen to some old Jazz recordings (I chose New Orleans Traditional) while I perused Mark Twain’s muse from the 19th century.

The Web Archive is the real show stopper here, using the Wayback Machine to travel back in time (albeit only a decade or so) and discover previous versions of websites published long ago. Most of the websites we visit every day are constantly changing content, and many have undergone a design makeover or two in the interval of preceding years.

The Wayback Machine claims to host over 150 billion pages, including hundreds of my own. . . yikes! For example, my very first web design was launched in 1999, so I was curious enough to investigate its dubious upgrade path. The Heritrix web crawler captured my site 17 times between 2003 and 2008. Sadly not much had changed, which is probably why it lost interest. . .

According to their FAQ there are some rules to follow, whether you want to get a new site indexed — or prevent an existing site from being archived.

  1. To get a site archived, it must first become listed in the Open Directory project.
  2. To prevent a site from being archived, use the robots.txt file.
  3. It’s not an instant process. There could be anywhere from 6 – 24 month time lag between the time Alexa crawls your site, and when it appears in the Web archive.