Saving the web

Jonni Bidwell interviews digital librarian par-excellence, Alexis Rossi director of the Internet Archive.

2015-10-27 -

Liberating government docs that should be free in the first place

Alexi Rossi on the Internet Archive

Alexis Rossi is a director at the Internet Archive, dubbed the Library of Alexandria 2.0, but which she jocosely describes as the biggest website no one’s ever heard of. We caught up with her to talk about big data, APIs, clay figures and to confess that we’ve been using the Archive to store our DVD images.

Linux Format: Tell us what led you to starting work at the Internet Archive?

Alexis Rossi: When I was in college I wanted to be a book editor; I used to edit cookbooks and event guides and things like that. I used to work for a non-fiction publisher. I kind of fell in love with the Internet in 1994 – just the idea of being able to find other people who were interested in the same things as I was, because I came from a pretty small place, being able to communicate with friends who’d moved halfway round the world and not having to pay money for it. So I decided to switch from book publishing to Internet publishing and I worked for, I think the first official news aggregator on the web, ClariNet. They don’t exist anymore, but they were founded in 1989 so when I started there in 1996 they were still publishing news in newsgroups, but very quickly thereafter started publishing on the web. So I guess I got sucked into technology because of my love for being able to communicate and share.

LXF: What do you do at the Internet Archive?

AR: I’m a director there and basically what I do is I’m in charge of all the digital media in the archive, whether it’s stuff from the Wayback Machine or movies or whatever. I’m also in charge of access projects. We just recently finished redoing the website, we’re still working on some improvements for that, but everything switched over big time around May. That’s been about a year and a half in the makings. I also do a lot of speaking to people about the Archive in particular. I’ve been working with them off and on since 2000 – a long time.

LXF: Tell me a bit about the Internet Archive. How long has it been archiving the internet?

AR: Sure. The IA was estabilshed in 1996 and we started out by archiving the Internet. We are a non-profit digital library, and we are recognised as a library by the state of California. The first service we came out with was the Wayback Machine in 2001. In 2002 we started hosting music and some videos as well. Since then we’ve really expanded, so we have about 425 billion things in the Wayback Machine.

LXF: Yikes.

AR: Yikes indeed: We have about 8 million texts, about 2 million movies, 2.5 million audio items and we archive about 60 channels of television 24 hours a day. We also archive software, that’s one of the things a lot of people know us for these days.

LXF: Ha, ha. Our magazine has started piggy-backing off that a wee bit.

AR: Really, how so?

LXF: We used to seed torrents of our cover disc images on our own server, but disk space started to become limiting factor. It was much easier to dump hundreds of gigabytes onto archive.org – they even provide torrent files, so we link to those from our own archive and everything works seamlessly. Someone even set up a dedicated collection for us. So we’re very grateful for all of that.

AR: Great, are you using the S3-like API that we have? LXF: No, I just push the big friendly upload button. Tell me of this API. AR: We have a bunch of APIs available at

http://archive.org/help where you’ll see pointers to a lot of different things. One of those things is an upload API that is very similar to Amazon’s S3, so anyone who’s ever used S3 will be able to use this right out of the box. It’s really good for bulk uploading. You can also use it to get data out, but it’s more designed for the other direction. So if you have large collections of things we can create a collection for you so you have your own place in the archive. Then you can upload as much stuff as you want into it. Hopefully in about a year or so you’ll be able to create your own collections, but right now we do it. We provide a lot of guidance around metadata – it’s kind of pointless to put things in the archive without metadata because then nobody else can find them.

LXF: What are some interesting ways that people are using some of the new Internet Archive APIs?

AR: My favourite one is called the RECAP archive, which allows people access to PACER documents (PACER is a paywall controlled system for accessing federal court records in the US). The RECAP software was developed jointly at Princeton and Harvard. What the RECAP people did was make a browser plugin for people who are paying for PACER records to automatically upload them to the Internet

on internet archive’s growth “We’ve expanded, so we have about 425 billion things in the Wayback Machine.”

Archive, where they can then be accessed by anyone for free. So basically liberating government documents that should have been free in the first place. We’ve got over a million court records through that and it’s all done through APIs, people don’t have to sit there and hold everybody’s hand. One of the main contributors to this project was the late Aaron Schwartz, who liberated a tranche of PACER documents in 2008. This corpus was used to seed RECAP’s database, which you can browse at http://archive.recapthelaw.com.

LXF: So speaking of legal stuff, under European law has recently introduced so-called ‘right to be forgotten’ legislation, which forces Google to remove (or at least think about removing) links from its search results based on user requests.

AR: We have always taken things down from the Internet Archive for various reasons, primarily when somebody has a copyright claim and they submit a DMCA takedown request. We do occasionally take things out of the Wayback Machine for other reasons, when someone is trying to protect their address, or other personal privacy type issues.

LXF: What about people just trying to photoshop their digital pasts?

AR: For the most part we’re reluctant to take things down, on the other hand when you have 420 billion things, taking down, say, two of them probably isn’t going to be a problem. You have to weigh up how important something is against any personal privacy concerns.

LXF: So I read somewhere that you store about 12 petabytes of data?

AR: Oh, we store way more than that now. Unique data is about 24PB now, everything is doubled though because we store everything twice, and then there is some storage on top of that that we use for running the search engine and the website and those sorts of things. I think all in all it’s about 55PB of spinning disks at this point.

LXF: Again, yikes. And at what rate is this data monster growing?

AR: I would say that we probably get about 5PB of new unique data in a year… ish.

LXF: So as a non-profit, where are the funds coming from to support this kind of behemoth storage?

AR: About 40% of our revenue comes from scanning books, so libraries hire us to scan physical books and turn them into eBooks for them. Another 20% comes from us archiving the web for people. We have a service called

http://archive-it.org, which is a subscriptionbased web archiving service that allows nontechnical people to give us a list of things that they want crawled and tell us how often they want to crawl it, and we go out and perform those crawls for them. Then they can download things and they end up with a searchable collection of websites. This will often be libraries, archives, museums, state government organisations, eg they might ask to archive all the government websites of South Carolina.

We also do domain-sized crawls – national libraries, such as the National Library of Australia, or the Library of Congress or whoever hires us to crawl all the New Zealand libraries, so they say “Give me all of New Zealand” and we go and crawl that, they take the data and we also put the data in the Wayback Machine, and the same thing is true for Archive It.

Our mission is universal access to all knowledge so all of the for-pay work we do, whether it’s digitising books or archiving the web all ends up in the public. So if somebody pays us to crawl all of New Zealand, then all of New Zealand ends up in the Wayback Machine. The other 40% of our funding comes from foundations and donations.

LXF: Forgive me this short digression, but in the Futurama episode, The Why of Fry there featured some information-obsessed aliens whose goal is to collect ‘all the information in the universe’. To this end they construct a Death Star-like object called the Infosphere which is a biological memory bank. The final stage of their data pull involves scanning the sphere itself.

AR: Cool.

LXF: Indeed, a fine episode, but only tenuously relevant to my next question, which is how exactly do you go about scanning the web without running into these sorts of recursive difficulties? What other difficulties do you come up against trying to harvest all this data?

AR: Well, it’d be good for redundancy if someone/thing out there was archiving the Wayback Machine, but obviously we don’t crawl our own stuff. But there’s also a lot of stuff we can’t crawl, we can’t crawl behind passwords and paywalls, we can’t crawl something that is dynamic – we can’t crawl all of Google because robots don’t work that way. There are more and more technologies out there on the web that are difficult for robots to deal with. So we run into walls all over the place, but we do crawl about a billion pages per week.

LXF: Tell me a little about the crawling software you use?

AR: We use an open source crawler called Heritrix, which we developed in conjunction with the Nordic national libraries, starting in 2003. We’ve also been experimenting with some new

on paying the bills “All of the for-pay work we do … ends up in the public.”

ways of crawling stuff that does a better job of mimicking a browser, rather than just being a robot. But we are still running into problems all over the place and the Archive It team spends a lot of its time changing their crawling practices and developing ways to crawl different kinds of media. That works better on the more smallscale crawls that Archive It does because they’re more able to do focussed QA with their customers, but for the large, 1 billion pages per week crawls that my team does, it becomes difficult to keep up with what you’re missing.

We end up having to do a lot of analysis to figure out where the gaps are. You make changes that are going to get the big wins as opposed to the little wins. The web is becoming more and more difficult to archive.

LXF: What other open source software do you use?

AR: Well, the Wayback Machine itself is all open source code, our book reader software is open source and basically as much of our stack as can be open source is open source. So we’re all Linux-based, of course. We have a website called Open Library ( https://openlibrary.org) where we catalogue millions of books for people to download (copyright permitting) or borrow. That is all open source and people do contribute to it. Our experience starting our own open source projects has been, for the most part , we make all the changes. There are definitely things that we do just for the sake of expediency that we don’t open source. But we also use JWPlayer for audio and video playback, we also try to make sure that we’re still deriving open source formats – so we still derive OGGs from MP3s even though most people don’t use them.

We’ve been trying to work with Wikipedia a little bit – they have Wikimedia Commons which holds a lot of media, but they only want things in open source formats. People don’t know how to create OGG files so Wikimedia ends up not getting a lot of things. So we’ve been doing a little experiment with Wikimedia in Germany that would allow people to upload whatever they want, whether it’s an MP4 or an WMV or whatever to Wikimedia, we’ll create the OGGs and Wikimedia can grab them and put them in Wikimedia Commons. I do think it’s important to do that transcoding, even if the open formats tend not be the ones accessed on our site.

LXF: Tell us a little about the OSCON talk you did this year.

AR: It was a joint effort with Vicky Brasseur who used to work at the Archive, she actually lives here in Portland now and does her own consulting work, but basically she’s an engineer. So I gave an overview of the Internet Archive and she talks about the APIs that you can use to get things into and out of the archive. I think it’s valuable for people, our organisation is very friendly, but not very slick. We definitely don’t have the best documention, nor do we have 15 people waiting to answer your query. So I generally think its helpful to do this talk at places like OSCON because it’s a chance for people to get a good overview of what we have so that they can figure out what to ask questions about. Because we do want to help people, we do want them to put things in the Archive and use things therefrom.

LXF: OSCON’s a reasonably diverse and very friendly conference, but what historically have your experiences been as a woman in technology?

AR: Oh gosh! It is way better now than it used to be. I started back in ‘96 and it was, um, interesting: I was asked to leave meetings every once in a while with comments like “Oh, the talk’s going to get too technical for you, sweetie”, or “the receptionist who’s the only other female in this company is on her break, would you mind answering the phones?”. But that doesn’t happen as much anymore, the Internet Archive in particular is a very diverse place, we have all sorts. It is still male-dominated, that is the reality of the situation as far as engineering talent goes. These sorts of conferences are always going to be pretty male dominated, I walked in this morning and I had to walk past about 400 people to get in and I think I saw maybe 10 women.

?? ?? Long-serving Archive members are honoured with a three-foot ceramic statue by sculptor Nuala Creed. Image credit: Jason Scott (CC BY 2.0) — Long-serving Archive members are honoured with a three-foot ceramic statue by sculptor Nuala Creed. Image credit: Jason Scott (CC BY 2.0)

Saving the web

Jonni Bidwell interviews digital librarian par-excellence, Alexis Rossi director of the Internet Archive.

Newspapers in English

Newspapers from Australia