Status:
Active Project
Project Leaders: Institut National de l'Audiovisuel, Netarchive.dk
Approved August 2012
Technical context and motivations
Technically, Web archiving can be split into 3 different activities:
- identifying contents that need to be archived;
- acquiring these contents;
- storing these contents for future access.
The second activity, also known as Web Scraping, consists in downloading an HTML document and interpreting its content in order to figure out the resources that will be required to render it. This is what Web Browsers are specialized in.
A way to achieve this — often favoured for economical reasons — consists in building custom Web crawler (e.g. Heritrix) that try to mimic Web Browsers. Those can be tuned to be extremely fast and run on very few resources, so that many of them can run on a single server and harvest hundreds of Web Pages at the same time. Specialized crawlers can be built for specific needs (targeted harvests) using various technologies (e.g. Python, Perl, Java, cURL or Wget). They are relatively cheap to develop and tend to achieve good results.
As the Web evolved into producing more and more interaction-rich contents featuring complex JavaScript programs embedded in webpages, the task of crawlers that try to mimic Web Browsers became more and more tedious. Their approach is still necessary, but is no longer sufficient. New approaches are being considered and tested, including automating whole Web Browsers (e.g. Selenium), operating Web Browsers manually, or building crawlers on top of browsers engines stripped down to the bare minimum (e.g. PhantomJS).
These new complementary approaches share a common problem to solve: a storage backend needs to be written for each new crawler. This introduces code duplication that is expensive (more code to maintain) and can be very problematic (errors fixed in one version but not the other). It also makes content deduplication more complicated because of the lack of a centralized writing agent. Furthermore, Browsers that can be automated to work as crawlers have notoriously poor programming interfaces for reaching the binary content of the resources in a Web page, bringing up the issue of the actual acquisition of the resources to store.
A possible solution to tackle these issues is to develop an HTTP proxy that would be able to capture all communications between the crawler and the Web and add them to the archive. Such a proxy solves the problem of capturing contents, code duplication and centralized writing agent, as many crawlers can use the same proxy at the same time. It presents some challenges but has many advantages both on cost efficiency and technical grounds.
Project goal
Our goal for this project is to deliver a Live Archiving Proxy that is, production-ready, highly performant, easily distributable and archive format agnostic. We also wanted to take advantage of existing code as much as possible.
Production-ready
The proxy will be resilient to “bad” HTTP clients (malformed requests) and servers (malformed responses). It will handle correctly as much of the HTTP protocol as possible and fail gracefully.
Highly performant
The proxy will be able to handle at least 250 requests per second on current average hardware.
Easily distributable
The proxy will be developed in Perl and distributed as a standalone package working out of the box on all major Linux distributions (64 bit only). No compilation will be required. The distributed package and sources will be released on GitHub or on Ina’s public SVN server, under the Artistic License 2.0.
Archive format agnostic
The proxy will use a “writer plugin” API to enable archives to be written in any format. The plugin API will be based on sockets in order to enable plugin developed in any language. However, a generic plugin codebase in Java will be distributed along with the Live Archiving Proxy to handle the socket manipulation work and facilitate plugin development for partners.
Reusing existing code
After reviewing existing HTTP proxy solutions, we found that they fall into two distinct categories. The first category is focused on caching and performances. This category includes Squid which is well known and robust, but written in C++, and with a very large codebase. Because of this, we found that it would take too long and costly to get familiar with the framework.
The second category is focused on monitoring and debugging of applications that use HTTP as a transport layer. It includes Fiddler which is not open source, as well as several open source Java based proxies (Membrane Monitor, WebScarab, Sahi, LittleProxy). Among these, performances and stability are modest, as production-readiness is not the main goal. None of these solutions offer the robustness, flexibility and tuning possibilities offered by libraries already developed and tested at Ina in other “real world” projects. Using our own existing code will also save us a lot of time while debugging and tuning performances.
Project phases
Phase 1: Ina develops a first prototype of the Proxy to share with partners;
Phase 2: Netarchive develops the Java WARC writer plugin and tests it with the Proxy prototype;
Phase 3: Designated partners test the Proxy and WARC writer, report bugs and give feedback;
Phase 4: The Proxy and WARC writer are modified according to feedback.
Deliverables (Ina)
- A standalone package that contains the Live Archiving Proxy software (via GitHub);
- The source code of the Live Archiving Proxy (via GitHub or SVN);
- A generic Writer Plugin library in Java (via Maven);
- The source code of the generic Writer Plugin library (via Maven);
- A documentation for the Live Archiving Proxy.
Deliverables (Netarchive)
- A WARC Writer Plugin library in Java;
- The source code of the WARC Writer Plugin library;
- A documentation for the WARC Writer Plugin.




