Live Archiving HHTP Proxy

Status: Past Project

Project
documentation

Author: David Rapin

 LIVE ARCHIVING PROXY PROJECT REPORT, 2013

 

Project overview

Project Leaders: Institut National de l’Audiovisuel, Netarchive.dk

Approved August 2012
The Live Archiving Proxy (LAP) project is an HTTP proxy that is able to capture the traffic that flows through it. The LAP delegates the handling of the captured data to one or multiple writers using a simple network protocol. Writers exists for the DAFF, WARC and ARC format. Using an HTTP proxy for Web archiving enables the use of any HTTP client for crawling (Heritrix,PhantomJS, HTTrack, Scrapy, etc.) while keeping a unified and simple storage backend. The LAP is designed to be high performance, easy to use and archive-format agnostic. It will run on any 64-bit linux system.

Project phases

Phase 1: Ina develops a first prototype of the Proxy to share with partners;
Phase 2: Netarchive develops the Java WARC writer plugin and tests it with the Proxy prototype;
Phase 3: Designated partners test the Proxy and WARC writer, report bugs and give feedback;
Phase 4: The Proxy and WARC writer are modified according to feedback.

Deliverables (Ina)

  • A standalone package that contains the Live Archiving Proxy software (via GitHub);
  • The source code of the Live Archiving Proxy (via GitHub or SVN);
  • A generic Writer Plugin library in Java (via Maven);
  • The source code of the generic Writer Plugin library (via Maven);
  • A documentation for the Live Archiving Proxy.

Deliverables (Netarchive)

  • A WARC Writer Plugin library in Java;
  • The source code of the WARC Writer Plugin library;
  • A documentation for the WARC Writer Plugin.