a project of the Electronic Archives Project of the Graduate School of Library and Information Science at the University of Illinois

1.0 Introduction

The CEP Project (a.k.a. PEP, Preserving Electronic Publications) is a web site archiving system developed with Open Source software for Unix/Linux. CEP makes it possible for organizations to periodically download and retain archival copies of their evolving web site(s). CEP uses a web spider, wget, to traverse and download a target website's pages and CVS to archive the pages and their subsequent changes. CEP also uses a variety of software packages to create, maintain historical data and provides summary statistics about the web site's content.

This document addresses only the installation of the constituent software units onto the CEP host computer. A companion document, the CEP Operations Guide, addresses ongoing operator actions in control and of the CEP system, and how its harvested materials are managed and utilized.

The packages used to create CEP include: Fedora, Apache, CVS, PERL, GD graphic tools, TreeTagger, CVS ChangeLogBuilder, XMLFile OAI-PMH Data Provider, and wget. Many of these packages are available from an installation of Redhat/Fedora, a few need to be downloaded from the individual sites and many are available from http://rpmfind.net/.

CEP integrates these stand-alone packages into a single system through the use of CGI, Perl and Java processes. The CEP system uses the wget web spider to retrieve web pages from a target web site defined by the XML configuration file, which get defined through a CEP provided web page. After a site is retrieved, other processes generate Meta-data, statistical data and then the web site data is presented to the operator for manual or automatic check-in to the CVS archive. An overview of the data flow is represented in Figure 1.

Government agency web site data flows through the wget web spider, then through CEP post-spider processing and CVS acceptance to both CVS version-controlled storage and statistics and metadata outputs.
            Data from an XML configuration file, itself arising from a web editing form, feeds into and guides the wget web spider.
            A spider web control table provides control inputs to the editing form, the wget web spider, and the post-spider processing and CVS acceptance.
            Automatic controls alternatively provide control inputs only to the wget web spider and to post-spider processing and CVS acceptance.
Figure 1 - CEP Data Flow
Click to enlarge.

CEP was developed by Larry S. Jackson from the Graduate School of Library and Information Science at the University of Illinois, Urbana-Champaign (UIUC) with funding from Institute for Museum and Library Services (IMLS) National Leadership Grants, and from the Illinois State Library (ISL). We encourage you to visit the CEP home page for all of the details.

This guide provides information about CEP installation, configuration, and frequently asked questions. This guide walks your through a CEP installation on a default Fedora 3 system. Deviations from the default install will require the user to adjust paths and configuration files to match the local system.


Copyright © 2001-2007, University of Illinois. All rights reserved. The contents of this electronic reference materials may not be reproduced in whole or in part without the prior permission of the University of Illinois. For information concerning reproduction of this material, please contact the Principal Investigator.