GSLIS Electronic Archives Project

The GSLIS Electronic Archives project began October 2001 with an Institute of Museum and Library Services (IMLS) 'National Leadership Grant' to the Illinois State Library (ISL), the State Library of Ohio, the Illinois State Archives, and the Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign (UIUC).

The work of this group now includes these major initiatives: (1) development and extensive operation of an open-source, highly automated website harvesting and archival system (CEP), used (so far) by Illinois, Alaska, Arizona, Montana, North Carolina, Utah, and Wisconsin to archive over 2.6 million files (as of July 2005), (2) design and operation of the Illinois Government Information search engine, based on SWISH-E and accessible document access surrogate documents generated from CEP operations, serving over 225 websites and 500,000 documents of Illinois State Government, and (3) the Illinois Electronic Documents Initiative (ILEDI), providing permanent, accessible, 'digital library' access to archived official publications of the State of Illinois which are in electronic form, and (4) participation, with OCLC and several State Libraries, in the 'Echo DEPository' Library of Congress NDIIPP grant, exploring tools for discovery of high-valued web documents and the possibility of interoperability of various popular online access systems. The remainder of this page addresses the CEP project. For more information on the other projects, see the Principal Investigator's homepage.

Subsequent funding has come in the form of (1) another related IMLS National Leadership Grant, (2) a related IMLS Library Services and Technology Act grant, (3) grants awarded by the Illinois State Library, a Division of the Office of Secretary of State, (4) participation in a Library of Congress NDIIPP grant, and (5) from GSLIS.

The major papers, reports, and public presentations of the group are online.

GSLIS contact: Larry S. Jackson is the Principal Investigator

Capturing Electronic Publications (CEP)

former title: Preserving Electronic Publications (PEP)

At GSLIS, development of harvesting and version control facilities have produced a system that anyone may use at no charge in activities related to digital information archiving. CEP provides web-based means to configure and control one or more groups of web spiders which each harvest one website. In the interest of disk space economy, and for the generation of some interesting managerial reports, harvested materials are deposited into repositories of the "Concurrent Versions System" (CVS) software. When copied into CVS, the original downloaded materials may be discarded (using a program supplied in CEP) to free up disk space.

This system also provides many valuable information management by-products;
(1) an alerting service, informing of additions/changes/deletions within a watched website,
(2) statistical summaries of website sizes, constituent files, and metadata use, such as the most recent result set for all of Illinois,
(3) a very extensive statistical, graphical, and interactive summary of the history and contents of each archived website,
(4) a list of all host computers currently mentioned within CEP spider configuration files,
(5) a list of the broken links observed during a harvesting of a website,
(6) a list of website host computers not already known to CEP in either a spider definition or a file of known hosts not to be archived (these entries might suggest a need to define a new harvester), and
(7) an OAI-PMH metadata server, providing metadata from harvesting results to other researchers, such as the continually updated metadata server for all Illinois websites

Please also see the PEP project homepage and the CEP project homepage at ISL.

Downloads

The downloadable PEP software and its documentation, version 1.1.0, is provided here. Previous versions are withdrawn as obsolete, as their command-line initiation of functions is still possible within the current version (but with considerably fewer bugs). Our portion of the software is provided under a slightly modified form of open-source license. Other software has its own terms of use, and while these are mostly unrestrictive, you should read their requirements to ensure you qualify. If you will be using CEP in your research or your work, please drop a note to the Principal Investigator and let us know of your activities and results.

Research Assistants

The following are gratefully acknowledged for their assistance in the University of Illinois portion of this research; Sai Deng, Tim Donohue, Sarah L. Eiben, Sungok Hong, Xiao Hu, Kyung-Ja (KJ) Hyun, M. Sharif Islam, Annette B. Jackson, Lanie Klinkner, Wing Yee (Vincci) Kwong, Weiguo (David) Liao, Guixian Lin, Edith Pfeifer List, Terry McLaren, Ozwaldo (Ozzie) Meza, Karishma Muntashir, Brynnen Owen, Haiyan Pei, Jian Wu, Xuan Xie, Peng Xu, Lin Yang, Huamin Yuan, Yiyi Zeng, Yan Zhang, Jing Zhang, Jing Zhao, and Guojun Zhu.

Affiliated Agencies

Institute of Museum and Library Services logo

The Institute of Museum and Library Services
1100 Pennsylvania Avenue
Washington, DC 20506 USA
(202) 606-8536 voice
http://www.imls.gov/ . imlsinfo@imls.gov

 
Illinois State Library Find-It Illinois logo

The Illinois State Library
300 S. 2nd Street
Springfield, IL 62701-1796 USA
(217) 785-5600 voice
http://www.cyberdriveillinois.com/library/isl/isl.html

 
University of Illinois logo

The Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel Street
Champaign, IL 61820-6211 USA
(217) 333-7197 voice . (217) 244-3302 fax
http://www.lis.uiuc.edu/ . gslis@uiuc.edu