A major and invaluable undertaking, the HRWA is indicative of the recognition by major research institutions of the importance of the practice of web archiving, or capturing and preserving websites and other web-only materials for future research. Earlier in the month the Round Table had the opportunity to conduct a brief interview with Tessa Fallon, Web Collection Curator for the HRWA. Many thanks to Tessa for her insightful replies, which highlight some of the complex issues at work in the HRWA and also touch upon future directions for the project:
Q: What are your primary responsibilities as Web Collection Curator?
A: My primary responsibilities revolve around the maintenance and development of our web archive collections. This includes (but is not limited to): selecting new sites, requesting permission from site owners, sending cataloging requests to our catalogers, testing sites for technical suitability, and managing crawls of our selected sites. In addition to the HRWA (managed jointly between myself and co-curator Alex Thurman), I also manage the new Burke New York City Religions and the Rare Book and Manuscript Library web archives (both collections are in stages of development). Alex manages the Avery Architectural Library web archive, which includes sites related to historic preservation and architecture in New York City, and the University Archives collection.
Q: One of the main criteria for website inclusion in the HRWA is a perceived risk of disappearance. How do you determine that a website is at risk of disappearing?
A: This is a perennially tricky question, and we are continually refining our perception of what “at risk” means for a website. Some might argue that given the ephemerality of the web, all websites are at risk. For the HRWA, there are some criteria that are clearly defined: organizations that are at risk of persecution from hostile governments or other groups, organizations that have limited or threatened access to the internet, and sites that are static, presumably abandoned (no longer updated–in some cases, for years). In our experiences, sites may also disappear and reappear without notice, which makes at-risk difficult to gauge.
Q: Can you briefly explain the process of how a website is captured for inclusion in the archive?
A: The (very) abbreviated version of How Web Crawling Works: Sites are captured using a tool called a web crawler. A crawler can capture web content by crawling from link to link on a given site. So, if I sent a crawler to capture this blog, the crawler starting, “nyhistoryblog.com” would capture all of the content on nyhistoryblog.com at that moment in time. The crawler creates a file (called a WARC file) that is then used by a program like the Wayback Machine to show the archived site (content captured by the crawler).
Q: The HRWA website states that the project team is currently pursuing other means of making the archive available in addition to the project page on the Internet Archive. What additional means are you considering?
A: As part of the grant, we are attempting to develop a portal that would allow us to provide a local index and interface for our archived web sites. This is not yet available to the public. Portal development is spearheaded by Stephen Davis, Director of Library Digital Programs Division, programmer David Arjaniks, and web designers Erik Ryerson and Eric O’Hanlon.
If you’d like to learn more about the HRWA, check out the highly-informative