Research and other libraries are priceless given that they hold, organize and provide sensible, generally open, access to many of the treasures of knowledge of our species. Especially when mass media, popular culture and/or governmental information flows fail in making available the objective and relevant information needed to enable us to make informed decisions about our lives, businesses or planet, libraries usually will provide the information needed. As both a first and last refuge for knowledge, libraries could come to play an increasingly expansive and critical role in society, given this need, if we can develop and better keep pace with the expanding role of technology in scholarly and educational communication and information access. Librarians and the technological directions we choose to develop and/or follow are more important, on a societal scale, than most us think. Libraries and related services are not inexpensive to develop and maintain, though, and, with the information boom that both preceded (in print resources) and continues to follow the advent of the Web (in both print and digital resources), are not keeping up with the large numbers of significant information resources being produced. Therefore an important contribution to help enable libraries work better and have more impact on a more expansive scale is the development and judicious use of machineassistance related software, technologies and services which amplify collection building expertise in library collection building. Discussed in this article are two projects in this area.
Data Fountains and iVia described
Challenges of context — Cooperation and better engagement with new technologies
Specific iVia technologies
Product definition, metadata modification and record export
Our projects, iVia (http://ivia.ucr.edu) and Data Fountains (http://datafountains.ucr.edu), are ongoing efforts to, respectively, develop new open source software for and with the library community and, based on this, create new digital library/library finding tool services. Both are ongoing, public domain, open source, and open service efforts in machine assistance that will enable digital libraries and libraries to create or augment metadata/data collections and associated finding tools (e.g., library catalogs, portals, subject directories) for digital information objects through:
A metadata generation utility and service for identifying and applying natural language fields including significant key phrases and contents for descriptions/summaries.
A metadata generation utility and service for applying controlled subject vocabularies/schema that have functioned for decades as knowledge community standards (i.e., Library of Congress Classifications and Library of Congress Subject Headings or LCC and LCSH). Use of these standards enables subject metadata for both print and digital records to be smoothly discovered and accessed without using often awkward subject metadata crosswalks/intermappings. This also allows us to seamlessly provide subject access to the great number of records for print resources held by libraries.
A metadata extraction utility and service for extracting natural language and other fields when metadata is supplied in the form of HTML/Dublin Core metatags on resources (e.g., titles, creators, etc. ... there are more than 20 of these currently handled).
A selected, rich fulltext identification and extraction utility and service. Rich text is that natural language, placed in conventional document structures (e.g., abstracts and introductions), that is intended by authors to indicate what a resource is concerned with topically and otherwise. Rich text can be retained as fulltext or processed into key phrases.
An Internet resource discovery utility and service using expert guided and focused topic crawlers.
Both metadata generation and resource discovery made available in semiautomated (emphasizing expert interactive input and refinement) and fully automated modes.
The software and services are of use to all who create and maintain portals, subject directories, catalogs or databases consisting of collections of Internet resources. They are intended to help these collections scale and meet the challenges of:
keeping up with the growing numbers of useful resources on the Internet of value to library end users;
the relatively small size of most collections and catalogs, and searches yielding few results;
moderating the high costs of manually created metadata; and,
charting new service areas.
The intent is to achieve these goals by assisting and amplifying, not replacing, the expertise of collection development and metadata experts. The emphasis is on partially repurposing the expenditure of expertise spent on routine tasks to tasks actually requiring subject and metadata expertise. In achieving this, these tools and services should help our community, as a whole, better apply and extend its expertise in the form of larger and richer metadata collections and thereby improve their use value for traditional, new and/or more specialized user communities that are increasingly being served by nonlibrary interests/institutions.
Among our goals more generally are to better explore and chart the vast areas between the pole of the MARC record and elaborate, rich, handcrafted metadata, on the one hand, and the pole of Google representative dataswatch records (e.g., text representing the gist of a resource), on the other hand, in the interests of extending library/digital library collections/services into new areas. This is a large territory.
Data Fountains and iVia described
iVia is the open source (LGPL and GPL) system or code base that Data Fountains, National Science Digital Library Data Fountains, Library of Congress Exploratory Data Fountains and INFOMINE (http://infomine.ucr.edu) are built upon. Data Fountains is intended to be a selfservice resource discovery, metadata generation and rich text extraction utility for collection building. Data Fountains is an open source (LGPL and GPL), evolved variant of the iVia system.
The iVia/Data Fountains code base has been an ongoing development over the last several years (Mitchell, 1997; Mason, et al., 2000; Mitchell, et al., 2003; Mitchell, 2005). It includes not only the resource discovery, metadata generation and extraction tools discussed in this article but very powerful backend/archival/repository/portal management and retrieval engine features (Mitchell, 2005; Mitchell, et al., 2003; Paynter, 2005). Over 230,000 lines of C++ code constitute the system; which relies upon the open source Debian Linux (and many other popular variants of Linux) operating system and the MySQL database management software packages. The code is standardized, uniform and wellwritten. C++ has been chosen given its advantages over Java and other languages in performance intensive tasks. We believe the technologies, applications, coding and protocols addressed are, for libraries, mission critical, widely applicable for thousands of libraries/digital libraries, and need to endure. Therefore we have not chosen light weight approaches or protocols in development. Finally the code base is modularly designed so that it is relatively easy to interchange many components (e.g., swap out database management software, crawlers, and/or classifiers) to meet the unique needs of implementing institutions.
Our services and software are open, communitybased and cooperative:
iVia and Data Fountains software are open source (GPL/LGPL) and freely available to the community.
The software and service are being integrated into the National Science Digital Library (NSDL) to be of assistance to NSDLassociated projects and are currently being explored by the Library of Congress and others.
Data Fountains’ metadata generation, rich text extraction and resource discovery services are available through what will become a Data Fountains cooperative which will operate on a costrecovery, nonprofit basis.
Researchers and librarians at the Library of Congress, Indian Institute of Technology (Bombay), Cornell University, University of Massachusetts (Amherst), California Digital Library (University of California), and California State University (Sacramento) are among those working with us in new research to further the work.
Challenges of context Cooperation and better engagement with new technologies
The community and organizational contexts within which we have been working have been complex and interesting, showing many new opportunities as well as continuing barriers to our type of effort. As a whole libraries represent an increasingly impoverished community that is generally quite underserved by usually small, expensive and not terribly responsive commercial software and service vendors. This is unfortunate given that libraries know knowledge organization within a wide spectrum of applications and have proven that they can generally do well with very limited resources. Therefore, it seems evident that, allied, libraries could do better on their own.
As a whole libraries represent an increasingly impoverished community that is generally quite underserved by usually small, expensive and not terribly responsive commercial software and service vendors.
This assumes though that libraries could develop approaches (or organizations) generating more meaningful cooperation, make decisions in a timely manner, and rid themselves of kneejerk risk adversity. The inability to accomplish this has generally proven to be quite a barrier. Often the problem is that the larger institutions that manage and enable libraries, and in which they are embedded (such as universities and many city/county governments), neither understand our capabilities nor see us very clearly in new roles. Additional challenges faced by anyone doing this work (either in the public or commercial domains) are that software development costs are significant; the work is hard and complex with expertise sometimes difficult to find and retain; and, the computer and information sciences and, most importantly, the basic computing power underlying these disciplines and technologies, represent a moving target frozen in fast forward.
These are large challenges but, considering the alternative of increasing disintermediation (often by mediocre services, organizations and tools), should serve to again underline the importance of cooperation in developing new, mutually useful tools and organizations. The challenges also should serve to indicate that this effort represents a communitysized set of tasks if we wish to take advantage of new technologies that enable new services of promise. For example, our project works with only a handful of the classification algorithms relevant to controlled subject vocabulary application when considering all of those that have some relevance to this type of research and development. Libraries represent a large community with many common interests in these technologies and tools.
The final observation here is that it is probable that developing cooperation, so that we can work together and start to own core oncoming technologies that apply to our service areas, will become as important as owning our buildings to house our physical collections has been heretofore. As an institution and community, we want to own the machinelearning based, machineassistance technologies described in this article, among others. At the very least, to best benefit from these technologies implies that we become much more directly and intensively involved in guiding and designing them than we have been. In this way we are more likely to ensure the best possible outcomes for information seekers.
Specific iVia technologies
New resource identification through guided or focused Web crawling
These are important as they represent appropriate technology scaled tools for librarians and other subject experts to employ and amplify their domain expertise in finding new resources for collections. Their value can be found in that:
They are intended to be used in Internet resource identification geared not towards the whole Internet but towards identifying resources of value to specific subject and other communities.
In comparison with large scale crawlers, smaller scale crawling can cover specialized topics in more depth and keep the crawl fresher or more current because there is less to cover for each crawler.
There is potentially considerably more precision (while retaining a significant level of recall) in using these tools than with other approaches given their targeted (i.e., TLC and EGC) or exemplar based (i.e., NiFC) approaches (i.e., use of large seed sets of very ontopic, expertprovided exemplars).
Among our crawlers are Targeted Link Crawler (TLC), Expert (or Manually) Guided Crawler (EGC) and Nalanda iVia Focused Crawler (NiFC). TLC is provided an URL, or list of URLs, and crawls those only, generating a metadata record and rich text for each URL supplied.
EGC is a crawler by which you can mine a site through specifying what is called a start URL from which the crawler can crawl or drill down into the site (a user specified number of levels) or drill out to external links only. In this manner it follows internal and/or external links that it identifies in the attempt to find new resources for which metadata records are created and rich text extracted.
Figure 1: Expert Guided Crawler Settings.
Nalanda iVia Focused Crawler focuses on and crawls interlinked resources within a topic community. The assumption being that those interested in and putting up resources on a specific topic generally link to or cocite one another. NiFC finds these by user-experts supplying ontopic exemplars (i.e., a list or seed set of URLs). With these, NiFC utilizes:
Web graph or cocitation analysis to determine important sites within a subject community (the most intensely interlinked or cocited).
Similarity analysis that compares key phrase profiles for a prospective new resource with key phrase profiles from the known high value, highly relevant resources in a subject that have been provided by the userexpert in the exemplars.
Preferential focused crawling which allows it to identify and follow only the most relevant links on a page. NiFC features an apprentice learner program that is able to determine, through cues in an HTML page, the most promising links to crawl. This makes focused crawling more efficient by reducing the total number of links that are crawled.
A Combined HITS and PageRank crawling algorithm to improve the crawling.
Figure 2: Nalanda iVia Focused Crawler Settings.
Rich, fulltext identification and harvest
Rich natural language text, as mentioned, is that text most likely to include author intended descriptions of the themes of the resource (e.g., abstracts or introductions). Different resource types often have differing areas where rich text can be found and different types of rich text. Rich text can greatly improve enduser retrieval (via proximity operators) for finely granular terms or phrases and is one step in and critical to improving generation of other forms of metadata. Simple semantic rules (i.e., aboutness terms such as the words introduction, faq, abstract, when author emphasized in large font, or bolded, etc.) are used to identify rich text (see Figure 3). Rich text can be extracted as found or processed as keyphrases in context. One to three pages can be kept (this limit is arbitrary and could be greatly expanded; see Figure 4).
Figure 3: Rich, Fulltext Settings: Aboutness Terms.
Figure 4: Rich, Fulltext Settings: Number of Pages and Amount of Text.
Automated record building
Approximately 35 fields are populated through a variety of means including extracting what is present on the page as metatagged data or fulltext, by developing original metadata as gisted from text and key phrases, or by a combination of both approaches (e.g., in some cases title). Both uncontrolled terms (e.g., natural language key phrases and descriptions) and controlled vocabularies/schema (representing library standards) are used to indicate topic. Users can specify the amount and type of metadata they wish generated (see Figure 5).
Figure 5: Metadata Choice.
Metadata creation through extraction and classification occurs in the following steps:
HTML and Dublin Core metatag data is identified and harvested (e.g., creator, title, etc., if any) when present on the page. If there is none, it is then generated automatically in the cases of core fields (e.g., title, key phrases, subjects, description). If the tagged data supplied is incorrect for a page (often the case where standard HTML templates have been used to describe the overall topic of a site, which may not well represent specific components, but which are still repeated throughout the site), it can be overridden as a user choice in favor of originally created metadata.
Rich, fulltext (e.g., abstracts, about pages, summaries) is identified and 13 pages harvested and processed into key phrases or key terms.
The most significant of these natural language key phrases are identified and extracted.
Annotationlike descriptions are developed from the fulltext and key phrases, if not present in metatags.
Classifiers are used to model and map key terms to controlled subject vocabularies/schema, notably Library of Congress Classifications (LCC) and Library of Congress Subject Headings currently (LCSH).
Controlled subject generation
In more detail, LCC and LCSH are applied through classification algorithms that build models mapping natural language to the controlled classifications or headings. The classification algorithms require hundreds of thousands or millions of training examples, which weve taken from Library catalog records primarily (i.e., those which reference an Internet resource and have an URL). Algorithms used have included kNearestNeighbor/Naïve Bayes then Logistic Regression and shortly Support Vector Machines (this one being revisited) combined with Hidden Markov Models, among others.
While results with LCC and LCSH assignment have been quite mixed, current IMLS supported research goes for three years to explore/improve specific classifiers, as well as to develop hybrids and suites of classifiers. The major difficulty has been in not having enough training data and, to a lesser degree, that the training data can be dirty and inconsistently applied. Still there is promise here and we have found that with greater than 200 training examples per class, classifier results greatly improve. We have been fortunate to have received the support of the California Digital Library, University of California Riverside Library, Library of Congress, Cornell University Library and OCLC in receiving training data.
It is important to note a set of concepts here that our subject vocabularies/schema standards, including LCC and LCSH, represent. These are that, though subjects can be applied automatically or semiautomatically through new technologies and algorithms, their application always occurs via expert initial profiling and interaction. Indeed, very significantly, the models developed for automated controlled subject vocabulary application are models distilled from exemplars (training data) of those vocabularies/schema in use and the knowledgebase (i.e., all those MARC records with LCC and LCSH) which has been built over decades by thousands of libraries and librarians in the community. This is another reason our work is open source. The models and mappings, as well as the classes themselves, can only exist as based on the community knowledgebase from which they draw.
Product definition, metadata modification and record export
Though folded in with the above, the system provides means of expertbased, flexible: metadata/field selection; product/service selection; rich text selection; and, crawler guidance/control. Metadata selected can range from very little to all 35, mostly Dublin Core, fields plus rich text. It should be noted that records can be archived, modified/edited, and searched though the iVia system database archive. Or records can simply be exported through a number of formats for inclusion/postprocessing in the participants native database. Formats for export currently include: Standard Delimited Format (comma separated); zipped SDF; and, OAIPMH. Shortly, export in MARC and HTML/XML will be supported.
Discussed have been machine learning based software, technologies and services developed by a library for the library community. Showing modest success, their intent is to amplify and augment librarian effort in collection building by providing experts with machine assistance with the goal of developing greater reach and coverage for the services they offer their users. A fascinating area of research and development, the work also addresses technologies that are maturing rapidly and will affect our community greatly. Therefore, they are technologies with which we need to better engage and guide in order to ensure the most productive and useful outcomes for library patrons, libraries and library support organizations.
About the author
Steve Mitchell is iVia and Data Fountains Projects Director. Steve has fourteen years of Internet service provision experience in libraries and was a science reference librarian for sixteen years. He is cofounder of INFOMINE (http://infomine.ucr.edu), a scholarly Internet resources directory and one of the first Webbased services offered by a library. He has a B.A. in sociology from the University of California, Santa Barbara, and an M.L.I.S. from the University of California, Berkeley.
The author would like to acknowledge the generous support of the U.S. Institute of Museum and Library Services (IMLS), National Science Foundations National Science Digital Library (NSDL) and the Library of the University of California, Riverside. Also deserving a great deal of thanks are Johannes Ruscheinski, lead programmer, Paul Vander Griend, Walter Howard, Jason Scheier and Gordon Paynter, former lead programmer, of our projects. I would finally like to thank Soumen Chakrabarti (Indian Institute of Technology, Bombay), Thorsten Joachims (Cornell University), Rich Caruana (Cornell University), Andrew McCallum (University of Massachusetts, Amherst), John Saylor (National Science Digital Library and Cornell University Library), Diane Hillmann (Cornell University Library), Jon Phipps (Cornell University Library), Jan Herd (Library of Congress), Carolyn Larson (Library of Congress) and Carlos Rodriguez (California State University, Sacramento) for their support and insights over the years. The viewpoints represented in this article are those solely of the author.
Julie Mason, Steve Mitchell, Margaret Mooney, Lynne Reasoner and Carlos Rodriguez, June 2000. INFOMINE: Promising Directions in Virtual Library Development, First Monday, volume 5, number 6 (June), at http://firstmonday.org/issues/issue5_6/mason/. http://dx.doi.org/10.5210/fm.v5i6.763
Steve Mitchell, 2005. Collaboration Enabling Internet Resource Collection-building Software and Technologies, Library Trends, volume 53, number 4 (Spring), pp. 604619, and at http://www.findarticles.com/p/articles/mi_m1387/is_4_53/ai_n14703165.
Steve Mitchell, 1997. INFOMINE: The First Three Years of a Virtual Library for the Biological, Agricultural and Medical Sciences, Proceedings of the Contributed Papers Session, Biological Sciences Division, Special Libraries Association Annual Conference, Seattle (11 June).
Steve Mitchell, Margaret Mooney, Julie Mason, Gordon W. Paynter, Johannes Ruscheinski, Artur Kedzierski, and Keith Humphreys, 2003. iVia: Open Source Virtual Library Software, DLib Magazine, volume 9, number 1 (January), at http://www.dlib.org/dlib/january03/mitchell/01mitchell.html. http://dx.doi.org/10.1045/january2003-mitchell
Gordon Paynter, 2005. Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources, Proceedings of the Joint Conference on Digital Libraries (JCDL), at http://ivia.ucr.edu/projects/publications/Paynter-2005-JCDL-Metadata-Assignment.pdf.
Paper received 17 May 2006; accepted 18 July 2006.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.
Machineassisted Metadata Generation and New Resource Discovery: Software and Services by Steve Mitchell
First Monday, volume 11, number 8 (August 2006),