Wednesday, May 25, 2011

Curating a Discovery Environment

[Update: 11/8/11: Book published last week].  Late last year, my good friend David Swords of EBL asked me to contribute the opening chapter to a forthcoming book entitled "Patron-Driven Acquisitions: History and Best Practices." This collection of essays and research studies will be published in July by DeGruyter, and includes contributions from a range of librarians, vendors, and publishers on this hottest of topics.

Samuel Johnson aside ("none but a blockhead ever wrote except for money"), the discipline of writing always leads to learning, and with luck a good concept or two. As I thought about the changes in collection development and management that have taken place in the past decade, it struck me that the work of selectors now emphasizes:
  1. Collecting for the Moment: given the growing availability of book content in electronic form--and the growth of secure digital archives such as Hathi Trust, it is no longer necessary for individual libraries to collect for the ages. Rather, the task is to assure that content can be delivered at the moment it is wanted--from wherever it may be. This links "collections" (or more properly, access) much more closely to discovery.   
  2. Curating a Discovery Environment: this is now the central task of the activity formerly known as collection development. Or, as outlined in the chapter I have called "Collecting for the Moment: Patron-Driven Acquisition as a Disruptive Technology":
The philosophical shift underlying PDA is profound and multi-dimensional. Instead of curating collections of tangible materials, libraries have begun to adopt a new role: curating a discovery environment for both digital and tangible materials. Instead of deliberately trying to identify titles most relevant to curriculum and research interests within a discipline, broad categories of material that may be relevant are enhanced for optimum discoverability, immediate delivery, and partial or temporary use. Instead of purchasing materials just in case a scholar may one day need them, PDA offers “just in time” access to needed titles or portions of titles. Instead of collecting for the ages, libraries are using PDA to enable more targeted collecting for the moment.
This same logic applies to deselection. As we begin to come to terms with digital books and widely-shared print collections, it is critically important to assure that items removed from the library shelves remain discoverable and deliverable through other means. We continue to curate the discovery environment, and to enhance delivery options. Paradoxically, it may actually make sense to enrich the metadata associated with deselected items. This might include not only descriptive metadata, but also "availability" metadata: a URL to a Hathi Trust public domain copy; print holdings in a shared archive; commercial e-book availability; print-on-demand; or links to used book dealers.

Monday, May 16, 2011

The Cost of Deselection (10): Summing Up

The Project:  To remove 10,000 volumes from a 250,000-volume library to make room in the stacks for 1-2 years' collection growth.

Assumptions: Librarian time @ median rate of $35/hour (including benefits) and 40-hour work week. Batch data comparison for 250,000 titles would cost a minimum of $10,000 to prepare, trouble-shoot and execute. (This is essentially an educated guess, based on logic outlined here and here.

The Process and Estimated Costs:

     Project Design and Management:            100 hours        $ 3,500
     Data Extract                                              20 hours          $    700
     Develop Deselection Criteria                     100 hours         $ 3,500
     Communication w/Stakeholders             100 hours         $ 3,500

     Data Comparison w/ 3 targets (batch)                              $10,000
     Title Review from List only                                                  $ 5,810
     Disposition & Record Maintenance                                     $ 5,000

===========================================================================
   
     TOTAL with list-only review                                $32,010
     Price per volume:                                                     $3.20

     TOTAL with in-stack review                                $39,000
     Price per volume:                                                     $3.90

     TOTAL with staged review                                   $41,400
     Price per volume:                                                     $4.14


The numbers are easy to poke holes in, so please feel free to do so. We have laid them out as fully as we currently understand them, but they will vary with local conditions. There is still a great deal to learn here. We are interested in other perspectives and experiences. In most cases, the per-volume cost will decrease substantially with volume--i..e., deselection criteria, once developed, can be applied to many more volumes for only incremental cost.

The steps in the process may also vary somewhat, but there are really none that can be skipped entirely. Unless a library manager is willing to forgo communication with stakeholders, to prevent review of candidate titles, or to act based solely on usage data, the deselection process will require meetings and discussions, decisions and adjustments. These take time, and probably more time than estimated here.

These are intended as reasonable estimates, and as a starting point for discussion. Comments are welcome and revisions are likely.

Links to related posts:

Monday, May 9, 2011

The Cost of Deselection (9): Data Comparisons Revisited

Time to exercise our perpetual beta clause. In trolling through this growing string of posts on deselection costs--with an eye toward toting them up--it struck me that the entry on "Data Comparisons" could benefit from an attempt at more specific estimates, both for manual searching and batch matching of candidate files to comparator data sets. In the example outlined in that post, the task was to compare 300,000 candidates to three targets: WorldCat, CHOICE, and Resources for College Libraries--essentially, to perform 900,000 searches, then record and compile the results.

This is an area where costs are very difficult to estimate. Some libraries may be able to forgo the task of assembling deselection metadata that extends beyond circulation and in-house use, but probably not too many. Withdrawing monographs can be controversial. Evidence of holdings in other libraries, print archives or in digital format is essential to making and defending responsible deselection decisions. Comparing low-circulation titles to WorldCat holdings is the most helpful first step, especially since the holdings data gleaned there can include Hathi Trust titles, some print archives, and specified consortial partners. At minimum, all low-circulation titles in the target set should be searched in WorldCat.

There are a couple of ways to do this. Manual searches based on OCLC control number, LCCN, or title offer one approach, but it is time-consuming to key the searches, interpret and record the results. To date, the smallest data set SCS has worked with included 10,000 candidates; the largest 750,000. At 60 searches per hour (1 minute each), it would require 5,000 hours to search 300,000 titles. At $8/hour for student workers, the cost would be $40,000; at $20.68 for a library technician, the cost would be $103,400. Searching 750,000 items in this manner, besides equating to cruel and unusual employment, would require 12,500 hours. Clearly, this is not a realistic option--so much so that we won't even calculate the time or money involved.

The obvious answer, for a candidate file of any significant size, is to use computing power for what it's best at -- automated batch matching. Over the past six months, my partners and I at Sustainable Collection Services have worked closely with library data sets ranging in size from 10,000 titles to 750,000 titles. We have learned a great deal about how to prepare data for large-scale batch matching, and how to shape results usefully. Among other things, we have learned that there are many variables in both source and target data sets.

We have worked with clean data and not-so-clean data. We have found many mutations of control numbers. We have learned to identify malformed OCLC numbers, and a good deal about how to correct them using LCCN, ISBN, and string-similarity matching. Our business partner and Chief Architect Eric Redman even invoked the fearsome-sounding Levensthein Distance Algorithm, which sounds cool and turns out to be not so fearsome after all. But a few snippets from our internal correspondence may give provide some local color related to data issues.
1. Matching by title: I have implemented a basic starting point that can be enhanced as we gain experience.  This is by using the very simple Levensthein Distance algorithm. We can plug in more sophisticated algorithms later. With this basic starting point I am able to determine which LCCNs are suspect and which OCLC numbers are suspect. When only one is suspect on a record, I can use the other to correct the suspect value.
2. Using this approach, I can detect [in this record set] a pattern of a leading zero of the OCLC number having been replaced by a "7." This occurs in 1621 records. I can correct this.
3. I should begin to catalog these patterns of errors with LCCNs, OCLC Numbers, and ISBNs. I can create some generalized routines that we can plug in depending on the "profile" of the library's bibliographic data.
Or, as we began to extrapolate our response to these anomalies, and to build a more replicable process, which we would embed into SCS ingest and normalization routines:
1. Do a title similarity check on OCLC Numbers by comparing the title retrieved from WorldCat via OCLC Number to the title from the library's bib file.
2. If step 1 yields a similarity measure of .3 or less, where 0 is a perfect match, then stop. Otherwise, use the LCCN from the bib file to do a WorldCat title similarity check.
3. If step 2 yields a similarity measure of .3 or less then stop. Otherwise use the OCLC Number from the step 2 WorldCat rec to replace the library's OCLC Number.
4. If step 3 did not yield a match, log the library's bib control number and OCLC Number for manual review. The manual review is where we'll identify patterns like the spurious 7 in the Library's initial file.
This particular exchange actually concerned the smallest file we've yet handled, a mere 10,000 records! As we fix these errors in the data extract provided to us by the library, it seems clear that an opportunity also exists to provide some sort of remediation service to the library, improving its overall data integrity. But that's a story for some other day. 

Our learning curve has been steep and steady. We have now worked extensively with the WorldCat API and learned some lessons about capacity the hard way. We have worked with target data sets that require match points other than OCLC number. We have made mistakes and we have fixed mistakes. We have made more mistakes and fixed those too. And ultimately we have managed to produce some useful and convenient analyses and withdrawal candidate lists for our library partners. We have performed multiple iterations of criteria and lists in order to focus withdrawals by subject or location. In short, we have done what any individual library would need to do in order to run similar data comparisons.

None of these is a trivial task, and everything gets more interesting as files grow larger. SCS has found that effective data comparison requires top-flight technical skills for data normalization and remediation. It requires configuring and launching multiple virtual processors to improve matching capacity. It requires creation of a massive cloud-based data warehouse, already populated with tens of millions of lines. It requires high-level ability with SQL and a frighteningly deep facility with Excel. Formatting withdrawal lists for printing as picklists is also much more time-consuming than one might expect. Because SCS is doing this work repeatedly, we continue to learn and improve. We have developed some efficiencies and will continue to do so. But most individual libraries will not have that advantage. If you plan to do this in your own library, also plan on your own steep and steady learning curve.

Which, finally, brings us back to the questions of costs. Automated batch matching is clearly the way to go, but automation and creation of batches may require significant investment and a surprisingly wide range of high-end technical skills. These are not cheap. While it remains difficult to know what dollar figure to ascribe to data normalization, comparison, trouble-shooting and related processes, it is clearly substantial.

Links to related posts: