Monday, May 9, 2011

The Cost of Deselection (9): Data Comparisons Revisited

Time to exercise our perpetual beta clause. In trolling through this growing string of posts on deselection costs--with an eye toward toting them up--it struck me that the entry on "Data Comparisons" could benefit from an attempt at more specific estimates, both for manual searching and batch matching of candidate files to comparator data sets. In the example outlined in that post, the task was to compare 300,000 candidates to three targets: WorldCat, CHOICE, and Resources for College Libraries--essentially, to perform 900,000 searches, then record and compile the results.

This is an area where costs are very difficult to estimate. Some libraries may be able to forgo the task of assembling deselection metadata that extends beyond circulation and in-house use, but probably not too many. Withdrawing monographs can be controversial. Evidence of holdings in other libraries, print archives or in digital format is essential to making and defending responsible deselection decisions. Comparing low-circulation titles to WorldCat holdings is the most helpful first step, especially since the holdings data gleaned there can include Hathi Trust titles, some print archives, and specified consortial partners. At minimum, all low-circulation titles in the target set should be searched in WorldCat.

There are a couple of ways to do this. Manual searches based on OCLC control number, LCCN, or title offer one approach, but it is time-consuming to key the searches, interpret and record the results. To date, the smallest data set SCS has worked with included 10,000 candidates; the largest 750,000. At 60 searches per hour (1 minute each), it would require 5,000 hours to search 300,000 titles. At $8/hour for student workers, the cost would be $40,000; at $20.68 for a library technician, the cost would be $103,400. Searching 750,000 items in this manner, besides equating to cruel and unusual employment, would require 12,500 hours. Clearly, this is not a realistic option--so much so that we won't even calculate the time or money involved.

The obvious answer, for a candidate file of any significant size, is to use computing power for what it's best at -- automated batch matching. Over the past six months, my partners and I at Sustainable Collection Services have worked closely with library data sets ranging in size from 10,000 titles to 750,000 titles. We have learned a great deal about how to prepare data for large-scale batch matching, and how to shape results usefully. Among other things, we have learned that there are many variables in both source and target data sets.

We have worked with clean data and not-so-clean data. We have found many mutations of control numbers. We have learned to identify malformed OCLC numbers, and a good deal about how to correct them using LCCN, ISBN, and string-similarity matching. Our business partner and Chief Architect Eric Redman even invoked the fearsome-sounding Levensthein Distance Algorithm, which sounds cool and turns out to be not so fearsome after all. But a few snippets from our internal correspondence may give provide some local color related to data issues.
1. Matching by title: I have implemented a basic starting point that can be enhanced as we gain experience.  This is by using the very simple Levensthein Distance algorithm. We can plug in more sophisticated algorithms later. With this basic starting point I am able to determine which LCCNs are suspect and which OCLC numbers are suspect. When only one is suspect on a record, I can use the other to correct the suspect value.
2. Using this approach, I can detect [in this record set] a pattern of a leading zero of the OCLC number having been replaced by a "7." This occurs in 1621 records. I can correct this.
3. I should begin to catalog these patterns of errors with LCCNs, OCLC Numbers, and ISBNs. I can create some generalized routines that we can plug in depending on the "profile" of the library's bibliographic data.
Or, as we began to extrapolate our response to these anomalies, and to build a more replicable process, which we would embed into SCS ingest and normalization routines:
1. Do a title similarity check on OCLC Numbers by comparing the title retrieved from WorldCat via OCLC Number to the title from the library's bib file.
2. If step 1 yields a similarity measure of .3 or less, where 0 is a perfect match, then stop. Otherwise, use the LCCN from the bib file to do a WorldCat title similarity check.
3. If step 2 yields a similarity measure of .3 or less then stop. Otherwise use the OCLC Number from the step 2 WorldCat rec to replace the library's OCLC Number.
4. If step 3 did not yield a match, log the library's bib control number and OCLC Number for manual review. The manual review is where we'll identify patterns like the spurious 7 in the Library's initial file.
This particular exchange actually concerned the smallest file we've yet handled, a mere 10,000 records! As we fix these errors in the data extract provided to us by the library, it seems clear that an opportunity also exists to provide some sort of remediation service to the library, improving its overall data integrity. But that's a story for some other day. 

Our learning curve has been steep and steady. We have now worked extensively with the WorldCat API and learned some lessons about capacity the hard way. We have worked with target data sets that require match points other than OCLC number. We have made mistakes and we have fixed mistakes. We have made more mistakes and fixed those too. And ultimately we have managed to produce some useful and convenient analyses and withdrawal candidate lists for our library partners. We have performed multiple iterations of criteria and lists in order to focus withdrawals by subject or location. In short, we have done what any individual library would need to do in order to run similar data comparisons.

None of these is a trivial task, and everything gets more interesting as files grow larger. SCS has found that effective data comparison requires top-flight technical skills for data normalization and remediation. It requires configuring and launching multiple virtual processors to improve matching capacity. It requires creation of a massive cloud-based data warehouse, already populated with tens of millions of lines. It requires high-level ability with SQL and a frighteningly deep facility with Excel. Formatting withdrawal lists for printing as picklists is also much more time-consuming than one might expect. Because SCS is doing this work repeatedly, we continue to learn and improve. We have developed some efficiencies and will continue to do so. But most individual libraries will not have that advantage. If you plan to do this in your own library, also plan on your own steep and steady learning curve.

Which, finally, brings us back to the questions of costs. Automated batch matching is clearly the way to go, but automation and creation of batches may require significant investment and a surprisingly wide range of high-end technical skills. These are not cheap. While it remains difficult to know what dollar figure to ascribe to data normalization, comparison, trouble-shooting and related processes, it is clearly substantial.

Links to related posts:


    No comments:

    Post a Comment