This is an area where costs are very difficult to estimate. Some libraries may be able to forgo the task of assembling deselection metadata that extends beyond circulation and in-house use, but probably not too many. Withdrawing monographs can be controversial. Evidence of holdings in other libraries, print archives or in digital format is essential to making and defending responsible deselection decisions. Comparing low-circulation titles to WorldCat holdings is the most helpful first step, especially since the holdings data gleaned there can include Hathi Trust titles, some print archives, and specified consortial partners. At minimum, all low-circulation titles in the target set should be searched in WorldCat.
The obvious answer, for a candidate file of any significant size, is to use computing power for what it's best at -- automated batch matching. Over the past six months, my partners and I at Sustainable Collection Services have worked closely with library data sets ranging in size from 10,000 titles to 750,000 titles. We have learned a great deal about how to prepare data for large-scale batch matching, and how to shape results usefully. Among other things, we have learned that there are many variables in both source and target data sets.
We have worked with clean data and not-so-clean data. We have found many mutations of control numbers. We have learned to identify malformed OCLC numbers, and a good deal about how to correct them using LCCN, ISBN, and string-similarity matching. Our business partner and Chief Architect Eric Redman even invoked the fearsome-sounding Levensthein Distance Algorithm, which sounds cool and turns out to be not so fearsome after all. But a few snippets from our internal correspondence may give provide some local color related to data issues.
1. Matching by title: I have implemented a basic starting point that can be enhanced as we gain experience. This is by using the very simple Levensthein Distance algorithm. We can plug in more sophisticated algorithms later. With this basic starting point I am able to determine which LCCNs are suspect and which OCLC numbers are suspect. When only one is suspect on a record, I can use the other to correct the suspect value.
2. Using this approach, I can detect [in this record set] a pattern of a leading zero of the OCLC number having been replaced by a "7." This occurs in 1621 records. I can correct this.
3. I should begin to catalog these patterns of errors with LCCNs, OCLC Numbers, and ISBNs. I can create some generalized routines that we can plug in depending on the "profile" of the library's bibliographic data.
Or, as we began to extrapolate our response to these anomalies, and to build a more replicable process, which we would embed into SCS ingest and normalization routines:
1. Do a title similarity check on OCLC Numbers by comparing the title retrieved from WorldCat via OCLC Number to the title from the library's bib file.
2. If step 1 yields a similarity measure of .3 or less, where 0 is a perfect match, then stop. Otherwise, use the LCCN from the bib file to do a WorldCat title similarity check.
3. If step 2 yields a similarity measure of .3 or less then stop. Otherwise use the OCLC Number from the step 2 WorldCat rec to replace the library's OCLC Number.
4. If step 3 did not yield a match, log the library's bib control number and OCLC Number for manual review. The manual review is where we'll identify patterns like the spurious 7 in the Library's initial file.This particular exchange actually concerned the smallest file we've yet handled, a mere 10,000 records! As we fix these errors in the data extract provided to us by the library, it seems clear that an opportunity also exists to provide some sort of remediation service to the library, improving its overall data integrity. But that's a story for some other day.
Our learning curve has been steep and steady. We have now worked extensively with the WorldCat API and learned some lessons about capacity the hard way. We have worked with target data sets that require match points other than OCLC number. We have made mistakes and we have fixed mistakes. We have made more mistakes and fixed those too. And ultimately we have managed to produce some useful and convenient analyses and withdrawal candidate lists for our library partners. We have performed multiple iterations of criteria and lists in order to focus withdrawals by subject or location. In short, we have done what any individual library would need to do in order to run similar data comparisons.
Which, finally, brings us back to the questions of costs. Automated batch matching is clearly the way to go, but automation and creation of batches may require significant investment and a surprisingly wide range of high-end technical skills. These are not cheap. While it remains difficult to know what dollar figure to ascribe to data normalization, comparison, trouble-shooting and related processes, it is clearly substantial.
Links to related posts:
- Cost of Deselection (1)
- Cost of Deselection (2): Fixed Costs
- Cost of Deselection (3): Wage Rates
- Cost of Deselection (4): Data Comparisons
- Cost of Deselection (5): Title Review from Lists
- Cost of Deselection (6): In-Stack Review
- Cost of Deselection (7): Staged Review
- Cost of Deselection (8): Disposition Options
- Cost of Deselection (10): Summing Up