Sample & Hold: Rick Lugg's Blog: The Cost of Deselection (4): Data Comparisons

The time and effort required to acquire and use title-level deselection metadata depend on several factors:

What does the library want to know about withdrawal candidates, in addition to the fact that they have not circulated in x years? What deselection metadata is needed?
What match points are available in both the library's data and the comparator target?
To what degree can comparisons be batched and automated?

To look more closely at this, let's consider a recent project with Grand Valley State University, one of our valued library partners. GVSU's circulating monographs collection includes just over 300,000 titles. At the highest level of analysis, they were interested in creating withdrawal candidate lists based on these criteria:

No circulations since 1998
Publication year before 2000
More than 100 US Holdings
More than 10 In-State Holdings (Michigan)
Not listed in Resources for College Libraries
Never reviewed in CHOICE

GVSU was also interested in titles that might be important to retain or preserve as a contribution to the collective collection. Characteristics of these titles include:

Fewer than 10 US holdings
Not represented in Hathi Trust
No circulations since 1998
Publication year before 2000

The cost of developing these criteria has already been considered in the fixed costs portion of our conceptual model. But the actual data comparison incurs additional costs, some of which are variable, growing with the number of titles involved. To generate statistics based on GVSU's criteria, it was first necessary to compare 300,000 titles against several external sources, including:

OCLC WorldCat, using the WorldCat API (220 million records)
Hathi Trust (4.6 million book titles)
CHOICE (156,000 titles)
Resources for College Libraries (60,000 titles)

A couple of cost factors come into play here. First, some of the comparator targets require the purchase of a subscription. If these subscriptions are already in place for another purpose, it seems reasonable to omit them from deselection costs. But if subscriptions are added directly in support of deselection, the cost should be included.

The bigger cost, however, is embedded in the work of executing those millions of comparisons and capturing the results. One option, of course, is to to search each title manually, but at this volume that is really impractical, and ultimately would cost an enormous amount in staff time. Remember that all 300,000 items had to be searched in four separate places (now three, since Hathi holdings are represented in WorldCat). It is clearly preferable to match via batch processes, unless the target collection is very small.

A second alternative is to work from a report generated by the library's ILS. Some systems can produce these routinely, but others require local creation of queries or scripts. Even with a good batch of bibliographic data, there are not always mechanisms for matching conveniently against multiple comparator files--these processes must be devised. To interact with the resulting data effectively, it may also be necessary to store results, so that criteria can be modified and new results produced iteratively. At a reasonable scale, much of this can be handled by Excel or Access or their open-source counterparts. But all of this may require a substantial amount of skilled staff time--with the clock running at a minimum of $35/hour.

Even batch-matching costs can vary, depending on what match points are available in both the candidate list and the target data. If the library has OCLC numbers in most records, this speeds matching with WorldCat and Hathi. If the library does not retain OCLC numbers in its local records, LCCN and ISBN matches are possible, but create more work and more exceptions. And some comparator targets, such as CHOICE and RCL, don't include OCLC numbers; different matching routines may be required for a single library.

Yet another approach, with its own associated cost, is to use a third-party tool or service such as WorldCat Collection Analysis, Library Dynamics, the GIST GDM, or the one offered by SCS. In various ways, these services allow the library to outsource some portion of the data comparison work--once the necessary data has been made available to the vendor. This has some advantages, but also some costs. As an example of completed data comparison, this SCS Collection Summary shows the preliminary results from GVSU:

In this instance, the combined effect of their deselection criteria resulted in 53,000 withdrawal candidates and 382 preservation candidates. The effect of individual data comparisons is reflected in the bottom section of the chart. The exact cost of completing this sort of analysis remains to be fully understood, as we are just beginning this work. But it is important to see all costs related to batch processes and tools in context. Any project involving more than 5,000 titles will be very time-consuming to do manually. It it also important to remember that we are still gathering information--and that several more steps remain to get the books off the shelves. Next up: selector review and staging.

Links to related posts:

Sample & Hold: Rick Lugg's Blog

Tuesday, March 29, 2011

The Cost of Deselection (4): Data Comparisons

No comments:

Post a Comment