Thursday, March 1, 2012

Data with Benefits

Initial batch data extract from library
Sustainable Collection Services (SCS), the company I run with three business partners, provides decision-support for print monographs deselection. SCS processes are built on data and batch processing. We first import the library's bibliographic, item, and circulation data. That data is then normalized, cleansed, and compared to other data sets such as HathiTrust, WorldCat, peer libraries, and authoritative lists. Library-defined rules then operate against the resulting superset of data, enabling selectors or administrators to gauge the effect of different deselection criteria. Ultimately, candidate lists for withdrawal and preservation are produced.

That is the service we planned to build. It is the service we actually have built and applied to numerous library projects over the past year. But it turns out to be only part of our business. Working with large monographs data sets also creates opportunities for validation, remediation, analysis, and batch processing. Initially, we regarded these as side benefits. Now we are beginning to think of them as integral to the overall SCS service. Consider some simple examples:

  • Missing or Invalid OCLC Control Numbers: It is fairly common for some portion of a library's cataloging records to lack OCLC numbers. In other cases, those numbers may be truncated or malformed. For records without valid OCLC numbers, SCS uses a combination of LCCN and string-similarity matching to identify likely record matches and corresponding control numbers. These can be returned to the library in a batch to enable update of its catalog. 
  • OCLC Holdings Not Set: As SCS queries the WorldCat API to look up summary and peer holdings, it becomes apparent that in some cases the library's own holding has not been set. We can report these instances to the library, and produce a list that enables batch holdings update--sort of a miniature reclamation project.
  • Profile of a Group Collection: In one recent project with a pilot group of seven libraries, SCS identified uniquely-held titles for each participant, as well as the degree of overlap on all others. Combined with corresponding circulation data, this enabled identification of a sweet spot for shared print commitments. There are many possibilities in this area.
  • Print/E-Book Overlap: Provided SCS has a library's records for both print and electronic books, it is increasingly possible to determine whether a low-circulation print title is also held as an e-book. There are many caveats here (e.g., it may be important to distinguish whether the e-book is owned, rather than simply available as part of a package or a patron-driven acquisition record). But this overlap is of interest to many libraries.
  • FRBR-on/FRBR-off: Edition matching is a critical element of deselection and collection analysis. For archiving and preservation purposes, exact matches are imperative. For user purposes, exact matches are sometimes important and sometimes not. SCS holdings lookups start with FRBR groupings off, a conservative approach that assures edition-specific matches. For titles that return few holdings, we then re-run the lookups with FRBR groupings on, returning these "softer" matches to the library for review.
  • Batch Processing Support: Deselection projects create record maintenance work, regardless of whether titles will be transferred or withdrawn. Some record maintenance steps (e.g., suppression, location changes) can be completed as batch processes, based on lists that include local control numbers and necessary data elements. Often, SCS can produce labor-saving lists of this sort from the data we hold. 
Remediated batch of data enroute to library...
In each project, we encounter additional opportunities to derive new value from the data. In shared print projects, for instance, it will increasingly prove useful to highlight retention commitments as well as withdrawal opportunities. These are most efficiently handled as batch processes. As always, we are limited only by the data itself and our own creativity. We will continue to look for more ways to benefit from the effort that goes into deselection projects. Some solutions may be partial in scope, but in a large data set even partial solutions can save many hours of staff time.

No comments:

Post a Comment