Tuesday, August 16, 2011

As Good As The Data

My partners and I at Sustainable Collection Services (SCS) coined the term 'data-driven deselection' as shorthand for our service offering and web application. We believe that solid data can help rationalize the necessary drawdown of print monograph collections, a process that sometimes elicits strong emotions. Good data  lays out the facts and sets context. Consideration of circulation rates, the number of other copies in the state, region, or nation, and the existence and accessibility of secure digital versions makes informed retention decisions possible. Data-driven deselection assures that withdrawals take place only when a title is well-secured in the collective collection. The very same data assures that the collective collection does not remain overloaded with copies of low-use books. It enables intelligent action.

Data, of course, is not monolithic. Despite the prevalence of library standards and agreed practices, there can be substantial differences among data from seemingly similar collections. For monographs deselection, the working data set includes not only bibliographic records, but item/holdings information (e.g. location, barcode number, enumeration), and circulation data. Even when bibliographic data is relatively consistent,  item records and circulation data often vary a great deal. A few observations from our early experience: 

There are no perfect catalogs. Not exactly a news flash, but all catalogs include a healthy number of mistakes. Some matter more than others, and some matter more to users than to the analytics SCS is performing. Our work allows us to ignore most problems related to descriptive cataloging, but SCS does rely heavily (though not exclusively) on control numbers. The OCLC number, LCCN, and ISBN comprise the holy trinity for matching a library's holdings against WorldCat, HathiTrust and other target data sets. Control numbers seem straightforward, and in fact are--assuming that:  

1) they are actually present;
2) they are formed correctly;
3) prefixes are entered consistently; and
4) mysterious errors such as the insertion of a '7' in front of some OCLC numbers have not occurred.

Suffice it to say that data normalization on these fields is an essential first step.

There are even fewer perfect inventories. Even the loveliest bibliographic record cannot directly answer the question 'is this item really on the shelf?' Shelf-reading and regular inventories are the sorts of tasks that libraries often defer in the press of other business. This is a logical trade-off in an era when print use is declining. But like all deferred maintenance, it eventually bites back.As shared print collections become more important, reliable inventory data is essential. That reliability is not a given at present; just ask any ILL librarian. Therefore, the date--and results-- of the library's most recent inventory should be articulated in any deselection project.

Errors and anomalies also scale. Efficiency, batch processing, and scale are essential to library operations. It is important to remember that the flip side of these approaches is the possibility of systemic error. In a large data set, even a miniscule rate of errors or gaps can result in a sizable raw number of exceptions. A set of 1 million records that is 95% accurate includes 50,000 errors or questions. Such a data set would be rated AAA--and probably doesn't exist.

Holdings in WorldCat and regional union catalogs are not always current. Spurred by journal de-accessioning projects, many libraries in recent years have embarked on OCLC "reclamation" projects, to assure that all holdings in the library's catalog are represented in WorldCat. This sort of recalibration is a good thing; the more libraries that pursue it, the more reliable the holdings information in WorldCat. Reclamation also benefits monographs, and improves the accuracy of the WorldCat data on the number of copies held in the collective collection.. The integrity of the shared print collection depends on verified holdings. The final step in any withdrawal project should be the removal of holdings from OCLC (or, as shared print archiving grows, the replacement of the library's holding symbol with that of the regional storage facility upon which it relies).

Circulation data varies widely and wildly. This point has come home to us with vigor recently, as we begin work with a small group of libraries seeking to share responsibility for retention of low-use print monographs. The first task is to identify those low-circulation titles, which requires combining and normalizing circulation data. This is more difficult than it sounds.  Three different library systems are in use among the group, which means that circulation data is captured in different ways. Some libraries have total circulations back to 1988; others only for a few years. Some libraries retain the date of last circulation (at least for some segment of the data); others do not. Some libraries include in-house use, ILL, and reserve transactions in their circulation counts; others do not. Some libraries use their circulation module to 'check out' books to Acquisitions or Cataloging or Bindery while they are in process; others do not. What common usage data exists across all participating libraries? What level of analysis will the data support? Stay tuned on that one; there's a good deal of work to do first.

These are just examples, of course. But they begin to illustrate the need for caution and precision in handling the data on which deselection decisions will be based. At present, when so many libraries have so much overlapping and unused content, it is possible to set aside any items with questionable data and still have plenty of scope to act. There are enough items with good data to achieve first-round deselection targets. For now, we can make significant progress by acting on only what the best data supports. Longer-term, this will get more complicated. As a community, we'll need to improve the data, or agree to run bigger risks.

No comments:

Post a Comment