Tuesday, July 5, 2011

Disturbing Dust and Data

There must be something in the Australian autumn air.

On March 8, 2011, the Sydney Morning Herald reported that the University of New South Wales Library "is throwing away thousands of books and scholarly journals as part of a policy that critics say is turning its library into a Starbucks." The initiative, which aimed to remove 50,000 volumes per year from the stacks, outraged faculty, students, and librarians.  

On one side: "They're getting rid of books to make space for students to sit around, have lunch, and plug their laptops in."  

On the other: "The library has an ongoing program to remove print journals where online archival access is provided. Our academic community prefers to use the online versions and they use them very heavily."

For the record, UNSW's deselection policy prohibits discard of the last Australian copy of any book.

http://myartwithlines.blogspot.com/
2011_01_01_archive.html
At about the same time, the University of Sydney announced plans to remove 500,000 low-use books and journals as part of a major renovation of its Fisher Library. University Librarian John Shipp noted that "volumes not borrowed in the past five years will be removed [to remote storage]." He referred to a "dust test" that indicated that books not borrowed are also not read. No books will be discarded; they will be moved to one of two offsite locations.
  
Picture by Melvyn Knipe
In response, University of Sydney students organized a protest on Facebook. They planned a "mass-borrowing" action to save the books and "disturb the dust", with each of the expected 50 protestors checking out the maximum allowable number of books. In the event, more than 500 students turned up. It is unclear how many books were checked out as part of the protest. An average of 10 per borrower seems like a reasonable estimate, although one woman reportedly arrived with a book trolley and the intent to save an entire section. Central News Magazine described the protest as follows:
Students and staff who oppose the move said Wednesday’s action was an attempt to prevent their removal.
“Our strategy is to borrow as many ‘old’ books as possible at lunchtime on Wednesday, to highlight the ridiculous notion that books have ... an expiry date,” the [protest] Facebook page states.
It is certainly laudable that undergraduates in particular were roused to defend their print collection. This speaks to the value that the idea of an academic library holds in users' imaginations. But between idea and reality falls...the dust. No matter how much anyone wants it to be otherwise, the fact remains that these 500,000 books have not been used for at least five years--and in many cases much longer. According to Paul Courant's estimate, it costs $4.26 per volume per year to retain these low-use titles in central stacks. Removal to a high-density storage facility reduces that cost to $.86 per volume per year. These books will remain available to users, and no content is being lost or even put at risk.

There are also opportunity costs that argue against the status quo. Clearly, like most academic libraries, the Fisher Library needs more space for students to study and collaborate. On upper floors, the stacks are reportedly too close together to allow adequate access for disabled users. Wider aisles require fewer shelves. The library is legally bound to comply with this mandate. And, of course, many of those students "having lunch and plugging in their laptops" are in fact accessing the library's electronic resources. All in all, University Librarian John Shipp has made a good case for a sensible proposal--one that balances responsiveness to users and collection integrity.

The students' response, led by history majors, appears both heartfelt and media-savvy. The organizers have clearly recognized that use (or rather, lack of use) determines how many books and which books will move offsite. Checking out thousands of older titles is an inspired strategy. It beefs up the circulation statistics, and may exempt thousands of titles from being moved to storage. It makes for compelling photographs and good news stories. Young people are rising up to protect our cultural and scholarly record!

Picture by Melvyn Knipe
But while the protest makes a good point with good theater, it also seems disingenuous. The "use" generated by this mass borrowing action is entirely artificial. It's hard to imagine that the "protest books" will be read, consulted, or cited in the same manner as those that actually relate to an assignment or research project. They are being used in a different way, to make a rhetorical point. (Of course, one might argue that this is the highest use of a book!) They are being used as a symbol or a category; the actual titles don't matter, so long as the volumes are dusty. In the end, this is not an argument about content at all. It's an argument about the idea of a library versus the reality of a library.

This is an important debate to have within our communities. But the discussion, however spirited, needs to occur honestly. One consequence (unintended?) of the mass borrowing: distorted circulation data. Thousands of titles now appear to have circulated that, if we are honest about it, would not otherwise have done. We now have a somewhat false -- and essentially romantic -- picture of collection use. The picture has been shaded toward what University of Sydney students and faculty would like it to be. The ballot box has been stuffed, a thumb placed on the scale. And while the effect in this case is not statistically significant (perhaps 1%-2% of the proposed withdrawal candidates have been affected), we need to be aware that the data, along with the dust, has been disturbed.
 
One checkout..ah-ah-ah...
As we grapple with the future of print collections, it is vital to retain the distinction between what we wish were happening and what is actually happening. Circulation and in-house use data, coupled with data on lifecycle costs, offer the most reliable picture of what is actually happening. That's the best place to start, whatever we ultimately decide to do...at least according to our recently-hired Director of Statistics, pictured at right.

Wednesday, June 29, 2011

Front-End Alignment

To remain vital, sustainable systems require a dynamic balance of inflows and outflows. For print collections, one way to achieve this is to determine the library's 'carrying capacity' for shelving and to manage both acquisitions and withdrawals to that footprint. This assures that the collection does not grow beyond what the environment can support, and that space allocated to print contains titles most likely to be used. This blog often focuses on deselection and managed drawdown of existing print collections. But it is equally important to attend to inputs. Achieving a sustainable collection may also require managing down the acquisition of print.

Patron-Driven Acquisitions (PDA) shows great promise as a technique for controlling the front end of the print lifecycle. In an environmental context, PDA is the 'Reduce' in Reduce-Re-Use-Recycle.' In automotive terms, PDA can be seen as a front-end alignment, the process which assures that a car steers and handles properly. In concert with a program of active deselection, PDA can assure that a print collection strategy stays on course.

There are print PDA programs in place at some universities (Vermont, Cornell, UC-Riverside), and this may serve as a bridge strategy, especially since only 30% of newly-published books are available simultaneously in print and electronic form. (In an interesting development here, Cornell and Coutts have collaborated to enable automated query of Coutts inventory from the Library's Voyager system.) But the full potential for space savings and sustainable collection management clearly lies in patron-driven eBooks. Their immediate availability and new business models such as short-term loans make the digital version of PDA especially attractive.

The ALCTS Acquisitions Section Technology Committee explored PDA in detail at a day-long preconference last week in New Orleans. As moderator of the session, I enjoyed a ringside seat as the 64 attendees and 11 presenters described new approaches to aligning selection of materials with user demand -- a key element in achieving print collection sustainability.

In a show-of-hands survey, attendees predicted that 30%-50% (and in a couple of cases 70%) of their monographs budgets will be dedicated to PDA within the next five years. The early evidence suggests that this approach will serve users better, will slow the rate of growth of physical collections, and will ultimately reduce the need for deselection. In a particularly compelling example, Doug Way of Grand Valley State University highlighted the savings after one year of a PDA program with EBL:
  • GVSU experienced 10,514 uses against the 50,000 PDA records loaded
  • 5,251 of these uses were short-term loans. Only 343 were used enough to trigger purchase.
  • Had all 10,514 been purchased, GVSU would have spent $550,464. Instead, the few purchases and many short-term loans cost just under $70,000. GVSU saved $481,625.
  • Space-wise, the Library satisfied user demand while putting 10,000 fewer books on its shelves. 

As always, your results may vary, but there is clearly great potential here. There are also many issues, ranging from objections to giving users too much power to concerns about archiving, metadata management, workflow design, and coordination with print approval plans. As I have noted before, PDA represents a profound change in the role of collection developers, which increasingly tends toward 'curating a discovery environment.' The many facets and many players involved in PDA are captured nicely in the presentations from this session, which will soon be available on ALA Connect.Have a look: there is a great deal of innovation and experience represented there.

Pictured below: Our Illustrious Panel of Supreme Magnitude, along with the Technology Committee members who organized this session. Thanks to everyone who contributed and attended for an excellent day.
SEATED: Doug Way (GVSU); Michael Levine-Clark (Univ of Denver); Barbara Kawecki (YBP); Robin Champieux (EBL). STANDING: Janet Hulm (Ohio Univ); Sadie Williams (YBP); Clare Appavoo (Coutts); Annette Day (NCSU); Rick Lugg (R2/SCS); Matt Barnes (ebrary); Lai-Ying Hsiung (UC/Santa Cruz); Jesse Koennecke (Cornell); Suzanne Ward (Purdue); Boaz Nadav-Manes (Cornell). Not Pictured: Mandy Havert (Notre Dame)

Tuesday, June 21, 2011

Philistines and Dinosaurs

Last week, I traveled to St. Louis to deliver a presentation on "Rethinking Library Resources: Sustainable Print Collections in a Digital Age." From a speaker's vantage point, it was an excellent experience, equal parts CNN and Fox News. How, you may well ask, is that possible?

CNN first. The event's sponsor was Missouri Library Network Corporation, better known as MLNC, led by the energetic Tracy Byerly. Among other services, MLNC provides training sessions for its members in Missouri, Illinois, and Kansas. To enable broad and low-cost participation, MLNC's Associate Director Keith Gaertner has put together a multi-faceted infrastructure for distributing presentations: a mix of live audience, teleconference, and webcasting. My in-person session was broadcast to five remote sites, with reciprocal audio/video, and over the web via WebEx. This enabled dozens more people to listen and watch, comment and question, all without leaving their own libraries.

The MLNC Situation Room
For a presenter, the experience is slightly disorienting. It's as if you've suddenly become Wolf Blitzer in the Situation Room, surrounded by multiple video displays, each showing a panel of librarian-avatars--like a live version of Second Life or something. All that's missing is the John King-style touchscreen. In the photo at left, Tracy checks in with four of the remote sites just prior to the presentation. In short, the full CNN.

The Fox News sensation dawned later, after 90 minutes of making the case for data-driven deselection. I had described the low and declining circulation of print books. I had outlined the lifecycle costs of managing monographs on the shelf and in storage. I had talked about other potential uses of library space. I had presented options for carefully managing down redundant print collections in the context of secure digital versions (Hathi Trust), accessible digital versions (eBooks), emerging shared print archives, and multiple copies available through ILL. Ultimately, I suggested that there exists--right now--ample opportunity to begin considered and coordinated drawdown of print collections, and to contribute to the collective collection, assuring that no content disappears.

At the conclusion of Q&A, a distinguished gentleman in the back row commented how much he appreciated the neutral tone of the presentation. In his experience, discussions around deselection too often posit the conservators as "dinosaurs" who can't recognize the decline of print and the activists as "philistines" who are oblivious to the value of what they are dismantling. "Rethinking Library Resources" looks at data, at middle ground, and at tools for making progress. In effect, he validated our argument as "fair and balanced", resulting in my unlikely Fox News moment of zen, to mix a couple of cable metaphors.

Strangely enough, media and politics can play major roles in the relatively mundane activity of weeding library collections. The prospect of removing books from library shelves can trigger primal reactions, as the University of Sydney recently discovered. This is exactly why the "data-driven" aspect of deselection espoused by SCS is so important. It is critical that withdrawal candidates be identified in context and that withdrawal or preservation decisions be informed by archival commitments and availability of other copies. There are millions of low-use books that are held by hundreds of libraries; many of them are also available digitally or on-demand. This leaves plenty of scope for immediate action without endangering the integrity of the scholarly and cultural record.

Rescued from caricature, the views of both the dinosaurs and the philistines merit attention. There are real issues to be debated here.

Point: Dinosaurs: Print still has enormous value. Not everything is available digitally -- or available in satisfactory form or resolution. Sufficient copies of print must be retained to assure that no content is ever lost, and to rectify problems found in digital versions. Print sometimes includes content or context lacking in the digital version. To some degree, the original artifacts matter. Full books are still mostly read in print form. Libraries typically own print, which assures access over time. There is always the risk of discarding something valuable. Research libraries are as much about future use as present use.

Point: Philistines: On the other hand, print use actually is declining. In many libraries, low-use books limit space available for users. It costs serious money to retain monographs on open shelves ($4.26/year per volume) and noticeable money to store them in high-density facilities ($.86/year per volume). Electronic access is in most cases preferable, especially for remote users and for multiple simultaneous users, not to mention procrastinators. A great deal of low-use content is readily re-obtainable in the unlikely event that it is needed. Retention of no-use print locks up resources that could be used for other purposes--this is on some level a misuse of scarce resources.

To husband our collective resources effectively, we need to respect both of these viewpoints. Although it can be great fun, we need to avoid polarizing the discussion, and to proceed cautiously and rationally, balancing data and experience. But we do need to proceed. As a profession, we have a problem to solve. And we do need to learn how to make a case that honors both past and future--while allowing us to survive and serve our users in the present.

Thursday, June 2, 2011

The Hathi Effect

The HathiTrust database is of fundamental and growing importance to any deselection project. To some degree, this is true for any library, though, as always, membership has its privileges. As of today, 4,788,131 book titles reside in Hathi in secure, TRAC-certified digital form. The academic community as a whole can rest assured that these titles will not disappear from the cultural record. If Hathi offered nothing more than preservation and security of this sort, its value would still be elephantine.

 But in fact, the Trust provides a great deal more. As described in Heather Christenson's excellent article [PDF] in the April 2011 issue of Library Resources & Technical Services, the repository "is providing full-text search across more than 2.8 billion pages." In itself, this is a major enhancement to discoverability, but Hathi also provides direct access to full-text content. This occurs to different extents for different classes of users and depends on the copyright status of each title. And while those who contribute most naturally benefit the most, virtually any library can obtain some degree of access to this 'research library at Web scale.'

I've taken the liberty of re-arranging one paragraph from the Christenson article, and adding a comment or two, to highlight the potential for using Hathi as one element in a surrogate collection strategy.

Public Domain Titles: View Online
  • All public domain titles can be viewed on the web in a page-turner application."
Public Domain Titles: Downloads
  • "Google-digitized public domain volumes are available in a full PDF download to authenticated users from partner institutions"
  • "public domain volumes digitized via Internet Archive and locally by partners are available in full PDF to all."
Public Domain Titles: Printed Versions
  •  "Printed versions of public domain books from some partners are now offered via a link within the HathiTrust Interface to print-on-demand service.
In-Copyright Titles: 
  • "Volumes that are in copyright are discoverable via large-scale search, and users may view a list of pages on which their search term appears."
  • Users can also 'find in a library' via an embedded link to WorldCat. [added comment].
The ready availability of digital versions of these titles greatly reduces any risk associated with removing physical copies from library shelves, especially when those physical copies have never circulated. Among the first five libraries with which SCS has worked, Hathi public domain matches range from 3%-5% - typically not enough to generate significant space savings, but an uncontroversial place to start.

This first step is especially attractive given the convenience of adding Hathi URLs to existing MARC records. Yesterday's announcement that the California Digital Library has opened its HathiTrust SFX Target to the broader SFX community adds yet another convenient access path.

Wednesday, May 25, 2011

Curating a Discovery Environment

[Update: 11/8/11: Book published last week].  Late last year, my good friend David Swords of EBL asked me to contribute the opening chapter to a forthcoming book entitled "Patron-Driven Acquisitions: History and Best Practices." This collection of essays and research studies will be published in July by DeGruyter, and includes contributions from a range of librarians, vendors, and publishers on this hottest of topics.

Samuel Johnson aside ("none but a blockhead ever wrote except for money"), the discipline of writing always leads to learning, and with luck a good concept or two. As I thought about the changes in collection development and management that have taken place in the past decade, it struck me that the work of selectors now emphasizes:
  1. Collecting for the Moment: given the growing availability of book content in electronic form--and the growth of secure digital archives such as Hathi Trust, it is no longer necessary for individual libraries to collect for the ages. Rather, the task is to assure that content can be delivered at the moment it is wanted--from wherever it may be. This links "collections" (or more properly, access) much more closely to discovery.   
  2. Curating a Discovery Environment: this is now the central task of the activity formerly known as collection development. Or, as outlined in the chapter I have called "Collecting for the Moment: Patron-Driven Acquisition as a Disruptive Technology":
The philosophical shift underlying PDA is profound and multi-dimensional. Instead of curating collections of tangible materials, libraries have begun to adopt a new role: curating a discovery environment for both digital and tangible materials. Instead of deliberately trying to identify titles most relevant to curriculum and research interests within a discipline, broad categories of material that may be relevant are enhanced for optimum discoverability, immediate delivery, and partial or temporary use. Instead of purchasing materials just in case a scholar may one day need them, PDA offers “just in time” access to needed titles or portions of titles. Instead of collecting for the ages, libraries are using PDA to enable more targeted collecting for the moment.
This same logic applies to deselection. As we begin to come to terms with digital books and widely-shared print collections, it is critically important to assure that items removed from the library shelves remain discoverable and deliverable through other means. We continue to curate the discovery environment, and to enhance delivery options. Paradoxically, it may actually make sense to enrich the metadata associated with deselected items. This might include not only descriptive metadata, but also "availability" metadata: a URL to a Hathi Trust public domain copy; print holdings in a shared archive; commercial e-book availability; print-on-demand; or links to used book dealers.

Monday, May 16, 2011

The Cost of Deselection (10): Summing Up

The Project:  To remove 10,000 volumes from a 250,000-volume library to make room in the stacks for 1-2 years' collection growth.

Assumptions: Librarian time @ median rate of $35/hour (including benefits) and 40-hour work week. Batch data comparison for 250,000 titles would cost a minimum of $10,000 to prepare, trouble-shoot and execute. (This is essentially an educated guess, based on logic outlined here and here.

The Process and Estimated Costs:

     Project Design and Management:            100 hours        $ 3,500
     Data Extract                                              20 hours          $    700
     Develop Deselection Criteria                     100 hours         $ 3,500
     Communication w/Stakeholders             100 hours         $ 3,500

     Data Comparison w/ 3 targets (batch)                              $10,000
     Title Review from List only                                                  $ 5,810
     Disposition & Record Maintenance                                     $ 5,000

===========================================================================
   
     TOTAL with list-only review                                $32,010
     Price per volume:                                                     $3.20

     TOTAL with in-stack review                                $39,000
     Price per volume:                                                     $3.90

     TOTAL with staged review                                   $41,400
     Price per volume:                                                     $4.14


The numbers are easy to poke holes in, so please feel free to do so. We have laid them out as fully as we currently understand them, but they will vary with local conditions. There is still a great deal to learn here. We are interested in other perspectives and experiences. In most cases, the per-volume cost will decrease substantially with volume--i..e., deselection criteria, once developed, can be applied to many more volumes for only incremental cost.

The steps in the process may also vary somewhat, but there are really none that can be skipped entirely. Unless a library manager is willing to forgo communication with stakeholders, to prevent review of candidate titles, or to act based solely on usage data, the deselection process will require meetings and discussions, decisions and adjustments. These take time, and probably more time than estimated here.

These are intended as reasonable estimates, and as a starting point for discussion. Comments are welcome and revisions are likely.

Links to related posts:

Monday, May 9, 2011

The Cost of Deselection (9): Data Comparisons Revisited

Time to exercise our perpetual beta clause. In trolling through this growing string of posts on deselection costs--with an eye toward toting them up--it struck me that the entry on "Data Comparisons" could benefit from an attempt at more specific estimates, both for manual searching and batch matching of candidate files to comparator data sets. In the example outlined in that post, the task was to compare 300,000 candidates to three targets: WorldCat, CHOICE, and Resources for College Libraries--essentially, to perform 900,000 searches, then record and compile the results.

This is an area where costs are very difficult to estimate. Some libraries may be able to forgo the task of assembling deselection metadata that extends beyond circulation and in-house use, but probably not too many. Withdrawing monographs can be controversial. Evidence of holdings in other libraries, print archives or in digital format is essential to making and defending responsible deselection decisions. Comparing low-circulation titles to WorldCat holdings is the most helpful first step, especially since the holdings data gleaned there can include Hathi Trust titles, some print archives, and specified consortial partners. At minimum, all low-circulation titles in the target set should be searched in WorldCat.

There are a couple of ways to do this. Manual searches based on OCLC control number, LCCN, or title offer one approach, but it is time-consuming to key the searches, interpret and record the results. To date, the smallest data set SCS has worked with included 10,000 candidates; the largest 750,000. At 60 searches per hour (1 minute each), it would require 5,000 hours to search 300,000 titles. At $8/hour for student workers, the cost would be $40,000; at $20.68 for a library technician, the cost would be $103,400. Searching 750,000 items in this manner, besides equating to cruel and unusual employment, would require 12,500 hours. Clearly, this is not a realistic option--so much so that we won't even calculate the time or money involved.

The obvious answer, for a candidate file of any significant size, is to use computing power for what it's best at -- automated batch matching. Over the past six months, my partners and I at Sustainable Collection Services have worked closely with library data sets ranging in size from 10,000 titles to 750,000 titles. We have learned a great deal about how to prepare data for large-scale batch matching, and how to shape results usefully. Among other things, we have learned that there are many variables in both source and target data sets.

We have worked with clean data and not-so-clean data. We have found many mutations of control numbers. We have learned to identify malformed OCLC numbers, and a good deal about how to correct them using LCCN, ISBN, and string-similarity matching. Our business partner and Chief Architect Eric Redman even invoked the fearsome-sounding Levensthein Distance Algorithm, which sounds cool and turns out to be not so fearsome after all. But a few snippets from our internal correspondence may give provide some local color related to data issues.
1. Matching by title: I have implemented a basic starting point that can be enhanced as we gain experience.  This is by using the very simple Levensthein Distance algorithm. We can plug in more sophisticated algorithms later. With this basic starting point I am able to determine which LCCNs are suspect and which OCLC numbers are suspect. When only one is suspect on a record, I can use the other to correct the suspect value.
2. Using this approach, I can detect [in this record set] a pattern of a leading zero of the OCLC number having been replaced by a "7." This occurs in 1621 records. I can correct this.
3. I should begin to catalog these patterns of errors with LCCNs, OCLC Numbers, and ISBNs. I can create some generalized routines that we can plug in depending on the "profile" of the library's bibliographic data.
Or, as we began to extrapolate our response to these anomalies, and to build a more replicable process, which we would embed into SCS ingest and normalization routines:
1. Do a title similarity check on OCLC Numbers by comparing the title retrieved from WorldCat via OCLC Number to the title from the library's bib file.
2. If step 1 yields a similarity measure of .3 or less, where 0 is a perfect match, then stop. Otherwise, use the LCCN from the bib file to do a WorldCat title similarity check.
3. If step 2 yields a similarity measure of .3 or less then stop. Otherwise use the OCLC Number from the step 2 WorldCat rec to replace the library's OCLC Number.
4. If step 3 did not yield a match, log the library's bib control number and OCLC Number for manual review. The manual review is where we'll identify patterns like the spurious 7 in the Library's initial file.
This particular exchange actually concerned the smallest file we've yet handled, a mere 10,000 records! As we fix these errors in the data extract provided to us by the library, it seems clear that an opportunity also exists to provide some sort of remediation service to the library, improving its overall data integrity. But that's a story for some other day. 

Our learning curve has been steep and steady. We have now worked extensively with the WorldCat API and learned some lessons about capacity the hard way. We have worked with target data sets that require match points other than OCLC number. We have made mistakes and we have fixed mistakes. We have made more mistakes and fixed those too. And ultimately we have managed to produce some useful and convenient analyses and withdrawal candidate lists for our library partners. We have performed multiple iterations of criteria and lists in order to focus withdrawals by subject or location. In short, we have done what any individual library would need to do in order to run similar data comparisons.

None of these is a trivial task, and everything gets more interesting as files grow larger. SCS has found that effective data comparison requires top-flight technical skills for data normalization and remediation. It requires configuring and launching multiple virtual processors to improve matching capacity. It requires creation of a massive cloud-based data warehouse, already populated with tens of millions of lines. It requires high-level ability with SQL and a frighteningly deep facility with Excel. Formatting withdrawal lists for printing as picklists is also much more time-consuming than one might expect. Because SCS is doing this work repeatedly, we continue to learn and improve. We have developed some efficiencies and will continue to do so. But most individual libraries will not have that advantage. If you plan to do this in your own library, also plan on your own steep and steady learning curve.

Which, finally, brings us back to the questions of costs. Automated batch matching is clearly the way to go, but automation and creation of batches may require significant investment and a surprisingly wide range of high-end technical skills. These are not cheap. While it remains difficult to know what dollar figure to ascribe to data normalization, comparison, trouble-shooting and related processes, it is clearly substantial.

Links to related posts: