a brief tour of the system-wide print book collection

Ubiquity, Volume 2006 Issue September | BY Brian F. Lavoie , Roger C. Schonfeld 


Print collections will likely undergo significant transformation as libraries continue to reshape themselves in the networked digital age. Some transformations will occur at the local level to meet the particular needs and requirements of a single institution and its users. However, it is likely that many more transformations will take place within a system-wide context -- not individual library collections as isolated units, but rather as units of the aggregate library collection, the combined holdings of multiple libraries. Of course, the system can be defined at various levels of aggregation: by state, by region, nationwide, or even all libraries everywhere. [2] In all cases, however, the key point is that decisions regarding local collections will eventually and inevitably be taken with system-wide implications in mind.

Today, decision making in a number of important areas would benefit from consideration of the system-wide context. Mass digitization programs such as the Google Print and Open Content Alliance projects raise fundamental questions about the size and scope of the entire system-wide collection. Given resource constraints, digitization efforts cannot hope to digitize every book in every library; some trade-offs will be necessary, such as digitizing selectively in certain disciplines, certain languages, or even certain libraries. Decision makers need to know the resources held in the system as a whole if they are to consider how multiple strategies might complement one another, avoid duplicative effort, and allocate resources in ways that maximize value and cost-effectiveness. [3]

A system-wide perspective is also useful in formulating collection management and preservation strategies. For example, retention, storage, and preservation decisions would benefit from knowledge of collection overlap across the system. Which resources are held redundantly by many libraries? Which resources are held by only a few libraries, or perhaps even a single library? Is the degree of overlap in the system acceptable to ensure resource survivability, given the risks of loss? Answers to these questions are necessary to inform the "weeding" of print collections, targeting scarce preservation funds to where they are needed most, and shared strategies for storage and preservation. Again, analysis of the system as a whole is necessary to support these decisions.

Digital and network technologies are breaking down the boundaries between local collections. Of course, this process has been at work for some time, as the increase in resource sharing has blurred the distinction between local and external collections. But now digitization, in combination with network connectivity, has accelerated the process, with one "copy" of a resource potentially being shared across many libraries. Digitization and online availability has opened up heretofore local collections to geographically dispersed audiences; therefore these technologies present new opportunities to expose users to resources beyond the local collection. Knowing what is in the system-wide collection, and how it is distributed across libraries, is an essential first step toward making that collection available to a system-wide audience.

In short, all of these factors (mass digitization, optimized collection management and preservation, and wide-spread access), as well as others, have contributed toward a shift in focus to the resources of the system, rather than individual library collections. But even as momentum in the library community has begun to shift towards a system perspective, the data needed to support and understand this perspective have not been widely available. In particular, data to support management and policy making at the system level, or at least with system implications in mind, are not routinely produced and analyzed. But this type of data will be increasingly critical as libraries and library collections become more deeply intertwined with the networked digital environment. System-wide analysis is certainly not a new concept; previous research studies have adopted, to a greater or lesser degree, a system perspective. [4] But in the networked age, the need to think "systematically" has never been greater.

In this paper, we focus on the collection of print books across libraries. Print books have been of particular interest recently, with the announcement of several mass digitization initiatives aimed at library print-book collections. This is not to say, of course, that print books are the only materials that would benefit from analysis at a system level. Given their rapidly progressing digital transformation, the serials literature (especially in the sciences) would provide another important terrain for system-wide analysis. However, analysis of the serials literature is sufficiently complex to warrant a separate study in its own right, and we therefore leave this for future work.

Questions addressed in this paper touch on some of the salient features of the system-wide print-book collection. How many titles does the system contain? What holdings patterns prevail within the system, especially in regard to the degree of overlap and the incidence of rare or unique materials? What are some of the characteristics of the print books in the system-wide collection, such as date of publication and language? This study offers only a brief sketch of the system-wide print book collection, with the objective of providing some examples of the kinds of data that could usefully be collected and applied. Ultimately, we hope this view of the system-wide print book collection helps libraries gain a fresh perspective on their collections, especially as they evaluate future needs and opportunities.

Defining the System

Any effort to count and characterize the components of a system naturally raises basic questions about how the system itself is defined. In regard to the system-wide print book collection that is the subject of this paper, the system in question should, ideally, consist of all print books held by libraries everywhere. [5] But assembling data on the system defined in this way presents practical difficulties that would be immensely difficult, if not impossible, to overcome. As a best approximation, we decided to define the boundaries of the system by the largest single source of cross-institutional bibliographic data available, OCLC's WorldCat database.

WorldCat is the world's largest and most comprehensive bibliographic resource. It currently contains more than 60 million bibliographic records and more than one billion holdings, reflecting the collections of more than 20,000 libraries worldwide. For the purposes of this paper, the print books represented in WorldCat serve as the system under study.

Defining a system in context of WorldCat has several limitations. All print books held by libraries have not been cataloged in WorldCat, nor have all libraries set their print book holdings in WorldCat. Moreover, WorldCat largely reflects North American library collections. But while WorldCat does not represent the entire universe of library collections, it provides the closest approximation to the ideal, and therefore is the best choice as a data source for a general overview of the system-wide collection of print books. The statistics reported in this paper are based on a the version of the WorldCat database from January 2005, containing about 55 million records and 950 million holdings.

The System-Wide Print Book Collection

The bibliographic records in WorldCat describe a wide variety of information resources, manifested in a range of formats. Figure 1 illustrates how the WorldCat database can be progressively filtered down to the subset of records describing print books.

Figure 1: Print Books in World Cat

Of the approximately 55 million records in WorldCat as of January 2005, about 41 million described monographic language-based materials, which for the purposes of this study are considered "books." [6] Records describing theses or dissertations and government documents were then removed from this total; those materials are generally acquired and managed as separate segments of a library collection and therefore were excluded from the analysis. This lowered the total to about 35 million records. Finally, the scope of the analysis is limited to print books only, so all other formats, such as digital, microform, or Braille, were removed. This produced the final total of about 32 million print books cataloged in WorldCat.

The remainder of this paper discusses some of the salient characteristics of the system-wide collection represented by the 32 million print books cataloged in WorldCat.

Works and Manifestations

FRBR (Functional Requirements for Bibliographic Records) is a framework for understanding the relationships between various bibliographic entities, including works, expressions, manifestations, and items. Two bibliographic entities of interest for this study are works and manifestations. FRBR defines a work as "a distinct intellectual or artistic creation." Thus, Macbeth is a work. A manifestation, on the other hand, is a physical embodiment of an expression of a work. Thus, the Folger Shakespeare Library edition of Macbeth, published in paperback by Washington Square Press in 2004, is a manifestation of the work Macbeth. A single work can have multiple manifestations.

WorldCat records describe manifestations. Therefore, the 32 million print books cataloged in WorldCat represent 32 million distinct print book manifestations. Most of the analysis reported in this paper concerns manifestations, but it is also useful for some purposes to consider system-wide implications in terms of works. For example, works can shed additional light on questions of collection overlap beyond what is possible with manifestations alone. To explore some of these implications, the FRBR work set algorithm, developed by OCLC Research, was used to cluster the 32 million records in the system-wide print book collection into their associated works. [7]

There are a little over 26 million distinct works represented in the 32 million print book manifestations in the system-wide collection. This suggests that on average, each of these works contains just over one (actually, 1.2) print book manifestation, although certainly there is a subset of works containing many more.

By definition, each of the 26 million works associated with the system-wide print book collection contain at least one print book manifestation, but as Figure 2 illustrates, less than half a percent contains both a print and a digital manifestation.

Figure 2: Print Manifestations, Print Works, and Digital Manifestations

This result must be interpreted carefully, because no one knows the exact proportion of digital resources held by libraries that are cataloged in WorldCat. But even if this figure is in reality 100 times greater, it is clear that only a small proportion of print books have been digitized. The transition of "legacy print content" to digital form has barely begun.

Holdings Patterns

The degree of collection overlap is a key issue in several areas, including digitization and preservation. Our examination of holdings patterns provides insight into the portions of the system-wide print book collection held redundantly by many libraries, and the portions held by only a few libraries, or perhaps even a single library.

Figure 3 illustrates some of the holdings patterns in the system-wide print book collection, illustrating the number of works held uniquely, those held twice, and those held more frequently throughout the system. By analyzing these statistics, we obtain a view of the maximum degree of overlap within the system-wide collection, since multiple manifestations of the same work are not regarded as distinct holdings.

Figure 3: Books Held Multiple Times

Our data indicate the presence of about 9.5 million works that are held uniquely within the system, or about 36 percent of the total, and only approximately 2.4 million works with 50 or more holdings. While at first glance these figures may seem alarming, they deserve careful scrutiny. The framework for book survivability has generally relied on two components: the careful stewardship of rare books in segregated collections, and the overlap, or "preservation through proliferation," across the components of general circulating collections. [8] To be sure, if the 9.5 million works held uniquely are the circulating materials found in general collections, then something is amiss in the library system. If, on the other hand, these are largely rare books that are treated as such, then the situation appears far different.

We therefore examined a sample of 100 uniquely held works. We estimate that about 50% of these are in languages other than English which, as we will see momentarily, is not dramatically different from the overall share of books printed in languages other than English. Many of the English-language materials appear to be locally-produced ephemera rather than traditional published books. Nevertheless, there are some items that would be recognized as traditionally published 20th Century books that appear to be held uniquely within the system. The conclusions to be drawn from this sampling will vary for different institutions, but we are comfortable that they represent a fair view of the system's holdings.

Our sampling effort was designed to provide some context, but we believe that significant additional research is needed to understand uniquely held works. If, in fact, there are books that are not widely held but not treated as rare, a more systematic search for uniquely held, endangered books might be in order. The results of such a search could let us know how urgent it is to develop paper repositories to ensure that print works are not lost to the system. Such repositories also enable libraries to take more aggressive local approaches to collections management. [9]

On the other extreme, only 301,000 works are held 500 or more times -- a relatively small share of the works in the system-wide collection. Because the system includes many public and school libraries along with academic and research libraries, it is all the more impressive that there are so few high-overlap works.

While our analysis is nothing more than a back-of-the-envelope assessment of collection overlap within the system, even a simple analysis such as this raises many questions that merit further research and policy debates.

Date of Publication

The rate of publication of print books has grown steadily over time. Figure 4 illustrates the distribution of books (manifestations) in the system-wide collection by year of publication. It is interesting to note the ebb and flow of book publication accompanying several important historical events, including a dramatic peak at the turn of the 20th century; troughs during the two World Wars and the Great Depression; and perhaps most importantly of all, the dramatic increase in publishing associated with the expansion of higher education and scientific research accompanying the start of the Cold War.

Figure 4: Print Manifestations by Year of Publication, 1800-2000

Cumulatively, the post-war increase in book publication is the dominating characteristic, as Figure 4 illustrates. Approximately half of all books held in the system-wide collection were published after 1977. The share of these books published prior to 1923 -- a rough cut-off point for in-copyright vs. out-of-copyright materials, according to U.S. copyright law -- is only 18%. Although the true share of out-of-copyright print books is undoubtedly higher than this due to non-renewal of copyright for books published prior to the 1976 copyright law changes, the key point to be drawn from this figure is that a date-based approach to copyright permissions is not likely to yield a high proportion of books for mass digitization.


There were approximately 450 languages represented in the system-wide print book collection. The distribution of these languages across the books in that collection is, of course, highly skewed. As Figure 5 illustrates, a little more than half of the print books in the collection were published in English.

Figure 5: English Language

The incidence of all other languages in the system-wide collection is shown in Figure 6. German, French, and Spanish are the most common languages after English; Chinese- and Japanese-language books make a surprisingly strong showing, probably reflecting at least in part the strength of area studies programs at some of the major research libraries. At the same time, we note the absence of any of the sub-continental languages among the top 25: Hindi, Urdu, Bengali, and Tamil all fall within the top 40, however, and in combination, would tie with Korean. This relative absence of sub-continental languages may be explained in part by the significant amount of English-language publishing in the region. And, in reference to the recent French concerns that US-based library digitization projects will omit French-language materials and thereby threaten the cultural balance of power, [10] we note that the system-wide collection -- which, as noted above is heavily oriented toward North American libraries -- contains more French-language books than any other language except English and German.

All Other Languages Collections as a Share of Book Production

Embedded in these preliminary steps to characterize the system-wide collection are critical archiving questions. Fundamentally, we might ask, what does the system-wide collection not contain? What portion of our printed cultural heritage is unavailable? What portion has been lost? One way to approach these sensitive and complicated questions is to examine the share of total book production over time that is now a part of the system-wide collection.

This approach has several shortcomings. Most importantly, our data source incorporates many, but not all, of the collections held by libraries across the globe, which means it is likely that more book titles have survived and are available somewhere than our system-wide collection currently reflects. In addition, the data on total book production over time are estimates at best. Finally, the unavailability of a given book title need not imply that it should have remained available - the values that inform preservation choices, and the judgment calls needed to implement them, cannot be revealed through statistical methodology. Given these shortcomings, the estimates that we will present in this section raise questions that we believe merit further examination.

In the process of estimating both total book availability and book availability by year, we follow in the tradition of earlier researchers who were interested both in book production and its availability. Iwinski estimated that 10,378,365 books was the total historical book production as of 1911. [11] His methodology in arriving at this estimate would probably have led him to undercount historical book production. By comparison, our figures show 4,568,987 print books with a publication date of 1911 or earlier. This could suggest that as many as 5.8 million book titles (all of which would be out of copyright today) may be absent from the system-wide collection.

In 1940, Merritt updated Iwinski's estimate, calculating that the historical total had grown to 15,277,276 published books, implying annual production since 1911 of about 156,000. [12] By comparison, our figures show 7,290,290 print books with a publication date of 1940 or earlier. That means the average number of print books with a publication date from 1912 through 1940 was 93,838. The book deficit by 1940 may have grown as high as 8 million; for the period 1912 to 1940 perhaps as many as 60,000 books per year did not, for one reason or another, enter the system-wide collection.

As Figure 7 shows, although the number of books unavailable in the system increased with the publishing output over this 29-year period, the share of titles presently available actually grew somewhat.

Figure 7: System-wide Book Gap, 1911 and 1940

The "book gap" illustrated in Figure 9 is an extremely rough estimate; further work is needed to estimate this gap with more accuracy.

Although it would be highly desirable to perform similar calculations bringing us forward to the present, we have been unable to identify sufficiently reliable estimates of world book production for the latter half of the 20th century, and performing such an estimate was out of the scope of the present project. We can, however, use estimates of annual book production in recent years to see the share of present publications that are being collected. In recent years, UNESCO has attempted to calculate worldwide annual book production, with estimates in the range of 1 million book titles published per year. [13] In the system-wide collection, there are 689,496 books that were published in 2000; if book production in that year was of the magnitude estimated by UNESCO (roughly 1 million books) then the system is currently collecting about two-thirds of total book production.

Some Implications and Future Research Opportunities

Taking a system-wide view of library collections offers the opportunity to estimate the current size of our printed book heritage and to begin exploring some of its characteristics. It also suggests several important directions for further research and areas of caution for policy makers.

The public domain, relative to more recent publishing activity, is much smaller than many observers anticipated. There are important public policy implications to this finding, not least of which relate to digitization and "orphan works." It would be useful to understand better the characteristics of public-domain titles relative to those that remain under copyright: Are they more or less likely to be widely held? Does their country or language of publication differ substantially from in-copyright materials? What share of in-copyright books is out-of-print or "orphaned"?

We need to learn more about the characteristics of the rare and unique titles. Is the rareness that we identified a specter, brought about by cataloging shortcomings, or is rareness truly this pervasive within library collections? Can the rare materials be characterized in terms of subject matter, book type, and year, language, and location of publication? Are libraries holding these materials aware of their rareness? Are these books being adequately cared for?

Such analyses would shed light on some of the preservation issues raised in this paper, as well as provide a strong basis for policymaking. It would help us evaluate frameworks for dealing with print in an environment of large-scale digitization. As the first paper repositories are being developed, the library community should identify the optimal number of copies of a non-circulating book that should be preserved to guarantee its survival. This is fundamentally a risk-analysis question, and one that researchers can answer using tools such as actuarial analysis.

With roughly 32 million books in the system, mass digitization could create a collection that is significantly larger than our largest research libraries. Yet the fact that these titles are widely dispersed across the system presents significant organizational and information-sharing challenges to any mass digitization effort.

We also need to better analyze the holdings distribution across the system, especially in regard to rare or unique titles. Are the rare and unique titles concentrated in a set of major research libraries or distributed more broadly?

Some of our conclusions would be strengthened by comparative analyses on other union catalogs, especially overseas catalogs. This would allow a more detailed mapping of the system-wide print book collection, and raise new issues for analysis, especially about the "missing pieces." For new publications, what languages and regions are not collected commensurately with title output? Can we characterize the new publications that are not being collected? How do they differ from those that are typically accessioned? With more complete bibliographic data, is the "book gap" as large as it appeared to us?

Finally, it may be desirable to extend some of these techniques and recommendations to other formats beyond books. The system-wide serials collection, in particular, would merit study, although we believe the research task there to be much more complicated.

As the digital transformation reshapes the nature of print collections, these and many other issues will require the attention of librarians and other decision makers. As we learn how the system-wide collection contextualizes local collections, we might be able to develop new strategies for print-collection management that reflect system-wide, rather than purely local, considerations. The observations and findings discussed in this paper are only a first step in this direction, but we hope they may set direction for discussions about the future of print books in the digital age.

Brian Lavoie is a Consulting Research Scientist in the Office of Research at OCLC Online Computer Library Center, Inc. Since joining OCLC in 1996, he has worked on a variety of projects, such as revising and expanding the Cutter tables, developing metadata for digital preservation, and analyzing the size and scope of the Web. Brian's research interests include data-mining, digital preservation, and the economics of information.

Roger C. Schonfeld is Manager of Research for Ithaka, a not-for-profit organization closely affiliated with JSTOR, ARTstor, Portico, Aluka, and NITLE, that is helping academia transition to an increasingly electronic environment. Roger's current research projects include a series of user, usage, and citation studies; surveys of faculty and librarians; and an examination of the history of book survivability over time. At Ithaka, Roger has written The Nonsubscription Side of Periodicals (Council on Library and Information Resources, 2004), and he is also the author of JSTOR: A History (Princeton University Press, 2003), which documented the development and growth of JSTOR as a self-sustaining archive of digitized journal literature. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.


