Bookworm is a tool for visualization and analysis. It is useful for plotting usage trends in collections of texts. The HathiTrust Digital Library is one such collection of text — it consists of materials from the digitized holdings of some of the most important research libraries in the world (which form part of the HathiTrust consortium), and currently consists of approximately twelve million physical volumes of text in digitized form.
Some of the participants in the project had been involved in the work that had led to the Google Books Ngram viewer. The HT+BW project applies a similar idea to the vast collection of digitized books of the HathiTrust Digital Library (HTDL). The latter’s metadata will be leveraged to provide facilities for sophisticated trend analysis at a fine-grained level for the HathiTrust Digital Library’s collection.
What distinguishes university library research collections is that they are rich in metadata, as the volumes in these collections were carefully catalogued. This has created a new opportunity which the HT+Bookworm project is leveraging: the ability to show trends in custom-built subsets (which can be created on the fly) of the collection itself. The importance of this for research is intuitive: A researcher may want, for example, to see the trend for the word “bairn” (a scottish term for “child”) just among books in English published from Scotland during the nineteenth century. From the metadata for the books, which usually includes place of publication, one could easily create, on the fly, the subset of books that meet these criteria, and plot the trends for these books — and these books alone. In other words, the HathiTrust “data” on which the Bookworm will operate (that is, the digitized text) can be “sliced and diced,” so to speak, based on metadata criteria, and the trends within such selected slices specifically plotted (and otherwise visualized).
Right now, the project is focusing on the back-end of the visualization service. The HathiTrust Digital Library’s text data is indexed by means of Solr, while Bookworm is currently set up to work with MySql, a relational database format. The work that has gotten started focuses on making Bookworm work with a Solr index, so that it can work with data from the HTDL. Once this functionality is created, the solution, we expect, will be highly portable. This is because most large libraries, especially academic ones, are already set up to work with Solr (or very similar) indices. So, the same kind of integration as between Bookworm and the HathiTrust Digital Library should also become easily achievable, with relatively less effort, between Bookworm and most other digital libraries’ collections in the future.
Much of the collection of a digital library like the HTDL is under stringent copyright protection and is unavailable for bulk download for that reason. This way of applying analytics to the text content in order to discover, discern and display trends can, however, work on copyrighted text without violating any laws, as useful information is merely being extracted at an aggregate level without actual consumption of the text by a reader. As such, this approach is a form of non-consumptive reading.
Last but not the least, this integration of Bookworm with the HathiTrust Digital Library’s collection provides an advantage that has to do with scale. Already at twelve million volumes, the HTDL’s collections keep on growing as more and more libraries join the HathiTrust consortium as partners, and as more and more of each library’s collection gets digitized and contributed by the library to become available through the HathiTrust Digital Library. In other words, the HTDL corpus is a growing corpus, and it will keep growing into the foreseeable future. This creates a strong incentive to generate scalable solutions.