1. Searching eTheses for the openVirus project

--- tags: openVirus --- # 1. Searching eTheses for the openVirus project ## Introduction The COVID-19 outbreak is an unprecedented global crisis that has prompted an unprecedented global response. I've been particularly interested in how academic scholars and publishers have responded: - Academic publishers [making COVID-19 and coronavirus-related publications openly available](https://wellcome.ac.uk/press-release/publishers-make-coronavirus-covid-19-content-freely-available-and-reusable) - Academics and publishers assembling datasets of publications for analysis, like the [COVID-19 Open Research Dataset](https://www.semanticscholar.org/cord19) and the [WHO COVID-19 Database](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov) - Various projects aiming to make these datasets of publications searchable, like [Semantic Scholar](https://www.semanticscholar.org/search?q=covid-19&sort=pub-date), [WHO](https://search.bvsalud.org/global-literature-on-novel-coronavirus-2019-ncov/), [COVID Scholar](https://covidscholar.org/), [COVIDSeer](https://covidseer.ist.psu.edu/), [SketchEngine](https://www.sketchengine.eu/covid19/), [Vespa.ai](https://cord19.vespa.ai/) and [fatcat COVID-19](https://covid19.fatcat.wiki/) from the Internet Archive. It's impressive how much has been done in such a short time! But I also saw one comment that really stuck with me: > "Our digital libraries and archives may hold crucial clues and content about how to help with the #covid19 outbreak: particularly this is the case with scientific literature. Now is the time for institutional bravery around access!" > -- [@melissaterras](https://twitter.com/melissaterras/status/1245645959876378625) Clearly, academic scholars and publishers are already collaborating. What could digital libraries and archives do to help? ## Scale, Audience & Scope Almost all the efforts I've seen so far are focused on helping _scientists working on the COVID-19 response_ to find information from publications that are _directly related to coronavirus epidemics_. The outbreak is much bigger than this. In terms of scope, it's not just about understanding the coronavirus itself. The outbreak raises many broader questions, like: - What types of personal protective equipment are appropriate for different medical procedures? - How effective are the different kinds of masks when it comes to protecting others? - What coping strategies have proven useful for people in isolation? (These are just the examples I've personally seen requests for. There will be more.) Similarly, the audience is much wider than the scientists working directly on the COVID-19 response. From medical professions wanting to know more about protective equipment, to journalists looking for context and counter-arguments. As a technologist working at the British Library, I felt like there must be some way I could help this situation. Some way to help a wider audience dig out any potentially relevant material we might hold? ## The openVirus Project While looking out for inspiration, I found [Peter Murray-Rust's](https://en.wikipedia.org/wiki/Peter_Murray-Rust) [openVirus project](https://github.com/petermr/openVirus). Peter is a vocal supporter of open source and open data, and had launched an ambitious attempt to aggregate information relating to viruses and epidemics from scholarly publications. In contrast to the other efforts I'd seen, Peter wanted to focus on novel data-mining methods, and on pulling in less well-known sources of information. This dual focus on text analysis and on opening up under-utilised resources appealed to me. And I already had a _particular_ resource in mind... ## EThOS Of course, the British Library has a very wide range of holdings, but as an ex-academic scientist I've always had a soft spot for [EThOS](https://ethos.bl.uk/), which provides electronic access to UK theses. Through the web interface, users can search the metadata and abstracts of over half a million theses. Furthermore, to support data mining and analysis, [the EThOS metadata has been published as a dataset](https://bl.iro.bl.uk/work/bb0b3ec4-4667-436a-8e6a-d2e8e5383726). This dataset includes links to institutional repository pages for many of the theses. Although Ph.D theses are not generally considered to be as important as journal articles, they are a rich and under-utlize source of information, capable of carrying much more context and commentary than a brief artice[^3]. ## The Idea Having identified EThOS as source of information, the idea was to see if I could use [our existing UK Web Archive tools](https://github.com/ukwa/) to collect and index the _full-text_ of these theses, build a simple faceted search interface, and perform some basic data-mining operations. If that worked, it would allow relevant theses to be discovered and passed to the [openVirus tools](https://github.com/petermr/ami3) for more sophisticated analysis. ## Preparing the data sources The links in the [EThOS dataset](https://bl.iro.bl.uk/work/bb0b3ec4-4667-436a-8e6a-d2e8e5383726) point to the HTML landing-page for each theses, rather than to the full text itself. To get to the text, the best approach would be to write a crawler to find the PDFs. However, it would take a while to create something that could cope with the variety of ways the landing pages tend to be formatted. For machines, it's not always easy to find the link to the actual theses! However, many of the universities involved have given the EThOS team permission to download a copy of their theses for safe-keeping. The URLs of the full-text files are only used once (to collect each thesis shortly after publication), but have nevertheless been kept in the EThOS system since then. These URLs are considered transient (i.e. likely to 'rot' over time) and come with no guarantees of longer-term availability (unlike the landing pages), so are not included in the main EThOS dataset. Nevertheless, the EThOS team were able to give me the list of PDF URLs, making it easier to get started quickly. This is far from ideal: we will miss theses that have been moved to new URLs, and from universities that do not take part (which, notably, includes Oxford and Cambridge). This skew would be avoided if we were to use the landing-page URLs provided for _all_ UK digital theses to crawl the PDFs. But we need to move quickly. So, while keeping these caveats in mind, the first task was to crawl the URLs and see if the PDFs were still there... ## Collecting the PDFs A simple [Scrapy](http://scrapy.org/) crawler was created, one that could read the PDF URLs and download them without overloading the host repositories. [The crawler itself](https://github.com/ukwa/golem/blob/master/golem/golem/spiders/ethos.py) does nothing with them, but by running behind [warcprox](https://github.com/internetarchive/warcprox) the web requests and responses (including the PDFs) can be captured in the [standardised Web ARChive (WARC) format](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/). For 35 hours, the crawler attempted to download the 130,330 PDF URLs. Quite a lot of URLs had already changed, but 111,793 documents were successfully downloaded. Of these, 104,746 were PDFs. All the requests and responses generated by the crawler were captured in 1,433 WARCs each around 1GB in size, totalling around 1.5TB of data. ## Processing the WARCs We [already have tools for handling WARCs](https://github.com/ukwa/webarchive-discovery), so the task was to re-use them and see what we get. As this collection is mostly PDFs, [Apache Tika](https://tika.apache.org/) and [PDFBox](https://pdfbox.apache.org/) are doing most of the work, but the [webarchive-discovery](https://github.com/ukwa/webarchive-discovery) wrapper helps run them at scale and add in additional metadata. The WARCs were transferred to our internal Hadoop cluster, and in just over an hour the text and associated metadata were available as about 5GB of _compressed_ [JSON Lines](http://jsonlines.org/). ## A Legal Aside Before proceeding, there's legal problem that we need to address. Despite being freely-available over the open web, the rights and licenses under which these documents are being made available can be extremely varied and complex. There's no problem gathering the content and using it for data mining. The problem is that there are limitations on what we can _redistribute without permission_: we can't redistribute the original PDFs, or any close approximation. However, collections of facts about the PDFs are fine. But for the other _openVirus_ tools to do their work, we need to be able to find out what each thesis are about. So how can we make this work? One answer is to generate statistical summaries of the contents of the documents. For example, we can break the text of each document up into individual words, and count how often each word occurs. These word frequencies are a no substitute for the real text, but are redistributable and suitable for answering simple queries. These simple queries can be used to narrow down the overall dataset, picking out a relevant subset. Once the list of documents of interest is down to a manageable size, an individual researcher can download the original documents themselves, from the original hosts[^2]. As the researcher now has local copies, they can run their own tools over them, including the openVirus tools. ## Working with Word Frequencies A [second, simpler Hadoop job](https://github.com/ukwa/ukwa-hadoop-tasks/blob/master/ethos/ethos_wf.py) was created, post-processing the raw text and replacing it with the word frequency data. This produced 6GB of _uncompressed_ JSON Lines data, which could then be loaded into an instance of the [Apache Solr](https://lucene.apache.org/solr/) search tool[^1]. While Solr provides a user interface, it's not really suitable for general users, nor is it entirely safe to expose to the World Wide Web. To mitigate this, the index was built on a virtual server well away from any production systems, and wrapped with a web server configured in a way that should prevent problems. The API this provides (see [the Solr documentation](https://lucene.apache.org/solr/guide/8_4/overview-of-searching-in-solr.html) for details) enables us to find which theses include which terms. Here are some example queries: - [_coronavirus_](https://services.anjackson.net/solr/discovery/select?q=coronavirus%20OR%20coronaviruses&wt=json&df=text) - [_coronavirus_ OR _coronaviruses_](https://services.anjackson.net/solr/discovery/select?q=coronavirus%20OR%20coronaviruses&wt=json&df=text) - [(SARS OR MERS) AND virus](https://services.anjackson.net/solr/discovery/select?q=(SARS%20OR%20MERS)%20AND%20virus&wt=json&df=text) - [face AND mask AND virus*](https://services.anjackson.net/solr/discovery/select?q=face%20AND%20mask%20AND%20virus*&wt=json&df=text) This is fine for programmatic access, but with a little extra wrapping we can make it more useful to more people. ### APIs & Notebooks For example, I was able to create _live_ API documentation and a simple user interface using Google's Colaboratory: <center> [Using the openVirus EThOS API](https://colab.research.google.com/drive/1STOcumLxab3-g_--YEzrJq5BW0EtEbdK) </center> Google Colaboratory is a proprietary platform, but those notebooks can be exported as more standard [Jupyter Notebooks](https://jupyter.org). See [here](https://github.com/anjackson/contentminer/blob/master/openVirus_EThOS_API.ipynb) for an example. ### Faceted Search Having carefully exposed the API to the open web, I was also able to take [an existing browser-based faceted search interface](https://github.com/evolvingweb/ajax-solr) and modify to suite our use case: <center> [EThOS Faceted Search Prototype](https://ajax-solr-ethos.glitch.me/examples/reuters/) </center> Best of all, this is running on the [Glitch](https://glitch.com/) collaborative coding platform, so you can go look at the source code and _remix_ it yourself, if you like: <center> [EThOS Faceted Search Prototype -- Glitch project](https://glitch.com/~ajax-solr-ethos) </center> ## Limitations The main limitation of using word-frequencies instead of full-text is that phrase search is broken. Searching for _face AND mask_ will work as expected, but searching for _"face mask"_ doesn't. Another problem is that the EThOS metadata has not been integrated with the raw text search. This would give us a much richer experience, like accurate publication years and more helpful facets.[^4] In terms of user interface, the faceted search UI above is very basic, but for the openVirus project the API is likely to be of more use in the short term. ## Next Steps To make the search more usable, the next logical step is to attempt to integrate the full-text search with the EThOS metadata. Then, if the results look good, we can start to work out how to feed the results into the workflow of the openVirus tool suite. [^1]: We use Apache Solr a lot so this was the simplest choice for us. [^2]: This is similar data sharing pattern used by Twitter researchers. See, for example, the [DocNow Catalogue](https://catalog.docnow.io/). [^3]: Even things like _negative_ results, which are informative but can be difficult to publish in article form. [^4]: Note that since writing this post, this limitation has been rectified.