From snowflake to avalanche: Possibilities of using free citation data in libraries

> [name=Philipp Zumstein] The following is an automatic translation with www.DeepL.com/Translator with a few manual fixes from the German journal article: Klein, A. (2017). Von der Schneeflocke zur Lawine: Möglichkeiten der Nutzung freier Zitationsdaten in Bibliotheken. o-bib. Das offene Bibliotheksjournal / herausgegeben vom VDB, 4(4), 127-136. https://doi.org/10.5282/o-bib/2017H4S127-136 ![CC-BY-4-0](http://i.creativecommons.org/l/by/4.0/88x31.png) # From snowflake to avalanche: Possibilities of using free citation data in libraries *Annette Klein, Mannheim University Library* **Summary:** Citations play an important role in scientific discourse, in the practice of information retrieval, and in bibliometrics. Recently, there have been a growing number of initiatives which make citations freely available as open data. The article describes the current status of these initiatives and shows that a critical mass of data could be made available in the near future. New opportunities could arise from that, especially for libraries. The DFG funded project Linked Open Citation Database (LOC-DB) is presented as a practical way for libraries to participate. ## 1. Introduction Citation is an essential element of the scientific discourse. The listing of cited literature in a bibliography is a requirement of "good scientific practice" because it makes it possible to understand what foreign content has been received by the author of the citing work. In this way, content-related relations between scientific publications are made transparent. "Citations are the links that knit together our scientific and cultural knowledge", summarises the Initiative for Open Citations (I4OC).[^1] [^1]: "About," I4OC, last checked 11/28/2017, https://i40c.org/#about. Citations are also important for libraries. In many introductions to literature research, librarians recommend the "pyramid scheme", which is based on the fact that a large number of other relevant sources can be quickly identified on the basis of just one suitable source for a specific topic by evaluating the literature cited. Anyone who has ever practiced this procedure with printed literature knows how tedious it can be in practice. It is therefore obvious to integrate the functionality of "tracking" citation relations directly into research systems such as the online catalogues of academic libraries. If citations are comprehensively integrated into a research system, in contrast to the analogous method, not only the cited literature (and thus preceding the publication) can be linked, but also the later publications, which in turn cite the source. Such clickable citation networks offer users real added value in research. Furthermore, citation data are fundamental for most quantitative publication analyses in the field of bibliometrics. For some time now, libraries have increasingly been dealing with such bibliometric analyses and offering services for research evaluation.[^2] As a rule, high-priced commercial citation databases such as Scopus and the Web of Science are used as a data basis. In this context, it is repeatedly criticised that these analyses are only of limited reliability due to the lack of completeness and quality of the data,[^3] although they nevertheless play an existential role in the careers of young researchers in some disciplines.[^4] In order to compensate for the known shortcomings of the commercial citation databases, considerable resources are once again invested in the preparation and improvement of the data at many institutions in addition to the licensing costs.[^5] [^2]: For example, the Technical University of Munich (see "Bibliometrie," TUM, last checked on 28 November 2017, https://www.ub.tum.de/bibliometrie) and the University Library of Vienna (see "Bibliometrics at the University of Vienna," University of Vienna, last checked on 30 August 2017, https://bibliothek.univie.ac.at/bibliometrie/). [^3]: See Benjamin Walker et al.,"Inter-rater Reliability of H-Index Scores Calculated by Web of Science and Scopus for Clinical Epidemiology Scientists," Health Information and Libraries Journal 33, No. 2 (2016): 140-149, http://doi.org/10.1111/hir.12140. [^4]: See Maximilian Fochler, Ulrike Felt and Ruth Müller,"Unsustainable Growth, Hyper-Competition, and Worth in Life Science Research: Narrowing Evaluative Repertoires in Doctoral and Postdoctoral Scientists' Work and Lives," Minerva 54, No. 2 (2016): 175-200, http://doi.org/10.1007/s11024-016-9292-y. [^5]: For example, the University of Vienna, which provides a very professional service in this area, has a "Department of Bibliometry and Publication Strategies" with six employees. In all this, it should not be forgotten that citation is ultimately not research content, but 'only' metadata or relationships between metadata. The fact that these should be free of charge and accessible without obstacles may actually be regarded as a broad consensus between many players in the publishing market. The following is intended to illustrate how a trend towards mass production and the release of citation data has developed in recent years on the basis of a number of initiatives, and what perspectives this could open up for libraries. ## 2. The Snowflake: CiteSeer/CiteSeerX The CiteSeer service, which is currently operated under the name CiteSeerX[^6] by the College of Information Sciences and Technology at Pennsylvania State University, has been available for scientific articles in computer science since 1998. The database contains over 6 million documents and 120 million citations under the CC-BY-NC-SA license. [^6]: CiteSeerX, http://citeseerx.ist.psu.edu/index. The data is automatically extracted from pdf files available online using a web crawler. The software is available as open source[^7] and is used by RePEc (Research Papers in Economics) among others. Through the early development of a mass-capable, automated procedure for the indexing of publications and citations, CiteSeer has provided an important impetus for the collection of free citation data - it was the first snowflake, so to speak, which together with many others has the potential to trigger an avalanche. [^7]: CiteSeerX, https://github.com/SeerLabs. However, the fully automated process, which works on the basis of unstructured data - the text from pdf files of various origins - has its limits. Even the basic metadata (author and title) of the evaluated articles are not always recognized (see Fig. 1). Complex relationships between the various hierarchical levels of a publication and the citation associated with them can hardly be developed with the technology used. Figure 2 illustrates such a case: The automatically extracted data suggest that the author Wendy Hall has published an article entitled "Linked Open Data", and this article does not contain any citations. In fact, however, "Linked Open Data" is the title of a special issue of the magazine Ercim News, in which the author wrote a one-page article entitled "Linked Data - the Quiet Revolution"[^8], which after all contains a literature list with six entries. Neither these citations nor the actual essay can be found in CiteSeerX. All of the above-mentioned information was listed on the title page of the special issue - they are only incorrectly compiled or have not been related to the information relevant for indexing within the magazine. [^8]: Wendy Hall,"Linked Data: The Quiet Revolution," ERCIM News 96 (2014): 4, last checked on 07/17/2017, https://ercim-news.ercim.eu/en96/keynote. Problems of this kind can only be solved with a new methodical approach. For example, it is possible to build on already existing structured metadata to describe the quoting works and thus at least exclude errors in their identification. This is possible if the indexing is limited to certain data sources in which electronic full texts including descriptive metadata are available and provided with licences allowing automated further processing. ![1_5229-9634-1-SP. tif](https://www.o-bib.de/article/viewFile/5229/6142/10605) Fig. 1: Left: title page of an evaluated paper; right: Metadata of this paper in CiteSeerX ![2_5229-9635-1-SP. tif](https://www.o-bib.de/article/viewFile/5229/6142/10606) Fig. 2: Mixture of different hierarchical levels in recognized metadata of CiteSeerX ## 3. The Snowball: OpenCitations This approach has been chosen by OpenCitations[^9]. Since 2010, biologist David Shotton (Oxford) has been working on the creation of a repository of open citation data with a focus on the life sciences; since October 2015, together with Silvio Peroni (Bologna). It is operated by the British non-profit organisation Infrastructure Services for Open Access[^10], which also houses the Directory of Open Access Journals (DOAJ). [^9]: OpenCitations, http://opencitations.net/. [^10]: Infrastructure Services for Open Access, https://is4oa.org/. OpenCitations is currently opening up the Open Access articles in the PubMed Central database with fully automated procedures. The results are offered as Linked Open Data in a Triple Store with SPARQL Endpoint under the license CC0 (public domain). At present there are about 8.4 million citations from 198,000 articles (as of 18.07.2017). For the future, a clear extension of content and functionality is aimed at: "our ultimate objective [...] is to provide an open alternative to Web of Knowledge and Scopus, covering all the disciplines."[^11] For this purpose, new data sources such as ArXiv and Crossref are to be included in the development and the processing speed is to be significantly increased. New tools for visualization, search and browsing are also planned. [^11]: David Shotton, private e-mail to the author from 15.05.2017. It can already be said that with OpenCitations a new quality has been achieved in the development of free citation data: By using structured data from defined data sources, the data quality is better than with CiteSeerX, for example. The use of Linked Data technology and CC0 license also means that the prerequisites for subsequent use and further linking and networking of the data are very good. Cooperations with other actors are obvious to cover further specialist areas and to develop new functionalities. The snowflakes of the previous initiatives are condensed, so to speak, with the new technological and methodical approach of OpenCitations to a solid snowball, with which a targeted throw at the goal of a comprehensive free citation database no longer seems impossible. In fact, several projects were launched last year, which also deal with quotations under various questions. We will deal with the Linked Open Citation Database (LOC-DB) project in more detail later on, additionaly there is the DFG project EXCITE[^12] and the WikiCite[^13] initiative of the Wikimedia Foundation. [^12]:The aim of the EXCITE project is to develop improved software components for the automatic extraction of citations from texts in existing specialist databases, cf."DFG-Project: EXCITE - Extraction of Citations from PDF Documents," Universität Koblenz-Landau, Institute for Web Science and Technologies, most recently reviewed on 28 November 2017, http://west.uni-koblenz.de/en/research/excite. [^13]: See WikiCite, https://meta.wikimedia.org/wiki/WikiCite. The initiative WikiCite first of all deals with bibliographic metadata in the various projects of the Wikimedia Foundation. Quotations are of particular importance here, however, as they can provide an indication of the reliability of content (cf. Dario Taraborelli, "WikiCite: The Journey and the Road Ahead," last modified on 23.05.2017, http://doi.org/10.6084/m9.figshare.5032235.v1). ## 4. The beginning of an avalanche? The Initiative for Open Citations A completely new dynamic developed in the spring of 2017 through the Initiative for Open Citations (I4OC).[^14] Different actors (including the Wikimedia Foundation, PLOS, DataCite and OpenCitations) have joined forces to encourage scientific publishers to make quotations in their publications freely available via the Crossref[^15] platform. In mid-July 2017,46 publishers are already participating, including Cambridge University Press, De Gruyter, Sage, Springer, Taylor & Francis and Wiley. This means that 45% of the approx. 35 million publications in Crossref, including references, are already open and accessible via the Crossref REST API. In the future, OpenCitations will make these data available in RDF format as Linked Open Data. [^14]: I4OC, https://i4oc.org/. [^15]: Crossref (https://www.crossref.org/) is operated by a non-profit organization and provides DOIs for electronic publications. The cross-ref platform and interfaces are used to make the DOIs with the associated metadata searchable and usable, e. g. for link resolvers. ![3_5229-9636-1-SP. tif](https://www.o-bib.de/article/viewFile/5229/6142/10607) Fig. 3: Citation data for Crossref[^16] [^16]: The example shown is an excerpt from an article in the Springer magazine Nature, cf. https://api.crossref.org/works/10.1038/227680a0 (last checked on 28.11.2017). Does this already trigger the avalanche and the days of commercial citation databases are numbered? There is no doubt that the sheer mass of data disclosed opens up entirely new possibilities. However, the goal has not yet been achieved: On the one hand, Crossref is focused on electronic publications, since the primary function is to allocate DOIs. On the other hand, the data quality of the citation data is very different. In the example in Figure 3, the title of the essay is missing for both references listed. Only in the second case is a DOI linked, so that the publication can be uniquely identified, enriched with additional metadata and linked. The DOI may either already have been included in the metadata provided by the publisher, or it may have been supplemented by a cross-ref link service.[^17] However, this only works if the quality of the source data is sufficient and the target publication is actually registered with Crossref - which was obviously not the case with the first article. For a (non-representative) sample of 2501 citations available via the Crossref API that are found with the keyword "social", only 33% contain a DOI. In 14% of the citations there is no such basic information as the title of the quoted work. [^17]: "Reference Linking," CrossRef, last checked 11/28/2017, https://www.crossref.org/services/reference-linking. The Discovery System Primo with the Index Primo Central provides an impression of the extent to which the existing citation data in Crossref, with its current quality and coverage in an existing research service for scientific users, can already be helpful. Since May 2016, citation data from Crossref has also been included here; according to the company ExLibris, 124,341,120 links are currently (as of 19.07.2017) and are updated weekly. At Mannheim University Library, Primo Central is used together with the local Mannheim catalogue data for a broad search of "Essays and UB holdings"[^18], whereby only those holdings that are available in Mannheim are displayed by default. If you enter the keyword "biology" in this search, citations will be displayed for 45 of the first 100 entries in the results list. If you enter "social", this is only the case for 17 of 100 entries. If you restrict your search to the resource type "Books" or the format "Print Media", you will not find a single entry with quotations in the first 100 results of both queries. [^18]: Cf. "Primo," Mannheim University Library, last checked on 28.11.2017, http://primo.bib.uni-mannheim.de/primo_library/libweb/action/search.do?mode=Basic&vid=MAN_UB&tab=man_all. While electronic journal articles are already quite well covered and there is a realistic chance that the situation will improve considerably with the progress of the I4OC initiative, books and all kinds of printed literature have hardly been recorded so far. The extent to which this is problematic presumably depends on the research goal and the subject area. If one thinks back to the pyramid scheme as a research method, however, it is generally recommended to start with a pertinent survey work or a monograph that is well-fitting in terms of content and to work on the works quoted there in advance of the more specific treatises. The technical implementation of such a strategy in our research systems is currently lacking in the data basis. ## 5 Perspectives for Libraries: The Linked Open Citation Database (LOC-DB) Project This is where the DFG project Linked Open Citation Database (LOC-DB) comes into play.[^19] The project started in October 2016 and will run for 24 months. The partners involved are the German Research Institute for Artificial Intelligence in Kaiserslautern (Prof. Andreas Dengel), the University of Media in Stuttgart (Prof. Kai Eckert), the German National Library of Economics (ZBW) - Leibniz Information Centre for Economics in Kiel/Hamburg (Prof. Ansgar Scherp) and the Mannheim University Library. [^19]: LOC-DB, https://locdb.bib.uni-mannheim.de/. The aim of the project is to demonstrate that libraries can make an efficient and sustainable contribution to the indexing of citation data. Precisely because existing services already cover certain areas well, libraries can concentrate on complementing and optimizing the existing services. The strength of libraries is that they have extensive experience with indexing processes and bibliographic data. Trained staff is available and will be used anyway to access the various media offered by the library. It is therefore obvious to embed the acquisition of citation data into these existing processes, provided that this is possible through extensive automation with reasonable effort. To achieve this, existing data and automatic methods should be used wherever possible. In addition, however, intellectual control and correction is to take place so that data can be produced at a reliably high level of quality, which in turn can be used as a "gold standard" to improve automatic indexing procedures. In order to close a systematic gap in the previously available free citation data, print literature should in any case also be included; in the end, however, it must be possible to adequately process all publication types and media types in practice-oriented workflows. For this purpose, an editorial system is being developed that integrates the various process steps and supports the workflow as efficiently as possible. In the case of print literature, it is first of all necessary to scan the bibliographies contained in the bibliographies - similar to what is already practiced for tables of contents in a number of libraries. These scans, or alternatively bibliographies that are already available in electronic form, are linked in the editorial system with the metadata of the publication from which they originate. They are then processed with methods of automatic text recognition (OCR) and individual citations are extracted.[^20] The system generates suggestions for linking the detected data with existing bibliographic data, preferably from data sources with high data quality, such as the German library networks or the OLC databases of the GBV, but also from Crossref or Google Scholar. If a link can be established with a high-quality data set, the initial data no longer needs to be further processed and the process can be completed quickly. If this is not possible, the correction and completion of the detected data is done manually. If already structured electronic citation data from sources such as Crossref or OpenCitations are available, the text recognition component can be skipped. A review, linking and, if necessary, supplementation of the existing data is however also quite reasonable in this case due to the quality problems described. [^20]: Kai Eckert, Anne Lauscher and Akansha Bhardwaj,"LOC-DB: A Linked Open Citation Database Provided by Libraries: Motivation and Challenges" (lecture at EXCITE Workshop 2017,30-31 March 2010) gives an impression of this.2017), presentation slides, last reviewed on 17.07.2017, https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf , 25-27. In contrast to existing procedures, deep learning methods are used for the extraction of references. A scientific publication on this subject is in preparation. Figure 4 shows the central screen of the editing system, in which a scan to be processed is displayed on the left and the process steps to be carried out on the right. The produced data from the LOC-DB project are provided as Linked Open Data under the license CC0. The data model follows that of OpenCitations, with a slight extension required to link scanned bibliographies. It has been agreed with the directors of OpenCitations that the further development of the data model will be coordinated in the future, so that the produced data can be linked and interacted without any problems in the future. ![4_5229-9637-1-SP. tif](https://www.o-bib.de/article/viewFile/5229/6142/10608) Fig. 4: Editorial system of LOC-DB (first prototype) Within the framework of a two-year project, in which software is first developed, the number of data that can actually be produced is limited. Since the Mannheim University Library has a focus on the social sciences and the social sciences are also underrepresented in the existing citation data, a sub-collection of this subject area was selected as a test corpus. Approximately 500 books in the reference collection of the Mannheim University Library, which was published in 2011, as well as the same volume of 104 sociological journals are recorded retrospectively. In addition, since July 2017, the new addition to the reference collection of social sciences has been dealt with on a regular basis. In the course of the year, the newly published articles of the 104 selected journals are to be included in the continuous processing. This sample should be sufficient to provide a reliable cost-benefit analysis at the end of the project. One thing is certain: no library will be able to access all the citation data relevant to science alone - this is only possible in cooperation with a large number of partners involved. It would be obvious, for example, to integrate them into the specialist information services, but it would also be possible to tap into parts of the holdings of "normal" scientific libraries (e. g. the publications of one's own university staff or special collections and specialist subject areas) in a targeted manner. However, this decision can only be taken if it is clear which resources are needed to produce a certain amount of data in a certain quality. The LOC-DB project will present its first results in a workshop in autumn 2017. Interested parties from libraries or related projects and developers who are interested in the methods used can get an impression of the system's functionality on this occasion. ## 6. Conclusion Due to the large number of already freely available citation data and the dynamic development in this area, an almost complete indexing of the citations of all scientifically relevant publications has become possible. Libraries could play an important role in this respect, and they could benefit considerably from it: in library catalogues and other library research systems, the inclusion of citation data could create added value for users. Furthermore, free citation data could form a transparent and extensible data basis for bibliometric evaluations. Even if the quality of free citation data is not yet comparable to that of commercial providers, any effort invested in improving free data will also lead to sustainable benefits. The LOC-DB project is developing a solution with which a distributed infrastructure for open citations at libraries could be realized efficiently and sustainably. ## Bibliography - Eckert, Kai, Anne Lauscher und Akansha Bhardwaj. „LOC-DB: A Linked Open Citation Database Provided by Libraries: Motivation and Challenges.“ Vortrag auf dem EXCITE Workshop 2017, 30.-31.03.2017. Vortragsfolien. Zuletzt geprüft am 17.07.2017. https://locdb.bib.uni-mannheim.de/wordpress/wp-content/uploads/2016/11/LOC-DB@EXCITE.pdf. - Fochler, Maximilian, Ulrike Felt und Ruth Müller. „Unsustainable Growth, Hyper-Competition, and Worth in Life Science Research: Narrowing Evaluative Repertoires in Doctoral and Postdoctoral Scientists’ Work and Lives.“ Minerva 54, Nr. 2 (2016): 175–200. http://doi.org/10.1007/s11024-016-9292-y. - Hall, Wendy. „Linked Data: The Quiet Revolution.” ERCIM News 96 (2014): 4. Zuletzt geprüft am 17.07.2017. https://ercim-news.ercim.eu/en96/keynote. - Taborelli, Dario. „WikiCite: The Journey and the Road Ahead.“ Zuletzt geändert am 23.05.2017. http://doi.org/10.6084/m9.figshare.5032235.v1. - Walker, Benjamin, Sepand Alavifard, Surain Roberts, Andrea Lanes, Tim Ramsay und Sylvain Boet. „Inter‐rater Reliability of H‐Index Scores Calculated by Web of Science and Scopus for Clinical Epidemiology Scientists.” Health Information and Libraries Journal 33, Nr. 2 (2016): 140–49. http://doi.org/10.1111/hir.12140.