When is a persistent identifier not persistent? Or an identifier?

Ever wondered what that bar code on the back of every book is? It's an ISBN: an International Standard Book Number. Every modern book published has an ISBN, which uniquely identifies that book, and anyone publishing a book can get an ISBN for it whether an individual or a huge publishing house. It's a little more complex than that in practice but generally speaking it's 1 book, 1 ISBN. Right? Right.

Except…

If you search an online catalogue, such as WorldCat or The British Library for the ISBN 9780393073775 (or the 10-digit equivalent, 0393073777) you'll find results appear for two completely different books:

Waal FD. The Bonobo and the Atheist: In Search of Humanism Among the Primates. New York: W. W. Norton & Co.; 2013. 304 p. http://www.worldcat.org/oclc/1167414372
Lodge HC. The Storm Has Many Eyes; a Personal Narrative. 1st edition. New York: New York Norton; 1973. http://www.worldcat.org/oclc/989188234

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In fact, things are so confused that the cover of one book gets pulled in for the other as well. Investigate further and you'll see that it's not a glitch: both books have been assigned the same ISBN. Others have found the same:

"However, if the books do not match, it's usually one of two issues. First, if it is the same book but with a different cover, then it is likely the ISBN was reused for a later/earlier reprinting. … In the other case of duplicate ISBNs, it may be that an ISBN was reused on a completely different book. This shouldn't happen because ISBNs are supposed to be unique, but exceptions have been found." –- GoodReads Librarian Manual: ISBN-10, ISBN-13 and ASINS

While most publishers stick to the rules about never reusing an ISBN, it's apparently common knowledge in the book trade that ISBNs from old books get reused for newer books, sometimes accidentally (due to a typo), sometimes intentionally (to save money), and that has some tricky consequences.

I recently attended a webinar entitled "Identifiers in Heritage Collections - how embedded are they?" from the Persistent Identifiers as IRO Infrastructure ("HeritagePIDs") project, part of AHRC's Towards a National Collection programme. As quite often happens, the question was raised: what Persistent Identifier (PID) should we use for books and why can't we just use ISBNs? Rod Page, who gave the demo that prompted this discussion, also wrote a short follow-up blog post about what makes PIDs work (or not) which is worth a look before you read the rest of this.

These are really valid questions and worth considering in more detail, and to do that we need to understand what makes a PID special. We call them persistent, and indeed we expect some sort of guarantee that a PID remains valid for the long term, so that we can use it as a link or placeholder for the referent without worrying that the link will get broken. But we also expect PIDs to be actionable: it can be made into a valid URL by following some rules: so that we can directly obtain the object referenced or at least some information about it.

Actionability implies two further properties: an actionable identifier must be

Unique: guaranteed to have only one identifier for a given object (of a given type); and
Unambiguous: guaranteed that a single identifier refers to only one object

Where does this leave us with ISBNs?

Well first up they're not actionable to start with: given an ISBN, there's no canonical way to obtain information about the book referenced, although in practice there are a number of databases that can help. There is, in fact, an actionable ISBN standard: ISBN-A permits converting an ISBN into a DOI with all the benefits of the underlying DOI and Handle infrastructure. Sadly, creation of an ISBN-A isn't automatic and publishers have to explicitly create the ISBN-A DOI in addition to the already-create ISBN; most don't.

More than that though, it's hard to make them actionable since ISBNs fail on both uniqueness and unambiguity. Firstly, as seen in the example I gave above, ISBNs do get recycled, They're not supposed to be:

"Once assigned to a monographic publication, an ISBN can never be reused to identify another monographic publication, even if the original ISBN is found to have been assigned in error." –- International ISBN Agency. ISBN Users’ Manual [Internet]. Seventh Edition. London, UK: International ISBN Agency; 2017 [cited 2020 Jul 23]. Available from: https://www.isbn-international.org/content/isbn-users-manual

Yet they are, so we can't rely on their precision^[1].

Secondly, and perhaps more problematic in day-to-day use, a given book may have multiple ISBNs. To an extent this is reasonable: different editions of the same book may have different content, or at the very least different page numbering, so a PID should be able to distinguish these for accurate citation. Unfortunately the same edition of the same book will frequently have multiple ISBNs; in particular each different format (hardback, paperback, large print, ePub, MOBI, PDF, …) is expected to have a distinct ISBN. Even if all that changes is the publisher, a new ISBN is still created:

"We recently encountered a case where a publisher had licensed a book to another publisher for a different geographical market. Both books used the same ISBN. If the publisher of the book changes (even if nothing else about the book has changed), the ISBN must also change." –- Everything you wanted to know about the ISBN but were too afraid to ask

Again, this is reasonable since the ISBN is primarily intended for stockkeeping by book sellers^[2], and for them the difference between a hardback and paperback is important because they differ in price if nothing else. This has bitten more than one librarian when trying to merge data from two different sources (such as usage and pricing) using the ISBN as the "obvious" merge key. It makes bibliometrics harder too, since you can't easily pull out a list of all citations of a given edition in the literature, just from a single ISBN.

So where does this leave us?

I'm not really sure yet. ISBNs as they are currently specified and used by the book industry aren't really fit for purpose as a PID. But they're there and they sort-of work and establishing a more robust PID for books would need commitment and co-operation from authors, publishers and libraries. That's not impossible: a lot of work has been done recently to make the ISSN (International Standard Serial Number, for journals) more actionable.

But perhaps there are other options. Where publishers, booksellers and libraries are primarily interested in IDs for stock management, authors, researchers and scholarly communications librarians are more interested in the scholarly record as a whole and tracking the flow of ideas (and credit for those) which is where PIDs come into their own. Is there an argument for a coalition of these groups to establish a parallel identifier system for citation & credit that's truly persistent? It wouldn't be the first time: ISNIs (International Standard Name Identifiers) and ORCIDs (Open Researcher and Contributor IDs) both identify people, but for different purposes in different roles and with robust metadata linking the two where possible.

I'm not sure where I'm going with this train of thought so I'll leave it there for now, but I'm sure I'll be back. The more I dig into this the more there is to find, including the mysterious, long-forgotten and no-longer accessible Book Item & Component Identifier proposal. In the meantime, if you want a persistent identifier and aren't sure which one you need these Guides to Choosing a Persistent Identifier from Project FREYA should get you started.

Actually, as my colleague pointed out, even DOIs potentially have this problem, although I feel they can mitigate it better with metadata that allows rich expression of relationships between DOIs. ↩︎
In fact, the newer ISBN-13 standard is simply an ISBN-10 encoded as an "International Article Number", the standard barcode format for almost all retail products, by sticking the "Bookland" country code of 978 on the front and recalculating the check digit. ↩︎