research
The term "chain of trust" comes from the domain of computer security, where it is one of the core principles behind verifying digital signatures. To know whether a signature is valid, a list of endorsing signatures is added to the original signature. These endorsing signatures can in turn have endorsing signatures themselves to help verify them, and so forth. When one or more signatures along this chain comes from an entity you explicitly trust (a family member, friend or trustworthy institution), the original signature can be trust implicitly.
This document outlines an experiment to see what it would take to create a chain of trust for sentences in a text article. The idea is to track down articles that support the claim made in the sentence, then track down articles that support the supporting articles, and so forth, until articles from explicitly trusted sources are reached (mainly peer reviewed journal articles). Moreover, we want to visualize this chain of trust so we get an overview of the origins of a claim, including which articles support and refute it, and how trustworthy those articles are in turn. While with digital signatures, we can deal in absolutes (either a signature endorses another signature or not), with text we will probably have to settle for a score indicating the degree of trust.
In academic articles, the idea of a "chain of trust" is explicitly encoded in the usage of citations. Hence, with these sorts of articles it should be considerably easier to get a basic system going to visualize this chain of trust than for example an article in a newspaper.
The main difficulty here is that a citation usually does not mention the specific sentence in the article that directly supports the claim. If we want to follow the chain of trust deeper than a single link, we must have a specific sentence to check. Sentence embeddings, such as those produced by BERT and many other models, may prove useful here. A sentence embedding is a numerical representation of a sentence. (In the case of BERT, each sentence is converted into a vector of 768 numbers.) These numbers act as coordinates in a vast semantic space. Sentences that are close in meaning ("The weather is good", "We are having fine weather") will be assigned embeddings which lie close together. In general, the distance between embeddings (measured as cosine distance) indicates the degree of semantic similarity. So, if we want to track down a sentence in an article that supports a specific claim, one approach is to compute embeddings for all sentences and compare distances to the embedding of the claim.
Let's begin with a simple case: one of my journal articles citing another journal article by Hauk et al..
Elsevier offers an API for text mining that offers up the full text of articles in XML format.
This can be parsed to extract each sentence and any citations supporting the claim made in the sentence.
All sentences where processed by the all-MiniLM-L12-v2
model created by Microsoft and supplied by HuggingFace.
The code and the articles in XML format can be found at https://github.com/wmvanvliet/tcot.
For this test, only the 4 sentences that cite the second article are considered. Here are, for each of those sentences, the 5 closest matches in the second article, along with the sentence number and (cosine) distance:
The model performs poorly for the first sentence (#019), which might be because the text without references is a bit garbled. For the other three sentences, the model performs well!