ConceptNet 5.5: An Open Multilingual Graph of General Knowledge - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/1612.03975) | [Note link](https://blog.csdn.net/boywaiter/article/details/103982824) | [Code link](https://github.com/commonsense/conceptnet5) | AAAI 2017 :::success **Thoughts** - With ConceptNet, word embeddings can be improved by considering including a source of relational knowledge. - It can represent word embeddings better than only using distributional semantics (word2vec, GloVe, and LexVec). - ConceptNet Numberbatch is the combination of the two (ConceptNet PPMI + distributional semantics). ::: ## Abstract ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include - expertcreated resources - crowd-sourcing - games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use. ## Introduction ConceptNet is a knowledge graph that connects words and phrases of natural language (*terms*) with labeled, weighted edges (*assertions*). This paper describes the release of ConceptNet 5.5. ## Structure of ConceptNet ConceptNet is the knowledge graph version of the Open Mind Common Sense project, a common sense knowledge base of the most basic things a person knows. ConceptNet’s role compared to these other resources is to **provide a sufficiently large, free knowledge graph that focuses on the common-sense meanings of words (not named entities) as they are used in natural language.** ### Knowledge Sources - Facts acquired from Open Mind Common Sense (OMCS) - Information extracted from parsing Wiktionary (The most) - “Games with a purpose” designed to collect common knowledge - Open Multilingual WordNet - JMDict - OpenCyc - A subset of DBPedia ### [Relations](https://github.com/commonsense/conceptnet5/wiki/Relations) ConceptNet uses a closed class of selected relations such as *IsA*, *UsedFor*, and *CapableOf*, intended to represent a relationship independently of the language or the source of the terms it connects. ConceptNet 5.5 aims to align its knowledge resources on its core set of 36 relations. - Symmetric relations: 7 - Asymmetric relations: 29 ![image](https://hackmd.io/_uploads/H1xPyz0ca.png) ### Term Representation ConceptNet represents terms in a standardized form. The tokens are joined with underscores, and this text is prepended with the URI $\verb|/c/lang|$, where *lang* is the [BCP 47 language code](https://tools.ietf.org/html/bcp47) for the language the term is in. As an example, the English term “United States” becomes $\verb|/c/en/united states|$. Relations have a separate namespace of URIs prefixed with $\verb|/r|$, such as $\verb|/r/PartOf|$. These relations are given artificial names in English, but apply to all languages. ### Vocabulary ConceptNet’s representation allows for more specific, disambiguated versions of a term. The URI $\verb|/c/en/lead/n|$ refers to noun senses of the word “lead”, and is effectively included within $\verb|/c/en/lead|$ when searching or traversing ConceptNet, and linked to it with the implicit relation *SenseOf*. ### Linked Data A term that is imported from another knowledge graph will be connected to ConceptNet nodes via the relation *Ex-ternalURL*, pointing to an absolute URL that represents that term in that external resource. ## Applying ConceptNet to Word Embeddings ### Computing ConceptNet Embeddings Using PPMI For performance reasons, when building this matrix, they prune the ConceptNet graph by discarding terms connected to fewer than three edges. They calculate word embeddings directly from this sparse matrix by context distributional smoothing, clip the negative values to yield positive pointwise mutual information (PPMI). This gives a matrix of word embeddings they call ConceptNet-PPMI. ### Combining ConceptNet with Distributional Word Embeddings Retrofitting is a process that adjusts an existing matrix of word embeddings using a knowledge graph. It infers new vectors $q_i$ with the objective of being close to their original values, $\hat{q_i}$ , and also close to their neighbors in the graph with edges $E$, by minimizing this objective function: $$ \Psi(Q) = \sum_{i=1}^n \left[ \alpha_i \| q_i - \hat{q}_i \|^2 + \sum_{(i, j) \in E} \beta_{ij} \| q_i - \hat{q}_i \|^2 \right] $$ The particular benefit of expanded retrofitting to ConceptNet is that it can benefit from the multilingual connections in ConceptNet. They add one more step to retrofitting, which is to subtract the mean of the vectors that result from retrofitting, then renormalize them to unit vectors, which helps to ensure that terms remain distinguishable from each other. ## Evaluation They will compare - Intrinsic evaluations of word relatedness - Apply the word embeddings to the downstream tasks - Solving proportional analogies - Choosing the sensible ending to a story ### Evaluations of Word Relatedness Ask it to rank the relatedness of pairs of words, and compare its judgments to human judgments. A good semantic space will provide a ranking of relatedness that is highly correlated with the human gold-standard ranking, as measured by its Spearman correlation ($\rho$). List of evaluation: - MEN-3000 - RW - WordSim-353 - MTurk-771 ### Solving SAT-style Analogies Proportional analogies are statements of the form “$a_1$ is to $b_1$ as $a_2$ is to $b_2$”. Here's the equation: $$ s = a_1 \cdot a_2 + b_1 \cdot b_2 + w_1(b_2 - a_2) \cdot (b_! - a_1) + w_2(b_2 - b_1) \cdot (a_2 - a_1) $$ The weights found for ConceptNet Numberbatch 16.09 were $w_1$ = 0.2 and $w_2$ = 0.6. This indicates, surprisingly, that the comparisons being made by the transposed form of the analogy were often more important than the directly stated form of the analogy for choosing the best answer pair. ### An Evaluation of Common-Sense Stories Their preliminary attempt to apply ConceptNet Numberbatch to the Story Cloze Test is to use a very simple “bag-of-vectors” model, by averaging the embeddings of the words in the sentence and choosing the ending whose average is closest. ## Results and Discussion **Word Relatedness** ![image](https://hackmd.io/_uploads/HJJuJ7A96.png) **SAT Analogies** ![image](https://hackmd.io/_uploads/Sk79yXCqp.png) ![image](https://hackmd.io/_uploads/Hky_UQAqp.png) **Story Cloze Test** The performance of our system on the Story Cloze Test was acceptable but unremarkable. ConceptNet Numberbatch chose the correct ending 59.4% of the time. --- ConceptNet can make word embeddings more robust and more correlated with human judgments, as shown by the state-of-the-art results that ConceptNet Numberbatch achieves at matching human annotators on multiple evaluations.