<!-- Adjust the default reveal.js style -->
<style type="text/css">
@import "https://hackmd.io/build/reveal.js/css/theme/white.css";
.reveal h1, .reveal h2, .reveal h3 { text-align: left; text-transform: none }
.reveal p, .reveal ul, .reveal ol { text-align: left; display: block; }
.reveal section img { border: none; box-shadow: none; }
</style>
<!--
Unter https://hackmd.io/p/S1YmXWC0e werden die Slides im Präsentationsmodus mit reveal.js angezeigt. Mit "S" werden die Speaker-
s angezeigt.
-->
# Wikidata as authority linking hub
*Joachim Neubert (ZBW - Leibniz Information Centre for Economics, Kiel/Hamburg)*
*Jakob Voß (Verbundzentrale des GBV, Göttingen)*
ELAG
Athens, June 8, 2017
`Online version (with links): https://hackmd.io/p/S1YmXWC0e`
---
# Introduction
---
## Authority files
Consistently refer to entities
* Via identifier ("things, not strings")
* GND, MeSH, STW, ISIL, RePEc Author ID...
---
## Linking hubs
Connect identifiers among authority files
* `owl:sameAs`, `skos:exactMatch`, `skos:closeMatch`...
* [VIAF](http://viaf.org), sameAs.org, [Wikidata](https://wikidata.org)...
![](https://i.imgur.com/dgRaN33.png)
Note: Two general models: 1:1 or hub. Differences of hubs
* VIAF - Libraries / algorithmically generated clusters
* samAs - Semantic Web / harvested coreference links from the web
* wikidata - Community / curated content
---
## Wikidata
* Knowledge base of Wikimedia projects
* All kinds of entities
* concepts, places, people, works...
---
## Wikidata usage
* Editable by anyone
* via Website and API
* via apps that use the API
* Data available
* http://query.wikidata.org/ (SPARQL)
* JSON API & database dumps
---
## Wikidata statements
![](https://upload.wikimedia.org/wikipedia/commons/c/ce/Wikidata_statement_with_ids.svg)
Note: Wikidata consists of
* Items
* uniquely identified by an item-id (Q...)
* having a label in each language
* Statements
* each with a property
* uniquely identified with a property-id (P...)
* and a value (string, date, item, external identifier...)
* and optional qualifiers and references to sources
---
## Wikidata item example
![](https://i.imgur.com/VNaeUMK.jpg)
Note: on the example
* This is a more condense display of a Wikidata item, based on the "SQID" Wikidata browser.
* Note the list of identifiers for this item/person, linking to other authority files.
* Property URL pattern + inserted ID value = link
---
## Authority file identifiers in Wikidata
More than half of all Wikidata properties
* Datatype external identifier (~1,750)
* [Properties for authority control](https://www.wikidata.org/wiki/Q18614948) (~1,500)
* Properties with corresponding KOS (~220)
Note: The total number is hard to measure because of the "organic" organization of Wikidata
---
# Wikidata---ISIL (organizations)
Example:
*Neuschwanstein Castle* ([Q4152](https://www.wikidata.org/wiki/Q4152))
ISIL ([P791](https://www.wikidata.org/wiki/Property:P791)): *DE-MUS-051612*
Current state:
* lobid.org: ~15,000 ISIL (DACH only)
* Wikidata: ~6,500 ISIL
---
## Tool: *Mix'n'match*
* Web application mapping tool
* Helps to add 1-to-1-mappings
https://tools.wmflabs.org/mix-n-match/
---
## Step 1: Upload ISIL list with names
![](https://i.imgur.com/Sh89f3l.png)
---
## Step 2: Confirm match candidates
![Automatically matched](https://i.imgur.com/st847ta.png)
---
![Visual Mode](https://i.imgur.com/9WWCrCa.png)
Note: The visual mode is even more handy to confirm mappings.
The ISIL-Mapping is finished for Geman Museums (lobid-museums).
---
# GND---RePEc Author ID
* In the [EconBiz](http://www.econbiz.de) economics search portal authors are identified differently:
* by **GND ID** in data from ZBW's *Econis* catalog
* by **RePEc Author ID** in data from *Research Papers for Economics*
* Large volumes: 450,000 vs. 50,000 distinct persons
* ~3,000 pairs of IDs discovered in a previous project
Note: Characteristics
* GND = well known and interlinked
* RAS = high qualitiy data, curated by the authors themselves, incented by rankings - but: linked to nothing else in the world
---
### Person identifiers in Wikidata and EconBiz
![](https://i.imgur.com/1J7I3aN.jpg)
Note: Maintaining a custom mapping environment would require
* custom software, database, operations
* access limited to ZBW staff
---
## Utilizing Wikidata as linking hub
* Wikidata-Properties for both identifier systems
* GND ID ([P227](https://www.wikidata.org/wiki/Property:P227)): ~375,000 items which are humans
* RePEc Short-ID ([P2428](https://www.wikidata.org/wiki/Property:P2428)): ~2,200 items
* Since every identifier should identify exactly one person, we can derive
* GND ID ⟶ Wikidata ID ⟶ RePEc ID
* RePEc ID ⟶ Wikidata ID ⟶ GND ID
where both properties have values (~760 items, as of 2017-04-25)
Note: I describe the process in 6 steps, which should be portable to other similar mappings. All numbers above as of April 25th 2017
---
## Step 1: Supplement WD items with RePEc Author IDs
* 77 WD items with GND ID without RePEc Short-ID
* Transform to *[QuickStatements2](https://tools.wmflabs.org/quickstatements/)* input file ([SPARQL query](https://github.com/zbw/repec-ras/blob/ELAG2017/sparql/missing_ids_in_wikidata_from_mapping.rq), [script](https://github.com/zbw/repec-ras/blob/ELAG2017/bin/create_missing_ids_in_wikidata_from_mapping.pl))
* Copy & paste to *QuickStatements2*
---
### Bulk editing with *QuickStatements2*
![](https://i.imgur.com/yYx3Vxn.png)
Further simplification with upcoming release of *[wdmapper](https://wdmapper.readthedocs.io/)* command line tool
---
## Step 2: Supplement WD items with GND IDs
* 384 WD items with RePEc Short-ID without GND ID
* *same process as other direction*
---
## Step 3: Add "most important" authors with RePEc identifiers
* Scraped from ranking pages ([Top 10% economists](https://ideas.repec.org/top/top.person.all.html), [Top 10% female economists](https://ideas.repec.org/top/top.women.html))
* Transform and load into *Mix'n'match* ([RePEc Top](https://tools.wmflabs.org/mix-n-match/#/catalog/406), [RePEc Top Female](https://tools.wmflabs.org/mix-n-match/#/catalog/404))
* *Same process as ISIL use case*
* Confirm match candidates (~ 1,600 of 4,600)
---
## Step 4: Add "most important" authors with GND identifiers
* 18,000 GND authors with >30 publications in EconBiz
* Transform and load into *Mix'n'match* ([GND economists (de)](https://tools.wmflabs.org/mix-n-match/#/catalog/431))
* 25% matched automatically with Wikidata items
* Confirm match candidates (~ 1500)
---
## Step 5: Rinse and repeat
* Repeat *Mix'n'match* "sync" operation before starting to work manually
* often, people are adding data at fast rate!
* Repeat bulk adding of missing identifiers to make use of complementing identifiers added meanwhile
---
## Step 6: Add missing Wikidata items
* Verify missing authors indeed are not in Wikidata (step 3 and 4)
* Generate Wikidata item data from existing mappings (or lists, e.g., top female economists) in *QuickStatements2* input format ([SPARQL query](https://github.com/zbw/repec-ras/blob/ELAG2017/sparql/ras_missing_in_wikidata.rq), [script](https://github.com/zbw/repec-ras/blob/ELAG2017/bin/create_missing_wikidata.pl))
<!-- ![Using Wikidata's QuickStatements tool](https://i.imgur.com/h95sTQl.png) -->
---
### Synthezised Wikidata items from ZBW's GND--RePEc mapping
2179 new Wikidata items created
* occupation "economist" from RePEc Author Service
* gender and date of birth/death (if available) from GND
* description from GND's info field and RePEc's "works for"
---
### Recommendations for item creation
* Explain your plan and ask for feedback in the [Wikidata project chat](https://www.wikidata.org/wiki/Wikidata:Project_chat)
* Pay attention to [Wikidata's notability criteria](https://www.wikidata.org/wiki/Wikidata:Notability)
* [Apply for a bot account](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot) to make mass edits ([example](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/JneubertAutomated_2))
* Use *QuickStatements2* (more convenient input format, batch mode, sources)
* Source every statement ([hints](https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat/Archive/2017/05#Source_statements_for_items_synte
d_from_authorities_-_recommendations.3F))
---
## Result: the mapping in Wikidata
* 4087 matching GND - RePEc author IDs
* 3081 matches from ZBW's mapping
* 1006 matches contributed by Wikidata users
* 468 additional RePEc author IDs
* 52,444 additional GND IDs (present in EconBiz)
(as of 2017-06-04)
Note: This looks even better if we focus on the "most important" persons.
---
### Top 10% RAS and frequent (>30) GND in EconBiz
![](https://i.imgur.com/LC3HQOb.jpg)
Note: The mapping currently covers more than 60% of RePEc's Top 10% Economists.
---
### Further results
* Mappings to other authorities *for free* (e.g., currently ~1550 RePEc ⟷ VIAF)
* Identifiers and items inserted by individual Wikidata contributors add up continuously
* Mapping steps can be repeated with additional data (e.g., [top economists from Latin America](https://ideas.repec.org/top/top.lamerica.html#authors))
* Further identifiers (VIAF, ORCID...) provide more opportunities for indirect matching
Results from _every step in the mapping process_ and all indiviual efforts are immediately available
---
# Tools
* *Mix'n'match* (system-supported intellectual matching)
* *QuickStatements2* (addition of generated properties and items)
* [*wdmapper*](https://github.com/gbv/wdmapper) (harvest, diff & add mappings)
* Support of indirect mappings (e.g., GND-WD-RePEc) in one step
* Work in progress (no adding by now)
* Daily harvested mappings in multiple formats:
http://coli-conc.gbv.de/concordances/wikidata/
Tools for mass editing require approved bot account.
---
## Limitations
* Mapping algorithms to find mapping candidates
* Limitation to easy-1-1-relationships
* part-whole
* often new Wikidata items required
* depends on the use case
* Large sets of mappings and results
* Regular review required for maintainance
Note:
* We just relied on the basic mapping algorithm build into Mix'n'match and added already existing mappings to Wikidata
* Each tool has its limitation on large result sets
---
# Benefits
* Outsourced interface, storage, and operation
* Crowdsourced mapping maintenance
* Wikidata has policies and tools for data quality
* Open Data for multiple and unknown uses
* Additional benefits:
* multilingual Wikipedia links
* lots of (formatted) data, nice pictures, ...
* links to multiple other vocabularies
---
# Contact
Joachim Neubert
j.neubert@zbw.eu
http://zbw.eu/labs
![](https://i.imgur.com/66sn2BY.gif)
Jakob Voss
jakob.voss@gbv.de
http://jakoblog.de/
![](https://i.imgur.com/N9sWh6L.jpg)
{"metaMigratedAt":"2023-06-14T12:47:38.269Z","metaMigratedFrom":"YAML","title":"Wikidata as authority linking hub","breaks":true,"lang":"en","slideOptions":"{\"README\":\"https://github.com/hakimel/reveal.js/#configuration\",\"slideNumber\":false,\"showNotes\":false,\"theme\":\"white\"}","contributors":"[{\"id\":\"6bfccf0f-3d06-485f-82a5-680e0616648a\",\"add\":4,\"del\":4}]"}