Wikidata as authority linking hub
Joachim Neubert (ZBW - Leibniz Information Centre for Economics, Kiel/Hamburg)
Jakob Voß (Verbundzentrale des GBV, Göttingen)
ELAG
Athens, June 8, 2017
Online version (with links): https://hackmd.io/p/S1YmXWC0e
Authority files
Consistently refer to entities
Via identifier ("things, not strings")
GND, MeSH, STW, ISIL, RePEc Author ID …
Linking hubs
Connect identifiers among authority files
VIAF - Libraries / algorithmically generated clusters
samAs - Semantic Web / harvested coreference links from the web
wikidata - Community / curated content
Wikidata
Knowledge base of Wikimedia projects
All kinds of entities
concepts, places, people, works …
Wikidata usage
Editable by anyone
via Website and API
via apps that use the API
Data available
Wikidata statements
Items
uniquely identified by an item-id (Q … )
having a label in each language
Statements
each with a property
uniquely identified with a property-id (P … )
and a value (string, date, item, external identifier … )
and optional qualifiers and references to sources
Wikidata item example
This is a more condense display of a Wikidata item, based on the "SQID" Wikidata browser.
Note the list of identifiers for this item/person, linking to other authority files.
Property URL pattern + inserted ID value = link
Authority file identifiers in Wikidata
More than half of all Wikidata properties
Wikidata – -ISIL (organizations)
Example:
Neuschwanstein Castle ( Q4152 )
ISIL ( P791 ): DE-MUS-051612
Current state:
lobid.org : ~15,000 ISIL (DACH only)
Wikidata: ~6,500 ISIL
Step 1: Upload ISIL list with names
Step 2: Confirm match candidates
GND – -RePEc Author ID
In the EconBiz economics search portal authors are identified differently:
by GND ID in data from ZBW's Econis catalog
by RePEc Author ID in data from Research Papers for Economics
Large volumes: 450,000 vs. 50,000 distinct persons
~3,000 pairs of IDs discovered in a previous project
GND = well known and interlinked
RAS = high qualitiy data, curated by the authors themselves, incented by rankings - but: linked to nothing else in the world
Person identifiers in Wikidata and EconBiz
custom software, database, operations
access limited to ZBW staff
Utilizing Wikidata as linking hub
Wikidata-Properties for both identifier systems
GND ID ( P227 ): ~375,000 items which are humans
RePEc Short-ID ( P2428 ): ~2,200 items
Since every identifier should identify exactly one person, we can derive
GND ID ⟶ Wikidata ID ⟶ RePEc ID
RePEc ID ⟶ Wikidata ID ⟶ GND ID
where both properties have values (~760 items, as of 2017-04-25)
Step 1: Supplement WD items with RePEc Author IDs
Bulk editing with QuickStatements2
Further simplification with upcoming release of wdmapper command line tool
Step 2: Supplement WD items with GND IDs
384 WD items with RePEc Short-ID without GND ID
same process as other direction
Step 3: Add "most important" authors with RePEc identifiers
Step 4: Add "most important" authors with GND identifiers
18,000 GND authors with >30 publications in EconBiz
Transform and load into Mix'n'match ( GND economists (de) )
25% matched automatically with Wikidata items
Confirm match candidates (~ 1500)
Step 5: Rinse and repeat
Repeat Mix'n'match "sync" operation before starting to work manually
often, people are adding data at fast rate!
Repeat bulk adding of missing identifiers to make use of complementing identifiers added meanwhile
Step 6: Add missing Wikidata items
Verify missing authors indeed are not in Wikidata (step 3 and 4)
Generate Wikidata item data from existing mappings (or lists, e.g., top female economists) in QuickStatements2 input format ( SPARQL query , script )
Synthezised Wikidata items from ZBW's GND – RePEc mapping
2179 new Wikidata items created
occupation "economist" from RePEc Author Service
gender and date of birth/death (if available) from GND
description from GND's info field and RePEc's "works for"
Recommendations for item creation
d_from_authorities_-_recommendations.3F))
Result: the mapping in Wikidata
4087 matching GND - RePEc author IDs
3081 matches from ZBW's mapping
1006 matches contributed by Wikidata users
468 additional RePEc author IDs
52,444 additional GND IDs (present in EconBiz)
(as of 2017-06-04)
Top 10% RAS and frequent (>30) GND in EconBiz
Further results
Mappings to other authorities for free (e.g., currently ~1550 RePEc ⟷ VIAF)
Identifiers and items inserted by individual Wikidata contributors add up continuously
Mapping steps can be repeated with additional data (e.g., top economists from Latin America )
Further identifiers (VIAF, ORCID … ) provide more opportunities for indirect matching
Results from every step in the mapping process and all indiviual efforts are immediately available
Tools
Mix'n'match (system-supported intellectual matching)
QuickStatements2 (addition of generated properties and items)
wdmapper (harvest, diff & add mappings)
Tools for mass editing require approved bot account.
Limitations
Mapping algorithms to find mapping candidates
Limitation to easy-1-1-relationships
part-whole
often new Wikidata items required
depends on the use case
Large sets of mappings and results
Regular review required for maintainance
We just relied on the basic mapping algorithm build into Mix'n'match and added already existing mappings to Wikidata
Each tool has its limitation on large result sets
Benefits
Outsourced interface, storage, and operation
Crowdsourced mapping maintenance
Wikidata has policies and tools for data quality
Open Data for multiple and unknown uses
Additional benefits:
multilingual Wikipedia links
lots of (formatted) data, nice pictures, …
links to multiple other vocabularies
Resume presentation
Wikidata as authority linking hub Joachim Neubert (ZBW - Leibniz Information Centre for Economics, Kiel/Hamburg) Jakob Voß (Verbundzentrale des GBV, Göttingen) ELAG Athens, June 8, 2017 Online version (with links): https://hackmd.io/p/S1YmXWC0e
{"metaMigratedAt":"2023-06-14T12:47:38.269Z","metaMigratedFrom":"YAML","title":"Wikidata as authority linking hub","breaks":true,"lang":"en","slideOptions":"{\"README\":\"https://github.com/hakimel/reveal.js/#configuration\",\"slideNumber\":false,\"showNotes\":false,\"theme\":\"white\"}","contributors":"[{\"id\":\"6bfccf0f-3d06-485f-82a5-680e0616648a\",\"add\":4,\"del\":4}]"}