Wikidata as authority linking hub

<style type="text/css"> @import "https://hackmd.io/build/reveal.js/css/theme/white.css"; .reveal h1, .reveal h2, .reveal h3 { text-align: left; text-transform: none } .reveal p, .reveal ul, .reveal ol { text-align: left; display: block; } .reveal section img { border: none; box-shadow: none; } </style>  # Wikidata as authority linking hub *Joachim Neubert (ZBW - Leibniz Information Centre for Economics, Kiel/Hamburg)* *Jakob Voß (Verbundzentrale des GBV, Göttingen)* ELAG Athens, June 8, 2017 `Online version (with links): https://hackmd.io/p/S1YmXWC0e` --- # Introduction --- ## Authority files Consistently refer to entities * Via identifier ("things, not strings") * GND, MeSH, STW, ISIL, RePEc Author ID... --- ## Linking hubs Connect identifiers among authority files * `owl:sameAs`, `skos:exactMatch`, `skos:closeMatch`... * [VIAF](http://viaf.org), sameAs.org, [Wikidata](https://wikidata.org)... ![](https://i.imgur.com/dgRaN33.png) Note: Two general models: 1:1 or hub. Differences of hubs * VIAF - Libraries / algorithmically generated clusters * samAs - Semantic Web / harvested coreference links from the web * wikidata - Community / curated content --- ## Wikidata * Knowledge base of Wikimedia projects * All kinds of entities * concepts, places, people, works... --- ## Wikidata usage * Editable by anyone * via Website and API * via apps that use the API * Data available * http://query.wikidata.org/ (SPARQL) * JSON API & database dumps --- ## Wikidata statements ![](https://upload.wikimedia.org/wikipedia/commons/c/ce/Wikidata_statement_with_ids.svg) Note: Wikidata consists of * Items * uniquely identified by an item-id (Q...) * having a label in each language * Statements * each with a property * uniquely identified with a property-id (P...) * and a value (string, date, item, external identifier...) * and optional qualifiers and references to sources --- ## Wikidata item example ![](https://i.imgur.com/VNaeUMK.jpg) Note: on the example * This is a more condense display of a Wikidata item, based on the "SQID" Wikidata browser. * Note the list of identifiers for this item/person, linking to other authority files. * Property URL pattern + inserted ID value = link --- ## Authority file identifiers in Wikidata More than half of all Wikidata properties * Datatype external identifier (~1,750) * [Properties for authority control](https://www.wikidata.org/wiki/Q18614948) (~1,500) * Properties with corresponding KOS (~220) Note: The total number is hard to measure because of the "organic" organization of Wikidata --- # Wikidata---ISIL (organizations) Example: *Neuschwanstein Castle* ([Q4152](https://www.wikidata.org/wiki/Q4152)) ISIL ([P791](https://www.wikidata.org/wiki/Property:P791)): *DE-MUS-051612* Current state: * lobid.org: ~15,000 ISIL (DACH only) * Wikidata: ~6,500 ISIL --- ## Tool: *Mix'n'match* * Web application mapping tool * Helps to add 1-to-1-mappings https://tools.wmflabs.org/mix-n-match/ --- ## Step 1: Upload ISIL list with names ![](https://i.imgur.com/Sh89f3l.png) --- ## Step 2: Confirm match candidates ![Automatically matched](https://i.imgur.com/st847ta.png) --- ![Visual Mode](https://i.imgur.com/9WWCrCa.png) Note: The visual mode is even more handy to confirm mappings. The ISIL-Mapping is finished for Geman Museums (lobid-museums). --- # GND---RePEc Author ID * In the [EconBiz](http://www.econbiz.de) economics search portal authors are identified differently: * by **GND ID** in data from ZBW's *Econis* catalog * by **RePEc Author ID** in data from *Research Papers for Economics* * Large volumes: 450,000 vs. 50,000 distinct persons * ~3,000 pairs of IDs discovered in a previous project Note: Characteristics * GND = well known and interlinked * RAS = high qualitiy data, curated by the authors themselves, incented by rankings - but: linked to nothing else in the world --- ### Person identifiers in Wikidata and EconBiz ![](https://i.imgur.com/1J7I3aN.jpg) Note: Maintaining a custom mapping environment would require * custom software, database, operations * access limited to ZBW staff --- ## Utilizing Wikidata as linking hub * Wikidata-Properties for both identifier systems * GND ID ([P227](https://www.wikidata.org/wiki/Property:P227)): ~375,000 items which are humans * RePEc Short-ID ([P2428](https://www.wikidata.org/wiki/Property:P2428)): ~2,200 items * Since every identifier should identify exactly one person, we can derive * GND ID ⟶ Wikidata ID ⟶ RePEc ID * RePEc ID ⟶ Wikidata ID ⟶ GND ID where both properties have values (~760 items, as of 2017-04-25) Note: I describe the process in 6 steps, which should be portable to other similar mappings. All numbers above as of April 25th 2017 --- ## Step 1: Supplement WD items with RePEc Author IDs * 77 WD items with GND ID without RePEc Short-ID * Transform to *[QuickStatements2](https://tools.wmflabs.org/quickstatements/)* input file ([SPARQL query](https://github.com/zbw/repec-ras/blob/ELAG2017/sparql/missing_ids_in_wikidata_from_mapping.rq), [script](https://github.com/zbw/repec-ras/blob/ELAG2017/bin/create_missing_ids_in_wikidata_from_mapping.pl)) * Copy & paste to *QuickStatements2* --- ### Bulk editing with *QuickStatements2* ![](https://i.imgur.com/yYx3Vxn.png) Further simplification with upcoming release of *[wdmapper](https://wdmapper.readthedocs.io/)* command line tool --- ## Step 2: Supplement WD items with GND IDs * 384 WD items with RePEc Short-ID without GND ID * *same process as other direction* --- ## Step 3: Add "most important" authors with RePEc identifiers * Scraped from ranking pages ([Top 10% economists](https://ideas.repec.org/top/top.person.all.html), [Top 10% female economists](https://ideas.repec.org/top/top.women.html)) * Transform and load into *Mix'n'match* ([RePEc Top](https://tools.wmflabs.org/mix-n-match/#/catalog/406), [RePEc Top Female](https://tools.wmflabs.org/mix-n-match/#/catalog/404)) * *Same process as ISIL use case* * Confirm match candidates (~ 1,600 of 4,600) --- ## Step 4: Add "most important" authors with GND identifiers * 18,000 GND authors with >30 publications in EconBiz * Transform and load into *Mix'n'match* ([GND economists (de)](https://tools.wmflabs.org/mix-n-match/#/catalog/431)) * 25% matched automatically with Wikidata items * Confirm match candidates (~ 1500) --- ## Step 5: Rinse and repeat * Repeat *Mix'n'match* "sync" operation before starting to work manually * often, people are adding data at fast rate! * Repeat bulk adding of missing identifiers to make use of complementing identifiers added meanwhile --- ## Step 6: Add missing Wikidata items * Verify missing authors indeed are not in Wikidata (step 3 and 4) * Generate Wikidata item data from existing mappings (or lists, e.g., top female economists) in *QuickStatements2* input format ([SPARQL query](https://github.com/zbw/repec-ras/blob/ELAG2017/sparql/ras_missing_in_wikidata.rq), [script](https://github.com/zbw/repec-ras/blob/ELAG2017/bin/create_missing_wikidata.pl))  --- ### Synthezised Wikidata items from ZBW's GND--RePEc mapping 2179 new Wikidata items created * occupation "economist" from RePEc Author Service * gender and date of birth/death (if available) from GND * description from GND's info field and RePEc's "works for" --- ### Recommendations for item creation * Explain your plan and ask for feedback in the [Wikidata project chat](https://www.wikidata.org/wiki/Wikidata:Project_chat) * Pay attention to [Wikidata's notability criteria](https://www.wikidata.org/wiki/Wikidata:Notability) * [Apply for a bot account](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot) to make mass edits ([example](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/JneubertAutomated_2)) * Use *QuickStatements2* (more convenient input format, batch mode, sources) * Source every statement ([hints](https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat/Archive/2017/05#Source_statements_for_items_synte d_from_authorities_-_recommendations.3F)) --- ## Result: the mapping in Wikidata * 4087 matching GND - RePEc author IDs * 3081 matches from ZBW's mapping * 1006 matches contributed by Wikidata users * 468 additional RePEc author IDs * 52,444 additional GND IDs (present in EconBiz) (as of 2017-06-04) Note: This looks even better if we focus on the "most important" persons. --- ### Top 10% RAS and frequent (>30) GND in EconBiz ![](https://i.imgur.com/LC3HQOb.jpg) Note: The mapping currently covers more than 60% of RePEc's Top 10% Economists. --- ### Further results * Mappings to other authorities *for free* (e.g., currently ~1550 RePEc ⟷ VIAF) * Identifiers and items inserted by individual Wikidata contributors add up continuously * Mapping steps can be repeated with additional data (e.g., [top economists from Latin America](https://ideas.repec.org/top/top.lamerica.html#authors)) * Further identifiers (VIAF, ORCID...) provide more opportunities for indirect matching Results from _every step in the mapping process_ and all indiviual efforts are immediately available --- # Tools * *Mix'n'match* (system-supported intellectual matching) * *QuickStatements2* (addition of generated properties and items) * [*wdmapper*](https://github.com/gbv/wdmapper) (harvest, diff & add mappings) * Support of indirect mappings (e.g., GND-WD-RePEc) in one step * Work in progress (no adding by now) * Daily harvested mappings in multiple formats: http://coli-conc.gbv.de/concordances/wikidata/ Tools for mass editing require approved bot account. --- ## Limitations * Mapping algorithms to find mapping candidates * Limitation to easy-1-1-relationships * part-whole * often new Wikidata items required * depends on the use case * Large sets of mappings and results * Regular review required for maintainance Note: * We just relied on the basic mapping algorithm build into Mix'n'match and added already existing mappings to Wikidata * Each tool has its limitation on large result sets --- # Benefits * Outsourced interface, storage, and operation * Crowdsourced mapping maintenance * Wikidata has policies and tools for data quality * Open Data for multiple and unknown uses * Additional benefits: * multilingual Wikipedia links * lots of (formatted) data, nice pictures, ... * links to multiple other vocabularies --- # Contact Joachim Neubert j.neubert@zbw.eu http://zbw.eu/labs ![](https://i.imgur.com/66sn2BY.gif) Jakob Voss jakob.voss@gbv.de http://jakoblog.de/ ![](https://i.imgur.com/N9sWh6L.jpg)