Adding your own stuff to Wikidata

# Adding your own stuff to Wikidata Workshop at SWIB conference 2018 in Bonn, November 26th (13-19h) ## Preparation Tools: * [Wikidata workshop page](https://www.wikidata.org/wiki/Wikidata:Events/SWIB_2018) * This page: https://hackmd.io/s/rk32W1C3m# * Pinnwand/Cards/... to collect/note topics and tools Handouts: * https://www.wikidata.org/wiki/Wikidata:In_one_page (2 pages) ## Intro: Round of introductions _(Joachim)_ * Who are you and why? * What kind of data do you work with? ### Intro: Questions to the audience * Who has queried Wikidata via the SPARQL endpoint? * Who has already edited Wikidata? * Who has used tools to add data to Wikidata? ## Basics (*Jakob, 13-14*) A short introduction to Wikidata (*Jakob*) ### Basic Skills * Editing items * Finding properties * Basic editing Wiki pages for discussion (e.g. `{{Q|...}}`) * Understanding of the Data model * items, properties, data types * qualifiers, references * constraints * Expression in RDF ### Practice: Editing and Querying a set of items In teams of two: * get a list of Zoos (possibly restricted to a subset) * add missing items (at least two each) and data (coordinates, dates...) Useful links: * https://tools.wmflabs.org/sqid/#/view?id=Q43501 * [Filter by latest additions](https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service/queries#Items_created_today?) * [zoo items added lately](https://query.wikidata.org/#select%20%3Fitem%20%3FitemLabel%0Awhere%20%7B%0A%20%20%23%20restrict%20to%20some%20class%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ43501%20.%0A%20%20bind%28xsd%3Ainteger%28strafter%28str%28%3Fitem%29%2C%20%27Q%27%29%29%20as%20%3Fid%29%0A%20%20service%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%2Cde%2Cfr%2Ces%2Cpt%2Cfi%2Csv%2Cno%2Clv%2Cnl%2Cpl%22%20.%7D%0A%7D%0Aorder%20by%20desc%28%3Fid%20%29%0Alimit%2010) * Show different ways to make use of the data * https://wikitable.info/compare/zoos * http://histropedia.com/ * ... ---- *short break* (~14:00) ## Mapping ### Mix-n-Match (*Joachim*) * **Demo**: [GND persons (economists) ](https://tools.wmflabs.org/mix-n-match/#/catalog/431) * different lists * lookup one side or other by link * multiple matches * '#' page * Google search - select WP article - WD link - set QID * create new item * [Prepared inserts](http://zbw.eu/beta/tmp/stw_qs_create.html) (from STW thesaurus) - along with Mix-n-Match * **Hands on** (single or groups of 2): * work through list of ["automatically matched" GND economists](https://tools.wmflabs.org/mix-n-match/#/list/431/auto) _[assign half a page to everybody]_ * descide if a person can be confirmed as matched, or remove it (don't just leave it) * Guideline: it's easier to merge than to split - don't add ID to a well-defined item if you are not sure * Try to match some persons from the ["unmatched" list](https://tools.wmflabs.org/mix-n-match/#/list/431/unmatched) _[better not from the first page]_ * **Presentation/Demo** Loading data into Mix-n-Match * example file ([GND economists](http://zbw.eu/beta/swib18_wikidata_examples/gnd_mix_n_match.tsv.txt)) * Sequence matters * Condensed information in description field (adapt to order for WD entries) * [Import interface](https://tools.wmflabs.org/mix-n-match/import.php) (open Mix'n'match and log into WiDaR before) * [Hints for manual and script-based input file creation from SPARQL query results](https://github.com/jneubert/doc/wiki/Wikidata-Workflows#load-mixnmatch) ### Cocoda _(Jakob)_ Short demo of Cocoda with GND as example: * <http://coli-conc.gbv.de/concordances/> * <http://coli-conc.gbv.de/cocoda/> Example: "Tuberkelbakterium" (GND-WD) ## Data modeling (Jakob) ![](http://format.gbv.de/img/gdm.jpg) Basic * Wikidata (meta)data model revisited Where to go, look and ask * Lists of properties * https://tools.wmflabs.org/hay/propbrowse/ * https://www.wikidata.org/wiki/Template:Bibliographic_properties * ... * Squid * Wikiprojects How to extend and change the model * Create/add/modify classes * wdtaxonomy * Try out on your machine or <https://paws.wmflabs.org/> and new bash notebook, `PATH=$(npm bin):$PATH` * `npm install wikidata-taxonomy` * Property-Proposals * Bot requests (Larger data ingests, e.g. performing arts) *nur kurz nennen* * mapping-properties (example: CSL ontology) Quality control and checks * Queries * Constraints * Listeria --- *coffee break* (~15:30) --- ## Larger data ingestion process _(Joachim)_ Companies/organizations from [20th Century Press Archives](https://en.wikipedia.org/wiki/20th_Century_Press_Archives) (PM20) ([web site](http://webopac.hwwa.de/pressemappe20), [example company](http://purl.org/pressemappe20/folder/co/051838)) Makes use of a [SPARQL endpoint](http://zbw.eu/beta/sparql-lab/?endpoint=http://zbw.eu/beta/sparql/pm20/query&queryRef=https://api.github.com/repos/zbw/cdv2018-pressemappe20/contents/sparql/search_folders_by_text.rq) with extracted data from the application -> workflow based on LOD technologies ### Check for already existing items _(Presentation)_ #### Insert automatically derived mappings * Using the [QuickStatements tool](https://tools.wmflabs.org/quickstatements/) * Lookup and alignment via existing identifiers (GND) - [federated query producing statements](http://zbw.eu/beta/sparql-lab/?endpoint=https://query.wikidata.org/sparql&queryRef=https://api.github.com/repos/zbw/sparql-queries/contents/wikidata/missing_pm20_id_via_gnd.rq) * manual insert of single id * batch inserts * Repeat process from time to time #### Mix-n-match catalogs * M-n-m companies ([international](https://tools.wmflabs.org/mix-n-match/#/catalog/622), [German](https://tools.wmflabs.org/mix-n-match/#/catalog/623)) * "Manually sync catalog" (_from_ Wikidata _to_ Mix-n-match, not the other direction) * lookup/disambiguation ### Excurse: Cleaning up along the way _(Demo)_ Merge and split items _(Please don't touch the examples!)_ * Duplicate Wikidata items Example: [Bilfinger & Berger Bau AG](http://purl.org/pressemappe20/folder/co/003135) -> [Bilfinger SE](https://www.wikidata.org/wiki/Q284672) and [Bilfinger (Germany)](https://www.wikidata.org/wiki/Q30255514) Example: [Legal services](http://zbw.eu/stw/version/latest/descriptor/13389-2/about) -> in Wikidata: [legal advice](https://www.wikidata.org/wiki/Q220117) and [legal activities](https://www.wikidata.org/wiki/Q29584787) * check type * check wikipedia links * check incoming links * [Help:Merge](https://www.wikidata.org/wiki/Help:Merge) * Messed up existing Wikidata items Example: [Hutschenreuter](https://www.wikidata.org/wiki/Q430822) * Recommended process: Create new item, [merge content from old one](https://www.wikidata.org/wiki/Special:MergeItems), clean up both * [Help:Split an item](https://www.wikidata.org/wiki/Help:Split_an_item) ### Model new entities to insert _(Group work, Jakob)_ * Modeling class * Modeling fields * existence from/to * headquarter (how to deal with not-yet aligned fields?) * industry * Rely on prior work (in this case, [property list of WikiProject Companies](https://www.wikidata.org/wiki/Wikidata:WikiProject_Companies/Properties)) ### Data preparation and insertion _(Demo/Presentation)_ * Select the data to insert * _only_ de-duplicated (keep track of mnm catalogs, possibly via transformation to named graph - [script](https://github.com/zbw/sparql-queries/blob/master/bin/mnm2graph.pl)) * exclude data where external identifier already exists in Wikidata * align names/item labels to naming conventions (lower/upper case, drop company legal form) * source information (source even "trivial" statements - e.g., for class - as provenance hints) * technical limitation: no source for the item itself, labels and aliases * content of description field (helpful for quick disambig in autosuggest) * special use: unformatted info ([example](https://www.wikidata.org/wiki/Q47163891)), useful for disambiguation and further extension * Prepare data _(Demo)_ * Create and execute [query](http://zbw.eu/beta/sparql-lab/?endpoint=http://zbw.eu/beta/sparql/pm20/query&queryRef=https://api.github.com/repos/zbw/sparql-queries/contents/pm20/companies_missing_in_wikidata.rq) * Transform via [script](https://github.com/zbw/sparql-queries/blob/master/bin/create_missing_wikidata.pl) to [statements](http://zbw.eu/beta/swib18_wikidata_examples/companies_missing_in_wikidata.2018-11-25.qs.txt) for [QuickStatements](https://tools.wmflabs.org/quickstatements/) * Insert data for a few example items * Search community consensus * (If you are not sure at all that your data fits in, you probably should discuss your ideas very early in the project) * Discuss your approach on [Wikidata Project chat](https://www.wikidata.org/wiki/Wikidata:Project_chat) ([example](https://www.wikidata.org/wiki/Wikidata:Project_chat#Preparing_a_data_import_for_companies_from_the_first_half_of_the_20th_century)) * Other places to ask: Property talk pages, User talk pages, WikiProject talk pages * Apply for Bot permission ([policy](https://www.wikidata.org/wiki/Wikidata:Bots), [request form](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot), [example](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/JneubertAutomated_3)) * Batch data insertion _(Demo)_ * insert via [QuickStatements](https://tools.wmflabs.org/quickstatements/), using the bot account * use batch mode and keep track of batch ids ### Gradual improvement of existing items * e.g., class (P31) for companies from PM20 ([query](http://zbw.eu/beta/sparql-lab/?endpoint=https://query.wikidata.org/sparql&queryRef=https://api.github.com/repos/zbw/sparql-queries/contents/wikidata/missing_class_via_pm20.rq)) * Repeating pattern: * federated SPARQL query * exclude items where the property already exists * further properties (e.g., location of headquarter, industry) can be added when the according vocabularies are matched ### Two more overall recommendations Taken from Julia Beck, Beat Estermann, Birk Weiberg: [Wikidata Case Report: Ingesting Production Databases of the Performing Arts](https://www.wikidata.org/wiki/Wikidata:WikiProject_Performing_arts/Reports/Ingesting_Production_Databases_of_the_Performing_Arts) (2018): * Always keep track of your work steps, both for yourself (as the data ingestion process may take longer than expected, and in case of errors, you may need to repeat some of the steps) and for posterity, as other people may want to ingest similar datasets. * Document your reflections regarding the various data modelling issues encountered. This is both useful in the context of property discussions and in view of future ingests or data cleansing activities on Wikidata. ---- *ca. 17 h short break* ## Manual Alignment: Open Refine _(Jakob)_ After a short [Introduction to OpenRefine](https://librarycarpentry.org/lc-open-refine/01-introduction/index.html) we will focus on key features specific to working with Wikidata, especially *importing* data. This can only scratch the surface of what is possible! 1. Introduction * [Importing data](https://librarycarpentry.org/lc-open-refine/02-importing-data/index.html): we focus on Google spreadsheet * [Facets and filters](https://librarycarpentry.org/lc-open-refine/04-faceting-and-filtering/index.html): we will focus on filtering to show subsets * *Thanks to [Antonin Delpeuch](https://github.com/wetneb) for bringing OpenRefine and Wikidata together!* 2. Reconcile with Wikidata: [Wikidata Reconciliation Service] * Reconcile via name and type * Reconcile via identifier (e.g. ISSN, ORCID, GND...) * Reconcile via name and additional properties Example data: journal, ISSN Code4Lib Journal, 1940-5758 International Journal of Digital Library Services, 2250-1142 3. Augment data with data from Wikidata *Edit column > Add column based on this column...* (add name as string) *Edit column > Add columns from reconciled values...* (for instance URL) 4. Edit Wikidata * Demo example: * add homepage http://www.ijodls.in/ * select "new item" * *create data model / Wikidata schema* * Export to QuickStatements (for preview) * Upload to Wikidata (*cross fingers!*) **Task:** Add Code4Lib bibliographic data. Requires OpenRefine installed. Groups of 2-3 people. - Draft data model (5 minutes + 5 minutes to agree on and to document) - Allocate work (groups) - Create items! (15 minutes) - Wrap-up (10 minutes) **Resources:** * [Library Carpentry: OpenRefine](https://librarycarpentry.org/lc-open-refine/) *--- we start with "Advanced OpenRefine functions"* * [Wikidata:Tools/OpenRefine](https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine) -- *several subpages, tutorials, screencasts...* * [Wikidata Reconciliation Service] * [QuickStatements] [QuickStatements]: https://tools.wmflabs.org/quickstatements/ [Wikidata Reconciliation Service]: https://tools.wmflabs.org/openrefine-wikidata/ --- *short break* ## Further tools _(Jakob)_ ### Quick statements * [CSV to QuickStatements](https://tools.wmflabs.org/ash-django/csv2qs/) Another Example: bibliography data from Code4Lib journal ~~~ SELECT ?article ?articleLabel WHERE { ?article wdt:P1433 wd:Q27042382 . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ~~~ <https://tools.wmflabs.org/scholia/venue/Q27042382> ### SourceMD [SourceMD](https://www.wikidata.org/wiki/Wikidata:SourceMD) is an example of a domain-specific tool to import data into Wikidata. Similar tools can be developed with Wikidata programming libraries. ### Wikidata-cli In particular see https://github.com/maxlath/wikidata-edit/blob/master/docs/how_to.md#create-entity wd set-description $QID en "journal article" ### Programming libraries * [Wikidata Integrator](https://github.com/SuLab/WikidataIntegrator) + Pywikibot (Python) * wikidata-sdk/wikicata-cli (JavaScript) * Wikidata-Toolkit (Java) * wdmapper, wdtaxonomy... ### Access tools * [Wikipedia Tools for Google Spreadsheets](https://chrome.google.com/webstore/detail/wikipedia-and-wikidata-to/aiilcelhmpllcgkhhpifagfehbddkdfp) * ... ### Custom tool for Mapping from Wikidata items to your own concpets * [WD items, selected by RePEc ID, to GND](http://zbw.eu/beta/sparql-lab/?endpoint=https://query.wikidata.org/bigdata/namespace/wdq/sparql&queryRef=https://api.github.com/repos/zbw/sparql-queries/contents/wikidata/missing_property.rq) Example: Ekaterina Zhuravskaya Example: Kim Hawtrey, Jason Barr (duplicate?) ### Import from Zotero * [Zotkat](https://github.com/UB-Mannheim/zotkat) Exports Zotero resources in QuickStatement format for import into Wikidata # Links * https://github.com/jneubert/doc/wiki/Wikidata-Links