# How to Get Wikidata Using SPARQL <img src="https://upload.wikimedia.org/wikipedia/commons/0/0b/Wikidata_and_SPARQL.png" alt="Wikidata and SPARQL" width="50%"/> Introductions: Kat & Leigh disclaimer: I think this should be about two hours of content, but this is the first time I've given the workshop, so mileage might vary --- ## The Purpose of the Workshop This workshop will not make you an expert on wikidata or using SPARQL. The goal is to build your knowledge toolbelt so you can read and build simple and moderately complex queries, and then have the vocabulary that when you search for help on how to build your own complex queries, you're able to read and comprehend the answers that you find. I've found Stack Overflow very helpful in finding answers about queries using SPARQL and the Wikidata Query Service (WDQS). And, this is hands-on, so you will retain more if you're able to follow and type as we go along. --- ## What Do You Know Already? Wikidata, SPARQL, and the (WDQS) are built largely from other concepts. So, if you already have experience with things like SQL and writing computer programs/scripts, we can leverage that as transferable knowledge that we can apply here. This quick survey just lets us know how much time we might need to spend on certain subjects. [What is your experience with](https://www.mentimeter.com/app/presentation/alh1wbwd8cmhkzxqrpwpxkh16rh3s9zs/e65n9ytckhjb) - Wikidata - The structure of wikidata items - SPARQL - SQL - variables in a programming language (Python, R, etc) - comments in a programming language (Python, R, etc) - Boolean Operators (AND/OR) - Conditional statements (if/then) - using a function in a programming language (Python, R, etc) --- ## Overview of Wikidata Records - similar to wikipedia - user pages, etc - no articles, but records of items (and properties), which have a structure - number of items in wikidata vs number of articles in english wikipedia - items have unique Q number - all languages - statements - `property` -> `value` - properties have p number - `instance of` -> `human` - not trivial, although we might think it is - compare with `instance of` -> `fictional characters` - some properties can have more than one value - qualifiers - Items in wikidata are supposed to [adhere to a few restrictions](https://www.wikidata.org/wiki/Help:Items). They need to be one of the following - they can have a corresponding page on a Wikimedia site such as Wikipedia, Wikivoyage, Wikisource, Wikiquote, Wikiversity, or Wikimedia Commons, - be notable in some way, similar to wikipedia. The specific wording for this policy is currently, "a clearly identifiable conceptual or material entity that can be described using serious and publicly available references." but, there are often arguments amongst the editors over what denotes notibility... similar to wikipedia - or, the item can fulfill a structural need. - there is a page that covers [the notability policy](https://www.wikidata.org/wiki/Wikidata:Notability). --- ## Wikidata Query Service (WDQS) query wikidata - query service - query.wikidata.org - power of being able to items using their structure - query helper (close) - what is SPARQL - easier to describe once we see the syntax, so let's use an example - examples - cats - run - when this query is run, it outputs a list of items, based on the query - links take to item - describe the components of a query example - comments are prefaced with # - normally, especially with self-documenting code, your comments should be why you do something. Here, because the query statement is slightly obscure, you might want to use it to describe *what* you are doing. - variables are prefaced with `?` - variables can be any letter or number, but only an underscore as a special character. - should be meaningful - query line ends in full stop : `?item wdt:P31 wd:Q146 .` - they like to use the term "triples" for `subject predicate object` which is a hold-over from the RDF framework, or semantic triple, which codifies a set of three things (entities) as a statement: the common is subject-predicate-object, but it can be any three entity representation, such as entity-attribute-value. So, let's see this in action: - `?item` is the variable that will represent one or more items that meet the following statment - `wdt` and `wd` are prefixes that are used to help build statements for your query. - `wdt:` stands for, well, I've found one claim on Stackoverflow that comes out and says the `t` stands for *truthy*, but the background is quite a lot more - and it goes into [the data model for wikibase](https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer), which is the infrastructure that wikidata uses. There's actually this play on language and semantics, where they say, and I'm quoting here - > "One of the requirements is that *"Wikibase will not be about the truth, but about statements and their references."* This means that in Wikibase we do not actually model the items themselves, but statements about them. We do not say that Berlin has a population of 3,5 M, we say that there is this statement about Berlin's population being 3,5 M as of 2011 according to the German statistical office." - *kat's note: I'm not sure, but this might lead to cognitive overload* - there's actually a pretty good [blog post](https://www.bobdc.com/blog/the-wikidata-data-model-and-yo/) on this, and the semantics of truthy statements, but for the purposes of learning SPARQL I think it can be summarized by saying `wdt:` will indicate the property you're interested in (objects w/`P` at the beginning of the number), and `wd:` will indicate the known value of the property, which should be an item (object w/`Q` at the beginning of the number). - you will sometimes see other prefixes, so here is [a link to the known SPARQL prefixes](https://en.wikibooks.org/wiki/SPARQL/Prefixes) that the Wikidata Query Service (WDQS) recognizes - hover over item to see value - `ctrl-space` will allow you to search entities, but I don't find it very useful as a someone beginning to use the WDQS, and this is why: - there are over 100 million items on wikidata - these are the objects that have numbers that start with `Q`. Because the items can be linked, that means each of those 100 million items have the possibility to be a value for a property. - there are over [10,000 properties](https://www.wikidata.org/wiki/Wikidata:List_of_properties) - when trying to either contruct queries, or - describe what `SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language` is --- ## The Results Let's look at the results. We already saw how we can click on the item links, which are in the column **item**, and the label for each item record is listed in the **itemLabel** column. - number of results Let's look at the other parts - table vs graph - </> code - Download - Link --- ## Boolean Logic: `AND` now what we have an idea of how a simple query is formed, let's start building more complex queries. Boolean operators - the most common ones being `AND` and `OR` - help narrow or expand search results. For example, if we wanted a list of all item records of humans in wikidata, that would be pretty long. If we wanted a list of item records of real OR fictional humans, that would be even longer. Chances are the query would time out. So, we can narrow the search result by using the boolean logic `AND`. To do this, we just need to add a second statement referring to the variable `?item` demonstrate cats who are listed as female. Add `?item wdt:P21 wd:Q43445 . ` ## Challenge Human Academics Edit one of the WDQS example queries to make a list of all wikidata item records of humans who have an occupation as an academic. https://w.wiki/6JGk ``` # Occupations of Academics SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q5 . # instance of (P31) human (Q5) ?item wdt:P106 wd:Q3400985 . # occupation (P106) academic (Q3400985) SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language } ``` Simply adding the second line tells the query service that we want item records where they are instances of humans **AND** have an occupation of academic. --- ## How Do You Build Your Own Complex Query? ### or, how I learned to stop worrying and love sub-classes If you want to build a complex query, the easier way to do it isn't starting at the `ctrl-space` in the query service. Using that you need to have good knowledge of the structure of item records, including which properties are used. Remember, there are over ten-thousand properties, with potential values coming from millions of item records. So, how should you go about building a complex query? First, start with your question. #### What are all of the streets in LA County? Then, start by looking at one or more item records which represent well known answer(s) to that question. In this case, we'll identify three streets - Alvarado Street - Olvera Street - Wilshire Blvd Let's start by looking at the record for first on the list: Alvarado Street. And, we're doing this to see how it's described; how is it described as a street, and how is it described as being in LA County. It has `instance of` a `road`, and for the geography, the property is `located in the administrative territorial entity` with a value of `Los Angeles County`. Which is perfect - this is what we want to run a query to find. https://w.wiki/6JHu ``` # What are all of the streets in LA county? SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q34442 . ?item wdt:P131 wd:Q104994 . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ``` And then we run the search, which gives us 115 results. 115? That's not many. Even using strictist notability rules, there are many more roads in LA County. So, why are there so few in the query results? Let's go down the list of the test items we identified to see if they are in the search results. While Alvarado Street is in the results, Olvera Street and Wilshire Boulevard are both missing. Both are notable, especially in Los Angeles, and so should have item records; let's look at those. Looking at Olvera Street, we see that it is an instance of a road, but the item record locates it in Downtown Los Angeles, and Los Angeles - and, looking at the LA record, this Los Angeles is the city, and not the county. Wilshire Boulevard is also listed as a road, but the only location listed is Beverly Hills. Looking at the item records of both Los Angeles and Beverly Hills, one finds that both of those item records list the administrative territorial entity as "Los Angeles County." Now, we've found a few things. First, that the claims for Wilshire Blvd are incomplete. It is over 15 miles long, running from nearly the Pacific Ocean in Santa Monica to Downtown Los Angeles; so, it's not only in Beverly Hills, but also in Santa Monica, City of LA. You might want to note that for later, if you decide to be an active editor. But, the more important thing that we discovered is how SPARQL queries return wikidata items. Queries only return exactly what you asked it to return. Since we asked it to return item records of roads that are listed in LA County, it only returned those records that have LA County directly recorded, not any road which are in a city in LA County, even though those roads are technically in LA County. So, how do we get those other records? We could try listing all of the possible items, but that seems excessive and onerous. It's also unneeded. These cities are already listed in LA County according to the structure of their record, so we can use that to revise our query. In order to include all items in a subclass of a property or item, we need to amend our query as so. ``` ?item wdt:P131*/wdt:P279* wd:Q104994 . ``` This tells the query service that we want both the classes and subclasses related to LA County as a territorial unit. We're adding two things; first, we're adding `P297` because we want subclasses of the original property, but we're also adding an asterisk, which is saying we want zero or more of this element. Meaning that it can return a subclass of a subclass of a subclass... etc. So, let's try the query now. https://w.wiki/6JLP ``` # What are all of the streets in LA County? SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q34442 . ?item wdt:P131*/wdt:P279* wd:Q104994 . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ``` It's returned a few more - now it's at 139 - but the results are still far fewer than what we should expect. Let's check our known answers. And yes, it now returns Wilshire and Olvera. Let's try a few other well known streets. I'm going to start listing some, because I happen to have a deep subject knowledge of LA geography. - Huntington Drive - Colorado Blvd - Burbank Blvd - Victory Blvd These last two are missing from the list. Let's see if they're in Wikidata. Burbank Boulevard is fairly well known in the San Fernando Valley. But, it doesn't have an item record for some reason. But, when we search, one of the results is Warner Boulevard, which is where the main Warner Brothers Studios is located. Let's take a look at that record. Looking back at the search results, we see that it is missing. So, let's look at the record. The item record lists this boulevard as an instance of a street, not road, so let's take a look at its record, to see how it's classified. There we go, a street is a subclass of a road, so we need to amend our original statement, too ``` ?item wdt:P31*/wdt:P279* wd:Q34442 . ``` Now, we have 517 results, but when we sort them, we see there are a lot of duplicates - which are being returned because of some reason - but we can get rid of them by adding this term at the end of our query outside of the `where` statement: ``` GROUP BY ?item ?itemLabel ``` ``` # What are all of the streets in LA County? SELECT ?item ?itemLabel WHERE { ?item wdt:P31*/wdt:P279* wd:Q34442 . ?item wdt:P131*/wdt:P279* wd:Q104994 . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } GROUP BY ?item ?itemLabel ``` And now the results are 184. Let's look at another While wikidata won't give you an exhaustive list of roads and streets in Los Angeles County, it will allow you to add supplemental information about those which it knows about. Once you've created a good query, you can change one aspect to adapt it to a slightly different query, such as changing the county to look up streets for a different location. https://w.wiki/6Lvy --- ## Equivalent to Boolean Operator `OR` : `UNION` We should be familiar with the boolean operator *OR*. It's used often when you query a database and you want it to return different words that have similar meaning. The way queries are structured in SPARQL to return items that are, going back to the streets in LA county example, either in the administrative district or maintained by the city of los angeles, you'd use the operator `UNION`. What it does specifically, is connects two blocks, so the results are either from one block or another. ``` # What are all of the streets in LA County? SELECT ?item ?itemLabel WHERE { ?item wdt:P31*/wdt:P279* wd:Q34442 . { ?item wdt:P131*/wdt:P279* wd:Q104994 .} UNION { ?item wdt:P126*/wdt:P279* wd:Q65 . } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } GROUP BY ?item ?itemLabel ``` --- ## Equivalent to Boolean Operator `NOT` : `MINUS` subclasses of roads, admin or maintained by LA: 186 results change to thoroughfare https://w.wiki/6JLg 261 results remove all instances of roads and subclasses of roads to see if there are items that are misclassified https://w.wiki/6JLh 75 results change to thoroughfare and subclasses doesn't add any meaningful records. --- ## Challenge: Human Academics https://w.wiki/6JHA ``` # List of Human Academics SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q5 . # instance of (P31) human (Q5) ?item wdt:P106 wd:Q3400985 . # occupation (P106) academic (Q3400985) ?item wdt:P27 wd:Q30 . # US citizen SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language } ``` https://w.wiki/6JHC ``` # List of Human Academics SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q5 . # instance of (P31) human (Q5) ?item wdt:P106 wd:Q3400985 . # occupation (P106) academic (Q3400985) MINUS { # remove every item record which has ?item wdt:P27 wd:Q30 . # US citizen } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language } ``` --- ## Limit After the where block. `LIMIT 20` --- ## More Labels Lists are nice, but how do you get more information about the item records returned? We can use variables two ways: the first is the variable which represents the item records returned. We've seen this in a lot of https://w.wiki/6JMA ``` # Occupations of Academics SELECT ?item ?itemLabel ?occupation ?occupationLabel WHERE { ?item wdt:P31 wd:Q5 . # instance of (P31) human (Q5) ?item wdt:P106 wd:Q3400985 . # occupation (P106) academic (Q3400985) ?item wdt:P106 ?occupation . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language } ``` how to include those which don't have a variable in occupation add `OPTIONAL` keyword at front of where block where we want the label added, but it might not have a value assigned --- ## Filter based on linked item within item record *Which academics had a parent who is an academic?* build up, first item records which have a mother listed `?item wdt:P25 ?mother . ` then item records whose mother as a listed occupation of an academic `?mother wdt:P106 wd:Q3400985 .` add `UNION` with father record show two different `UNION` formations this times out ``` { ?item wdt:P22 ?father . } UNION { ?item wdt:P25 ?mother . } { ?father wdt:P106 wd:Q3400985 . } UNION { ?mother wdt:P106 wd:Q3400985 . } ``` this does not ``` { ?item wdt:P22 ?father . ?father wdt:P106 wd:Q3400985 . } UNION { ?item wdt:P25 ?mother . ?mother wdt:P106 wd:Q3400985 . } ``` Final query ``` # Occupations of Academics, whose parent was also an Academic SELECT ?item ?itemLabel ?occupation ?occupationLabel ?fatherLabel ?motherLabel WHERE { ?item wdt:P31 wd:Q5 . # instance of (P31) human (Q5) ?item wdt:P106 wd:Q3400985 . # occupation (P106) academic (Q3400985) ?item wdt:P106 ?occupation . { ?item wdt:P22 ?father . ?father wdt:P106 wd:Q3400985 . } UNION { ?item wdt:P25 ?mother . ?mother wdt:P106 wd:Q3400985 . } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } # Helps get the label in your language, if not, then en language } ``` https://w.wiki/6LjZ --- ## Other Filters ### Number ranges Number range: first pull the variable, then filter based on that Select ?item ?dob [...] ?item wdt:P569 ?dob FILTER (YEAR{?dob) > 1950) ### Items that are missing values in a property https://w.wiki/6JsW --- ## Return Search Results Based on Text I'll use searches similar to this to see if an item exists. It's useful because I'm able to use it computationally, then store the results in a json file to evaluate later - https://w.wiki/6JrZ - https://github.com/kekoziar/wikidata/ --- ## Converting a Query List to Useable CSV File --- ## Where to Get Help - find examples - WDQS examples - also listed here https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples - [Read the Docs](https://www.w3.org/TR/2013/REC-sparql11-query-20130321/) - SPARQL has a large number of functions you can use to select and filter, such as YEAR() under the WDQS help submenu - request a query: https://www.wikidata.org/wiki/Wikidata:Request_a_query - archived queries: https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive - https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service - Stack Overflow - Ask me- workshop feedback https://forms.gle/CZzu75AjxKhmY1Bj7
{"metaMigratedAt":"2023-06-17T19:29:22.526Z","metaMigratedFrom":"YAML","title":"How to Get Wikidata Using SPARQL","breaks":true,"slideOptions":"{\"transition\":\"slide\",\"theme\":\"white\"}","contributors":"[{\"id\":\"8f316b4f-9ce2-4d07-beb7-71d5f949884c\",\"add\":30125,\"del\":9510},{\"id\":\"b6ba77cf-9caf-42fd-835a-e00ac9eddd35\",\"add\":14,\"del\":9}]"}
    623 views