# Software Design Course exercise sheet: part 1 The main teaching method in this course is the implementation and refinement of a not too small codebase that you will write yourself. We will have two rounds of review in which your code will be read by other students and you will read code written by other students. To streamline the review process, every student will implement the same data analysis pipeline using the same programming language. ## Overall goal of the data analysis pipeline You are provided with data that was scraped from the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) and [ArXiv](https://arxiv.org) databases of academic publications. The data contains bibliographic records of all publications from 500 authors. For part 1 of the exercise, the goal is to aggregate this data by country and produce a table that lists for each country, how many authors in the dataset are affiliated with institutions in that country. ## The tools we will use The programming language we will use is [Python](https://python.org) (version 3.5 or above). It comes with an extensive [standard library](https://docs.python.org/3/library/) of functions and classes that can be used to perform every part of this exercise. Take note that the standard library includes, and you are encouraged to make use of, the [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) XML parser. While third party modules are not needed for this part of the exercise, you are allowed to use them as long as: - You document these dependencies carefully, for example in an `requirements.txt` file. - You don't use any modules that implement large chunks of this exercise. For example, don't use pre-made PubMed and ArXiv parsers. However, you are free to draw inspiration from such modules for your own designs. ## Obtaining the data and description of its contents The data can be downloaded from Google Drive here: https://drive.google.com/file/d/1DDu3wiRTO1vJl4kMyqrzUxdmqH1vl5VR/view?usp=sharing The `.zip` file contains two folders, `pubmed/` and `arxiv/`, each containing 500 `.xml` files containing the bibliographical data of the same 500 authors. The `.xml` files contain the raw data as obtained through the PubMed and ArXiv API's. Both platforms use different conventions for their data. A summary of the relevant XML tags is given in the detailed instructions below. ## Detailed instructions for the data analysis As stated in the summary, the end result of the data analysis pipeline should be a summary table of authors grouped by country. For each country, we are interested in the total number of authors that have one or more affiliations with institutions within that country. Here are a couple of example rows: Country | Number of Authors -------------------|------------------ United States | 85062 China | 5515 Finland | 3717 Japan | 3649 Germany | 2777 France | 2603 India | 2393 In order to construct this table, you will have to parse all the XML files to retrieve the affiliations of all authors of all publications. We recommend using the [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) class in the Python standard library for this task. ### PubMed XML format There are many XML tags used in the data returned by the PubMed API. For this exercise, you only need to concern yourself with a few of them: ```xml <PubMedArticleSet> <!-- Top-level element --> <PubMedArticle> <!-- Start of data for a single article --> <MedlineCitation> <Article> <!-- Relevant article data starts here --> <ArticleTitle>The title of the article</ArticleTitle> <AuthorList> <!-- List of all authors --> <Author> <!-- Data for a single athor --> <LastName>The last name of the author</LastName> <ForeName>All forenames of the author</ForeName> <Initials>All initials of the author</Initials> <AffiliationInfo> <Affiliation> Names of all affiliations of the author </Affiliation> </AffiliationInfo> </Author> <Author> <!-- Next author --> ... </Author> </AuthorList> </Article> </MedlineCitation> </PubMedArticle> <PubMedArticle> <!-- Next article --> ... </PubMedArticle> </PubMedArticleSet> ``` ### ArXiv XML format There are many XML tags used in the data returned by the ArXiv API. In this format, all XML elements are part of the `http://www.w3.org/2005/Atom` namespace. For this exercise, you only need to concern yourself with a few of them: ```xml <feed> <!-- Top-level element --> <entry> <!-- Data for a single article --> <title>Title of the article</title> <author> <!-- Data for a single author --> <name>Full name of the author</name> <affiliation>Full names of the affiliations of the author</affiliation> </author> <author> <!-- Next author --> ... </author> </entry> <entry> <!-- Next article --> ... </entry> </feed> ``` ### Deduplicating authors You will find that authors will appear numerous times in the XML files as they have co-authored multiple papers. Author names are not always consistently entered. For example, these all refer to the same author: - Wile E. Coyote - Wile Ethelbert Coyote - W. E. Coyote - WE Coyote - Wilé E. Coyoté Making sense of names from different cultures is a daunting task that is well outside the scope of this exercise. Here, we will make the following simplifying assumptions: - The last word in a name is always the family name - The capital letters in all words, excluding the last word, form the initials - When comparing names, accented characters such as `é`, `ä`, `ö`, etc. match their Latin counterparts `e`, `a`, `o`, etc. See the implementation tips below for how to strip accents from characters in Python. - Two names refer to the same author when both the family name and all initials match According to the above rules, the following names do **not** match the names above: - Wile Coyote - W. Coyote - Wile E. F. Coyote - Coyote, Wile E. We acknowledge that these rules do not cover all cases, for example, these names match: - Wile Coyote - William Coyote but we're going to accept this for this exercise. ### Matching affiliations to countries The data does not explicitly record a country for authors. Instead, it often provides author affiliations, which lie inside a country. However, we presently do not possess an unambiguous list that maps institutions to their residing country. Instead, we are going to keep it simple: - An author can have zero, one or multiple affiliation tags - A single affiliation tag contains one or more affiliation names. - Affiliation names are separated by a `;` mark in the affiliation string. - If the name of a country appears inside an affiliation name, that is the country in which the affiliation is situated (see below for obtaining a list of all countries). - If no country name appears in an affiliation name, we give up on trying to deduce the country from the affiliation name and move on to the next. - If the affiliation string points to multiple countries, we count the corresponding author for all mentioned countries. ### Country names A list of countries is provided in the `AltCountries.csv` file. For each country, the file lists the official country code, the country's commonly used name, and a tab-separated list of alternative names for that country. When looking for country names in an affiliation string, make sure to look for both the common name and all alternative names. ## Implementation tips ### Stripping accents from characters in Python The [`unicodedata`](https://docs.python.org/3/library/unicodedata.html) module in the Python standard library offers several unicode normalization schemes. By using them cleverly, a function can be derived that strips accents from characters (https://stackoverflow.com/questions/517923): ```python import unicodedata def strip_accents(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn') ``` ### Quick lookup using `set` and `dict` Be aware that checking whether an item exists in a list (for example, whether an author with a given name is present in a list of authors) is very slow. The item will be compared to every item in the list until a match is found. The `set` and `dict` datastructures are much faster in this regard. ```python # Slow seen_authors_list = ['Road Runner', 'ACME', 'Anvil', 'Dynamite'] print('Wile E. Coyote' in seen_authors_list) # Fast seen_authors_set = {'Road Runner', 'ACME', 'Anvil', 'Dynamite'} print('Wile E. Coyote' in seen_authors_set) # Fast seen_authors_dict = {'Road Runner': 1, 'Acme': 2, 'Anvil': 3, 'Dynamite': 4} print('Wile E. Coyote' in seen_authors_dict) ``` ### Store your parsed database Likely parsing the database will take quite some time, so it makes sense to store the resulting data once you have the parsing working. The easiest is probably to use [`pickle`](https://docs.python.org/3/library/pickle.html): ```python # store the data with open('datafile', 'wb') as file: pickle.dump(data, file) # load the data with open('datafile', 'rb') as file: data = pickle.load(file) ```