# Software Design Course exercise sheet: part 1
The main teaching method in this course is the implementation and refinement of a not too small codebase that you will write yourself.
We will have two rounds of review in which your code will be read by other students and you will read code written by other students.
To streamline the review process, every student will implement the same data analysis pipeline using the same programming language.
## Overall goal of the data analysis pipeline
You are provided with data that was scraped from the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) and [ArXiv](https://arxiv.org) databases of academic publications.
The data contains bibliographic records of all publications from 500 authors.
For part 1 of the exercise, the goal is to aggregate this data by country and produce a table that lists for each country, how many authors in the dataset are affiliated with institutions in that country.
## The tools we will use
The programming language we will use is [Python](https://python.org) (version 3.5 or above).
It comes with an extensive [standard library](https://docs.python.org/3/library/) of functions and classes that can be used to perform every part of this exercise.
Take note that the standard library includes, and you are encouraged to make use of, the [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) XML parser.
While third party modules are not needed for this part of the exercise, you are allowed to use them as long as:
- You document these dependencies carefully, for example in an `requirements.txt` file.
- You don't use any modules that implement large chunks of this exercise. For example, don't use pre-made PubMed and ArXiv parsers. However, you are free to draw inspiration from such modules for your own designs.
## Obtaining the data and description of its contents
The data can be downloaded from Google Drive here: https://drive.google.com/file/d/1DDu3wiRTO1vJl4kMyqrzUxdmqH1vl5VR/view?usp=sharing
The `.zip` file contains two folders, `pubmed/` and `arxiv/`, each containing 500 `.xml` files containing the bibliographical data of the same 500 authors.
The `.xml` files contain the raw data as obtained through the PubMed and ArXiv API's.
Both platforms use different conventions for their data.
A summary of the relevant XML tags is given in the detailed instructions below.
## Detailed instructions for the data analysis
As stated in the summary, the end result of the data analysis pipeline should be a summary table of authors grouped by country.
For each country, we are interested in the total number of authors that have one or more affiliations with institutions within that country.
Here are a couple of example rows:
Country | Number of Authors
-------------------|------------------
United States | 85062
China | 5515
Finland | 3717
Japan | 3649
Germany | 2777
France | 2603
India | 2393
In order to construct this table, you will have to parse all the XML files to retrieve the affiliations of all authors of all publications.
We recommend using the [`ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) class in the Python standard library for this task.
### PubMed XML format
There are many XML tags used in the data returned by the PubMed API.
For this exercise, you only need to concern yourself with a few of them:
```xml
<PubMedArticleSet> <!-- Top-level element -->
<PubMedArticle> <!-- Start of data for a single article -->
<MedlineCitation>
<Article> <!-- Relevant article data starts here -->
<ArticleTitle>The title of the article</ArticleTitle>
<AuthorList> <!-- List of all authors -->
<Author> <!-- Data for a single athor -->
<LastName>The last name of the author</LastName>
<ForeName>All forenames of the author</ForeName>
<Initials>All initials of the author</Initials>
<AffiliationInfo>
<Affiliation>
Names of all affiliations of the author
</Affiliation>
</AffiliationInfo>
</Author>
<Author> <!-- Next author -->
...
</Author>
</AuthorList>
</Article>
</MedlineCitation>
</PubMedArticle>
<PubMedArticle> <!-- Next article -->
...
</PubMedArticle>
</PubMedArticleSet>
```
### ArXiv XML format
There are many XML tags used in the data returned by the ArXiv API.
In this format, all XML elements are part of the `http://www.w3.org/2005/Atom` namespace.
For this exercise, you only need to concern yourself with a few of them:
```xml
<feed> <!-- Top-level element -->
<entry> <!-- Data for a single article -->
<title>Title of the article</title>
<author> <!-- Data for a single author -->
<name>Full name of the author</name>
<affiliation>Full names of the affiliations of the author</affiliation>
</author>
<author> <!-- Next author -->
...
</author>
</entry>
<entry> <!-- Next article -->
...
</entry>
</feed>
```
### Deduplicating authors
You will find that authors will appear numerous times in the XML files as they have co-authored multiple papers.
Author names are not always consistently entered.
For example, these all refer to the same author:
- Wile E. Coyote
- Wile Ethelbert Coyote
- W. E. Coyote
- WE Coyote
- Wilé E. Coyoté
Making sense of names from different cultures is a daunting task that is well outside the scope of this exercise.
Here, we will make the following simplifying assumptions:
- The last word in a name is always the family name
- The capital letters in all words, excluding the last word, form the initials
- When comparing names, accented characters such as `é`, `ä`, `ö`, etc. match their Latin counterparts `e`, `a`, `o`, etc. See the implementation tips below for how to strip accents from characters in Python.
- Two names refer to the same author when both the family name and all initials match
According to the above rules, the following names do **not** match the names above:
- Wile Coyote
- W. Coyote
- Wile E. F. Coyote
- Coyote, Wile E.
We acknowledge that these rules do not cover all cases, for example, these names match:
- Wile Coyote
- William Coyote
but we're going to accept this for this exercise.
### Matching affiliations to countries
The data does not explicitly record a country for authors.
Instead, it often provides author affiliations, which lie inside a country.
However, we presently do not possess an unambiguous list that maps institutions to their residing country.
Instead, we are going to keep it simple:
- An author can have zero, one or multiple affiliation tags
- A single affiliation tag contains one or more affiliation names.
- Affiliation names are separated by a `;` mark in the affiliation string.
- If the name of a country appears inside an affiliation name, that is the country in which the affiliation is situated (see below for obtaining a list of all countries).
- If no country name appears in an affiliation name, we give up on trying to deduce the country from the affiliation name and move on to the next.
- If the affiliation string points to multiple countries, we count the corresponding author for all mentioned countries.
### Country names
A list of countries is provided in the `AltCountries.csv` file.
For each country, the file lists the official country code, the country's commonly used name, and a tab-separated list of alternative names for that country.
When looking for country names in an affiliation string, make sure to look for both the common name and all alternative names.
## Implementation tips
### Stripping accents from characters in Python
The [`unicodedata`](https://docs.python.org/3/library/unicodedata.html) module in the Python standard library offers several unicode normalization schemes.
By using them cleverly, a function can be derived that strips accents from characters (https://stackoverflow.com/questions/517923):
```python
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
```
### Quick lookup using `set` and `dict`
Be aware that checking whether an item exists in a list (for example, whether an author with a given name is present in a list of authors) is very slow.
The item will be compared to every item in the list until a match is found.
The `set` and `dict` datastructures are much faster in this regard.
```python
# Slow
seen_authors_list = ['Road Runner', 'ACME', 'Anvil', 'Dynamite']
print('Wile E. Coyote' in seen_authors_list)
# Fast
seen_authors_set = {'Road Runner', 'ACME', 'Anvil', 'Dynamite'}
print('Wile E. Coyote' in seen_authors_set)
# Fast
seen_authors_dict = {'Road Runner': 1, 'Acme': 2, 'Anvil': 3, 'Dynamite': 4}
print('Wile E. Coyote' in seen_authors_dict)
```
### Store your parsed database
Likely parsing the database will take quite some time, so it makes sense to store the resulting data once you have the parsing working. The easiest is probably to use [`pickle`](https://docs.python.org/3/library/pickle.html):
```python
# store the data
with open('datafile', 'wb') as file:
pickle.dump(data, file)
# load the data
with open('datafile', 'rb') as file:
data = pickle.load(file)
```