# Collaborative Document. Day 1, Sept 26th
2022-09-26 R for Social Scientists
Welcome to The Workshop Collaborative Document
This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
All content is publicly available under the Creative Commons Attribution License
https://creativecommons.org/licenses/by/4.0/
----------------------------------------------------------------------------
This is the Document for today: [link](https://hackmd.io/@o3DWHyfCQNqBUaAA1JO-_A/HJkxiTA-o)
## 👮Code of Conduct
* Participants are expected to follow those guidelines:
* Use welcoming and inclusive language
* Be respectful of different viewpoints and experiences
* Gracefully accept constructive criticism
* Focus on what is best for the community
* Show courtesy and respect towards other community members
## ⚖️ License
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
## 🙋Getting help
to ask a question, type in the chat window
to get help, type in the chat window
you can ask questions in the document or chat window and helpers will try to help you
## 🖥 Workshop website
[link](https://steltenpower.github.io/2022-09-26-dc-socsci-R-nlesc-dccpo-online/)
### 🛠 Setup
[link](https://datacarpentry.org/socialsci-workshop/setup-r-workshop.html)
#### For today:
- Excel of LibreOffice
- OpenRefine (http://openrefine.org/download.html)
- You need to download some files to follow this lesson:
Download the following three files:
[SAFI_clean.csv](https://ndownloader.figshare.com/files/11492171)
[SAFI_messy.xlsx](https://ndownloader.figshare.com/files/11502824)
[SAFI_dates.xlsx](https://ndownloader.figshare.com/files/11502827)
- Place these 3 files in a folder you can easily find and access on your computer (for instance in a datacarpentry-spreadsheets folder on your Desktop or within your Home folder).
## About the data
For more information about the dataset and to download it from [Figshare](http://www.datacarpentry.org/socialsci-workshop/data), check out the Social Sciences workshop data page.
## 👩🏫👩💻🎓 Instructors
Ruud Steltenpool, Rick de Klerk
## 🧑🙋 Helpers
Rins Rutgers, Margriet Miedema
## Check in
## Introduction
### Exercise
**How many people have used spreadsheets in their research?**
- Rick: Als ik snel iets wil bekijken!
- Judith: veel gebruikt in onderzoek, met formules, voor de pre-processing
**How many people have accidentally done something that made them frustrated or sad?**
- Judith: Als onderzoeker heb ik met excel een probleem gehad met de invoer van de geboortedata van proefpersonen, het format (UK / US) was veranderd in een gedeeld document, waardoor ik de leeftijd van de proefpersonen niet goed kon berekenen, en allemaal handmatig moest aanpassen
- Voorbeeld genetica: https://www.nature.com/articles/d41586-021-02211-4
## Data formatting
### Data formatting problems
### Keeping track of your analyses
### Structuring data in spreadsheets
Alle variabelen in een kolom
Elke observatie een rij
Geen cellen combineren
Bewaar de ruwe data!
Exporteer naar een csv
**Opdracht:**
We’re going to take a messy version of the SAFI data and describe how we would clean it up.
1. Download the [messy data](https://ndownloader.figshare.com/files/11502824).
1. Open up the data in a spreadsheet program.
1. Notice that there are two tabs. Two researchers conducted the interviews, one in Mozambique and the other in Tanzania. They both structured their data tables in a different way. Now, you’re the person in charge of this project and you want to be able to start analyzing the data.
1. With the person next to you, identify what is wrong with this spreadsheet. Discuss the steps you would need to take to clean up the two tabs, and to put them all together in one spreadsheet.
**Important** Do not forget our first piece of advice, to create a new file (or tab) for the cleaned data, never modify your original (raw) data.
After you go through this exercise, we’ll discuss as a group what was wrong with this data and how you would fix it.
### Findings
- missing number als -99 (gaat mis als je getallen gaat optellen etc)
- voetnoot in geel, daar kan computer niets mee
- verschillende spelling van roof-type bijv mabatisloping en mabati_sloping en floortype bijvoorbeeld errth en earth
- Misschien de opmaak met tabel randen?
- Verschillende manier van opschrijven tussen Mozambique en Tanzania
- Bij Mozambique wordt er meerdere informatiepunten in één cel geschreven
- Niet alles in rijen
- Kleurformatting in dwelling tabel (Mozambique)
- Meerdere stukjes informatie in één kolom (Mozambique)
- Geen eenduidige formatting in plots (Mozambique)
- Aantal kolommen lifestocktabel kan worden verminderd (Tanzania)
- niet één variabele per kolom, variabelen verdeeld over meerdere kolommen
- Typefouten (errth/earth)
- Onduidelijke tabelnamen/
- 3 tabelen in 1 sheet, met dezelfde key_id --> ik vroeg mij af, staan de - key_id's voor hetzelfde?
### Metadata
- Standaarden
- Codebooks
- [DDI]()
- https://frictionlessdata.io/
-- https://lov.linkeddata.es/dataset/lov/
-wat zou een goede meta data bescrhijving zijn voor deze set (vraag Tineke)
**Opdracht**
Download a [clean version of this dataset](https://ndownloader.figshare.com/files/11492171) and open the file with your spreadsheet program. This data has many more variables that were not included in the messy spreadsheet and is formatted according to tidy data principles.
Discuss this data with a partner and make a list of some of the types of metadata that should be recorded about this dataset. It may be helpful to start by asking yourself, “What is not immediately obvious to me about this data? What questions would I need to know the answers to in order to analyze and interpret this data?”
### Findings
- Uitleg van variabelenamen - wat is in 'room' (incl. of excl. 'barn')? Wat is bedoeld met 'affect_conflicts'? Etc.
- Vreemd formaat van de tijd-datum
- Meer key-ids dan regels - somige data valt uit?
## Dates as data
**Opdracht**
> Download and open the [SAFI_dates.xlsx](https://ndownloader.figshare.com/files/11502827) file. This file
> contains a subset of the data from the SAFI interviews, including the dates on which the
> interviews were conducted.
>
> Choose the tab of the spreadsheet that corresponds to the way you format dates in your
> location (either day first `DD_MM_YEAR`, or month first `MM_DD_YEAR`).
>
> Extract the components of the date to new columns. For this we
> can use the built in Excel functions:
>
> `=MONTH()`
> `=DAY()`
> `=YEAR()`
>
> Apply each of these formulas to its entire column.
> Make sure the new column is formatted as a number and not as a date.
>
> We now have each component of our date isolated in its own column. This will allow us
> to group our data with respect to month, year, or day of month for our analyses and will
> also prevent problems when passing data between different versions of spreadsheet
> software (as for example when sharing data with collaborators in different countries).
```=DAY([@[interview_date]])```
## Quality assurance
Looking again at the [clean version of the SAFI
dataset](https://ndownloader.figshare.com/files/11492171), we see that there are several columns with numeric data. One example of this is the column `no_membrs` representing the number of people in the household. We would expect this always to be a positive integer, and so we should reject values like `1.5` and `-8` as entry errors. We would also reject values over a certain maximum - for example an entry like `90` is probably the result of the researcher inputting `9` and their finger slipping and also hitting the `0` key. It is up to you as the researcher to decide what a reasonable maximum value would be for your data, here we will assume that there are no families with greater than 30 members.
**Opdracht**
> Apply a new data validation rule to one of the other numeric
> columns in this data table. Discuss with the person sitting next
> to you what a reasonable rule would be for the column you've selected. Be sure to create an informative input message.
## Exporting data
To save a file you have opened in Excel in CSV format:
1. From the top menu select `File` and `Save as`.
2. In the `Format` field, from the list, select `Comma Separated Values` (`*.csv`).
3. Double check the file name and the location where you want to save it and hit `Save`.
Note: selecting and exporting in UTF-8 file format ensures all special characters will be corectly saved.
# OpenRefine
Data: https://ndownloader.figshare.com/files/11502815
* Data is often very messy. OpenRefine provides a set of tools to allow you to identify and amend the messy data.
* It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
* All actions are easily reversed in OpenRefine.
* If you save your work it will be to a new file. OpenRefine always uses a copy of your data and _does not_ modify your original dataset.
* Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets.
* Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
**Opdracht**
> 1. Using faceting, find out how many different `interview_date` values there are in the survey results.
>
> 2. Is the column formatted as Text or Date?
>
> 3. Use faceting to produce a timeline display for `interview_date`. You will need to use `Edit cells` > `Common transforms` > `To date` to convert this column to dates.
>
> 4. During what period were most of the interviews collected?