owned this note
owned this note
Published
Linked with GitHub

# Collaborative Document
2022-05-30 Data Carpentry with Python (day 1)
Welcome to The Workshop Collaborative Document.
This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
----------------------------------------------------------------------------
This is the Document for today: [link](https://tinyurl.com/python-socsci-day1)
Collaborative Document day 1: [link](https://tinyurl.com/python-socsci-day1)
Collaborative Document day 2: [link](https://tinyurl.com/python-socsci-day2)
Collaborative Document day 3: [link](https://tinyurl.com/python-socsci-day3)
Collaborative Document day 4: [link](https://tinyurl.com/python-socsci-day4)
## 👮Code of Conduct
Participants are expected to follow these guidelines:
* Use welcoming and inclusive language.
* Be respectful of different viewpoints and experiences.
* Gracefully accept constructive criticism.
* Focus on what is best for the community.
* Show courtesy and respect towards other community members.
## ⚖️ License
All content is publicly available under the Creative Commons Attribution License: [creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/).
## 🙋Getting help
To ask a question, type `/hand` in the chat window.
To get help, type `/help` in the chat window.
You can ask questions in the document or chat window and helpers will try to help you.
## 🖥 Workshop information
:computer:[Workshop website](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/)
🛠 [Setup](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/#setup)
:arrow_down: Download the following files and place them in a single, easily findable folder:
- [SAFI_clean.csv](https://ndownloader.figshare.com/files/11492171)
- [SAFI_messy.xlsx](https://ndownloader.figshare.com/files/11502824)
- [SAFI_dates.xlsx](https://ndownloader.figshare.com/files/11502827)
- [SAFI_openrefine.csv](https://ndownloader.figshare.com/files/11502815)
## 👨🏫👩💻🎓 Instructors
Barbara Vreede
Francesco Nattino
## 🧑🙋 Helpers
Suvayu Ali
Dafne van Kuppevelt
Candace Makeda Moore
## 👩💻👩💼👨🔬🧑🔬🧑🚀🧙♂️🔧 Roll Call
Name/ pronouns (optional) / job, role / social media (twitter, github, ...) / background or interests (optional) / city
- Francesco Nattino (he, him) / Research Software Engineer at eScience Center / gh: fnattino / Leiden
- Candace Makeda Moore (she, her)/ Research Software engineer at the Netherlands eScience Center/ gh: drcandacamakedamoore/ Medicine/ Zaandam
- Dafne van Kuppevelt / she,her / Research Software Engineer @ eScience Center / @dafnevk / Machine Learning, Natural language processing, teaching / Utrecht
- Lieke de Boer / she, her / Community Manager @ eScience Center / Open Science, Neuroscience
- Philippine Waisvisz / she, her / Researcher Digital Ethics @ Hogeschool Utrecht / Social Psychology (motivation, emotion & behavior), data annotation, inclusive AI
- Barbara Vreede / she, her / Research Software Engineer @eScience Center / gh:@bvreede / biology, climate / Utrecht
- Kasimir Dederichs, PhD student in Sociology, Nuffield College, Uni Oxford. Voluntary organizations / Social integration / Segregation
- Adri Mul (He), AmsterdamUMC, research analist
- Kevin Wittenberg (he, him), PhD in Sociology, Utrecht University / interests in network analysis, human cooperation, convergence of classical statistics & ML / Joining ODISSEI summer school
- Hekmat Alrouh (he,him) / PhD student @ Vrije Universiteit Amsterdam / @hekmatalrouh / epidemiology, obesity
- Melisa Diaz (she, her) a post-doc researcher (Politecnico di Milano & VU) joining the odissei summer school. My field of research is economics of education
- Kyri Janssen (she,, her) Researcher at Delft University and Technology. Joining ODISSEI summer school
- Carissa Champlin/ Postdoc at TU Delft/ Delft-Enschede / starting 1 June as Assistant Professor in Community-Level Design for Climate Action
- Anne Maaike Mulders (she, her) / PhD student at Radboud University Nijmegen
- Signe Glæsel (she, her) / Student Assistent Datamanager @ DANS / Psychology, Statistics / Leiden
- Yahua Zi (she, her)/ PhD in FGB/ heritability of motor development/ Amsterdam
- Dominika Czerniawska/ she her/ postdoc Leiden University / sociology, social networks/ Oddisei Summer School prep
- Babette Vlieger (she,her)/PhD student/ University of Amsterdam/ Plant Phytopathology
- Roxane Snijders (she, her) / PhD student / University of Amsterdam / Plant Physiology----
- Daniela Negoita / Junior Researcher / European Values Study (Tilburg University) / Tilburg
- Cecilia Potente / Postdoc / University of Zurich / Sociology, Biodemography / Odissei Summer School
- Michael Quinlan (he, him) / Researcher Observational Technology at KNMI / BS UMass - Lowell in Meteorology / MS Utrecht University in Applied Data Science / https://github.com/mquinlan0824 / Amsterdam
- Ranran Li (she/her) / PhD student in Psychology / VU University Amsterdam/ ODISSEI summer school
- Swee Chye (he, him) / overseas, interests in Data Science, Data Analytics, AI
- Yunfeng Liu(he,him)/phD student in Biomedicical data science/Leiden university medical center
- Reshmi Gopalakrishna Pillai/Lecturer, Informatics Institute, University of Amsterdam/Summer School, ODISSEI/social media, natural language processing, psycholinguistics/
- Samareen Zubair(she/her)/Research Masters Social and Behavioral Sciences/Tilburg University/Summer School ODISSEI
- Lianne Bakkum / Postdoc @ Vrije Universiteit Amsterdam / Clinical child and family studies / ODISSEI Summer school
- Rael Onyango(She/her)/Phd candidate in SBE /VU University Amsterdam
## 🗓️ Agenda
| Time | Topic |
|------:|:----------------------------------|
| 9:00 | Welcome & introduction |
| 9:30 | Data Organization in Spreadsheets |
| 10:45 | Coffee break |
| 11:00 | OpenRefine for Data Cleaning |
| 12:00 | Coffee break |
| 12:15 | Introduction to Python |
| 12:45 | Day 1 wrap-up |
| 13:00 | END |
## Ice-breaker: Your favorite children's book?
- Lieke: Heksenkind by Monica Furlong
- Candace Makeda: Le petit prince :prince:
- Roxane: Puk van de Pettenflat
- Barbara: The Neverending Story (Het Oneindige Verhaal)
- Dafne: Roald Dahl - Mathilda :+1:
- Signe: Harry Potter and the Prisoner of Azkaban (Lianne: +1)
- Hekmat: Majid
- Cecilia: Matilda Roald Dahl
- Babette: Rupsje nooitgenoeg
- Anne Maaike: Despereaux
- Kasimir: The little prince
- Carissa: Bartholomew and the Oobleck
- Samareen Zubair: The night of the living dummy (R.L Stine)
- Daniela: The Chronicles of Narnia
- Dominika: Pchła Szachrajka by Jan Brzechwa
- Philippine: The little Prince
- Francesco: All Richard Scarry's
- Adri: BFG
- Swee Chye: Hans Christian Andersen, Ugly Duckling
- Kyri: Water ship down :rabbit2:
- Jeanette: Lotta on Troublemakerstreet by Astrid Lindgren
- Michael: Le Petit Prince
- Reshmi : Harry Potter (all books :))
## 🔧 Exercises
**Answers moved below**
### Status chart
#### Symbols:
:heavy_check_mark: I am done!
:question: I am confused
#### Assignment [what are we working on?]
| Name | Done? |
|:------------|:-------------------|
| Barbara | :question: |
| Francesco | |
| Suvayu | :heavy_check_mark: |
### Messy data exercise
We’re going to take a messy version of the SAFI data and describe how we would clean it up.
Download the [messy data](https://ndownloader.figshare.com/files/11502824).
1. Open up the data in a spreadsheet program.
2. Notice that there are two tabs. Two researchers conducted the interviews, one in Mozambique and the other in Tanzania. They both structured their data tables in a different way. Now, you’re the person in charge of this project and you want to be able to start analyzing the data.
3. In your breakout room, identify what is wrong with this spreadsheet. Discuss thesteps you would need to take to clean up the two tabs, and to put them all together in one spreadsheet. Write this down in a list in the collaborative document.
4. **Important** You don't have to clean the data, just write down the possible improvements.
After you go through this exercise, we will discuss as a group what was wrong with this data and how you would fix it.
#### Room 1
- start a new spreadsheet
- remove headers: dwelling, livestock, plots
- one variable per column
- create a new column to indicate the country
- Code Mozambique livestock as Tanzania livestock
- Combine IDs
- Convert yes/no to 1 and 0 and coherent response for missing data
#### Room 2
- shift key_id to first column
- Variables top row
- label Mozambique/Tanzania with Mz/Tnz for example with key_id so they can all be combined in one table (not two tabs)
- make one table by key_id
- spread contents of 3, (animal, animal animal) over more cells (not everything in one cell)
- in tanzania table: make variable types consistant
- Cowshed into two variables (new variable)
- make all NA consistant (-999, NULL, -99) to NA
- spread and dichotomize variables (livestock)
-
#### Room 3
- choose wide (√) or long format
- add variable/column for country````
- fix IDs
- decide how to code for missing data (NA)
- have a codebook for all variables / make list of all variables and their data type
- check incorrect codings (row 2 in tab 2, table 2)
#### Room 4
- One table for all data types (dwelling, livestock, plots)
- Merge all datasets and include a country variable
- Additional column with notes for exceptions on how responses are coded
- Are the interviews about dwelling, livestock and plots conducted with the same individual. So does the key-id across these tables identify the same person/plot?
- Include the different types as livestock as separate columns, as in Tanzania, but then ensure consistent responses so either a y/n or as the number.
- Different IDs across the two countries
- Spellcheck
- Consistent coding of responses (e.g. dummy variables: y/n or 0/1)
#### Room 5
* include metadata
* single type vars (no combos of alpha and numeric)
* one entry per field - separate out sub-enries (i.e., oxen, cows, goats, etc.)
* identify missing/bad data
* eliminate formatting
* create a 'remarks/notes' columns for metadata (e.g., "includes barn")
* spell-check (errth v. earth)
* include location identifier for Mozambique v. Tanzania
* use consistent coding for responses (thus not both "mabati_sloping" and "mabatisloping")
* check if tables can be combined using a unique identifier for each case (does not seem to be the case)
- make sure ID numbers are unique
- create new column for country
- code missing data
- check if column names are the same for each make sure ID numbers are unique
- create new column for country
- code missing data
- check if column names are the same for each
## 🧠 Collaborative Notes
---
#### eScience Center and Workshop Welcome
(Lieke):
Website shared. Center tries to work with researchers and has open calls. Link provided in chat. We also sponsor NL-RSE, whcih organizes meet-ups to discuss topics in research software engineering.
(Barbara):
We will use open tools to help with automation and reproducibility over the next four days. Today a bit of data and how to clean it up using open-refine. The next three days will focus on Python. We are using
a hack-md document. (Explanation of how to use the collaborative document. Roll call.)
Agenda. Ice-breaker.
#### Spreadsheets.
Crowd/participant survey on whether you want more Python (not spreadsheets), if you used spreadsheets, if you ever got frustrated with spreadsheets...we will review one or two tricks in Excel, but mostly we will tell you not to use Excel. We will cover how we go from spreadsheets to something a computer can work with. The pipeline we will show uses spreadsheets, but as a tool for entering data. We will even take cleaning steps away from spreadsheet, and final steps will be in Python.
What is data that can be used well by a computer? Tidy Data : each variable forms its own column, each observation has its own row and each observation has its own cell. There are many ways to diverge from this, but that makes it harder for a computer to work with. "Every tidy datasets is alike but every messy dataset is messy in its own way". A tidy dataset streamlines process of further downstream analysis. We will work with the SAFI dataset. It includes interviews, data on households, agricultural practices, etc. We will start by looking at this dataset together. In breakout rooms, we will start looking at the data and discuss, write steps in collaborative documents to put together two tabs of data. Discussion on steps by groups for solution to this particular 'messy data'. Important notes from the discussion:
-There are various ways of indicating missing data, but being consistent in a merged set is important.
-There are some problems that can be fixed with a code-book. Codebooks can be an important tool to make data re-usable.
- We must encode data in text or some other way the computer code we later make will 'understand' (not color or some other way which may be easy for us to see, but not the computer)
-Perhaps use a notes column for further data
-Naming of key ideas has specific considerations. Consider making it an 'obscure' string to avoid accidental analysis of it
-Questions on how to take the spreadsheet with you as you go to interviews or other data collection. Advise to keep it 'clean' but it doesn't have to be tidy. Steps between a clean datatset and tidy dataset can be automated.
-Importance of metadata. Metadata is often formatted in a specific way (vital for re-usability)
-Spell check
Further notes: When you start processing you data, do not touch raw data! You will need it for many reasons. Keep it stored away, read only. We want you to see this tool as a data entry tool, not a tool for analysis. You can do a quick and dirty plot, but we hope you will do these steps with other tools e.g. Python. Other tools allow you to standardize, automate, and replicate. You can retrieve your steps in a script. Dates are worthy of special attention. Example: 7th of March may read incorrectly as 3rd of July based on how Excel is formatting it. Remember files may be loaded in a different program. Consider different ways to store dates e.g. day , month and years separate. Another important issue is measurements. Watch out for units. Units should be in a codebook, a header or both. Don't put the unit in the cell (then you can't do operation on it). Formatting best practices covered in links/resources. Finally, save data in an open format. Open file formats are .csv or even a .tsv. The advantage is both are open, so you won't need licenses to open the files. Another advantage is that they are text based , as opposed to Excel files which can not be opened and read in a text reader.
Questions on different types of csvs. Issues when commas are used as separator. There are ways around the problem e.g. using a different separator (perhaps a very obscure character).Open office may give a bit more control on storage of csvs.
Take home points:
Use spreadsheets to only enter data
Openrefine to clean
Use tools like python
Do not touch your raw data
### Open Refine
Data pipeline:
```mermaid
graph LR
A[Data Entry] --> B[Data Cleaning] --> C[Data Analysis]
style B stroke-width:5px
```
Data cleaning with **OpenRefine**:
* Open source ([source on GitHub](https://github.com/OpenRefine/OpenRefine)).
* A large growing community, from novice to expert, ready to help.
* Works with large-ish datasets (100,000 rows).
* OpenRefine always keeps your data private on your own computer.
(Francesco) OpenRefine:
We are looking at data cleaning. What can we do with OpenRefine? Find errors, make data consistent. There are many tools to do this. OpenRefine is free and opensource, and has a growing community (you can google and talk with others on issues). OpenRefine keeps your data private. It does not require internet connection (but you use a browser). Follow-along hands on for demonstration. Note OpenRefine should work on all systems, Mac included. Any browser should be fine, perhaps on a Mac, Safari is better.
We open data with OpenRefine. If you realize file is encoded in another way, you may modify. One useful feature of OpenRefine is facets. Facets are way to group values in columns.We can get an overview of the values in a dataset this way (by looking at facets).
We will look at automatic ways to identify errors.
Exercise with SAFI_openrefine.csv file (if you have not downloaded also available in chat).
-Timeline facet for dates- produces something nonsensical seeming. Very likely imported as text. Need to use edit cells, common transform, to date, and convert column in this case. Then OpenRefine will understand these as dates, and they can be made into a histogram, and manipulated correctly.
-Close facets with top left click
-Clustering: we here will cluster values on the basis of some phonetic features. OpenRefine suggests some values for us, but we don't have to follow. There are different ways to cluster. Check other clustering methods
-Can we define our own clustering methods? Probably not (yet) implemented.
We can edit our steps. Be careful in going back.
Undo/redo tab
Columns can be complex for various reasons. We may need to get rid of special charecters, and do other steps.
To get rid of special charecters: edit cells, transform (follow pop-up windows). We can type expressions in multiple languages including Python. But we can also use OpenRefine language (there is lots of documentation for this). Example of removing opening square bracket. We can copy-paste to speed this up.
Question on undoing one specific step in the middle of a sequence is awnsered with the idea that you can't do it that way. You could undo step by step.
There are tools and methods to seperate out certain values in columns.
Way to see all the column names: all, dropdown, edit columns, reorder.
Contiued exercise, special type of faceting: custom-text
#### Separate items in the `items_owned` column
Click on the arrow next to 'items_owned'
In the 'expression' box, type: `value.split(";")`
#### Filtering & sorting
Filtering: Select a column (e.g. 'Village') > text filter > type in the box (left part of screen) to filter the data
Sorting: Select column, e.g. 'gps_Longitude' > Sort > Use the dialog to sort, even if there is text, you can still sort by number
#### Exercise solution
1. Get rid of special characters in `months_lack_food`:
-. Click on edit cells > transform, use expression: `value.replace("[","").replace("]","").replace("'","").replace(" ","")` (check 'History' to reuse expressions!)
2. Look at the different months and see the frequency:
custom text > facet, use expression `value.split(";")`
3. Sort by 'count'
#### How to export the operations?
This way, others can apply the same changes to their raw dataset.
1. Undo/Redo tab >
2. extract >
3. tick boxes of steps to export
This code is in .json format, and you can copy it and paste it into an empty (plain text) document.
Apply someone else's code:
1. Open the same file as they have
2. Open the Undo/redo tab
3. Click 'apply'
4. Paste the steps (same .json format)
5. Click 'perform operation'
OpenRefine stores your project history; when opening OpenRefine you can find projects in progress under 'Open Projects'.
Another option is to export the history and the data both:
Export > Open refine project character file
This generates a compressed archive that includes the entire project, data and cleaning steps included.
#### How to export the clean data?
Export > Tab-separated value (.tsv) or Comma-separated value (.csv)
This exports the data into a .tsv or .csv file that can be used for further analysis.
### Python
Mac/Linux: open terminal, enter:
```
jupyter lab
```
Windows: start 'anaconda navigator', choose 'Jupyter Lab'
OR: open anaconda prompt, and type
```
jupyter lab
```
When you are done, enter it in the [status chart](https://hackmd.io/2DOOHZ-YQxeRnRKSEZwl_g?both#Check-in-opening-JupyterLab)!
From jupyter lab: click 'Notebook' > 'Python 3'
Rename the file by right-clicking on the notebook (on the left side), click 'rename'.
`01_introduction_python.ipynb`
Type in the cell:
```python=
a = 2
b = 3
a*b
```
Click 'play' (the triangle icon) to run the cell.
### Check-in for opening OpenRefine
| Name | Done with opening OpenRefine? |
|:------------------- |:----------------------------- |
| Barbara (example) | :question: |
| Francesco (example) | :question: |
| Suvayu(example) | :heavy_check_mark: |
| Lianne | :heavy_check_mark: |
| Anne Maaike | :heavy_check_mark: |
| Kyri | :heavy_check_mark: |
| Carissa | :heavy_check_mark: |
| Babette | :heavy_check_mark: |
| Signe | :heavy_check_mark: |
| Daniela | :heavy_check_mark: |
| Melisa | :heavy_check_mark: |
| Marilù | :heavy_check_mark: |
| | :question: |
| Adri | :heavy_check_mark: |
| Kevin | :heavy_check_mark: |
| Reshmi | :heavy_check_mark: |
| Michael Q | :heavy_check_mark: |
| Kasimir | :heavy_check_mark: |
| Hekmat | :heavy_check_mark: |
| Roxane | :heavy_check_mark: |
| Cecilia | :heavy_check_mark: |
| Ilaria | :heavy_check_mark: |
| Swee Chye | :heavy_check_mark: |
|Rael O. | :heavy_check_mark:
| Yunfeng | :heavy_check_mark: |
Jeanette | :heavy_check_mark: |
Samareen | :heavy_check_mark: |
| ruidong | :heavy_check_mark: |
|Yahua Zi|:heavy_check_mark: |
|Philippine | :heavy_check_mark: |
### Exercise: Faceting
1. Using faceting, find out how many different `interview_date` values there are in the survey results.
2. Is the column formatted as Text or Date?
3. Use faceting to produce a timeline display for `interview_date`. You will need to use `Edit cells` > `Common transforms` > `To date` to convert this column to dates.
4. During what period were most of the interviews collected?
5.
--Lianne
1. 19 choices
2. Text
3.
4. 16 November-21 NovemberNovember-21 NovemberNovember-21 NovemberNovember-21 NovemberNovember-21 November
--Carissa
1.19 choices
2. Text
3. done
4. 16-21 nov 2016
--Reshmi
1. 19 choices
2. Text
3. Done.
4. 2016-11-16T00:00:00Z
-- Swee Chye:
1. 19 choices
2. Text
3.
4. 2016-11-16T00:00:00Z
--Roxane
1. 19
2. Text
3. 16 nov 2016
--MD
1. 19 choices: from Nov/16 to May/17
2. Text
3. 2016-11-16 01:00:00 — 2017-06-04 02:00:00
4. November 2016
Babette
1. 19
2. Text
3. 2016-11-16 --> 16 nov 2016
Samareen
1. 19
2. Text
3. Done :heavy_check_mark:
4. 2016-11-16T00:00:00Z (26 entries)
Kyri
1. 19
2. text
3. 16/11/2016 - 21/11/2016
Dominika
1. 19
2. text
3. Nov 2016
Michael
1. 19
2. text
3. 2016-11-16 (23 entries)
Signe:
1. 19
2. Text
3. :heavy_check_mark:
4. November 16th - November 21st 2016
Hekmat:
1. 19
2. Text
Adri
1. 19 choices
2. Text
3. :heavy_check_mark:
4. 2016-11-16T00:00:00Z
--Kevin
1. 19
2. Text
4. 16th nov - 25th nov 2016
-- Daniela
1. 19
2. text
3.
4. 16-21 NOV
--Cecilia
1. 19
2. text
3.
-- Kasimir
1. 19
2. text
3. November 16th - 21st 2016
-- Ranran
1. 19
2. Text
3.
### Exercise: Transforming Data
Perform the same clean up steps and customized text faceting for the `months_lack_food` column. **Which month were farmers more likely to lack food?**
-- Swee Chye
* Oct
* Nov
-- Yahua Zi
--Reshmi
November - 71 October - 59
-- Dominika - Nov (71), Oct (59)
-- Signe
* November (71)
* October (59)
--MD: Nov (71 times)
- Kyri: November, October
Adri
1. :heavy_check_mark:
2. Nov, Oct
-- Roxane
1. Oct & Nov
-- Anne Maaike
-- Carissa
1. Oct and Nov
-- Kevin
November
--Michael
November (71)
### Check in: opening JupyterLab
| Name | Done |
|:------------------- |:----------------------------- |
| Barbara (example) | :question: |
| Francesco (example) | :heavy_check_mark: |
| Lianne | :heavy_check_mark: |
| Anne Maaike | :heavy_check_mark: |
| Kyri | :heavy_check_mark: |
| Carissa | :heavy_check_mark: |
| Babette | :heavy_check_mark: |
| Signe | :heavy_check_mark: |
| Daniela | :heavy_check_mark: |
| Melisa | :heavy_check_mark: |
| Marilù | :heavy_check_mark: |
| Adri | :heavy_check_mark: |
| Kevin | :heavy_check_mark: |
| Reshmi | :heavy_check_mark: |
| Michael Q | :heavy_check_mark: |
| Kasimir | :heavy_check_mark: |
| Hekmat | :heavy_check_mark: |
| Roxane | :heavy_check_mark: |
| Cecilia | :heavy_check_mark: |
| Ilaria | :heavy_check_mark: |
| Swee Chye | :heavy_check_mark: |
| Rael O. | _left?_ |
| Yunfeng | |
| Jeanette | :heavy_check_mark: |
| Samareen | :heavy_check_mark: |
| ruidong | :heavy_check_mark: |
| Yahua Zi | :heavy_check_mark: |
| Dominika | :heavy_check_mark: |
| Ranran Li | :heavy_check_mark: |
### Tip: what can we improve?
- Specify which steps required to complete the data cleaning exercise
- I would've liked a bit more reasoning on why openrefine is used/ when to use which programme
- The faceting function seems like a nice tool for data cleaning. Not clear to me is what to do with sorted/filtered data- is this to select a subset of data for analysis?
- Provide link to the online lessons/syllabus from carpentry
- Provide zoom link for the day sessions, did not get it by email
- We could start the first exercise all together and then move to the groups?
- A handout with instructions/commands so we can quickly get along with the demo
- more time to try and test and get along - so it is more like a training and less like a demo
- Provide link to the online lessons/syllabus ahead of the training so we can prepare
-
### Questions / Remarks
1. Is there a guide for when to use "custom text facet" and when not?
2. Maybe not a tip, but more a remark. The data carpentries for social science in R, provides the same materials and steps regarding openrefine and excel. It is likely that more people follow both these carpentries.
3. I generally clean data with dplyr/tidyverse in R. Would you recommend OpenRefine over it (what are the pros and cons)?
4. Open Refine seems like a great tool to add to our toolboxes for quick exploratory data analysis, to get a sense of the dataset.
### Top: what went well?
- Collaborative document works really well! +1 +1 +1 +1
- Clear instructions
- Friendly instructors +1 +1
- I liked the balance of the class (theory/instructions/exercises)
- Sufficient pace, enough time +1 +1 +1
- General notes that can be used in the future
- Good intro to tidy data
- The topic comes across a lot less intimidating than expected
- I liked that the instructors were very attentive of people asking questions or struggling
- very nice, knowledgeable instructors
- great we had all installation instructions beforehand so we could start immediately +1
- useful excerises
- L +1
- Long and timely breaks so i can handle my emails
## 📚 Resources
[Good Enough Practices for Scientific Computing](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510)
[Karl W. Broman & Kara H. Woo, Data Organization in Spreadsheets](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989)
[Hadley Wickham, Tidy Data](http://www.jstatsoft.org/v59/i10)