Barbara Vreede
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    ![](https://i.imgur.com/iywjz8s.png) # Collaborative Document 2022-05-30 Data Carpentry with Python (day 1) Welcome to The Workshop Collaborative Document. This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. ---------------------------------------------------------------------------- This is the Document for today: [link](https://tinyurl.com/python-socsci-day1) Collaborative Document day 1: [link](https://tinyurl.com/python-socsci-day1) Collaborative Document day 2: [link](https://tinyurl.com/python-socsci-day2) Collaborative Document day 3: [link](https://tinyurl.com/python-socsci-day3) Collaborative Document day 4: [link](https://tinyurl.com/python-socsci-day4) ## 👮Code of Conduct Participants are expected to follow these guidelines: * Use welcoming and inclusive language. * Be respectful of different viewpoints and experiences. * Gracefully accept constructive criticism. * Focus on what is best for the community. * Show courtesy and respect towards other community members. ## ⚖️ License All content is publicly available under the Creative Commons Attribution License: [creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/). ## 🙋Getting help To ask a question, type `/hand` in the chat window. To get help, type `/help` in the chat window. You can ask questions in the document or chat window and helpers will try to help you. ## 🖥 Workshop information :computer:[Workshop website](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/) 🛠 [Setup](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/#setup) :arrow_down: Download the following files and place them in a single, easily findable folder: - [SAFI_clean.csv](https://ndownloader.figshare.com/files/11492171) - [SAFI_messy.xlsx](https://ndownloader.figshare.com/files/11502824) - [SAFI_dates.xlsx](https://ndownloader.figshare.com/files/11502827) - [SAFI_openrefine.csv](https://ndownloader.figshare.com/files/11502815) ## 👨‍🏫👩‍💻🎓 Instructors Barbara Vreede Francesco Nattino ## 🧑‍🙋 Helpers Suvayu Ali Dafne van Kuppevelt Candace Makeda Moore ## 👩‍💻👩‍💼👨‍🔬🧑‍🔬🧑‍🚀🧙‍♂️🔧 Roll Call Name/ pronouns (optional) / job, role / social media (twitter, github, ...) / background or interests (optional) / city - Francesco Nattino (he, him) / Research Software Engineer at eScience Center / gh: fnattino / Leiden - Candace Makeda Moore (she, her)/ Research Software engineer at the Netherlands eScience Center/ gh: drcandacamakedamoore/ Medicine/ Zaandam - Dafne van Kuppevelt / she,her / Research Software Engineer @ eScience Center / @dafnevk / Machine Learning, Natural language processing, teaching / Utrecht - Lieke de Boer / she, her / Community Manager @ eScience Center / Open Science, Neuroscience - Philippine Waisvisz / she, her / Researcher Digital Ethics @ Hogeschool Utrecht / Social Psychology (motivation, emotion & behavior), data annotation, inclusive AI - Barbara Vreede / she, her / Research Software Engineer @eScience Center / gh:@bvreede / biology, climate / Utrecht - Kasimir Dederichs, PhD student in Sociology, Nuffield College, Uni Oxford. Voluntary organizations / Social integration / Segregation - Adri Mul (He), AmsterdamUMC, research analist - Kevin Wittenberg (he, him), PhD in Sociology, Utrecht University / interests in network analysis, human cooperation, convergence of classical statistics & ML / Joining ODISSEI summer school - Hekmat Alrouh (he,him) / PhD student @ Vrije Universiteit Amsterdam / @hekmatalrouh / epidemiology, obesity - Melisa Diaz (she, her) a post-doc researcher (Politecnico di Milano & VU) joining the odissei summer school. My field of research is economics of education - Kyri Janssen (she,, her) Researcher at Delft University and Technology. Joining ODISSEI summer school - Carissa Champlin/ Postdoc at TU Delft/ Delft-Enschede / starting 1 June as Assistant Professor in Community-Level Design for Climate Action - Anne Maaike Mulders (she, her) / PhD student at Radboud University Nijmegen - Signe Glæsel (she, her) / Student Assistent Datamanager @ DANS / Psychology, Statistics / Leiden - Yahua Zi (she, her)/ PhD in FGB/ heritability of motor development/ Amsterdam - Dominika Czerniawska/ she her/ postdoc Leiden University / sociology, social networks/ Oddisei Summer School prep - Babette Vlieger (she,her)/PhD student/ University of Amsterdam/ Plant Phytopathology - Roxane Snijders (she, her) / PhD student / University of Amsterdam / Plant Physiology---- - Daniela Negoita / Junior Researcher / European Values Study (Tilburg University) / Tilburg - Cecilia Potente / Postdoc / University of Zurich / Sociology, Biodemography / Odissei Summer School - Michael Quinlan (he, him) / Researcher Observational Technology at KNMI / BS UMass - Lowell in Meteorology / MS Utrecht University in Applied Data Science / https://github.com/mquinlan0824 / Amsterdam - Ranran Li (she/her) / PhD student in Psychology / VU University Amsterdam/ ODISSEI summer school - Swee Chye (he, him) / overseas, interests in Data Science, Data Analytics, AI - Yunfeng Liu(he,him)/phD student in Biomedicical data science/Leiden university medical center - Reshmi Gopalakrishna Pillai/Lecturer, Informatics Institute, University of Amsterdam/Summer School, ODISSEI/social media, natural language processing, psycholinguistics/ - Samareen Zubair(she/her)/Research Masters Social and Behavioral Sciences/Tilburg University/Summer School ODISSEI - Lianne Bakkum / Postdoc @ Vrije Universiteit Amsterdam / Clinical child and family studies / ODISSEI Summer school - Rael Onyango(She/her)/Phd candidate in SBE /VU University Amsterdam ## 🗓️ Agenda | Time | Topic | |------:|:----------------------------------| | 9:00 | Welcome & introduction | | 9:30 | Data Organization in Spreadsheets | | 10:45 | Coffee break | | 11:00 | OpenRefine for Data Cleaning | | 12:00 | Coffee break | | 12:15 | Introduction to Python | | 12:45 | Day 1 wrap-up | | 13:00 | END | ## Ice-breaker: Your favorite children's book? - Lieke: Heksenkind by Monica Furlong - Candace Makeda: Le petit prince :prince: - Roxane: Puk van de Pettenflat - Barbara: The Neverending Story (Het Oneindige Verhaal) - Dafne: Roald Dahl - Mathilda :+1: - Signe: Harry Potter and the Prisoner of Azkaban (Lianne: +1) - Hekmat: Majid - Cecilia: Matilda Roald Dahl - Babette: Rupsje nooitgenoeg - Anne Maaike: Despereaux - Kasimir: The little prince - Carissa: Bartholomew and the Oobleck - Samareen Zubair: The night of the living dummy (R.L Stine) - Daniela: The Chronicles of Narnia - Dominika: Pchła Szachrajka by Jan Brzechwa - Philippine: The little Prince - Francesco: All Richard Scarry's - Adri: BFG - Swee Chye: Hans Christian Andersen, Ugly Duckling - Kyri: Water ship down :rabbit2: - Jeanette: Lotta on Troublemakerstreet by Astrid Lindgren - Michael: Le Petit Prince - Reshmi : Harry Potter (all books :)) ## 🔧 Exercises **Answers moved below** ### Status chart #### Symbols: :heavy_check_mark: I am done! :question: I am confused #### Assignment [what are we working on?] | Name | Done? | |:------------|:-------------------| | Barbara | :question: | | Francesco | | | Suvayu | :heavy_check_mark: | ### Messy data exercise We’re going to take a messy version of the SAFI data and describe how we would clean it up. Download the [messy data](https://ndownloader.figshare.com/files/11502824). 1. Open up the data in a spreadsheet program. 2. Notice that there are two tabs. Two researchers conducted the interviews, one in Mozambique and the other in Tanzania. They both structured their data tables in a different way. Now, you’re the person in charge of this project and you want to be able to start analyzing the data. 3. In your breakout room, identify what is wrong with this spreadsheet. Discuss thesteps you would need to take to clean up the two tabs, and to put them all together in one spreadsheet. Write this down in a list in the collaborative document. 4. **Important** You don't have to clean the data, just write down the possible improvements. After you go through this exercise, we will discuss as a group what was wrong with this data and how you would fix it. #### Room 1 - start a new spreadsheet - remove headers: dwelling, livestock, plots - one variable per column - create a new column to indicate the country - Code Mozambique livestock as Tanzania livestock - Combine IDs - Convert yes/no to 1 and 0 and coherent response for missing data #### Room 2 - shift key_id to first column - Variables top row - label Mozambique/Tanzania with Mz/Tnz for example with key_id so they can all be combined in one table (not two tabs) - make one table by key_id - spread contents of 3, (animal, animal animal) over more cells (not everything in one cell) - in tanzania table: make variable types consistant - Cowshed into two variables (new variable) - make all NA consistant (-999, NULL, -99) to NA - spread and dichotomize variables (livestock) - #### Room 3 - choose wide (√) or long format - add variable/column for country```` - fix IDs - decide how to code for missing data (NA) - have a codebook for all variables / make list of all variables and their data type - check incorrect codings (row 2 in tab 2, table 2) #### Room 4 - One table for all data types (dwelling, livestock, plots) - Merge all datasets and include a country variable - Additional column with notes for exceptions on how responses are coded - Are the interviews about dwelling, livestock and plots conducted with the same individual. So does the key-id across these tables identify the same person/plot? - Include the different types as livestock as separate columns, as in Tanzania, but then ensure consistent responses so either a y/n or as the number. - Different IDs across the two countries - Spellcheck - Consistent coding of responses (e.g. dummy variables: y/n or 0/1) #### Room 5 * include metadata * single type vars (no combos of alpha and numeric) * one entry per field - separate out sub-enries (i.e., oxen, cows, goats, etc.) * identify missing/bad data * eliminate formatting * create a 'remarks/notes' columns for metadata (e.g., "includes barn") * spell-check (errth v. earth) * include location identifier for Mozambique v. Tanzania * use consistent coding for responses (thus not both "mabati_sloping" and "mabatisloping") * check if tables can be combined using a unique identifier for each case (does not seem to be the case) - make sure ID numbers are unique - create new column for country - code missing data - check if column names are the same for each make sure ID numbers are unique - create new column for country - code missing data - check if column names are the same for each ## 🧠 Collaborative Notes --- #### eScience Center and Workshop Welcome (Lieke): Website shared. Center tries to work with researchers and has open calls. Link provided in chat. We also sponsor NL-RSE, whcih organizes meet-ups to discuss topics in research software engineering. (Barbara): We will use open tools to help with automation and reproducibility over the next four days. Today a bit of data and how to clean it up using open-refine. The next three days will focus on Python. We are using a hack-md document. (Explanation of how to use the collaborative document. Roll call.) Agenda. Ice-breaker. #### Spreadsheets. Crowd/participant survey on whether you want more Python (not spreadsheets), if you used spreadsheets, if you ever got frustrated with spreadsheets...we will review one or two tricks in Excel, but mostly we will tell you not to use Excel. We will cover how we go from spreadsheets to something a computer can work with. The pipeline we will show uses spreadsheets, but as a tool for entering data. We will even take cleaning steps away from spreadsheet, and final steps will be in Python. What is data that can be used well by a computer? Tidy Data : each variable forms its own column, each observation has its own row and each observation has its own cell. There are many ways to diverge from this, but that makes it harder for a computer to work with. "Every tidy datasets is alike but every messy dataset is messy in its own way". A tidy dataset streamlines process of further downstream analysis. We will work with the SAFI dataset. It includes interviews, data on households, agricultural practices, etc. We will start by looking at this dataset together. In breakout rooms, we will start looking at the data and discuss, write steps in collaborative documents to put together two tabs of data. Discussion on steps by groups for solution to this particular 'messy data'. Important notes from the discussion: -There are various ways of indicating missing data, but being consistent in a merged set is important. -There are some problems that can be fixed with a code-book. Codebooks can be an important tool to make data re-usable. - We must encode data in text or some other way the computer code we later make will 'understand' (not color or some other way which may be easy for us to see, but not the computer) -Perhaps use a notes column for further data -Naming of key ideas has specific considerations. Consider making it an 'obscure' string to avoid accidental analysis of it -Questions on how to take the spreadsheet with you as you go to interviews or other data collection. Advise to keep it 'clean' but it doesn't have to be tidy. Steps between a clean datatset and tidy dataset can be automated. -Importance of metadata. Metadata is often formatted in a specific way (vital for re-usability) -Spell check Further notes: When you start processing you data, do not touch raw data! You will need it for many reasons. Keep it stored away, read only. We want you to see this tool as a data entry tool, not a tool for analysis. You can do a quick and dirty plot, but we hope you will do these steps with other tools e.g. Python. Other tools allow you to standardize, automate, and replicate. You can retrieve your steps in a script. Dates are worthy of special attention. Example: 7th of March may read incorrectly as 3rd of July based on how Excel is formatting it. Remember files may be loaded in a different program. Consider different ways to store dates e.g. day , month and years separate. Another important issue is measurements. Watch out for units. Units should be in a codebook, a header or both. Don't put the unit in the cell (then you can't do operation on it). Formatting best practices covered in links/resources. Finally, save data in an open format. Open file formats are .csv or even a .tsv. The advantage is both are open, so you won't need licenses to open the files. Another advantage is that they are text based , as opposed to Excel files which can not be opened and read in a text reader. Questions on different types of csvs. Issues when commas are used as separator. There are ways around the problem e.g. using a different separator (perhaps a very obscure character).Open office may give a bit more control on storage of csvs. Take home points: Use spreadsheets to only enter data Openrefine to clean Use tools like python Do not touch your raw data ### Open Refine Data pipeline: ```mermaid graph LR A[Data Entry] --> B[Data Cleaning] --> C[Data Analysis] style B stroke-width:5px ``` Data cleaning with **OpenRefine**: * Open source ([source on GitHub](https://github.com/OpenRefine/OpenRefine)). * A large growing community, from novice to expert, ready to help. * Works with large-ish datasets (100,000 rows). * OpenRefine always keeps your data private on your own computer. (Francesco) OpenRefine: We are looking at data cleaning. What can we do with OpenRefine? Find errors, make data consistent. There are many tools to do this. OpenRefine is free and opensource, and has a growing community (you can google and talk with others on issues). OpenRefine keeps your data private. It does not require internet connection (but you use a browser). Follow-along hands on for demonstration. Note OpenRefine should work on all systems, Mac included. Any browser should be fine, perhaps on a Mac, Safari is better. We open data with OpenRefine. If you realize file is encoded in another way, you may modify. One useful feature of OpenRefine is facets. Facets are way to group values in columns.We can get an overview of the values in a dataset this way (by looking at facets). We will look at automatic ways to identify errors. Exercise with SAFI_openrefine.csv file (if you have not downloaded also available in chat). -Timeline facet for dates- produces something nonsensical seeming. Very likely imported as text. Need to use edit cells, common transform, to date, and convert column in this case. Then OpenRefine will understand these as dates, and they can be made into a histogram, and manipulated correctly. -Close facets with top left click -Clustering: we here will cluster values on the basis of some phonetic features. OpenRefine suggests some values for us, but we don't have to follow. There are different ways to cluster. Check other clustering methods -Can we define our own clustering methods? Probably not (yet) implemented. We can edit our steps. Be careful in going back. Undo/redo tab Columns can be complex for various reasons. We may need to get rid of special charecters, and do other steps. To get rid of special charecters: edit cells, transform (follow pop-up windows). We can type expressions in multiple languages including Python. But we can also use OpenRefine language (there is lots of documentation for this). Example of removing opening square bracket. We can copy-paste to speed this up. Question on undoing one specific step in the middle of a sequence is awnsered with the idea that you can't do it that way. You could undo step by step. There are tools and methods to seperate out certain values in columns. Way to see all the column names: all, dropdown, edit columns, reorder. Contiued exercise, special type of faceting: custom-text #### Separate items in the `items_owned` column Click on the arrow next to 'items_owned' In the 'expression' box, type: `value.split(";")` #### Filtering & sorting Filtering: Select a column (e.g. 'Village') > text filter > type in the box (left part of screen) to filter the data Sorting: Select column, e.g. 'gps_Longitude' > Sort > Use the dialog to sort, even if there is text, you can still sort by number #### Exercise solution 1. Get rid of special characters in `months_lack_food`: -. Click on edit cells > transform, use expression: `value.replace("[","").replace("]","").replace("'","").replace(" ","")` (check 'History' to reuse expressions!) 2. Look at the different months and see the frequency: custom text > facet, use expression `value.split(";")` 3. Sort by 'count' #### How to export the operations? This way, others can apply the same changes to their raw dataset. 1. Undo/Redo tab > 2. extract > 3. tick boxes of steps to export This code is in .json format, and you can copy it and paste it into an empty (plain text) document. Apply someone else's code: 1. Open the same file as they have 2. Open the Undo/redo tab 3. Click 'apply' 4. Paste the steps (same .json format) 5. Click 'perform operation' OpenRefine stores your project history; when opening OpenRefine you can find projects in progress under 'Open Projects'. Another option is to export the history and the data both: Export > Open refine project character file This generates a compressed archive that includes the entire project, data and cleaning steps included. #### How to export the clean data? Export > Tab-separated value (.tsv) or Comma-separated value (.csv) This exports the data into a .tsv or .csv file that can be used for further analysis. ### Python Mac/Linux: open terminal, enter: ``` jupyter lab ``` Windows: start 'anaconda navigator', choose 'Jupyter Lab' OR: open anaconda prompt, and type ``` jupyter lab ``` When you are done, enter it in the [status chart](https://hackmd.io/2DOOHZ-YQxeRnRKSEZwl_g?both#Check-in-opening-JupyterLab)! From jupyter lab: click 'Notebook' > 'Python 3' Rename the file by right-clicking on the notebook (on the left side), click 'rename'. `01_introduction_python.ipynb` Type in the cell: ```python= a = 2 b = 3 a*b ``` Click 'play' (the triangle icon) to run the cell. ### Check-in for opening OpenRefine | Name | Done with opening OpenRefine? | |:------------------- |:----------------------------- | | Barbara (example) | :question: | | Francesco (example) | :question: | | Suvayu(example) | :heavy_check_mark: | | Lianne | :heavy_check_mark: | | Anne Maaike | :heavy_check_mark: | | Kyri | :heavy_check_mark: | | Carissa | :heavy_check_mark: | | Babette | :heavy_check_mark: | | Signe | :heavy_check_mark: | | Daniela | :heavy_check_mark: | | Melisa | :heavy_check_mark: | | Marilù | :heavy_check_mark: | | | :question: | | Adri | :heavy_check_mark: | | Kevin | :heavy_check_mark: | | Reshmi | :heavy_check_mark: | | Michael Q | :heavy_check_mark: | | Kasimir | :heavy_check_mark: | | Hekmat | :heavy_check_mark: | | Roxane | :heavy_check_mark: | | Cecilia | :heavy_check_mark: | | Ilaria | :heavy_check_mark: | | Swee Chye | :heavy_check_mark: | |Rael O. | :heavy_check_mark: | Yunfeng | :heavy_check_mark: | Jeanette | :heavy_check_mark: | Samareen | :heavy_check_mark: | | ruidong | :heavy_check_mark: | |Yahua Zi|:heavy_check_mark: | |Philippine | :heavy_check_mark: | ### Exercise: Faceting 1. Using faceting, find out how many different `interview_date` values there are in the survey results. 2. Is the column formatted as Text or Date? 3. Use faceting to produce a timeline display for `interview_date`. You will need to use `Edit cells` > `Common transforms` > `To date` to convert this column to dates. 4. During what period were most of the interviews collected? 5. --Lianne 1. 19 choices 2. Text 3. 4. 16 November-21 NovemberNovember-21 NovemberNovember-21 NovemberNovember-21 NovemberNovember-21 November --Carissa 1.19 choices 2. Text 3. done 4. 16-21 nov 2016 --Reshmi 1. 19 choices 2. Text 3. Done. 4. 2016-11-16T00:00:00Z -- Swee Chye: 1. 19 choices 2. Text 3. 4. 2016-11-16T00:00:00Z --Roxane 1. 19 2. Text 3. 16 nov 2016 --MD 1. 19 choices: from Nov/16 to May/17 2. Text 3. 2016-11-16 01:00:00 — 2017-06-04 02:00:00 4. November 2016 Babette 1. 19 2. Text 3. 2016-11-16 --> 16 nov 2016 Samareen 1. 19 2. Text 3. Done :heavy_check_mark: 4. 2016-11-16T00:00:00Z (26 entries) Kyri 1. 19 2. text 3. 16/11/2016 - 21/11/2016 Dominika 1. 19 2. text 3. Nov 2016 Michael 1. 19 2. text 3. 2016-11-16 (23 entries) Signe: 1. 19 2. Text 3. :heavy_check_mark: 4. November 16th - November 21st 2016 Hekmat: 1. 19 2. Text Adri 1. 19 choices 2. Text 3. :heavy_check_mark: 4. 2016-11-16T00:00:00Z --Kevin 1. 19 2. Text 4. 16th nov - 25th nov 2016 -- Daniela 1. 19 2. text 3. 4. 16-21 NOV --Cecilia 1. 19 2. text 3. -- Kasimir 1. 19 2. text 3. November 16th - 21st 2016 -- Ranran 1. 19 2. Text 3. ### Exercise: Transforming Data Perform the same clean up steps and customized text faceting for the `months_lack_food` column. **Which month were farmers more likely to lack food?** -- Swee Chye * Oct * Nov -- Yahua Zi --Reshmi November - 71 October - 59 -- Dominika - Nov (71), Oct (59) -- Signe * November (71) * October (59) --MD: Nov (71 times) - Kyri: November, October Adri 1. :heavy_check_mark: 2. Nov, Oct -- Roxane 1. Oct & Nov -- Anne Maaike -- Carissa 1. Oct and Nov -- Kevin November --Michael November (71) ### Check in: opening JupyterLab | Name | Done | |:------------------- |:----------------------------- | | Barbara (example) | :question: | | Francesco (example) | :heavy_check_mark: | | Lianne | :heavy_check_mark: | | Anne Maaike | :heavy_check_mark: | | Kyri | :heavy_check_mark: | | Carissa | :heavy_check_mark: | | Babette | :heavy_check_mark: | | Signe | :heavy_check_mark: | | Daniela | :heavy_check_mark: | | Melisa | :heavy_check_mark: | | Marilù | :heavy_check_mark: | | Adri | :heavy_check_mark: | | Kevin | :heavy_check_mark: | | Reshmi | :heavy_check_mark: | | Michael Q | :heavy_check_mark: | | Kasimir | :heavy_check_mark: | | Hekmat | :heavy_check_mark: | | Roxane | :heavy_check_mark: | | Cecilia | :heavy_check_mark: | | Ilaria | :heavy_check_mark: | | Swee Chye | :heavy_check_mark: | | Rael O. | _left?_ | | Yunfeng | | | Jeanette | :heavy_check_mark: | | Samareen | :heavy_check_mark: | | ruidong | :heavy_check_mark: | | Yahua Zi | :heavy_check_mark: | | Dominika | :heavy_check_mark: | | Ranran Li | :heavy_check_mark: | ### Tip: what can we improve? - Specify which steps required to complete the data cleaning exercise - I would've liked a bit more reasoning on why openrefine is used/ when to use which programme - The faceting function seems like a nice tool for data cleaning. Not clear to me is what to do with sorted/filtered data- is this to select a subset of data for analysis? - Provide link to the online lessons/syllabus from carpentry - Provide zoom link for the day sessions, did not get it by email - We could start the first exercise all together and then move to the groups? - A handout with instructions/commands so we can quickly get along with the demo - more time to try and test and get along - so it is more like a training and less like a demo - Provide link to the online lessons/syllabus ahead of the training so we can prepare - ### Questions / Remarks 1. Is there a guide for when to use "custom text facet" and when not? 2. Maybe not a tip, but more a remark. The data carpentries for social science in R, provides the same materials and steps regarding openrefine and excel. It is likely that more people follow both these carpentries. 3. I generally clean data with dplyr/tidyverse in R. Would you recommend OpenRefine over it (what are the pros and cons)? 4. Open Refine seems like a great tool to add to our toolboxes for quick exploratory data analysis, to get a sense of the dataset. ### Top: what went well? - Collaborative document works really well! +1 +1 +1 +1 - Clear instructions - Friendly instructors +1 +1 - I liked the balance of the class (theory/instructions/exercises) - Sufficient pace, enough time +1 +1 +1 - General notes that can be used in the future - Good intro to tidy data - The topic comes across a lot less intimidating than expected - I liked that the instructors were very attentive of people asking questions or struggling - very nice, knowledgeable instructors - great we had all installation instructions beforehand so we could start immediately +1 - useful excerises - L +1 - Long and timely breaks so i can handle my emails ## 📚 Resources [Good Enough Practices for Scientific Computing](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) [Karl W. Broman & Kara H. Woo, Data Organization in Spreadsheets](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989) [Hadley Wickham, Tidy Data](http://www.jstatsoft.org/v59/i10)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully