2022-05-30-dc-socsci Exercises document

# 2022-05-30-dc-socsci Exercises document ## All on participant list | Name | :heavy_check_mark: or :x: | |------------|-| |Name one | | |Name two | | | ## Present on Monday | Name | :heavy_check_mark: or :x: | |------------|-| ## Present on Tuesday | Name | :heavy_check_mark: or :x: | |------------|-| ## Present on Wednesday | Name | :heavy_check_mark: or :x: | |------------|-| ## Present on Thursday | Name | :heavy_check_mark: or :x: | |------------|-| ## Check in template ### Check in: [activity] | Name | Done | |:------------------- |:----------------------------- | | Barbara (example) | :question: | | Francesco (example) | :heavy_check_mark: | | Adri | | | Anne Maaike | | | Babette | | | Carissa | | | Cecilia | | | Daniela | | | Dominika | | | Hekmat | | | Ilaria | | | Jeanette | | | Kasimir | | | Kevin | | | Kyri | | | Lianne | | | Marilù | | | Melisa | | | Michael Q | | | Philippine | | | Rael O. | | | Ranran Li | | | Reshmi | | | Roxane | | | ruidong | | | Samareen | | | Signe | | | Swee Chye | | | Yahua Zi | | | Yunfeng | | | | | Day 4 (students who are present) | Name | Done | |:------------------- |:----------------------------- | | Barbara (example) | :question: | | Francesco (example) | :heavy_check_mark: | | Adri | | | Anne Maaike | | | Babette | | | Carissa | | | Cecilia | | | Daniela | | | Dominika | | | Hekmat | | | Ilaria | | | Jeanette | | | Kasimir | | | Kevin | | | Kyri | | | Lianne | | | Marilù | | | Melisa | | | Michael Q | | | Philippine | | | Rael O. | | | Ranran Li | | | Reshmi | | | Roxane | | | ruidong | | | Samareen | | | Signe | | | Swee Chye | | | Yahua Zi | | | Yunfeng | | | | | Name list: * Adri * Anne Maaike * Babette * Carissa * Cecilia * Daniela * Dominika * Hekmat * Ilaria * Jeanette * Kasimir * Kevin * Kyri * Lianne * Marilù * Melisa * Michael Q * Philippine * Rael O. * Ranran Li * Reshmi * Roxane * ruidong * Samareen * Signe * Swee Chye * Yahua Zi * Yunfeng ## Spreadsheets ### Messy data exercise We’re going to take a messy version of the SAFI data and describe how we would clean it up. Download the [messy data](https://ndownloader.figshare.com/files/11502824). 1. Open up the data in a spreadsheet program. 2. Notice that there are two tabs. Two researchers conducted the interviews, one in Mozambique and the other in Tanzania. They both structured their data tables in a different way. Now, you’re the person in charge of this project and you want to be able to start analyzing the data. 3. In your breakout room, identify what is wrong with this spreadsheet. Discuss the steps you would need to take to clean up the two tabs, and to put them all together in one spreadsheet. Write this down in a list in the collaborative document. 4. **Important** You don't have to clean the data, just write down the possible improvements. After you go through this exercise, we will discuss as a group what was wrong with this data and how you would fix it. ## OpenRefine ### Exercise: Faceting 1. Using faceting, find out how many different `interview_date` values there are in the survey results. 2. Is the column formatted as Text or Date? 3. Use faceting to produce a timeline display for `interview_date`. You will need to use `Edit cells` > `Common transforms` > `To date` to convert this column to dates. 4. During what period were most of the interviews collected? ### Exercise: Transforming Data Perform the same clean up steps and customized text faceting for the `months_lack_food` column. Which month were farmers more likely to lack food? ### Exercise: Filtering 1. What roof types are selected by this procedure? 2. How would you restrict this to only one of the roof types? ## Python 1st part ### Exercise: Arithmetic and printing Create a new cell and paste the code from the example into it: ```python= print("a =", a, "and b =", b) print(a + 2*b) print(a + (2*b)) print((a + b)*2) ``` 1. Remove all of the calls to the print function so you only have the expressions that were to be printed and run the code. What is returned? 2. Now remove all but the first line (with the 4 items in it) and run the cell again. How does this output differ from when we used the print function? Bonus if you are done early: * Practice assigning values to variables using as many different operators as you can think of. * Create some expressions to be evaluated using parentheses to enforce the order of mathematical operations that you require ### Exercise: string operations In this exercise you are going to explore a method that can operate on a string. 1. Create a string object. 2. Look at the different methods available to this string, using the function `dir()`. 3. Choose a method, and apply it to your string. (Remember: `mystring.isalpha()` applies the `isalpha()` method to `mystring`). a. What method did you choose? b. Can you see what it does by applying it? c. If not, use `help()` or [shift] + [tab] to read the docstring. What does your method do? ### Exercise: Boolean values How does Python interpret different data types when converting them to boolean? Explore the following values. Can you define what data is seen as `True` when forced to be boolean (using the `bool()` function)? And what data is converted to `False`? ```python bool_val1 = 'TRUE' print('read as type ',type(bool_val1)) print('value when cast to bool',bool(bool_val1)) bool_val2 = 'FALSE' print('read as type ',type(bool_val2)) print('value when cast to bool',bool(bool_val2)) bool_val3 = 1 print('read as type ',type(bool_val3)) print('value when cast to bool',bool(bool_val3)) bool_val4 = 0 print('read as type ',type(bool_val4)) print('value when cast to bool',bool(bool_val4)) bool_val5 = -1 print('read as type ',type(bool_val5)) print('value when cast to bool',bool(bool_val5)) ``` If you have time left, explore other values than the ones mentioned above! ### Exercise: List indexing 1. Create a cell with the code below. ```python= num_list = [4,5,6,11] ``` 2. Select the 1st element from this list. (Your code should return `4`) 3. Select the last element from this list. (Your code should return `11`) 4. Make a new list with the second and fourth element in this list. (Your code should return `[5,11]`) ### Exercise: Ranges 1. Create a list with `range()`: ```python= list(range(3,12,2)) ``` 3. What happens if you change the step value (the third argument) to 1? And to 3? Or to -1? Did you expect this? 4. Create a list using the `range()` function, which contains the even numbers between 1 and 10 in reverse order: `[10,8,6,4,2]`. ### Exercise: if-statements 1. Start with the following code: ```python= apple_cost = 0.5 bread_cost = 2.5 money = 2 ``` 2. Write an `if`-statement, replacing the ____ in the code below, that checks whether you can buy a bread with your money: ```python= if _________ print("I can buy bread!") ``` 3. Add an `elif` to your statement, where you check if you can buy an apple instead. ```python= elif _________ print("At least I can buy an apple!") ``` 4. Bonus: write a final `else` for the tragic scenario where you can buy neither bread nor apple. ### Exercise: for-loop Suppose that we have a string containing a set of 4 different types of values separated by , like this: ```python= variablelist = "01/01/2010,34.5,Yellow,True" ``` Research the `split()` method and write a for-loop that prints each of the 4 components of `variablelist` ### Exercise: Creating functions 1. Write a function definition to calculate the volume of a cuboid. The function will use three parameters h, w and l and return the volume. 2. Supposing that in addition to the volume I also wanted to calculate the surface area and the sum of all of the edges. Would I (or should I) have three separate functions or could I write a single function to provide all three values together? ## Python 2nd part ### Exercise: Pandas sep 1. What happens if you forget to specify `sep='\t'` when reading a tab delimited dataset? 2. (Optional): Use the help of the `read_csv` function. How do you read in the first 10 lines of the .tsv file? How do you read in a subset of the columns? 3. (Very optional): Write a function that reads in the data and returns a pandas dataframe ### Exercise: head and tail 1. As well as the `head()` method there is a `tail()` method. What do you think it does? Try it. 2. (Optional) Both methods accept a single numeric parameter. What do you think it does? Try it. ### Exercise: Print all columns 1. When we asked for the column names and their data types, the output was abridged, i.e. we didn’t get the values for all of the columns. Can you write a small piece of code which will print all of the values on separate lines. Paste your code in the collaborative document. 2. (optional): Using if statements, write a piece of code that prints 'big dataset' if the number of columns is larger than 10, and 'small dataset' if the number of columns is smaller. 3. (even more optional): Write a function that returns whether a dataset has more than 10 columns. (it should return a boolean value) ### Exercise: Pandas columns What happens if you: 1. List the columns you want out of order from the way they appear in the file? 2. Put the same column name in twice? 3. Put in a non-existing column name? (a.k.a Typo) ### Exercise: Selecting rows and colums 1. Select all the rows for which `Q1` is equal to `Q2` or `numkid` is equal to 3, then select the columns `daily1` and `daily2`. 2. (Optional) Take the first 10 rows of all the columns the header of which starts with `daily`. ### Exercise: Number of values Compare the count values returned for the `B_no_membrs` and the `E19_period_use` variables. 1. Why do you think they are different? 2. How does this affect the calculation of the mean values? 3. (optional): Use your search engine to find how to deal with missing values in pandas. ### Exercise: aggregation In breakout rooms. Discuss the answers in your group and write the answers in the collaborative document. 1. Read in the SAFI_results.csv dataset. 2. Get a list of the different E26_affect_conflicts values. 3. Groupby E26_affect_conflicts and describe the results. 4. How many of the respondents never had any conflicts? 5. (optional) Using groupby find out whether farms that use water ('E01_water_use') have more plots ('D_plots_count') than farms that do not use water. ## Python 3rd part ### Exercise: Practice with data 1. Using the `SN7577i_aa.csv` and `SN7577i_bb.csv` files, create a Dataframe which is the result of an outer join using the `Id` column to join on. 2. What do you notice about the column names in the new Dataframe? 3. How would you re-write the code so that all the columns which are common to both files are joined in the resulting Dataframe? ### Exercise: Plotting with Pandas 1. Make a histogram of the number of buildings in the compound (`buildings_in_compound`). Determine the appropriate number of bins, then include the `bins` argument in your function to improve the chart. 2. Make a scatter plot of `years_farm` vs `years_liv` and color the points by `buildings_in_compound`. 3. (Optional) Make a bar plot of the mean number of rooms per wall type (use columns `rooms` and `respondent_wall_type`). Hint: check out the function `plot.bar`, and recall how to use `groupby` to apply statistics to grouped data. ### Exercise: Customize your plot Revisit your favorite plot we’ve made so far, or make one with your own data then: - add axes labels - add a title - save it in two different formats ## Wrap-up last day ### Exercise: Recap Take a few minutes to write down your thoughts on what we learned in this course: * What questions do you still have? * Whether there are any incremental improvements that can benefit your projects? * What’s nice that we learnt but is overkill for your current work?