Collaborative Document

![](https://i.imgur.com/iywjz8s.png) # Collaborative Document 2022-06-02 Data Carpentry with Python (day 4) Welcome to The Workshop Collaborative Document. This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. ---------------------------------------------------------------------------- This is the Document for today: [link](https://tinyurl.com/python-socsci-day4) Collaborative Document day 1: [link](https://tinyurl.com/python-socsci-day1) Collaborative Document day 2: [link](https://tinyurl.com/python-socsci-day2) Collaborative Document day 3: [link](https://tinyurl.com/python-socsci-day3) Collaborative Document day 4: [link](https://tinyurl.com/python-socsci-day4) ## 👮Code of Conduct Participants are expected to follow these guidelines: * Use welcoming and inclusive language. * Be respectful of different viewpoints and experiences. * Gracefully accept constructive criticism. * Focus on what is best for the community. * Show courtesy and respect towards other community members. ## ⚖️ License All content is publicly available under the Creative Commons Attribution License: [creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/). ## 🙋Getting help To ask a question, type `/hand` in the chat window. To get help, type `/help` in the chat window. You can ask questions in the document or chat window and helpers will try to help you. ## 🖥 Workshop information :computer:[Workshop website](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/) 🛠 [Setup](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/#setup) :arrow_down: Download [pandas-data.zip](https://github.com/esciencecenter-digital-skills/2022-05-30-dc-socsci-python-nlesc/raw/main/files/pandas-data.zip) - [Post-workshop survey](https://carpentries.typeform.com/to/UgVdRQ?slug=2022-05-30-dc-socsci-python-nlesc&typeform-source=esciencecenter-digital-skills.github.io) ## 👩‍🏫👩‍💻🎓 Instructors Barbara Vreede Francesco Nattino ## 🧑‍🙋 Helpers Suvayu Ali Dafne van Kuppevelt Candace Makeda Moore ## 👩‍💻👩‍💼👨‍🔬🧑‍🔬🧑‍🚀🧙‍♂️🔧 Roll Call Name/ pronouns (optional) / job, role / social media (twitter, github, ...) / background or interests (optional) / city * Suvayu Ali (Helper) / RSE @ NLeSC / Experimental Particle Physics * Francesco Nattino / RSE @ NLeSC / Leiden * Signe Glæsel (she, her) / Student Assistent Datamanager @ DANS / Psychology, Statistics / Leiden * Barbara Vreede (she, her) / Research Software Engineer @netherlands eScience center / Biology, climate / Utrecht * Dafne van Kuppevelt / she,her / Research Software Engineer @netherlands eScience center / Machine Learning, Natural language processing / Utrecht * Kevin Wittenberg (he, him), PhD in Sociology, Utrecht University / interests in network analysis, human cooperation, convergence of classical statistics & ML / Joining ODISSEI summer school * Reshmi G Pillai (she,her), Lecturer at the University of Amsterdam/ Social media, natural language processing, computational social science/ Joining ODISSEI summer school * Roxane Snijders (she, her) / PhD / Amsterdam University / Plant Physiology * Hekmat Alrouh / PhD student / Vrije Universiteit Amsterdam / behavioral genetics / Joining ODISSEI summer school * Daniela Negoita / Junior Researcher / European Values Study (Tilburg University) * Adri Mul (He), AmsterdamUMC, research analist * Melisa Diaz / she, her/ Post-doc researcher / VU-Politecnico di Milano * Kyri Janssen (She, Her)/ PhD / Delft / Human Geography / Joining SICSS ODISSEI * Samareen Zubair / Reserach Masters / Tilburg University / ODISSEI Summer School * Ilaria Piovesan / Leturer @ Maastricht University * Rael Onyango / Phd in Entrepreneurial Economics /VU Universiteit Amsterdam * Lianne Bakkum / postdoc at VU Amsterdam, Clinical child and family studies / ODISSEI summer school * Babette Vlieger / PhD at Molecular Plant Pathology/ University of Amsterdam * Anne Maaike Mulders (she, her) / PhD at Radboud University / Social networks and inequality * Swee Chye (he, him) * Jeanette Hadaschik , University of Twente & Maastricht, Behavioural sciences/psychology, www.linkedin.com/in/jeanette-hadaschik * Yahua Zi / PhD at Vrije Universiteit Amsterdam, Biological Psychology * Michael Quinlan (he, him) / Researcher of Observational Technology at KNMI / Meteorology & Physics * Ranran Li (she/her)/ PhD in Psychology @ VU University Amsterdam * Rui Dong (he/him) Radboud University * Carissa Champlin, TU Delft ## 🗓️ Agenda | Time | Topic | |--:|:---| | 9:00 | Welcome, recap of day 3 | | 9:15 | Joining Pandas Dataframes | | 10:15 | Coffee break | | 10:30 | Data visualisation using Matplotlib | | 11:30 | Coffee break | | 11:45 | Data visualisation using Matplotlib | | 12:45 | Workshop wrap-up & [post-workshop survey](https://carpentries.typeform.com/to/UgVdRQ?slug=2022-05-30-dc-socsci-python-nlesc) | | 13:00 | END | ## 🔧 Exercises ### Exercise: Practice with data 1. Using the `SN7577i_aa.csv` and `SN7577i_bb.csv` files, create a Dataframe which is the result of an outer join using the `Id` column to join on. 2. What do you notice about the column names in the new Dataframe? 3. How would you re-write the code so that all the columns which are common to both files are joined in the resulting Dataframe? Adri: ``` python 1. df_assignment_merged_outer = pd.merge(df_aa,df_bb, how="outer", on="Id") 2. Has an _x and _y extra 3. common_cols = list((df_aa.columns).intersection(df_bb.columns)) df_assignment_merged_outer_same = pd.merge(df_aa,df_bb, how="outer", on=common_cols) ``` Roxane: ``` python 1. df_merged_aabb = pd.merge(df_aa, df_bb, how="outer", on="Id") 2. the column names are referred to as Q_x and Q_y (x for df_aa and y for df_bb) 3. df_merged_aabb = pd.merge(df_aa, df_bb, how="outer", on=["Id", "Q1", "Q2"]) ``` Michael: ``` python 1. df_12 = pd.merge(df_1, df_2, how="outer", on="Id") 2. common column names are given an additonal character to distiguish them from merged and non-merged 3. comment out the 'on' argument: df_12 = pd.merge(df_1, df_2, how="outer")#, on="Id") could also use: df_12 = pd.merge(df_1, df_2, how="outer", on=["Id", "Q1", "Q2"]) ``` Carissa: 1. df_merged_outer = pd.merge(df_aa, df_bb, how="outer", on="Id") 2. Column names that are the same in both df's receive additional indicator. 3. df_merged_outer = pd.merge(df_aa, df_bb, how="outer", on=["Id","Q1", "Q2"]) df_merged_outer Kevin: ```python= 1. df_outer = pd.merge(pd.read_csv("SN7577i_aa.csv"), pd.read_csv("SN7577i_bb.csv"), on="Id", how="outer") 2. df_outer #Column Q1 and Q2 occur twice (and we assume are the same item) but not specified so they appear twice and are annotated based on origin. 3. df_outer = pd.merge(pd.read_csv("SN7577i_aa.csv"), pd.read_csv("SN7577i_bb.csv"), on=["Id", "Q1", "Q2"], how="outer") ``` Babette ```python= df_aabb_merged_outer = pd.merge(df_aa,df_bb, how="outer", on="Id") df_aabb_merged_outer df_aabb_merged_outer2 = pd.merge(df_aa,df_bb, how="outer", on=["Id", "Q1", "Q2"]) df_aabb_merged_outer2 ``` Signe ``` python # Loading data and viewing columns df_aa = pd.read_csv("SN7577i_aa.csv") df_aa.head(2) df_bb = pd.read_csv("SN7577i_bb.csv") df_bb.head(2) # Merging by `Id` df_merge_outer = pd.merge(df_aa, df_bb, how = "outer", on = "Id") df_merge_outer ``` 2. One can see that the column names are repeated, with an extention indicating that they come from different data frames. The merged data frame consequently contains a lot of missing values. 3. Dropping the `on = "Id"` argument and re-writing it like below fixes the issue. ``` python # Merging without `Id` df_merge_outer2 = pd.merge(df_aa, df_bb, how = "outer") df_merge_outer2 ``` Melisa ``` python df_exe=pd.merge(df_aa, df_bb, how="outer",on="Id") #Ids do not match, the size of the new table is double and Q1 & Q2 are duplicated df_exe=pd.merge(df_aa, df_bb, how="outer",on=["Id", "Q1", "Q2"]) ``` Ranran: ``` python= df_aa = pd.read_csv("SN7577i_aa.csv") df_bb = pd.read_csv("SN7577i_bb.csv") pd.merge(df_aa, df_bb, how="outer", on=["Id", "Q1", "Q2"]) ``` Daniela ```python= 1. df1=pd.read_csv("pandas-data/SN7577i_aa.csv") df2=pd.read_csv("pandas-data/SN7577i_bb.csv") merged_outter=pd.merge(df1, df2, how="outer", on = "Id") 2. Cols present in both dataframes are annotated with an "_x" & "_y" 3. merged_outter=pd.merge(df1, df2, how="outer", on = ["Id", "Q1", "Q2"]) ``` Kyri ```python= #merge1 merge_on_2 = pd.merge(df_aa, df_bb, how ="outer" , on = "Id" ) #combining columns merge_on_2 = pd.merge(df_aa, df_bb, how ="outer" ) ``` Hekmat 1. ```python= df_merged_ex1 = pd.merge(df_aa, df_bb, how="outer", on="Id") ``` 2. _x and _y are appended to the names of the common columns 3. ```python= df_merged_ex2 = pd.merge(df_aa, df_bb, how="outer", on=("Id","Q1","Q2")) ``` Samareen: ```python= df_merged_outer_ex = pd.merge(df_aa,df_bb, how = 'outer',on = 'Id') df_merged_outer_ex = pd.merge(df_aa,df_bb, how = 'outer',on = ['Id','Q1','Q2']) ``` Anne Maaike ``` python df_aa = pd.read_csv("pandas-data/pandas-data/SN7577i_aa.csv") df_bb = pd.read_csv("pandas-data/pandas-data/SN7577i_bb.csv") mergerex = pd.merge(df_aa, df_bb, how="outer", on="Id") mergerex.head() # it shows column names with suffixes .x and .y for variables coming from both datasets respectively ``` ---Reshmi ```python 1. df_aa=pd.read_csv("SN7577i_aa.csv") df_bb=pd.read_csv("SN7577i_bb.csv") df_merged_outer=pd.merge(df_aa,df_bb, how="outer",on="Id") 2. For columns Q1, Q2 which are present in both dataframes, new columns Q1_x, Q1_y and Q2_x, Q2_y are created. 3. df_merged_outer=pd.merge(df_aa,df_bb, how="outer",on=["Id","Q1","Q2"]) ``` Swee Chye: ``` python 1. df_aa = pd.read_csv("../pandas-data/SN7577i_aa.csv") df_bb = pd.read_csv("../pandas-data/SN7577i_bb.csv") pd.merge(df_aa, df_bb, how="outer", on="Id") 2. Id Q1_x Q2_x Q3 Q1_y Q2_y Q4 Has an _x and _y for Q1, Q2 3. pd.merge(df_aa, df_bb, how="outer", on=["Id", "Q1", "Q2"]) ``` ### Exercise: Plotting with Pandas 1. Make a histogram of the number of buildings in the compound (`buildings_in_compound`). Determine the appropriate number of bins, then include the `bins` argument in your function to improve the chart. 2. Make a scatter plot of `years_farm` vs `years_liv` and color the points by `buildings_in_compound`. 3. (Optional) Make a bar plot of the mean number of rooms per wall type (use columns `rooms` and `respondent_wall_type`). Hint: check out the function `plot.bar`, and recall how to use `groupby` to apply statistics to grouped data. Barbara (Instructor): ```python= df.hist("buildings_in_compound") ``` ![](https://i.imgur.com/aBErYl0.png) The bin boundaries are fractional, so we can inspect: ```python= df.buildings_in_compound.describe() ``` |count| 131.000000 |--|--| |mean| 2.068702 |std | 1.241530 |min | 1.000000 |25% | 1.000000 |50% | 2.000000 |75% | 3.000000 |max | 8.000000 Name: buildings_in_compound, dtype: float64 And then pick the correct number of bins: ```python= df.hist("buildings_in_compound", bins=8) ``` ![](https://i.imgur.com/RraBWsy.png) Scatter plot: ```python= df.plot.scatter(x="years_farm", y="years_liv", c="buildings_in_compound", colormap="cool", sharex=False) ``` ![](https://i.imgur.com/B7YTQMG.png) Group by: ```python= rooms_mean = df.groupby("respondent_wall_type").mean()["rooms"] rooms_mean.plot.bar() ``` ![](https://i.imgur.com/jTteLXQ.png) -Reshmi ```python= 1. df['buildings_in_compound'].hist(bins=10) 2. df.plot.scatter(x='years_farm',y='years_liv',c='buildings_in_compound', colormap='viridis') 3. ``` -Michael ```python= 1. df['buildings_in_compound'].hist(bins=7) 2. df.plot.scatter(x='years_farm', y='years_liv', c='gps_Altitude', colormap='viridis', figsize=(10,5), sharex=False) 3. ``` Carissa 1.df["buildings_in_compound"].hist(bins=15) 2.df.plot.scatter(x="years_farm", y="years_liv", c="buildings_in_compound", colormap="viridis") 3. Samareen: ```python= df.hist(column = 'buildings_in_compound', bins = 8) df.plot.scatter(x = 'years_farm',y = 'years_liv',c = 'buildings_in_compound') walls_rooms = df.groupby('respondent_wall_type').describe() walls_rooms walls_rooms.plot.bar(x = 'mean', y = 'count') ``` Adri: ```python= 1. df_plot.hist(column = 'buildings_in_compound',bins = 8) #bins = 10 does not add anything, is default 2. df_plot.plot.scatter(x="years_farm", y="years_liv", c="buildings_in_compound", colormap="viridis",sharex=False) 3. ?? ``` Babette ``` python = df['buildings_in_compound'].hist(bins=8) df.plot.scatter(x='years_farm', y='years_liv', c='buildings_in_compound', colormap = 'viridis', sharex=False) ``` Signe ```python= 1) df.hist(column ='buildings_in_compound', bins=7) 2) df.plot.scatter(x='years_farm', y='years_liv', c='buildings_in_compound', sharex= False) ``` Daniela ```python = df['buildings_in_compound'].hist() df['buildings_in_compound'].hist(bins=8) df.plot.scatter(x='years_farm', y = 'years_liv', c="buildings_in_compound", colormap= 'viridis', sharex= False) ``` Roxane ```python= 1. df.hist(column ='buildings_in_compound', bins=7) 2. df.plot.scatter(x='years_farm', y='years_liv', c='buildings_in_compound', colormap='viridis', sharex=False) 3. ``` Anne Maaike ``` python df.hist(column = 'buildings_in_compound', bins = 7) df.plot.scatter(x='years_farm', y='years_liv', c='buildings_in_compound', colormap='autumn', sharex=False) ``` Hekmat ```python= 1. df.hist(column="buildings_in_compound", bins=8) 2. df.plot.scatter(x="years_farm", y="years_liv", c="buildings_in_compound", colormap="viridis") ``` Melisa ```python= 1. df.hist(column='buildings_in_compound',bins=8) 2. df.plot.scatter(x='years_farm', y='years_liv', c='buildings_in_compound', colormap='viridis', sharex=False) 3. grouped_data=df.groupby('respondent_wall_type')['rooms'].mean() ``` Swee Chye: ``` python 1. df["buildings_in_compound"].hist() df["buildings_in_compound"].hist(bins=14) 2. df.plot.scatter(x="years_farm", y="years_liv", c="buildings_in_compound", colormap="viridis", sharex=False, rot=45) 3. ``` ### Exercise: Customize your plot Revisit your favorite plot we’ve made so far, or make one with your own data then: - add axes labels - add a title - save it in two different formats code Roxane ``` python lt.title("Time devoted to farming") plt.ylabel("Years farming in area") plt.xlabel("Years lived in area") plt.scatter(x=parents_from_area['years_liv'], y=parents_from_area['years_farm'], label="one or both parents from area") plt.scatter(x=parents_not_from_area['years_liv'], y=parents_not_from_area['years_farm'], label="no parents from area") plt.legend() plt.show() plt.savefig("farming_parents.png", dpi=150) plt.savefig("farming_parents.pdf", dpi=150) ``` Melisa ``` python parent_not_area['source']="parent_not_area" parent_from_area['source']="parent_from_area" new_df = pd.concat([parent_from_area, parent_not_area]) sns.lmplot(data=new_df, x='years_liv', y='years_farm', hue='source') ``` Signe ```python= sns.violinplot(data = df, x = "village", y = "buildings_in_compound", palette = "rocket") plt.title("Buildings in compound per village") # Plot Title plt.ylabel("Number of buildings in compound") # Name y-axis plt.xlabel("Village") # Name x-axis # putting labels after since the plotting fucntion sets them equal to variable names plt.show() ``` --- ## 🧠 Collaborative Notes ### Recap of day 3 - Introduced a new library Pandas ```python= import pandas as pd ``` - Learned to load datasets from a file ```python=2 df = pd.read_csv(...) ``` - Accessing columns in a dataframe ```python=3 df["A"] # single column df.A df[["A", "B"]] # multiple columns ``` - Accessing rows ```python=6 df[0:5] # rows from 1, to 5 df.iloc[0] df.iloc[[0, 1, 2]] df.iloc[0, 1] df.loc[0:5, ["A", "B"]] ``` - Filters/masks ```python=11 mask = df["A"] == -1 # condition df[mask] df[df["A"] == -1] # equivalent mask = (df["A"] == -1) & (df["B"] == 1) df[mask] ``` - Missing values ```python=16 import numpy as np df.isnull() # cells with missing values df.fillna(0) # replace NA with 0 df.replace(0, np.nan) # replace 0 with NA ``` - overview of a dataframe ```python=20 df.describe() # overview of numerical data df.groupby("A").describe() # categorical values ``` merge_on_2 = pd.merge(df_aa, df_bb, how ="outer" , on = "Id" ) ### Joining Pandas dataframes - appending rows - appending columns - more complex join: merging both rows & columns #### Appending rows We want to merge rows where, say, we have different set of measurements for the same experiment. First let's read the files: ```python= df_a = pd.read_csv("SN7577i_a.csv") df_b = pd.read_csv("SN7577i_b.csv") ``` You may inspect with `df.head()`: Let's merge the rows: ```python=3 df_merge_rows = pd.concat([df_a, df_b], ignore_index=True) ``` But you will note, the indices are repeated (for a & b). We can resolve this by using the option Let's look at the case where the number of columns do not match: ```python= df_aa = pd.read_csv("SN7577i_aa.csv") # Q1, Q2, Q3 df_bb = pd.read_csv("SN7577i_bb.csv") # Q1, Q2, Q4 ``` If we concatenate them: ```python=3 df_merge_row = pd.concat([df_aa, df_bb], ignore_index=True) ``` | | Id | Q1| Q2 | Q3| Q4 |--|--|--|--|--|--| |0 | 1| 1| -1 | 1.0 | NaN |1 | 2| 3| -1| 1.0| NaN |2 | 3| 10 | 3 | 2.0 | NaN | |... |10 | 1277 | 10 | 10| NaN| 6.0 11 | 1278 | 2 | -1 | NaN | 4.0 12 | 1279 | 2 | -1 | NaN | 5.0 - Columns missing from one dataframe are filled with NaNs - Column data types are coerced to fit right type (float) that can store both types (NaN and integer), NaNs can only be represented as floats *Tip:* If you want to keep the data type as integer, you may convert the column type to Pandas specifi integer type: ```python=4 df_merge_row.astype("Int64") ``` #### Appending columns from a different dataframe ```python= df_c = pd.read_csv("SN7577i_c.csv") df_d = pd.read_csv("SN7577i_d.csv") ``` Merge the columns: ```python= df_merge_cols = pd.concat([df_c, df_d], axis="columns") ``` | | Id| maritl| numhhd| Id| Q1| Q2| |--|--|--|--|--|--|--| | 0| 1| 6| 3| 1| 1| -1| | 1 | 2| 4 | 3| 2| 3| -1 |2 | 3 | 6 | 2 | 3 | 10| 3 *Note:* common columns are repeated, this can be avoided by selecting only the columns that you want. When repeated columns are present, accessing by label will return both columns: ```python= df_merge_cols["Id"] ``` || Id| Id| |--|--|--| |0 | 1 | 1 |1 | 2 | 2 #### Joining data ![](https://i.imgur.com/EZLGgPL.png) ##### Inner join ```python= df_merged_inner = pd.merge(df_c, df_d, how="inner") ``` || Id | maritl| numhhd| Q1| Q2| |--|--|--|--|--|--| |0 |1 | 6 | 3 | 1 | -1 |1 |2 | 4 | 3 | 3 | -1 |2 |3 | 6| 2 | 10| 3 |3 | 4 | 4 | 1 | 9 | -1 - must have at least one column in common - only rows where the values in the common column match will be retained - under the hood Pandas finds the common column, and merges based on that - it is good practice to explicitly specify the column to use ```python=2 pd.merge(df_c, df_d, how="inner", on="Id") ``` ##### Outer join ```python=3 pd.merge(df_c[0:3], df_d[5:10], how="outer", on="Id") ``` || Id| maritl| numhhd| Q1| Q2| |--|--|--|--|--|--| | 0| 1| 6.0| 3.0 | NaN |NaN | 1| 2| 4.0 | 3.0 | NaN| NaN |2| 3| 6.0 | 2.0 | NaN| NaN |3| 6| NaN | NaN | 1.0| -1.0 |4| 7 | NaN | NaN | 1.0 | -1.0 |5| 8 | NaN | NaN | 1.0 | -1.0 - union of the values of all rows for the common column Let's try to merge dataframes with more than one common column: ```python=4 df_c_q = pd.concat([df_c, df_d["Q1"]], axis=1) # dataframe with >1 common column pd.merge(df_c_q, df_d, on="Id") ``` || Id| maritl| numhhd| Q1_x| Q1_y| Q2| |--|--|--|--|--|--|--| |0 | 1| 6| 3| 1 | 1 | -1 | 1 | 2| 4 | 3 | 3 | 3 | -1 |2| 3| 6| 2 | 10| 10| 3 - the common column not being used in the merge (not in `on=...`) gets renamed to `_x` and `_y`, and the values are retained We can also merge on multiple columns: ```python=6 pd.merge(df_c_q, df_d, on=["Id", "Q1"]) ``` || Id| maritl| numhhd| Q1| Q2 |--|--|--|--|--|--| |0| 1| 6 | 3 | 1| -1 |1 | 2| 4 | 3| 3| -1 ### Plotting with Pandas and Matplotlib Let's work with the SAFI dataset, and get a histogram: ```python= df = pd.read_csv("SAFI_full_shortname.csv") df["years_liv"].hist() # histogram ``` ![](https://i.imgur.com/20wDa09.png) Let's customise a bit, specify the number of bins: ```python=3 df["years_liv"].hist(bins=20) ``` ![](https://i.imgur.com/MO3lepI.png) Let's group by: ```python=4 df.hist(column="years_liv", by="village") df["years_liv"].hist(by=df["village"]) # alternate ``` ![](https://i.imgur.com/kiZn7D2.png) *Note:* the second method is less preferable because it's easier to make a mistake, and pass an incorrect `by=...`. Change the layout and figure size: ```python=6 df.hist(column="years_liv", by="village", layout=(1,3), figsize=(10, 5)) ``` ![](https://i.imgur.com/hGllUho.png) #### Scatter plots ```python= df.plot.scatter(x="gps_Latitude", y="gps_Longitude") ``` ![](https://i.imgur.com/iaT2nwv.png) Adding more information (see: [resources](#resources) for more colormaps) ```python=2 df.plot.scatter(x="gps_Latitude", y="gps_Longitude", c="gps_Altitude", colormap="viridis") ``` ![](https://i.imgur.com/eQowUky.png) *Note:* to get ticks show up properly, see this [SO answer](https://stackoverflow.com/questions/66444722/scatter-plot-x-axis-tick-labels-not-showing-up) #### Boxplot and other plot types The default pandas style isn't very readable: ```python= df.boxplot(by ='village',column=['buildings_in_compound']) ``` ![](https://i.imgur.com/cIs1mUh.png) Using `seaborn`, we can get much cleaner plots: ```python= import seaborn as sns sns.boxplot(data=df, x="respondent_wall_type", y="buildings_in_compound") ``` ![](https://i.imgur.com/vrRNQ5z.png) ```python= sns.violinplot(data=df, x="respondent_wall_type", y="buildings_in_compound") ``` ![](https://i.imgur.com/AOJIAFs.png) ```python= sns.lmplot(data=df, x='years_farm', y='years_liv',hue='village') ``` ![](https://i.imgur.com/Gp2QCXy.png) ### Using matplotlib directly Let's get the data we want to plot ```python= parents_from_area = df[(df.parents_liv=="yes") | (df.sp_parents_liv=="yes")] # either parents live in area parents_not_area = df[(df.parents_liv=="no") & (df.sp_parents_liv=="no")] # none of the parents live in the area ``` Build the plot gradually: ```python= import matplotlib.pyplot as plt plt.title("Time devoted to farming") plt.ylabel("Years farming in area") plt.xlabel("Years lived in area") # scatter plot from two different datasets plt.scatter(x=parents_from_area["years_liv"], y=parents_from_area["years_farm"], label="one or both parents from area") plt.scatter(x=parents_not_area["years_liv"], y=parents_not_area["years_farm"], label="neither parents from area") plt.legend() # include legend plt.show() # not necessay when using Jupyter ``` ![](https://i.imgur.com/CFamgAP.png) To save a plot to a file, you can save to a file by adding `plt.savefig(..)` before you all `plt.show()`: ```python plt.savefig("farming_parents.png", dpi=150) ``` --- ## Tips *What you think we should improve* * Missed the ice breaker today * SLightly abrupt ending today with the plotting. Lost the overview a little because we went into 3 different packages with plots. Would prefer 1 big final exercise where we bring everything together (e.g. loop over dataframes to merge them, then define a function that prints a plot based on datatype with if statements for example) * Could have done tip / top verbally. * Overall recap at the end would be nice. * If possible, can do a recording of each of the 4 days, so that we can review and go over those areas of doubts, uncertainty, and help to recap and recall what we learnt over the 4 days ## Tops *What you liked about the workshop* * recaps and overviews always help tie everything togetherr +1 +1 * Number of helpers makes sure everything can continue, and helpers could split up to help people/search for links etc. +1 +1 :+1: * Love how interactive it is. +1 * The collaborative document is very helpful! Also thanks for synchronizing notes here. * You all seem to know what you are doing, and when you don't you find solutions. --- ## 📚 Resources - Pandas user guide: [Merge, join, concatenate and compare](https://pandas.pydata.org/docs/user_guide/merging.html) - Matplotlib [color maps](https://matplotlib.org/stable/gallery/color/colormap_reference.html) - Seaborn [API docs](https://seaborn.pydata.org/api.html) - [Post-workshop survey](https://carpentries.typeform.com/to/UgVdRQ?slug=2022-05-30-dc-socsci-python-nlesc&typeform-source=esciencecenter-digital-skills.github.io)