Barbara Vreede
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    ![](https://i.imgur.com/iywjz8s.png) # Collaborative Document 2022-06-01 Data Carpentry with Python (day 3) Welcome to The Workshop Collaborative Document. This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. ---------------------------------------------------------------------------- This is the Document for today: [link](https://tinyurl.com/python-socsci-day3) Collaborative Document day 1: [link](https://tinyurl.com/python-socsci-day1) Collaborative Document day 2: [link](https://tinyurl.com/python-socsci-day2) Collaborative Document day 3: [link](https://tinyurl.com/python-socsci-day3) Collaborative Document day 4: [link](https://tinyurl.com/python-socsci-day4) ## 👮Code of Conduct Participants are expected to follow these guidelines: * Use welcoming and inclusive language. * Be respectful of different viewpoints and experiences. * Gracefully accept constructive criticism. * Focus on what is best for the community. * Show courtesy and respect towards other community members. ## ⚖️ License All content is publicly available under the Creative Commons Attribution License: [creativecommons.org/licenses/by/4.0/](https://creativecommons.org/licenses/by/4.0/). ## 🙋Getting help To ask a question, type `/hand` in the chat window. To get help, type `/help` in the chat window. You can ask questions in the document or chat window and helpers will try to help you. ## 🖥 Workshop information :computer:[Workshop website](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/) 🛠 [Setup](https://esciencecenter-digital-skills.github.io/2022-05-30-dc-socsci-python-nlesc/#setup) :arrow_down: Download [pandas-data.zip](https://github.com/esciencecenter-digital-skills/2022-05-30-dc-socsci-python-nlesc/raw/main/files/pandas-data.zip) ## 👩‍🏫👩‍💻🎓 Instructors Barbara Vreede Francesco Nattino ## 🧑‍🙋 Helpers Suvayu Ali Dafne van Kuppevelt Candace Makeda Moore ## 👩‍💻👩‍💼👨‍🔬🧑‍🔬🧑‍🚀🧙‍♂️🔧 Roll Call Name/ pronouns (optional) / job, role / social media (twitter, github, ...) / background or interests (optional) / city * Lieke de Boer (she, her) / Community Manager @ eScience Center / Open Science, Neuroscience * Suvayu Ali / RSE @ NL eScience Center * Candace Makeda Moore (she, her) / Research Software Engineer@ eScience Center / Medicine * Francesco Nattino / RSE @ eSciece Center / Leiden * Kevin Wittenberg (he, him), PhD in Sociology, Utrecht University / interests in network analysis, human cooperation, convergence of classical statistics & ML / Joining ODISSEI summer school * Daniela Negoita / Junior Researcher / European Values Study (Tilburg University) * Roxane Snijders (she, her) / PhD / Amsterdam University / Plant Physiology * Carissa Champlin / Assistant Professor, Human-Centered Design, TU Delft, Planning support methods for participatory design - Adri Mul (He), AmsterdamUMC, research analist * Signe Glæsel (she, her) / Student Assistent Datamanager @ DANS / Psychology, Statistics / Leiden * Michael Quinlan (he, him), Reasercher of Observational Technology at KNMI / meteorology, physics, data science / Amsterdam / https://www.linkedin.com/in/michaeljohnquinlan/ * Jeanette Hadaschik , University of Twente & Maastricht, Behavioural sciences/psychology, www.linkedin.com/in/jeanette-hadaschik * Kyri Janssen (she, her)/ PhD researcher / Delft * Melisa Diaz / she, her/ Post-doc researcher / VU-Politecnico di Milano * Anne Maaike Mulders (she,her) / PhD in Sociology at Radboud University / Social networks and inequality * Lianne Bakkum / postdoc @ VU Amsterdam Lianne Bakkum / postdoc @ VU Amsterdam * Samareen Zubair / Research Masters / Tilburg University * Swee Chye (he, him) * Reshmi G Pillai (she,her)/ Lecturer at the University of Amsterdam/ social media text, natural language processing, computtaional social science * Babette Vlieger/ PhD at Molecular Plant Pathology/ University of Amsterdam * Rael Onyango/Phd in Entrepreneurial Economics @ VU Amsterdam / * Ranran Li (she/her) / PhD in Psychology at the VU University Amsterdam ## Ice breaker What is your favourite video game (or analog one if not into video games) from "old times"? * Francesco: Super Mario Land - Game boy * Lieke: I used to play Olympic games on MS-DOS which was incredibly tiring because you had to press space bar really quickly to for example run. * Suvayu: Quake, these days I play a bit of Stellaris * Adri: Old times: Boulderdash, Heroes, Maze of Galious, Nemesis. Now: Cities * If we liberally define 2012 as "old times", Guild Wars 2. Or OG Star Wars Battlefront (2004) * Makeda: chess * Pokemon! * Lianne: Super mario bros 3 on the NESLianne: Super mario bros 3 on the NESLianne: Super mario bros 3 on the NESLianne: Super mario bros 3 on the NES * Roxane: Harry Potter 2 on the PS2 * Anne Maaike: Animal Crossing: New Horizons * Signe: Halo Reach * Kyri: the Sims * Carissa: The Oregon Trail * Jeanette: Super Mario on GameBoy * Melisa: Mario Kart - Nintendo 64 * Michael: Mario * Swee Chye: Pacman * * Hekmat: Age of empires * Samareen: Pacman, Super Mario Bros, Pokemon, Street fighter * Reshmi: Roadrash ## 🗓️ Agenda | Time | Topic | |--:|:---| | 9:00 | Welcome, recap of day 2 | | 9:15 | Reading data from a file using Pandas | | 10:15 | Coffee break | | 10:30 | Extracting row and columns | | 11:30 | Coffee break | | 11:45 | Data Aggregation using Pandas | | 12:45 | Day 3 wrap-up | | 13:00| END | ## 🔧 Exercises **Did you manage to complete the exercise?** | Name | Done | |:------------------- |:----------------------------- | | Barbara (example) | :question: | | Francesco (example) | :heavy_check_mark: | | Adri | | Anne Maaike | | Babette | | Carissa | | Cecilia | | Daniela | :heavy_check_mark: | Hekmat | | Ilaria | | Jeanette | | Kasimir | | Kevin | | Kyri | | Lianne | | Melisa | | Michael Q | :heavy_check_mark: | Philippine | | Rael O. | | Ranran Li | | Reshmi | | Roxane | | Samareen | | Signe | | Swee Chye | ### Exercise: head and tail 1. As well as the `head()` method there is a `tail()` method. What do you think it does? Try it. 2. (Optional) Both methods accept a single numeric parameter. What do you think it does? Try it. -- Michael: The parameter returns the number of rows in the dataframe; if head(5), first five rows, if tail(5), last five rows. -- Roxane: 1. The --Kevin 1. Last (5) rows 2. Specify number of rows -- Hemkat 1. last 5 rows 2. number of rows --Babette 1. Last 5 rows 2. specify nr of rows shown ### Exercise: Print all columns 1. When we asked for the column names and their data types, the output was abridged, i.e. we didn’t get the values for all of the columns. Can you write a small piece of code which will print all of the values on separate lines. Paste your code in the collaborative document. 2. (optional): Using if statements, write a piece of code that prints 'big dataset' if the number of columns is larger than 10, and 'small dataset' if the number of columns is smaller. 3. (even more optional): Write a function that returns whether a dataset has more than 10 columns. (it should return a boolean value) Reshmi: 1. for column_name in df.columns: print(column_name) 2. if (len(df.columns)>10): print("Big Dataset") elif (len(df.columns)<10): print("Small Dataset") 3. def is_big_dataset(df): return (len(df.columns)>10) ```python Adri: 1. for i in df.columns: print(i) 2. if len(df.columns) > 10: print("Big data") else: print("Small data") 3. def countcolumns(df): return len(df.columns) > 10 ``` * Roxane: ```python for i in df.columns: print(i) ``` ```python Michael: (1) columns = [print(col) for col in df.columns] (2) if len(columns) > 10: print('big dataset') elif len(columns) < 10: print('small dataset') (3) def more_than_ten_columns(df): ''' input: dataframe output: boolean value (true/false) based on data set greater than 10 columns ''' columns = [col for col in df.columns] if len(columns) > 10: return True ``` * Kasimir: for i in df.columns: print(i) * Melisa: 1. ```python for col_name in df.columns: print(col_name) ``` 2. ```python if len(df.columns)> 10: print("big dataset") else: print("small dataset") ``` * Babette ```python= for i in df.columns: print (i) ``` * Kevin: ```python= 1. %%capture with pd.option_context('display.max_rows', None): print(df.dtypes) 2. if len(df.dtypes) > 10: print("Big dataset") else: print("small dataset") 3. def big_data(df): return len(df.dtypes) > 10 ``` * Swee Chye ```python for col in df.columns: print(col) if len(df.columns) > 10: print("big dataset") else: print("small dataset") ``` * Anne Maaike ``` python list(df.columns) if (len(df.columns) > 10): print("big dataset") elif(len(df.columns) <= 10): print("small dataset") ``` * Signe ``` python if len(df) > 10: print("BIG DATA") else: print("lil data") for i in df.columns: for j in df.dtypes: print("Name: " + str(i) + " Type: " + str(j)) ``` * Kyri ```python for i in df.columns: print(i) for i in df.columns: if len(df.columns) < 10: print("small dataset") elif len(df.columns)>10: print("big dataset") ``` * Lianne: for i in df.columns: print(i) ### Exercise: Pandas columns What happens if you: 1. List the columns you want out of order from the way they appear in the file? 2. Put the same column name in twice? 3. Put in a non-existing column name? (a.k.a Typo) -Reshmi ```python= 1. df.reindex(columns=["Q2","Q1"]) 2. df.loc[0:9,["Q1","Q2","Q2","Q3"]] 3. ``` ```python Michael: data = "SN7577.tab" sep = "\t" cols = ['Q5avi', 'Q3', 'Q1', 'Q2'] df_few_columns = pd.read_csv(data, sep, usecols = cols) df_few_columns (1) Column output is in order of how they are organized in df. (2) If same column name is selected twice, only one will appear in output. (3) Error message: ValueError (unidentified column name) ``` - Kevin ```python= 1. df.loc[:6, ["Q3", "Q2"]] #They appear in mentioned order in new df 2. df.loc[:6, ["Q3", "Q3"]] #They appear twice in new df 3. df.loc[:6, ["Q3", "Z1"]] #Error - execution halted ``` - Roxane: 1.They appear in the specified order 2.The columns appear twice 3.Error -Melisa ```python= df[1:10][["Q2","Q3","Q1"]] #appear in the selected order df[1:10][["Q2","Q2","Q1"]] #Q2 appears twice df[1:10][["Q2","Q2","Q100"]] #error ``` -Babette ```python= df[["Q2", "Q1", "Q3"]] #appear in the specified order df[["Q1", "Q1"]] #Q1 appears twice df[["Q1", "Qx"]] #error ``` -Adri ```python= 1 [["Q2", "Q1", "Q3"]] 2 [["Q2", "Q2", "Q3"]] #it is there twice ``` Kyri ``` python df[["Q2", "Q1", "Q3"]] #same order as notation df.[:10, ["Q2", "Q1", "Q3", "Q2"]]#Q2 appears twice df[["Q2", "Q1", "Q3", "Q2", "blurp"]] #error ``` - Hekmat ```python= 1. df[["Q1", "Q3", "Q2"]] #shows columns in specified order 2. df[["Q1", "Q3", "Q3"]] # repeats column as specified 3. df[["Q1", "Q23", "Q2"]] # program throws a fit ``` - Signe 1 ) ```df[["Q1", "Q3", "Q2"]]``` They appear in the order put. 2 ) ```df[["Q1", "Q2", "Q2"]]``` It will appear twice. 3 ) ```df[["Q1", "Q2", "Qx"]]``` An error will occur. - Anne Maaike ``` python df.loc[1:10, ["Q2", "Q1"]] # they appear in the given order df.loc[1:10, ["Q2", "Q1"]] # column is simply repeated df.loc[1:10, ["q1"]] # key error ``` * Swee Chye ```python # out of order df[1:5][["Q3", "Q1", "Q2"]] # same column name df[6:10][["Q3", "Q3", "Q1"]] # typo not existence column df[11:15][["Q3", "Q3", "Q10"]] ``` * Daniela df[["Q3", "Q2", "Q5aiii"]] df[["Q4", "Q4"]] df[["Q5"]] ### Exercise: aggregation In breakout rooms. Discuss the answers in your group and write the answers in the collaborative document. 1. Read in the SAFI_results.csv dataset. 2. Get a list of the different E26_affect_conflicts values. 3. Groupby E26_affect_conflicts and describe the results. 4. How many of the respondents never had any conflicts? 5. (optional) Using groupby find out whether farms that use water ('E01_water_use') have more plots ('D_plots_count') than farms that do not use water. --Reshmi ```python= 2.pd.unique(df_SAFI['E26_affect_conflicts']) array([nan, 'once', 'never', 'more_once', 'frequently'], dtype=object) 3.grouped_data_byE26 = df_SAFI.groupby('E26_affect_conflicts') 4. 46 5. ``` * Melisa 4. 46 of the respondants never had a conflict, 9 instead had a conflict frequently -- Roxane ```python= 2. pd.unique(df_SAFI['E26_affect_conflicts']) 3. grouped_data_E26 = df_SAFI.groupby('E26_affect_conflicts').describe() 4. 46 ``` --Babette ```python= df_SAFI = pd.read_csv("SAFI_results.csv") pd.unique(df_SAFI['E26_affect_conflicts']) grouped_data_E26= df_SAFI.groupby('E26_affect_conflicts') grouped_data_E26.describe() #46 respondents never had any conflicts ``` kyri ```python= pd.unique(df_SAFI["E26_affect_conflicts"]) grouped_data3 = df_SAFI.groupby(["E26_affect_conflicts"]) grouped_data3.describe() 46 ``` Signe ```python= 2. pd.unique(df_SAFI['E26_affect_conflicts']) 3. grouped_data.describe() 4. len(df_SAFI[df_SAFI["E26_affect_conflicts"] == "never" ]) # could also just read previous table 5. df_SAFI.groupby(['E01_water_use', 'D_plots_count']).describe() ``` Daniela ```python= 1. df = pd.read_csv("pandas-data/SAFI_results.csv") df 2. df['E26_affect_conflicts'].describe() 3.df.groupby("E26_affect_conflicts").describe().T 4. 46 people ``` Anne Maaike ``` python pd.unique(df_SAFI['E26_affect_conflicts']) groupi = df_SAFI.groupby('E26_affect_conflicts') groupi.describe() # 46 respondents never experienced conflicts ``` Hekmat ```python= df2 = pd.read_csv("SAFI_results.csv") pd.unique(df2['E26_affect_conflicts']) grouped_df2 = df2.groupby('E26_affect_conflicts') grouped_df2.describe() grouped_E01 = df2.groupby('E01_water_use') grouped_E01['D_plots_count'].describe() 4. 46 5. means: yes=2.5, no=1.58 ``` ## 🧠 Collaborative Notes On day 2 we have learned how to: * Work in the Jupyter environment * Create variables and assign values to them * Check the type of a variable (integer, string, boolean, list) * Perform simple arithmetic operations * Specify parameters when using built-in functions * Get help for functions * Conditionally execute a section of code (`if` statement) * Execute a section of code on a list of items (`for` loop) * Create and use simple functions * Use functions written by others by importing libraries into your code ### Concept recap Variables: ```python= a = 3 # integer b = "hello world" # string c = True # boolean d = [1, 2, 3, 4] # list ``` Function (call): ```python=5 print(a) # built-in function to print # 3 ``` Methods: ```python=7 b.split(" ") # ['hello', 'world'] ``` Control structure: ```python=9 if condition: # code if condition is True for i in iterable: # code to repeat for each of `i` ``` Function (definition): ```python=14 def function_name(arg1, arg2): """ What does the function do (this is a docstring) """ # body of function return variable ``` List vs Tuple: ```python= a = [1, 2, 3, 4] # list: mutable b = (1, 2, 3, 4) # tuple: immutable ``` ### Working with Pandas - Reading data stored in CSV files (other file formats can be read as well) - Slicing and subsetting data in Dataframes (=tables) - Dealing with missing data - Inserting and deleting columns from data structures - Aggregating data using data grouping facilities using the split-apply-combine paradigm - Joining of datasets (after they have been loaded into Dataframes) - Creating simple plots ```python= import pandas as pd ``` We will look at `Series` and `DataFrame`. - `Series`: like column, or vector in other languages - `DataFrame`: like a table, 2D array *Tip:* to figure out current working directory - from Python ```python import os os.getcwd() # /path/to/working/directory ``` - inside a Jupyter cell, type `!pwd` #### Load Data We read a tab separated file, and save the result in a variable `df`. ```python=2 df = pd.read_csv("SN7577.tab", sep="\t") ``` *Tip:* You can find all supported options in the docstring of `read_csv`: `help(pd.read_csv)` You can inspect the dataframe by executing the following: ```python df # print out of the dataframe ``` The output can be limited to fewer rows by calling `df.head()`, or `df.tail()`. Both take a numeric argument which specifies the number of rows to include. ### Exploring the dataframe ```python=3 len(df) #length/number of rows # 1286 df.shape # number of rows and columns # (1286, 202) df.size # number of cells (row x columns) # 259772 ``` *Note:* unlike `.head()` or `.tail()`, `.shape` and `.size` are attributes, not functions. You can inspect them either by looking at the completion menu shown in Jupyter when you press `tab`, or by looking at their type: `type(df.shape)`. ```python=9 df.columns # Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', # 'Q5avi', # ... # 'numhhd', 'numkid', 'numkid2', 'numkid31', 'numkid32', 'numkid33', # 'numkid34', 'numkid35', 'numkid36', 'wts'], # dtype='object', length=202) df.dtypes # Q1 int64 # Q2 int64 # Q3 int64 # Q4 int64 # Q5ai int64 # ... # numkid33 int64 # numkid34 int64 # numkid35 int64 # numkid36 int64 # wts float64 # Length: 202, dtype: object ``` *Note:* When printing long output, Pandas abridges the output to fit in the screen. To inspect all of the values, we can resort to plain Python ```python= for col in df.columns print(col) # Q1 # Q2 # ... ``` ### Selecting columns & rows Read first 3 columns: ```python= df_few_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2]) ``` or, by name: ```python=2 df_few_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ["Q1", "Q2", "Q3"]) ``` Selecting a column from a dataframe: ```python=3 df["Q1"] df.Q1 # supported, but not recommended as might not work sometimes ``` Select multiple columns: ```python=5 df[["Q1", "Q2", "Q3"]] ``` Selecting rows: ```python=6 df[0:10] # select first 10 rows ``` *Note:* using just the number will not work ```python df[0] # gives `KeyError` ``` This is because Pandas is looking for a column called `0` ### Indexing by position ```python= df.iloc[[0, 1, 2]] # first 3 rows ``` You can also select columns by providing a second argument: ```python=2 df.iloc[1:10, [0, 1, 2]] # select rows 2-10, and columns 1-3 ``` *Note:* note that the row and column numbers are 0-indexed, (counting starts at 0) Indexing by name: ```python=3 df.loc[[0, 1, 2], ["Q1", "Q2", "Q3"]] # select first 3 rows, and columns ``` *Tip:* The indexer method (`.iloc[]` or `.loc[]`) accepts: - slices: `0:4` (0-3) - lists: `[0, 1, 2, 3]` or `["Q1", "Q2", "Q3"]` - name: for single column or rows, just the name, say, `0` or ``"Q"`` Some other column manipulation ```python=4 df[["Q2", "Q1"]] # reorder columns df[["Q1", "Q1"]] # repeat columns df["nonexistent"] # raises `KeyError` ``` ### Filtering rows Filter all rows where Q1 is equal to 1 ```python= mask = df["Q1"] == 1 df[mask] # rows where Q1 is 1 df[df["Q1"] == 1] # equivalent as above ``` Combining multiple conditions: ```python=4 mask = (df["Q1"] == 1) & (df["Q2"] == -1) df[mask] ``` *Note:* the `&` operator is overloaded in Pandas to do boolean operation between array elements. In normal Python, the ampersand is a bitwise operator, the two should not to be confused. To select specific columns after applying our mask, we can do: ```python=6 df[mask][["Q1", "Q2"]] # select first two columns after filtering ``` ### Summarising data ```python= df_SAFI = pd.read_csv("SAFI_results.csv") df_SAFI.describe() # summary of numeric columns df_SAFI.describe().T # transposes the output, useful if you have lots of columns ``` *Tip:* `df.info()` gives you a summary of data types ```python=4 df_SAFI.min().T ``` Some text columns are included here, because text can be sorted (alphabetically) ```python=5 df_SAFI.count() ``` Note some columns have a different count, because they have missing values, and missing values are not included in summary statistics. ```python=6 df_SAFI["B_no_membrs"].describe() ``` We can also call `.describe()` on individual columns. ```python=7 print(df_SAFI['B_no_membrs'].min()) print(df_SAFI['B_no_membrs'].max()) print(df_SAFI['B_no_membrs'].mean()) print(df_SAFI['B_no_membrs'].std()) print(df_SAFI['B_no_membrs'].count()) print(df_SAFI['B_no_membrs'].sum()) # 2 # 19 # 7.190839694656488 # 3.1722704895263734 # 131 # 942 ``` Similar for other summary statistics methods. ### Dealing with missing values We can count missing values in a column with: ```python= df_SAFI.isnull().sum() ``` We can also do this for individual columns: ```python=2 df_SAFI['E19_period_use'].isnull().sum() df_SAFI['E19_period_use'].describe() ``` In this column, missing value means the land wasn't irrigated, so we can fill the missing values with 0. ```python=4 df_SAFI['E19_period_use'].fillna(0, inplace=True) ``` *Note:* `inplace=True` overwrites the existing dataframe, this is usually not considered to be good practice If not using `inplace`, we can save the new filled column into the original dataframe as: ```python= df_SAFI['E19_period_use_noNA'] = df_SAFI['E19_period_use'].fillna(0) df_SAFI['E19_period_use'].isnull.sum() # 39 df_SAFI['E19_period_use_noNA'].isnull.sum() # 0 ``` Similarly, we can also replace existing values: ```python=6 import numpy as np # need numpy for np.NaN df_SAFI["E19_period_use_withNA"] = df_SAFI['E19_period_use_noNA'].replace(0, np.NaN) ``` We can see the columns we manipulated like this: ```python=8 df_SAFI[[col for col in df_SAFI.columns if col.startswith("E19")]].describe() ``` | E19_period_use | E19_period_use_noNA | E19_period_use_withNA |------|----|--- | count | 92.000000 | 131.000000 | 92.000000 | mean | 12.043478 | 8.458015 | 12.043478 | std | 8.583031 | 9.062399 | 8.583031 | min | 1.000000 | 0.000000 | 1.000000 | 25% | 4.000000 | 0.000000 | 4.000000 | 50% | 10.000000 | 5.000000 | 10.000000 | 75% | 20.000000 | 15.500000 | 20.000000 | max | 45.000000 | 45.000000 | 45.000000 *Note:* we are indexing by creating a filtered list of columns that start with `"E19"`: ```python [col for col in df_SAFI.columns if col.startswith("E19")] # list comprehension # ['E19_period_use', 'E19_period_use_noNA', 'E19_period_use_withNA'] ``` ### Categorical variables ```python= df_SAFI.dtypes[:20] # look at the data types of the first 20 columns ``` Let's look at `"C01_respondent_roof_type"` ```python=2 pd.unique(df_SAFI['C01_respondent_roof_type']) # array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object) ``` We can group by the roofing type: ```python=4 grouped_data = df_SAFI.groupby('C01_respondent_roof_type') grouped_data.describe() ``` We can inspect a specific column: ```python=6 grouped_data["A11_years_farm"].describe() ``` | | count| mean| std| min| 25% | 50%| 75%| max | --|--|--|--|--|--|--|--|--| | **C01_respondent_roof_type** |grass| 73.0| 14.986301| 11.635068| 1.0 | 6.0| 12.0| 20.00| 60.0 | mabatipitched | 10.0 | 21.300000 | 11.916841| 9.0 | 17.0 | 20.0 | 20.75 | 53.0 | mabatisloping| 48.0 | 15.979167| 9.315943 | 2.0 | 10.0 | 16.0 | 20.25 | 50.0 We can also group by multiple columns: ```python=7 grouped_data2 = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type']) grouped_data2["A11_years_farm"].describe() ``` | | | count | mean | std | min | 25% | 50% | 75% | max | --|--|--|--|--|--|--|--|--|--| | **C01_respondent_roof_type** | **C02_respondent_wall_type** | grass | burntbricks | 22.0 | 16.772727 | 9.159600 | 5.0 | 11.25 | 15.0 | 20.75 | 41.0 ||muddaub | 42.0 | 13.904762| 13.110803 | 1.0 | 4.25 | 11.0 | 20.00 | 60.0 ||sunbricks| 9.0| 15.666667 | 10.087121 | 6.0 | 10.00 | 12.0| 17.00 | 35.0 |mabatipitched| burntbricks| 6.0 | 18.000000 | 4.857983 | 9.0 | 17.00 | 20.0 | 20.75 | 22.0 ||muddaub | 3.0 | 28.333333 | 21.733231 | 12.0 | 16.00 | 20.0 | 36.50 | 53.0 ||sunbricks | 1.0 | 20.000000 | NaN | 20.0 | 20.00 | 20.0 | 20.00 |20.0 |mabatisloping | burntbricks | 39.0 |14.666667 | 8.285477 | 2.0 | 9.00 | 15.0 | 20.00 | 40.0 ||cement | 1.0 | 10.000000 | NaN |10.0 | 10.00 | 10.0 | 10.00 | 10.0 ||muddaub | 1.0 | 22.000000 | NaN | 22.0 |22.00 | 22.0 | 22.00 | 22.0 ||sunbricks| 7.0 | 23.285714 |12.632159 |10.0 |18.00 | 22.0 | 22.50 | 50.0 ## Questions * Recommendations for sites/books for more exercises for Python for this particular purpose? ## 📚 Resources * [Dataset: Audit of Political Engagement 11, 2013](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7577&type=data%20catalogue#!/details) * [Data science with Python tutorial](https://www.geeksforgeeks.org/data-science-tutorial/) * [Pandas documentation](https://pandas.pydata.org/docs) * [Python documentation](https://docs.python.org/3/index.htmlw) ## Feedback ### Tips (what can be improved) * More time spent on exercises * at the end of the day, let us know what has to be installed/dowloaded/uploaded on the next day. * Collaborative document has its use but rather messy to work in (like when writing this, my cursor was moved mysteriously several times) <- This, whenever we have a table with our names its less messy, maybe we can have that for all exercises * Perhaps a few practice exercises to 'take-home' for people who still struggled with a particular topic is useful. Or even some 'take-home' to prepare for next session if people feel like they would like to be a bit more prepared. :+1: +1: ### Tops (what went well) * Nice live code-along. Seeing instructors troubleshoot their own errors makes the learning process more natural. * The pace was pretty doable for me, I liked that it picked up a bit more at the end * Great that we now have time to solve issues without falling behind on the lecture. * The choice to either work alone or use a break-out room * :+1: * Instructions in the collaborative document are very clear. I missed the second half of day 3 but have managed to work through the material on my own. Thank you!!!

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully