Try   HackMD

Workshop Details
Dates: September 6th - 13th, 2022
Time: 9am - 12pm

Workshop Agenda:
https://ucsdlib.github.io/2022-09-06-carpentries-uc/

Workshop Lesson:
http://swcarpentry.github.io/python-novice-gapminder/

Day 1 - 3: Introduction to Python

Software Installation:
Anaconda
https://www.anaconda.com/download/

  • download latest version - 64-bit installer for Windows 10
  • This application is used to install and run Jupyter Notebooks
  • Google Collab: https://colab.research.google.com (for use if there are problems during the workshop)

Lesson Data (download)

NOTES:

A copy of the instructor live session notes will be made available to participants upon request at the end of the workshop.

Jupyterlab will be used for the lessons
[m] Markdown cell = notes
[#]also works in code cell for notes
[b] = add cell below [a] is above
[r]Raw cells cannot have text edits

(for Python lessons)
https://www.markdownguide.org/getting-started/
https://www.markdownguide.org/basic-syntax/

Workshop Day 1

First name and Last Name/Organization/Dept./Email |

Name (first & last) Organization Dept. Email
(example) Jane Doe UCSD IT jdoe1@ucsd.edu
Kat Koziar (Helper) UCR Library katherine.koziar@ucr.edu
Jacob sola UCR Chemistry/Biomedical jsola032@ucr.edu
Douglas Zhang UCSD Chemistry/Biochemistry doz023@ucsd.edu
Jacqueline Giacoman UC Merced Political Science jgiacoman@ucmerced.edu
Jose Hernandez UCB Library jose1991@berkeley.edu
John Thompson UC Merced Molecular & Cellular Biology jthompson44@ucmerced.edu
Derek Devnich UC Merced
Sam Erickson UC Merced Physics serickson3@ucmerced.edu
Dilawer Ali UC Merced Mechanical Engineering dali4@ucmerced.edu
Igor Aprelev UCSD Mathematics and Economics iaprelev@ucsd.edu
Benjamin Nauman UCLA Geography bnauman@ucla.edu
Mohit Saraswat UC Merced Chemistry msaraswat@ucmerced.edu
Jacob Ross UCSD Anesthesiology jaross@ucsd.edu
Jay Colond UCM Sociology jcolond@ucmerced.edu
Zhaoning (Johnny) Wang UCSD CMM zhw063@health.ucsd.edu
Lillie Pennington UC Merced Life and Environmental Sciences lpennington@ucmerced.edu
Christian Henry UC Berkeley Integrative Biology chrishenry@berkeley.edu
Belina Chong UCLA Ecology and Evolutionary Biology moonmoon394@ucla.edu
Josiah Piceno UCM MBSE jpiceno3@ucmerced.edu
Jun Tan UCSD Economics j4tn@ucsd.edu
Jon Dean UCSD Anesthesiology j1dean@health.ucsd.edu
Tahirah Williams UCM QSB twilliams76@ucmerced.edu
Liam de Villa Bourke UCLA Institute of the Environment and Sustainability liamdevilla@g.ucla.edu
Rukmini Ravi UCSD San Diego Supercomputer Center ruravi@ucsd.edu
Amber Heidbrink UCSD Cell and Developmental Biology aheidbrink@ucsd.edu
Haley Potts UCSD Math & Economics hpotts@ucsd.edu
isabella schaedle UCSD MMMMMMMM
Apisit Kaewsanit UCSF Epidemiology and Biostatistics apisit.kaewsanit@ucsf.edu
Ivan Felix Rios UCSD Mathemathics & Economics ifelixrios@ucsd.edu
Christian Corrales UCLA Neurology ccorrales@mednet.ucla.edu
Michael Woller UCLA Psychology michaelwoller@g.ucla.edu
Stella Yuan UCLA Ecology and Evolutionary Biology scy8@g.ucla.edu
Jonathan Le UCR Mathematics jle173@ucr.edu
Laika Aguinaldo UCSD Psychiatry laaguinaldo@ucsd.edu
Chris Gray UCR Data Science cgray024@ucr.edu
Ana Carolina Dantas Machado UCSD Medicine adantasmachado@ucsd.edu
Jason Ngo UC Merced Bioengineering jngo42@ucmerced.edu
Yibing Zhang UC Merced Bioengineering yzhang291@ucmerced.edu
Ashwin Thomas UC Merced Environmental Systems athomas59@ucmerced.edu
Eric Hyde UCSD Epidemiology ehyde@health.ucsd.edu
Bineh Ndefru UCLA Materials Science bndefru@ucla.edu
Vishakha Malhotra UCSF Biostatistics and Epidemiology
Bruce Hamilton UCSD School of Medicine bah@ucsd.edu
Kazuma Nagatsuka UCSD Robotics(Mechanical Engineering) kngatsuka@ucsd.edu
Caitlin Tribelhorn UCSD Pediatrics ctribelh@ucsd.edu
Vikram Jambulapati UCSD Economics vjambula@ucsd.edu
Simran Kanal UCSF Biostatistics and Epidemiology simran.kanal@ucsf.edu
Daryl Han UC Irvine Student Center and Event Services ddhan@uci.edu
Charles Faulhaber UC Berkeley Bancroft Library / Dept. of Spanish
Mario Cuaya UCR Computer Science mcuay001@ucr.edu
Waleed Rajabally UC Merced Sociology wrajabally@ucmerced.edu
Junxiao Gao UCSF Biostatistics and Epidemiology Junxiao.Gao@ucsf.edu
Jay Chi UCSB ETS jaychi@ucsb.edu
Vishakha Malhotra UCSF Biostatistics and Epidemiology vishakha.malhotra@ucsf.edu

Day 1 Questions:

Please enter any questions not answered during live session here:
1.

Day 1 Live Class Notes:

Download link: https://www.anaconda.com/products/distribution
Working in Anaconda JupyterLab
GUI (middle-man, colloquially pronounced as "gooey") vs command-line
Today's workshop is strictly in JupyterLab GUI

Computer programming languages - there are a lot of them, and what they do is similar, syntax is also similar between different languages (although, each is specific). Able to learn the basics and apply them to different langauges.
Your favorite search engine is a good resource when you're looking for answers to your programming questions (kat's note: I <3 Stack Exchange)

working directory - in JupyterLab, working directory is shown on the left sidebar. Left sidebar also shows tabs, such as file browse (where you can select your working directory, create new files/folders), a list of what terminals are running, etc. The left sidebar can also be collapsed or expanded. Running anaconda JupyterLab is local to your computer, so when you're using a public computer, any files are saved on that public computer

new file - Day1_Python_LiveNotes.ipynb (to rename, right click on file to bring up submenu)

Interface - menu bar at top contains more options than the tabs in the left sidebar quicklinks

Command and Edit modes - press B will create a new cell below current cell

  • code cell will allow you to enter code
  • markdown cell doesn't run code, it's only notes (formatted in markdown) You can change a cell into a markdown cell by pressing m - switch between code and markdown cells by pressing the m or y keys. m y
  • print('Hello') will show 'Hello' right below the cell if it's executed in a code cell
  • a-key creates a cell above
  • ctrl-enter will run the cell, either execute the command in a code cell or render the markdown in a markdown cell
  • menu -> Kernel -> restart and clear all output will clear all output and saved variables, but keep the text in the cells.
  • Markdown cells are stylized text
  • Hello There

  • Bye
  • raw (code) cells are plain text, and executable codes the octothorp, pound sign, number sign, hash: # is used for comments in code
  • comments are used to explain why/what your code is doing - comments are a love note to your future self
  • to create a list in markdown, bullets are created using a - or * with a space. different levels are created using levels.
    • example of level 2
      • level 3

Numbered lists

  1. level1
    2. level 2, also requires tabs

A tool like HackMD lets you practice markdown.

Bold and italics

  • bold is surrounded by two astrisks
  • italics is surrounded by single astrisk or underscores

In JupyterLab markdown cells, you can combine some html elements, such as <br>
backslash \ before the less-than-symbol will escape the character so it isn't read as html \<br>

Mixed list

  1. level 1
    • Level 2
  2. level 1
    • Level 2
  3. level 1
    • Level 2

Headings use # to create different sizes

largest

one smaller

smaller

etc

even smaller

Markdown Cheatsheet

Save and save often

  • Always shut down your kernel (menu -> Kernel -> Shut Down Kernel) when you're finished
    • this makes sure your file/project isn't continuing to use resources when not intended - especially useful when you're using a hpcc environment.

Challenge #1

Lesson 2: Variables and Assignments

age = 42
first_name = 'Ahmed'

  • variable_name = value
  • computer only recognizes the values assigned to the variable after the code cell is executed
  • variable name rules
    • can only contain letters, digits, underscores (a dash is a minus sign in code!)
    • use underscore or camelCase to help human readability
    • thisisaverylongnamethatishardforahumantoread = "Jimmy"
    • this_is_more_readable = "Jimmy"
    • thisIsCamelCase = "jimmy"
    • variable names cannot start with a number
    • use self-describing short variable names (x is not self-describing, age or weight are self-describing)
    • variable names are CaseSensitive
    • variables that start with an underscore have a special meaning (_dont_use_until_you_understand_what_it_means)
    • will get syntax error if the variable name doesn't follow the rules, such as 3age (starts with a number) or read@one (uses any symbol other than the underscore _)

Built in functions

  • print() prints things as text
  • print(first_name, 'is', age, 'years old') will print Ahmed is 42 years old
  • built-in functions are native to python, and are functions that are commonly used by programers
  • print() will automatically add single spaces in the current version of Python.
  • print(argument1, argument2, argument3, argument4)
  • functions are self-contained - will take in arguments and provide output.
  • functions allow you to easily reuse code
  • not all functions require arguments. some functions require a certain number of arguments.

Variables

  • must be created before they are used.
  • print(myval) will give an error if myval isn't already created with a value

This will throw an error because last_name does not have an assigned value

print(last_name)
last_name = "Smith" 

This will not throw an error

last_name = "Smith" 
print(last_name)

Challenge #2
Assign the variable named color1 to the value red and the variable named color2 to the value blue. Then print red is not blue using the variable names as input (or arguments)

color1 = 'red'
color2 = 'blue'
print(color1, 'is not', color2)
print(color1, 'is', 'not', color2)

Blocks of text

  • you can surround a block of text with triple quotes, like so: """ My very long block of text """

variables used in calculations

  • need to be a certain datatype for calculations - num type, integer or float
  • age = age + 3
  • 3 + 5 * 4 calculates according to math rules (order of operations), not read left to right
    • parentheses/brackets, exponents/radicals, muliplication/division, addition/subtraction
  • 3 + 5 * 4 = 23
  • (3 + 5) * 4 = 32

Challenge #3
Write the code for for the following: number1 is 22, number2 is 5, and number3 is 100. Multiple number1 by number3 then divid by number2. The answer calculation answer should be number4. Finally, output 'The answer is number4' - with the value displaying rather than the variable.

Built-in functions

  • index() gives you a single character from a string
    • in python, indices start with 0 (zero)
atom_name = 'helium'
print(atom_name[0])

output is h

  • index() uses the variable name, then square brackets around the number of the index you want to obtain

datatype strings are text surrounded by single or double-quotes (pair single-quotes with single-quotes, don't interchange 'like this")

id_number = 2587464
print(id_number[2])

will result in error because id_number is an integer, and not a string

list

my_list = ['apple', 'pear', 'peach]'
print(my_list[1])

output is pear

slices

  • slice is a substring or subset
  • slice is variable[start position: stop position(not including)]
# string example
atom_name = 'sodium'
print(atom_name[0:3])

output is sod

**# list example
many_atoms = ['oxygen', 'carbon', 'nitrogen', 'neon', 'iron', 'zinc']
print(many_atoms[1:4])**

output will be ['carbon', 'nitrogen', 'neon'] (notice how it outputs in a list format!)

how long are things?

  • function is len()
  • finds the length of a string or list
  • lets you know how long a string is, or how many elements are in a list
#string example
print(len('helium'))

output is 6 (counts number of characters)

# list example
my_list2 = ['a', '1', '43', 'dream', 'please']
print(len(my_list2))

output is 5 (counts number of elements in list)

Challenge #4

  1. what does thing[:] (just a colon) do?
  2. What does thing[number:some-negative-number do?
  3. What does the following program print?
atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])

Solution #4

  1. returns everything
  2. returns a slice from number to the the negative count from the end of the variable
#example
atom_name = 'carbon'
print(atom_name[1:-4])

output is ca

  1. output is atom_name[1:3] is: ar
    • (remember, the number that is the stop position in the slice isn't included.)

Data types & type conversion

  • all data that python reads is associated with a data type. Types we've covered so far are string, integer, floats, which are the three commonly used data types.
  • Type conversion means you're converting data from one type to another
  • integers : whole numbers
    • type conversion use int()
  • floats : also called floating points, they are decimal (real) numbers
    • type conversion use float()
  • strings : sequence of characters, written inside quotes
    • type conversion use float()
  • to identify the type of data, use type()

type(52) will output int

print(type(52)) will output <class 'int'>

fitness = 'average'
print(type(fitness))

output is <class 'str'>

print(type(hair)) will throw an error, because Python is reading hair as a variable name, which isn't defined.

print(type(3.4))

output is <class 'float'>

print (5-2)

will output 2

print ('hello'-'h')

will throw an error because you can't subtract strings

You can use '+' and '*' on integers, floats, and strings, but operates differently on strings

print (4+5)

output is 9

print ("Ahmed"+"Walch")

output is AhmedWalch

print ('Ahmed'*10)

output is AhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmed

  • Cannot mix strings with integers/floats for mathematical purposes
print (1 + '2')

will throw an error.

however,

print (1 + int('2'))

will output 3 because '2' is type cast as an integer, allowing math operations.

print (str(1) + '2')

will output 12 (which is actually a string, not a number!)

print ('Gene'+str(23455685))

will output Gene23455685, which allows easy labels!

Variables only change values once the value is (re-)assigned

if you need to keep an original value of a variable, create a new variable name, otherwise you're overwriting the original value.

LIVE LESSON NOTES: https://drive.google.com/file/d/1TSm1bA55RwQu5-iqdnBNRU47U3os9x86/view?usp=sharing

End Day 1

Workshop Day 2

First name and Last Name/Organization/Dept./Email

Name (first & last) Organization Dept. Email
Geno Sanchez (helper) UCLA Library genosanchez@library.ucla.edu
Amber Heidbrink UCSD Cell and Developmental Biology aheidbrink@ucsd.edu
Kat Koziar UCR Library katherine.koziar@ucr.edu
Yibing Zhang UCM Bioengineering yzhang291@ucmerced.edu
Douglas Zhang UCSD Chemistry and Biochemistry doz023@ucsd.edu
Kazuma Nagatsuka UCSD Robotics(Mechanical Engineering) knagatsuka@ucsd.edu
Jay Colond UCM Sociology jcolond@ucmerced.edu
Belina Chong UCLA Ecology and Evolutionary Biology moonmoon394@ucla.edu
Jonathan Le UCR Mathematics jle173@ucr.edu
Caitlin Tribelhorn UCSD Pediatrics ctribelh@ucsd.edu
Igor Aprelev UCSD Mathematics and Econonmics iaprelev@ucsd.edu
Sam Erickson UC Merced Physics serickson3@ucmerced.edu
Jay Chi UCSB ETS jaychi@ucsb.edu
Apisit Kaewsanit UCSF Epidemiology and Biostatistics apisit.kaewsanit@ucsf.edu
Benjamin Nauman UCLA Geography bnauman@ucla.edu
Suzanne Paulson UCLA AOS paulson@atmos.ucla.edu
Liam de Villa Bourke UCLA IOES liamdevilla@g.ucla.edu
Mario Cuaya UCR Computer Science mcuay001@ucr.edu
Josiah Piceno UCM MBSE jpiceno3@ucmerced.edu
John Thompson UC Merced Cell & Molecular Biology jthompson44@ucmerced.edu
Bineh Ndefru UCLA Material Science bndefru@ucla.edu
Zhiyuan Yao UCLA Data Science Center zyao@ucla.edu
Tahirah Williams UCM QSB twilliams76@ucmerced.edu
Haley Potts UCSD Math & Econ hpotts@ucsd.edu
Zhaoning (Johnny) Wang UCSD CMM zhw063@health.ucsd.edu
Daryl Han UC Irvine Student Center and Event Services ddhan@uci.edu
Simran Kanal UCSF Epidemiology and Biostatistics simran.kanal@ucsf.edu
Jon Dean UCSD Anesthesiology j1dean@health.ucsd.edu
Junxiao Gao UCSF Biostatistics and Epidemiology Junxiao.Gao@ucsf.edu
Stella Yuan UCLA Ecology and Evolutionary Biology scy8@g.ucla.edu
Waleed Rajabally UC Merced Sociology wrajabally@ucmerced.edu
Jun Tan UCSD Economics j4tan@ucsd.edu
Christian Henry UC Berkeley Integrative Biology chrishenry@berkeley.edu
Jacob Ross UCSD Anesthesiology jaross@ucsd.edu
Christopher Gray UCR Computer Science cgray024@ucr.edu

Day 2 Questions:

Please enter any questions not answered during live session here:
1.

Day 2 Live Class Notes:

Gapminder data download: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip

Lesson 5 Libraries

Most of the power of a programming language is in its libraries.

A library is a collection of files (called modules) that contains functions for use by other programs.

  • May also contain data values
  • Pandas - widely used library often used in the science world
  • Many are open source
  • The Python standard library is an extensive suite of modules that comes with Python itself.

A program must import a library module before using it.

Use import to load a library module into a program’s memory.

import math
print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))

pi is 3.141592653589793 cos(pi) is -1.0

Have to refer to each item with the module’s name.

Use help to learn about the contents of a library module.

help(math)
Help on module math:

NAME
    math

MODULE REFERENCE
    http://docs.python.org/3/library/math
...

Import specific items from a library module to shorten programs.

from math import cos, pi

print('cos(pi) is', cos(pi))

cos(pi) is -1.0

Create an alias for a library module when importing it to shorten programs.

import math as m

print('cos(pi) is', m.cos(m.pi))

cos(pi) is -1.0

  • Use import as to give a library a short alias while importing it.
  • Then refer to items in the library using that shortened name
import matplotlib as mpl

Challenge

  1. Fill in the blanks so that the program below prints 90.0.
  2. Rewrite the program so that it uses import without as.
  3. Which form do you find easier to read?
import math as m
angle = ____.degrees(____.pi / 2)
print(____)

Solution:

import math as m #1 angle = m.degrees(m.pi / 2) print(angle) #2 import math angle = math.degrees(math.pi / 2) print(angle)
90.0

Lesson 6: Writing Functions

Define a function using def with a name, parameters, and a block of code.

# you need to declare a new function with the keyword 'def'.
# you need to include a 'name()'.
def say_hello():
    print("hello!")
  • Begin the definition of a new function with def
  • Followed by the name of the function.
    • Must obey the same rules as variable names
    • You need to use a letter or underscore or number, but you can not start with a number.
  • Then parameters in parentheses
    • Empty parenteses if the function doesn't take any input
  • Then a colon is used
  • Next line of code is indented
  • Some functions require an argument to be passed for it to be execute and others do not.
# After defining a function, you must 'call' a function to execute it.

say_hello()

hello!

# Let's make a function that prints a date as an example of a function that takes an argument.

def print_date(year, month, day): # so the input is 'arg1', arg2,arg3' being required for the function
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)
    
print_date(2022, 1, 2)

2022/1/2

print_date(month = 1, year = 2019, day = 23)

2019/1/23

Defining a function using the return call.

def average(values):
    if len(values) == 0:
        return None
    return sum(values) / len(values)

avg = average([1,3,4])

print(avg)

emptyAvg = avg([])

print(emptyAvg)

2.6666666666666665
None

# 
result = print_date(1871, 3, 19)
print('result of print_date', result)

1871/3/19 result of print_date None

Challenge

What is wrong with this example?

#Example
result = print_time(11,37,59)

def print_time(hour, minute, second):
    time_string = str(hour) + ':' + str(minute)+ ':' + str(second)
    print(time_string)
# After fix:
 result = print_time(11, 37, 59)
 print('result of call is:', result)

11:37:59 result of call is: None

Reading tabular data into data frames

import os

#Get our current working directory
print(os.getcwd())

#List the contents of this directory
print(os.listdir())
import pandas as pd

data = pd.read_csv("gapminder_gdp_oceania.csv")
#Reading data from a subfolder
#data = pd.read_csv("subfolder/gapminder_gdp_oceania.csv")
print(data)
       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  
data

        country 	gdpPercap_1952 	gdpPercap_1957 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972 	gdpPercap_1977 	gdpPercap_1982 	gdpPercap_1987 	gdpPercap_1992 	gdpPercap_1997 	gdpPercap_2002 	gdpPercap_2007
0 	Australia 	10039.59564 	10949.64959 	12217.22686 	14526.12465 	16788.62948 	18334.19751 	19477.00928 	21888.88903 	23424.76683 	26997.93657 	30687.75473 	34435.36744
1 	New Zealand 	10556.57566 	12247.39532 	13175.67800 	14463.91893 	16046.03728 	16233.71770 	17632.41040 	19007.19129 	18363.32494 	21050.41377 	23189.80135 	25185.00911

# lets identify our rows by country not index number

data = pd.read_csv("gapminder_gdp_oceania.csv", index_col = "country")
 	        gdpPercap_1952 	gdpPercap_1957 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972 	gdpPercap_1977 	gdpPercap_1982 	gdpPercap_1987 	gdpPercap_1992 	gdpPercap_1997 	gdpPercap_2002 	gdpPercap_2007
country 												
Australia 	10039.59564 	10949.64959 	12217.22686 	14526.12465 	16788.62948 	18334.19751 	19477.00928 	21888.88903 	23424.76683 	26997.93657 	30687.75473 	34435.36744
New Zealand 	10556.57566 	12247.39532 	13175.67800 	14463.91893 	16046.03728 	16233.71770 	17632.41040 	19007.19129 	18363.32494 	21050.41377 	23189.80135 	25185.00911
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gdpPercap_1952  2 non-null      float64
 1   gdpPercap_1957  2 non-null      float64
 2   gdpPercap_1962  2 non-null      float64
 3   gdpPercap_1967  2 non-null      float64
 4   gdpPercap_1972  2 non-null      float64
 5   gdpPercap_1977  2 non-null      float64
 6   gdpPercap_1982  2 non-null      float64
 7   gdpPercap_1987  2 non-null      float64
 8   gdpPercap_1992  2 non-null      float64
 9   gdpPercap_1997  2 non-null      float64
 10  gdpPercap_2002  2 non-null      float64
 11  gdpPercap_2007  2 non-null      float64
dtypes: float64(12)
memory usage: 208.0+ bytes

stat info of your data

data.describe()
 	gdpPercap_1952 	gdpPercap_1957 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972 	gdpPercap_1977 	gdpPercap_1982 	gdpPercap_1987 	gdpPercap_1992 	gdpPercap_1997 	gdpPercap_2002 	gdpPercap_2007
count 	2.000000 	2.000000 	2.000000 	2.000000 	2.00000 	2.000000 	2.000000 	2.000000 	2.000000 	2.000000 	2.000000 	2.000000
mean 	10298.085650 	11598.522455 	12696.452430 	14495.021790 	16417.33338 	17283.957605 	18554.709840 	20448.040160 	20894.045885 	24024.175170 	26938.778040 	29810.188275
std 	365.560078 	917.644806 	677.727301 	43.986086 	525.09198 	1485.263517 	1304.328377 	2037.668013 	3578.979883 	4205.533703 	5301.853680 	6540.991104
min 	10039.595640 	10949.649590 	12217.226860 	14463.918930 	16046.03728 	16233.717700 	17632.410400 	19007.191290 	18363.324940 	21050.413770 	23189.801350 	25185.009110
25% 	10168.840645 	11274.086022 	12456.839645 	14479.470360 	16231.68533 	16758.837652 	18093.560120 	19727.615725 	19628.685412 	22537.294470 	25064.289695 	27497.598692
50% 	10298.085650 	11598.522455 	12696.452430 	14495.021790 	16417.33338 	17283.957605 	18554.709840 	20448.040160 	20894.045885 	24024.175170 	26938.778040 	29810.188275
75% 	10427.330655 	11922.958888 	12936.065215 	14510.573220 	16602.98143 	17809.077558 	19015.859560 	21168.464595 	22159.406358 	25511.055870 	28813.266385 	32122.777858
max 	10556.575660 	12247.395320 	13175.678000 	14526.124650 	16788.62948 	18334.197510 	19477.009280 	21888.889030 	23424.766830 	26997.936570 	30687.754730 	34435.367440
data.columns
# or
print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')

Dataframes

Dataframes are a collection of columns. Within a column it has to be the same data type (e.g. float, int, str)

Challenge

  1. Read the data in gapminder_gdp_americas.csv into a variable called americas and display its summary statistics.
  2. After reading the data for the Americas, use help(americas.head) and help(americas.head) to find out what DataFrame.head and DataFrame.tail do.
  3. How can you display the first three rows of this data?

solution:

americas = pd.read_csv("data/gapminder_gdp_americas.csv", index_col = "country")

print(americas.head(3))

print(americas.describe())
          continent  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                               
Argentina  Americas     5911.315053     6856.856212     7133.166023   
Bolivia    Americas     2677.326347     2127.686326     2180.972546   
Brazil     Americas     2108.944355     2487.365989     3336.585802   

           gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
country                                                                     
Argentina     8052.953021     9443.038526    10079.026740     8997.897412   
Bolivia       2586.886053     2980.331339     3548.097832     3156.510452   
Brazil        3429.864357     4985.711467     6660.118654     7030.835878   

           gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
country                                                                     
Argentina     9139.671389     9308.418710    10967.281950     8797.640716   
Bolivia       2753.691490     2961.699694     3326.143191     3413.262690   
Brazil        7807.095818     6950.283021     7957.980824     8131.212843   

           gdpPercap_2007  
country                    
Argentina    12779.379640  
Bolivia       3822.137084  
Brazil        9065.800825 




       gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
count       25.000000       25.000000       25.000000       25.000000   
mean      4079.062552     4616.043733     4901.541870     5668.253496   
std       3001.727522     3312.381083     3421.740569     4160.885560   
min       1397.717137     1544.402995     1662.137359     1452.057666   
25%       2428.237769     2487.365989     2750.364446     3242.531147   
50%       3048.302900     3780.546651     4086.114078     4643.393534   
75%       3939.978789     4756.525781     5180.755910     5788.093330   
max      13990.482080    14847.127120    16173.145860    19530.365570   

       gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
count       25.000000       25.000000       25.000000       25.000000   
mean      6491.334139     7352.007126     7506.737088     7793.400261   
std       4754.404329     5355.602518     5530.490471     6665.039509   
min       1654.456946     1874.298931     2011.159549     1823.015995   
25%       4031.408271     4756.763836     4258.503604     4140.442097   
50%       5305.445256     6281.290855     6434.501797     6360.943444   
75%       6809.406690     7674.929108     8997.897412     7807.095818   
max      21806.035940    24072.632130    25009.559140    29884.350410   

       gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
count       25.000000       25.000000       25.000000       25.000000  
mean      8044.934406     8889.300863     9287.677107    11003.031625  
std       7047.089191     7874.225145     8895.817785     9713.209302  
min       1456.309517     1341.726931     1270.364932     1201.637154  
25%       4439.450840     4684.313807     4858.347495     5728.353514  
50%       6618.743050     7113.692252     6994.774861     8948.102923  
75%       8137.004775     9767.297530     8797.640716    11977.574960  
max      32003.932240    35767.433030    39097.099550    42951.653090  

Getting data out of your data frame

# get a column
data = pd.read_csv("gapminder_gdp_europe.csv", index_col = "country")
data.columns
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')
col1 = data["gdpPercap_1957"] # getting data by columnn label

print(col1)

country
Albania                    1942.284244
Austria                    8842.598030
Belgium                    9714.960623
Bosnia and Herzegovina     1353.989176
Bulgaria                   3008.670727
Croatia                    4338.231617
Czech Republic             8256.343918
Denmark                   11099.659350
Finland                    7545.415386
France                     8662.834898
Germany                   10187.826650
Greece                     4916.299889
Hungary                    6040.180011
Iceland                    9244.001412
Ireland                    5599.077872
Italy                      6248.656232
Montenegro                 3682.259903
Netherlands               11276.193440
Norway                    11653.973040
Poland                     4734.253019
Portugal                   3774.571743
Romania                    3943.370225
Serbia                     4981.090891
Slovak Republic            6093.262980
Slovenia                   5862.276629
Spain                      4564.802410
Sweden                     9911.878226
Switzerland               17909.489730
Turkey                     2218.754257
United Kingdom            11283.177950
Name: gdpPercap_1957, dtype: float64
# Pandas introduces new data types

print(type(data))

<class 'pandas.core.frame.DataFrame'>
print(type(col1))


<class 'pandas.core.series.Series'>

Get data subsets by position

subset1 = data.iloc[0, 0]
print(subset1)

1601.056136

Get data subsets by label

subset2 = data.loc["Albania", "gdpPercap_1952"]
print(subset2)

1601.056136

Get row by label

data.loc["Albania",:]


gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64
country_subset = data.loc["Italy":"Poland", "gdpPercap_1962":"gdpPercap_1972"]
country_subset


	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	8243.582340 	10022.401310 	12269.273780
Montenegro 	4649.593785 	5907.850937 	7778.414017
Netherlands 	12790.849560 	15363.251360 	18794.745670
Norway 	13450.401510 	16361.876470 	18965.055510
Poland 	5338.752143 	6557.152776 	8006.506993
print(type(country_subset))
print(country_subset.describe())


<class 'pandas.core.frame.DataFrame'>
       gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
count        5.000000        5.000000        5.000000
mean      8894.635868    10842.506571    13162.799194
std       4093.410673     4855.106424     5517.298708
min       4649.593785     5907.850937     7778.414017
25%       5338.752143     6557.152776     8006.506993
50%       8243.582340    10022.401310    12269.273780
75%      12790.849560    15363.251360    18794.745670
max      13450.401510    16361.876470    18965.055510

# Gives you dataframes for 2 specific countries in your data

data.loc[["Italy","Poland"], :]


 	gdpPercap_1952 	gdpPercap_1957 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972 	gdpPercap_1977 	gdpPercap_1982 	gdpPercap_1987 	gdpPercap_1992 	gdpPercap_1997 	gdpPercap_2002 	gdpPercap_2007
country 												
Italy 	4931.404155 	6248.656232 	8243.582340 	10022.401310 	12269.273780 	14255.984750 	16537.483500 	19207.234820 	22013.644860 	24675.02446 	27968.09817 	28569.71970
Poland 	4029.329699 	4734.253019 	5338.752143 	6557.152776 	8006.506993 	9508.141454 	8451.531004 	9082.351172 	7738.881247 	10159.58368 	12002.23908 	15389.92468

alt solution:

italy = data.loc["Italy", "gdpPercap_1952":"gdpPercap_1962"]
poland = data.loc["Poland", "gdpPercap_1952":"gdpPercap_1962"]
pd.concat([italy, poland])


gdpPercap_1952    4931.404155
gdpPercap_1957    6248.656232
gdpPercap_1962    8243.582340
gdpPercap_1952    4029.329699
gdpPercap_1957    4734.253019
gdpPercap_1962    5338.752143
dtype: float64
data.iloc[0:2, 0:2]


 	        gdpPercap_1952 	gdpPercap_1957
country 		
Albania 	1601.056136 	1942.284244
Austria 	6137.076492 	8842.598030

Filter data

#Filtering data by a criterion

country_subset


 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	8243.582340 	10022.401310 	12269.273780
Montenegro 	4649.593785 	5907.850937 	7778.414017
Netherlands 	12790.849560 	15363.251360 	18794.745670
Norway 	13450.401510 	16361.876470 	18965.055510
Poland 	5338.752143 	6557.152776 	8006.506993

country_subset > 10000


	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	False 	True 	True
Montenegro 	False 	False 	False
Netherlands 	True 	True 	True
Norway 	True 	True 	True
Poland 	False 	False 	False
country_subset[country_subset > 10000]


 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	NaN 	10022.40131 	12269.27378
Montenegro 	NaN 	NaN 	NaN
Netherlands 	12790.84956 	15363.25136 	18794.74567
Norway 	13450.40151 	16361.87647 	18965.05551
Poland 	NaN 	NaN 	NaN
# Using the where() method for filtering

country_subset.where(country_subset > 10000)


 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	NaN 	10022.40131 	12269.27378
Montenegro 	NaN 	NaN 	NaN
Netherlands 	12790.84956 	15363.25136 	18794.74567
Norway 	13450.40151 	16361.87647 	18965.05551
Poland 	NaN 	NaN 	NaN
# Method chaining

country_subset.where(country_subset > 10000).describe()


 	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
count 	2.000000 	3.000000 	3.000000
mean 	13120.625535 	13915.843047 	16676.358320
std 	466.373656 	3408.589070 	3817.597015
min 	12790.849560 	10022.401310 	12269.273780
25% 	12955.737548 	12692.826335 	15532.009725
50% 	13120.625535 	15363.251360 	18794.745670
75% 	13285.513522 	15862.563915 	18879.900590
max 	13450.401510 	16361.876470 	18965.055510
country_subset.rank()


 	       gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	    3.0 	3.0 	3.0
Montenegro 	1.0 	1.0 	1.0
Netherlands 4.0 	4.0 	4.0
Norway    	5.0 	5.0 	5.0
Poland  	2.0 	2.0 	2.0
# An elaborate chaining example
country_subset.rank().corr("kendall")
country_subset.to_csv("country_subset.csv")

country_subset

	gdpPercap_1962 	gdpPercap_1967 	gdpPercap_1972
country 			
Italy 	8243.582340 	10022.401310 	12269.273780
Montenegro 	4649.593785 	5907.850937 	7778.414017
Netherlands 	12790.849560 	15363.251360 	18794.745670
Norway 	13450.401510 	16361.876470 	18965.055510
Poland 	5338.752143 	6557.152776 	8006.506993

LIVE LESSON NOTES:
Day 2 live notes A
Day 2 live notes B

End Day 2

Workshop Day 3

First name and Last Name/Organization/Dept./Email

Name (first & last) Organization Dept. Email
Zhiyuan Yao UCLA Data Science Center zyao@ucla.edu
Mario Cuaya UCR Computer Science mcuay001@ucr.edu
Amber Heidbrink UCSD Cell and Developmental Biology aheidbrink@ucsd.edu
Douglas Zhang UCSD Chemistry and Biochemistry doz023@ucsd.edu
Stella Yuan UCLA Ecology and Evolutionary Biology scy8@ucla.edu
Benjamin Nauman UCLA Geography bnauman@ucla.edu
Belina Chong UCLA Ecology and Evolutionary Biology moonmoon394@ucla.edu
Haley Potts UCSD Math & Economics hpotts@ucsd.edu
Igor Aprelev UCSD Mathematics and Economics iaprelev@ucsd.edu
Jun Tan UCSD Economics j4tan@ucsd.edu
Jonathan Le UCR Mathematics jle173@ucr.edu
Bineh Ndefru UCLA Materials Science bndefru@ucla.edu
Jay Chi UCSB ETS jaychi@ucsb.edu
Kazuma Nagatsuka UCSD Robotics(Mechanical Engineering) knagatsuka@ucsd.edu
Josiah Piceno UCM MBSE jpiceno3@ucmerced.edu
Yibing Zhang UCM Bioengineering yzhang291@ucmerced.edu
Simran Kanal UCSF Epidemiology and Biostatistics simran.kanal@ucsf.edu
Dilawer Ali UC Merced Mechanical Engineering dali4@ucmerced.edu
Tahirah Williams UCM QSB twilliams76@gmail.com
Christian Henry UC Berkeley UC Berkeley chrishenry@berkeley.edu
Zhaoning (Johnny) Wang UCSD CMM zhw063@health.ucsd.edu
Daryl Han UC Irvine Student Center and Event Services ddhan@uci.edu
Jacob Ross UCSD Anesthesiology jaross@ucsd.edu
Jay Colond UCM Sociology jcolond@ucmerced.edu
John Thompson UC Merced Molecular & Cellular Biology jthompson44@ucmerced.edu
Apisit Kaewsanit UCSF Epidemiology and Biostatistics apisit.kaewsanit@ucsf.edu
Caitlin Tribelhorn UCSD Pediatrics ctribelh@ucsd.edu
Waleed Rajabally UCM Sociology wrajabally@ucmerced.edu
Junxiao Gao UCSF Epidemiology and Biostatistics Junxiao.Gao@ucsf.edu
Sam Erickson UC Merced Physics serickson3@ucmerced.edu
Christopher Gray UCR Computer Science cgray024@ucr.edu

Day 3 Questions:

Please enter any questions not answered during live session here:
1.

Day 3 Live Class Notes:

# Day 3 Lists # brackets[] # can have different data types # it is mutable - character string is not mutable # you can extend/append a slist to make it longer pressure = [0.6, 0.7, 0.8, 0.9] print(pressure) #output [0.6, 0.7, 0.8, 0.9]
list_a = ['a', 'b', 4, 6.7] print(list_a) #output ['a', 'b', 4, 6.7]
#array import numpy as np a = np.array
len(list_a) #output 4
list_a[1] #output 'b'
pressure #output [0.6, 0.7, 0.8, 0.9]
# assign a new value to a list pressure[3] = 5 pressure #output [0.6, 0.7, 0.8, 5]
# extend or append new values to make a list longer a = [1,2,3,4] b = [5,6,7,8,9] a.append(b) print(a) #output [1, 2, 3, 4, [5, 6, 7, 8, 9]]
a[4][1] #output 6
a = [1,2,3,4] a.append(8) print(a) #output [1, 2, 3, 4, 8]
# extend a = [1,2,3,4] b = [5,6,7,8,9] a.extend(b) print(a) #output [1, 2, 3, 4, 5, 6, 7, 8, 9]
list_empty = [] print(list_empty) #output []
# character string in immutable string_list = 'address' string_list[3] #output 'r'
string_list[3] = 'o' #output --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-28-0babce904faa> in <module> ----> 1 string_list[3] = 'o' TypeError: 'str' object does not support item assignment
# But if you convert the string to a list, you can use index to change the list string_list = list('address') string_list[3] = 'o' print(string_list) #output ['a', 'd', 'd', 'o', 'e', 's', 's']
# from a string to a list and back string_a = 'gold' string_list = list('gold') print(string_list) #output ['g', 'o', 'l', 'd']
string_list[0] #output 'g'
#convert a list to a string using string.join() string_list print(''.join(string_list)) #output gold
# Stepping through a list string_list = list('address') print(string_list) #output ['a', 'd', 'd', 'r', 'e', 's', 's']
# the double colon means I want to look through each value in the list
string_list[::1]  

#output
# putting 2 instead of 1 means to look through every other (or 2nd) value
string_list[::2]

#output
['a', 'd', 'e', 's']

# putting 2 at the beginning only omits the first two index string_list[2::] #output ['d', 'r', 'e', 's', 's']
# Difference between sort and sorted using a list string_list = list('gold') result = sorted(string_list) print(result) # the output is sorted in alphabetical order ['d', 'g', 'l', 'o']
print(string_list) #output ['g', 'o', 'l', 'd']
string_list = list('gold') result = string_list.sort() print(result) print(string_list) #output None ['d', 'g', 'l', 'o']
# Use sorted(variable) to assign to a new variable; thereby creating a new list list_num = [10,2,5,7,8,4] result_num = sorted(list_num) print(result_num) print(list_num) #output [2, 4, 5, 7, 8, 10] [10, 2, 5, 7, 8, 4]
list_num = [10,2,5,7,8,4] result_num = list_num.sort() print(result_num) print(list_num) #output None [2, 4, 5, 7, 8, 10]
# Use variable.sort() as a function acting on the list to sort the list in place # This changes the list itself list_num.sort() print(list_num) #output [2, 4, 5, 7, 8, 10]
## Lesson: Plotting import matplotlib. pyplot as plt
time = [1, 2, 3, 4] position = [100, 200, 300, 400] plt.plot(time,position, label = 'Position changes during time') plt.xlabel('Time') plt.ylabel('Position') plt.legend() plt.title('Position changes during time') #output Text(0.5, 1.0, 'Position changes during time') #graph
# Plot directly from a dataframe import pandas as pd # import the data and save as a dataframe data_oceania = pd.read_csv('gapminder_gdp_oceania.csv', index_col = 'country') # Let's remove part of the columns name to only use the year data_oceania.columns = data_oceania.columns.str.strip('gdpPercap_') # Now let's make sure the year is an integer by converting it data_oceania.columns.astype(int) print(data_oceania.columns) # this data in the columns of the dataframe print(data_oceania.index) # this data entry associated with each column #output Index(['1952', '1957', '1962', '1967', '1972', '1977', '1982', '1987', '1992', '1997', '2002', '2007'], dtype='object') Index(['Australia', 'New Zealand'], dtype='object', name='country')
# This plot doesn't make much sense data_oceania.plot() #output <AxesSubplot:xlabel='country'> #graph has several unreadable lines
# Use transpose 'T' to switch the variable axis so the country is on y axis data_oceania.T.plot() plt.ylabel('GDP Per Capita') # here we added a y axis label plt.xticks(rotation = 90) # here we rotated the x axis labels #output (array([-2., 0., 2., 4., 6., 8., 10., 12.]), [Text(-2.0, 0, '2002'), Text(0.0, 0, '1952'), Text(2.0, 0, '1962'), Text(4.0, 0, '1972'), Text(6.0, 0, '1982'), Text(8.0, 0, '1992'), Text(10.0, 0, '2002'), Text(12.0, 0, '')]) # graph only has two lines for each country
# Using different plot styles with ggplot plt.style.use('ggplot') data_oceania.T.plot() #output # graph
plt.style.use('seaborn') # Let's plot one country against the other country # s changes the size # c changes the color # m changes the type of marker data_oceania.T.plot(kind = 'scatter', x = 'New Zealand', y = 'Australia', s = 60, c = 'orange', marker = '3') #output # graph

Challenges

Challeges #1

Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)

Challenge #1 solution

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.min().plot(label='min')
data_europe.max().plot(label='max')
plt.legend(loc='best')
plt.xticks(rotation=90)

Challenge #2

Fill in the blanks so that the program below produces the output shown.

values = ____
values.____(1)
values.____(3)
values.____(5)
print('first time:', values)
values = values[____]
print('second time:', values)

# output 
first time: [1, 3, 5]
second time: [3, 5]

Challenge #2 solution

values = []
values.append(1)
values.append(3)
values.append(5)
print('first time:', values)
values = values[1:]
print('second time:', values)

Challenge #3

Fill in the blanks in each of the programs below to produce the indicated result.

# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
    ____ = ____ + len(word)
print(total)

Challenge #3 solution

total = 0
for word in ["red", "green", "blue"]:
    total = total + len(word)
print(total)

Challenge #4

Fill in the blanks so that this program creates a new list containing zeroes where the original list’s values were negative and ones where the original list’s values were positive.

original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = ____
for value in original:
    if ____:
        result.append(0)
    else:
        ____
print(result)
# output 
[0, 1, 1, 1, 0, 1]

Challenge #4 solution

original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = []
for value in original:
    if value < 0.0:
        result.append(0)
    else:
        result.append(1)
print(result)

LIVE Session Notes: https://drive.google.com/file/d/1y8A0xUEWSdSrAhS9Sbvx39Etb1Vnn4rM/view?usp=sharing

End Day 3