Workshop Details
Dates: September 6th - 13th, 2022
Time: 9am - 12pm
Workshop Agenda:
https://ucsdlib.github.io/2022-09-06-carpentries-uc/
Workshop Lesson:
http://swcarpentry.github.io/python-novice-gapminder/
Software Installation:
Anaconda
https://www.anaconda.com/download/
Lesson Data (download)
A copy of the instructor live session notes will be made available to participants upon request at the end of the workshop.
Jupyterlab will be used for the lessons
[m] Markdown cell = notes
[#]also works in code cell for notes
[b] = add cell below [a] is above
[r]Raw cells cannot have text edits
(for Python lessons)
https://www.markdownguide.org/getting-started/
https://www.markdownguide.org/basic-syntax/
Name (first & last) | Organization | Dept. | |
---|---|---|---|
(example) Jane Doe | UCSD | IT | jdoe1@ucsd.edu |
Kat Koziar (Helper) | UCR | Library | katherine.koziar@ucr.edu |
Jacob sola | UCR | Chemistry/Biomedical | jsola032@ucr.edu |
Douglas Zhang | UCSD | Chemistry/Biochemistry | doz023@ucsd.edu |
Jacqueline Giacoman | UC Merced | Political Science | jgiacoman@ucmerced.edu |
Jose Hernandez | UCB | Library | jose1991@berkeley.edu |
John Thompson | UC Merced | Molecular & Cellular Biology | jthompson44@ucmerced.edu |
Derek Devnich | UC Merced | ||
Sam Erickson | UC Merced | Physics | serickson3@ucmerced.edu |
Dilawer Ali | UC Merced | Mechanical Engineering | dali4@ucmerced.edu |
Igor Aprelev | UCSD | Mathematics and Economics | iaprelev@ucsd.edu |
Benjamin Nauman | UCLA | Geography | bnauman@ucla.edu |
Mohit Saraswat | UC Merced | Chemistry | msaraswat@ucmerced.edu |
Jacob Ross | UCSD | Anesthesiology | jaross@ucsd.edu |
Jay Colond | UCM | Sociology | jcolond@ucmerced.edu |
Zhaoning (Johnny) Wang | UCSD | CMM | zhw063@health.ucsd.edu |
Lillie Pennington | UC Merced | Life and Environmental Sciences | lpennington@ucmerced.edu |
Christian Henry | UC Berkeley | Integrative Biology | chrishenry@berkeley.edu |
Belina Chong | UCLA | Ecology and Evolutionary Biology | moonmoon394@ucla.edu |
Josiah Piceno | UCM | MBSE | jpiceno3@ucmerced.edu |
Jun Tan | UCSD | Economics | j4tn@ucsd.edu |
Jon Dean | UCSD | Anesthesiology | j1dean@health.ucsd.edu |
Tahirah Williams | UCM | QSB | twilliams76@ucmerced.edu |
Liam de Villa Bourke | UCLA | Institute of the Environment and Sustainability | liamdevilla@g.ucla.edu |
Rukmini Ravi | UCSD | San Diego Supercomputer Center | ruravi@ucsd.edu |
Amber Heidbrink | UCSD | Cell and Developmental Biology | aheidbrink@ucsd.edu |
Haley Potts | UCSD | Math & Economics | hpotts@ucsd.edu |
isabella schaedle | UCSD | MMMMMMMM | |
Apisit Kaewsanit | UCSF | Epidemiology and Biostatistics | apisit.kaewsanit@ucsf.edu |
Ivan Felix Rios | UCSD | Mathemathics & Economics | ifelixrios@ucsd.edu |
Christian Corrales | UCLA | Neurology | ccorrales@mednet.ucla.edu |
Michael Woller | UCLA | Psychology | michaelwoller@g.ucla.edu |
Stella Yuan | UCLA | Ecology and Evolutionary Biology | scy8@g.ucla.edu |
Jonathan Le | UCR | Mathematics | jle173@ucr.edu |
Laika Aguinaldo | UCSD | Psychiatry | laaguinaldo@ucsd.edu |
Chris Gray | UCR | Data Science | cgray024@ucr.edu |
Ana Carolina Dantas Machado | UCSD | Medicine | adantasmachado@ucsd.edu |
Jason Ngo | UC Merced | Bioengineering | jngo42@ucmerced.edu |
Yibing Zhang | UC Merced | Bioengineering | yzhang291@ucmerced.edu |
Ashwin Thomas | UC Merced | Environmental Systems | athomas59@ucmerced.edu |
Eric Hyde | UCSD | Epidemiology | ehyde@health.ucsd.edu |
Bineh Ndefru | UCLA | Materials Science | bndefru@ucla.edu |
Vishakha Malhotra | UCSF | Biostatistics and Epidemiology | |
Bruce Hamilton | UCSD | School of Medicine | bah@ucsd.edu |
Kazuma Nagatsuka | UCSD | Robotics(Mechanical Engineering) | kngatsuka@ucsd.edu |
Caitlin Tribelhorn | UCSD | Pediatrics | ctribelh@ucsd.edu |
Vikram Jambulapati | UCSD | Economics | vjambula@ucsd.edu |
Simran Kanal | UCSF | Biostatistics and Epidemiology | simran.kanal@ucsf.edu |
Daryl Han | UC Irvine | Student Center and Event Services | ddhan@uci.edu |
Charles Faulhaber | UC Berkeley | Bancroft Library / Dept. of Spanish | |
Mario Cuaya | UCR | Computer Science | mcuay001@ucr.edu |
Waleed Rajabally | UC Merced | Sociology | wrajabally@ucmerced.edu |
Junxiao Gao | UCSF | Biostatistics and Epidemiology | Junxiao.Gao@ucsf.edu |
Jay Chi | UCSB | ETS | jaychi@ucsb.edu |
Vishakha Malhotra | UCSF | Biostatistics and Epidemiology | vishakha.malhotra@ucsf.edu |
Please enter any questions not answered during live session here:
1.
Download link: https://www.anaconda.com/products/distribution
Working in Anaconda JupyterLab
GUI (middle-man, colloquially pronounced as "gooey") vs command-line
Today's workshop is strictly in JupyterLab GUI
Computer programming languages - there are a lot of them, and what they do is similar, syntax is also similar between different languages (although, each is specific). Able to learn the basics and apply them to different langauges.
Your favorite search engine is a good resource when you're looking for answers to your programming questions (kat's note: I <3 Stack Exchange)
working directory - in JupyterLab, working directory is shown on the left sidebar. Left sidebar also shows tabs, such as file browse (where you can select your working directory, create new files/folders), a list of what terminals are running, etc. The left sidebar can also be collapsed or expanded. Running anaconda JupyterLab is local to your computer, so when you're using a public computer, any files are saved on that public computer
new file - Day1_Python_LiveNotes.ipynb (to rename, right click on file to bring up submenu)
Interface - menu bar at top contains more options than the tabs in the left sidebar quicklinks
Command and Edit modes - press B will create a new cell below current cell
Numbered lists
A tool like HackMD lets you practice markdown.
Bold and italics
In JupyterLab markdown cells, you can combine some html elements, such as <br>
backslash \ before the less-than-symbol will escape the character so it isn't read as html \<br>
Mixed list
Headings use # to create different sizes
age = 42
first_name = 'Ahmed'
thisisaverylongnamethatishardforahumantoread = "Jimmy"
this_is_more_readable = "Jimmy"
thisIsCamelCase = "jimmy"
x
is not self-describing, age
or weight
are self-describing)_dont_use_until_you_understand_what_it_means
)3age
(starts with a number) or read@one
(uses any symbol other than the underscore _)Built in functions
print()
prints things as textprint(first_name, 'is', age, 'years old')
will print Ahmed is 42 years old
print()
will automatically add single spaces in the current version of Python.print(argument1, argument2, argument3, argument4)
Variables
print(myval)
will give an error if myval
isn't already created with a valueThis will throw an error because last_name
does not have an assigned value
print(last_name)
last_name = "Smith"
This will not throw an error
last_name = "Smith"
print(last_name)
Challenge #2
Assign the variable named color1 to the value red and the variable named color2 to the value blue. Then print red is not blue
using the variable names as input (or arguments)
color1 = 'red'
color2 = 'blue'
print(color1, 'is not', color2)
print(color1, 'is', 'not', color2)
Blocks of text
variables used in calculations
age = age + 3
3 + 5 * 4
calculates according to math rules (order of operations), not read left to right
3 + 5 * 4
= 23
(3 + 5) * 4
= 32
Challenge #3
Write the code for for the following: number1 is 22, number2 is 5, and number3 is 100. Multiple number1 by number3 then divid by number2. The answer calculation answer should be number4. Finally, output 'The answer is number4' - with the value displaying rather than the variable.
Built-in functions
index()
gives you a single character from a string
atom_name = 'helium'
print(atom_name[0])
output is h
index()
uses the variable name, then square brackets around the number of the index you want to obtaindatatype strings are text surrounded by single or double-quotes (pair single-quotes with single-quotes, don't interchange 'like this")
id_number = 2587464
print(id_number[2])
will result in error because id_number
is an integer, and not a string
list
my_list = ['apple', 'pear', 'peach]'
print(my_list[1])
output is pear
slices
variable[start position: stop position(not including)]
# string example
atom_name = 'sodium'
print(atom_name[0:3])
output is sod
**# list example
many_atoms = ['oxygen', 'carbon', 'nitrogen', 'neon', 'iron', 'zinc']
print(many_atoms[1:4])**
output will be ['carbon', 'nitrogen', 'neon']
(notice how it outputs in a list format!)
how long are things?
len()
#string example
print(len('helium'))
output is 6
(counts number of characters)
# list example
my_list2 = ['a', '1', '43', 'dream', 'please']
print(len(my_list2))
output is 5
(counts number of elements in list)
Challenge #4
thing[:]
(just a colon) do?thing[number:some-negative-number
do?atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])
Solution #4
number
to the the negative count from the end of the variable#example
atom_name = 'carbon'
print(atom_name[1:-4])
output is ca
atom_name[1:3] is: ar
int()
float()
float()
type()
type(52)
will output int
print(type(52))
will output <class 'int'>
fitness = 'average'
print(type(fitness))
output is <class 'str'>
print(type(hair))
will throw an error, because Python is reading hair
as a variable name, which isn't defined.
print(type(3.4))
output is <class 'float'>
print (5-2)
will output 2
print ('hello'-'h')
will throw an error because you can't subtract strings
You can use '+' and '*' on integers, floats, and strings, but operates differently on strings
print (4+5)
output is 9
print ("Ahmed"+"Walch")
output is AhmedWalch
print ('Ahmed'*10)
output is AhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmedAhmed
print (1 + '2')
will throw an error.
however,
print (1 + int('2'))
will output 3
because '2'
is type cast as an integer, allowing math operations.
print (str(1) + '2')
will output 12
(which is actually a string, not a number!)
print ('Gene'+str(23455685))
will output Gene23455685
, which allows easy labels!
if you need to keep an original value of a variable, create a new variable name, otherwise you're overwriting the original value.
LIVE LESSON NOTES: https://drive.google.com/file/d/1TSm1bA55RwQu5-iqdnBNRU47U3os9x86/view?usp=sharing
Name (first & last) | Organization | Dept. | |
---|---|---|---|
Geno Sanchez (helper) | UCLA | Library | genosanchez@library.ucla.edu |
Amber Heidbrink | UCSD | Cell and Developmental Biology | aheidbrink@ucsd.edu |
Kat Koziar | UCR | Library | katherine.koziar@ucr.edu |
Yibing Zhang | UCM | Bioengineering | yzhang291@ucmerced.edu |
Douglas Zhang | UCSD | Chemistry and Biochemistry | doz023@ucsd.edu |
Kazuma Nagatsuka | UCSD | Robotics(Mechanical Engineering) | knagatsuka@ucsd.edu |
Jay Colond | UCM | Sociology | jcolond@ucmerced.edu |
Belina Chong | UCLA | Ecology and Evolutionary Biology | moonmoon394@ucla.edu |
Jonathan Le | UCR | Mathematics | jle173@ucr.edu |
Caitlin Tribelhorn | UCSD | Pediatrics | ctribelh@ucsd.edu |
Igor Aprelev | UCSD | Mathematics and Econonmics | iaprelev@ucsd.edu |
Sam Erickson | UC Merced | Physics | serickson3@ucmerced.edu |
Jay Chi | UCSB | ETS | jaychi@ucsb.edu |
Apisit Kaewsanit | UCSF | Epidemiology and Biostatistics | apisit.kaewsanit@ucsf.edu |
Benjamin Nauman | UCLA | Geography | bnauman@ucla.edu |
Suzanne Paulson | UCLA | AOS | paulson@atmos.ucla.edu |
Liam de Villa Bourke | UCLA | IOES | liamdevilla@g.ucla.edu |
Mario Cuaya | UCR | Computer Science | mcuay001@ucr.edu |
Josiah Piceno | UCM | MBSE | jpiceno3@ucmerced.edu |
John Thompson | UC Merced | Cell & Molecular Biology | jthompson44@ucmerced.edu |
Bineh Ndefru | UCLA | Material Science | bndefru@ucla.edu |
Zhiyuan Yao | UCLA | Data Science Center | zyao@ucla.edu |
Tahirah Williams | UCM | QSB | twilliams76@ucmerced.edu |
Haley Potts | UCSD | Math & Econ | hpotts@ucsd.edu |
Zhaoning (Johnny) Wang | UCSD | CMM | zhw063@health.ucsd.edu |
Daryl Han | UC Irvine | Student Center and Event Services | ddhan@uci.edu |
Simran Kanal | UCSF | Epidemiology and Biostatistics | simran.kanal@ucsf.edu |
Jon Dean | UCSD | Anesthesiology | j1dean@health.ucsd.edu |
Junxiao Gao | UCSF | Biostatistics and Epidemiology | Junxiao.Gao@ucsf.edu |
Stella Yuan | UCLA | Ecology and Evolutionary Biology | scy8@g.ucla.edu |
Waleed Rajabally | UC Merced | Sociology | wrajabally@ucmerced.edu |
Jun Tan | UCSD | Economics | j4tan@ucsd.edu |
Christian Henry | UC Berkeley | Integrative Biology | chrishenry@berkeley.edu |
Jacob Ross | UCSD | Anesthesiology | jaross@ucsd.edu |
Christopher Gray | UCR | Computer Science | cgray024@ucr.edu |
Please enter any questions not answered during live session here:
1.
Gapminder data download: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
A library is a collection of files (called modules) that contains functions for use by other programs.
Use import
to load a library module into a program’s memory.
import math
print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))
pi is 3.141592653589793 cos(pi) is -1.0
help(math)
Help on module math:
NAME
math
MODULE REFERENCE
http://docs.python.org/3/library/math
...
from math import cos, pi
print('cos(pi) is', cos(pi))
cos(pi) is -1.0
import math as m
print('cos(pi) is', m.cos(m.pi))
cos(pi) is -1.0
import matplotlib as mpl
import
without as
.import math as m
angle = ____.degrees(____.pi / 2)
print(____)
Solution:
import math as m
#1
angle = m.degrees(m.pi / 2)
print(angle)
#2
import math
angle = math.degrees(math.pi / 2)
print(angle)
90.0
# you need to declare a new function with the keyword 'def'.
# you need to include a 'name()'.
def say_hello():
print("hello!")
def
# After defining a function, you must 'call' a function to execute it.
say_hello()
hello!
# Let's make a function that prints a date as an example of a function that takes an argument.
def print_date(year, month, day): # so the input is 'arg1', arg2,arg3' being required for the function
joined = str(year) + '/' + str(month) + '/' + str(day)
print(joined)
print_date(2022, 1, 2)
2022/1/2
print_date(month = 1, year = 2019, day = 23)
2019/1/23
return
call.def average(values):
if len(values) == 0:
return None
return sum(values) / len(values)
avg = average([1,3,4])
print(avg)
emptyAvg = avg([])
print(emptyAvg)
2.6666666666666665
None
#
result = print_date(1871, 3, 19)
print('result of print_date', result)
1871/3/19 result of print_date
None
#Example
result = print_time(11,37,59)
def print_time(hour, minute, second):
time_string = str(hour) + ':' + str(minute)+ ':' + str(second)
print(time_string)
# After fix:
result = print_time(11, 37, 59)
print('result of call is:', result)
11:37:59 result of call is: None
import os
#Get our current working directory
print(os.getcwd())
#List the contents of this directory
print(os.listdir())
import pandas as pd
data = pd.read_csv("gapminder_gdp_oceania.csv")
#Reading data from a subfolder
#data = pd.read_csv("subfolder/gapminder_gdp_oceania.csv")
print(data)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
0 Australia 10039.59564 10949.64959 12217.22686
1 New Zealand 10556.57566 12247.39532 13175.67800
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \
0 14526.12465 16788.62948 18334.19751 19477.00928
1 14463.91893 16046.03728 16233.71770 17632.41040
gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \
0 21888.88903 23424.76683 26997.93657 30687.75473
1 19007.19129 18363.32494 21050.41377 23189.80135
gdpPercap_2007
0 34435.36744
1 25185.00911
data
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
0 Australia 10039.59564 10949.64959 12217.22686 14526.12465 16788.62948 18334.19751 19477.00928 21888.88903 23424.76683 26997.93657 30687.75473 34435.36744
1 New Zealand 10556.57566 12247.39532 13175.67800 14463.91893 16046.03728 16233.71770 17632.41040 19007.19129 18363.32494 21050.41377 23189.80135 25185.00911
# lets identify our rows by country not index number
data = pd.read_csv("gapminder_gdp_oceania.csv", index_col = "country")
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Australia 10039.59564 10949.64959 12217.22686 14526.12465 16788.62948 18334.19751 19477.00928 21888.88903 23424.76683 26997.93657 30687.75473 34435.36744
New Zealand 10556.57566 12247.39532 13175.67800 14463.91893 16046.03728 16233.71770 17632.41040 19007.19129 18363.32494 21050.41377 23189.80135 25185.00911
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Australia to New Zealand
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gdpPercap_1952 2 non-null float64
1 gdpPercap_1957 2 non-null float64
2 gdpPercap_1962 2 non-null float64
3 gdpPercap_1967 2 non-null float64
4 gdpPercap_1972 2 non-null float64
5 gdpPercap_1977 2 non-null float64
6 gdpPercap_1982 2 non-null float64
7 gdpPercap_1987 2 non-null float64
8 gdpPercap_1992 2 non-null float64
9 gdpPercap_1997 2 non-null float64
10 gdpPercap_2002 2 non-null float64
11 gdpPercap_2007 2 non-null float64
dtypes: float64(12)
memory usage: 208.0+ bytes
data.describe()
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
count 2.000000 2.000000 2.000000 2.000000 2.00000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
mean 10298.085650 11598.522455 12696.452430 14495.021790 16417.33338 17283.957605 18554.709840 20448.040160 20894.045885 24024.175170 26938.778040 29810.188275
std 365.560078 917.644806 677.727301 43.986086 525.09198 1485.263517 1304.328377 2037.668013 3578.979883 4205.533703 5301.853680 6540.991104
min 10039.595640 10949.649590 12217.226860 14463.918930 16046.03728 16233.717700 17632.410400 19007.191290 18363.324940 21050.413770 23189.801350 25185.009110
25% 10168.840645 11274.086022 12456.839645 14479.470360 16231.68533 16758.837652 18093.560120 19727.615725 19628.685412 22537.294470 25064.289695 27497.598692
50% 10298.085650 11598.522455 12696.452430 14495.021790 16417.33338 17283.957605 18554.709840 20448.040160 20894.045885 24024.175170 26938.778040 29810.188275
75% 10427.330655 11922.958888 12936.065215 14510.573220 16602.98143 17809.077558 19015.859560 21168.464595 22159.406358 25511.055870 28813.266385 32122.777858
max 10556.575660 12247.395320 13175.678000 14526.124650 16788.62948 18334.197510 19477.009280 21888.889030 23424.766830 26997.936570 30687.754730 34435.367440
data.columns
# or
print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
dtype='object')
Dataframes are a collection of columns. Within a column it has to be the same data type (e.g. float, int, str)
gapminder_gdp_americas.csv
into a variable called americas
and display its summary statistics.help(americas.head)
and help(americas.head)
to find out what DataFrame.head
and DataFrame.tail
do.solution:
americas = pd.read_csv("data/gapminder_gdp_americas.csv", index_col = "country")
print(americas.head(3))
print(americas.describe())
continent gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \
country
Argentina Americas 5911.315053 6856.856212 7133.166023
Bolivia Americas 2677.326347 2127.686326 2180.972546
Brazil Americas 2108.944355 2487.365989 3336.585802
gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \
country
Argentina 8052.953021 9443.038526 10079.026740 8997.897412
Bolivia 2586.886053 2980.331339 3548.097832 3156.510452
Brazil 3429.864357 4985.711467 6660.118654 7030.835878
gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \
country
Argentina 9139.671389 9308.418710 10967.281950 8797.640716
Bolivia 2753.691490 2961.699694 3326.143191 3413.262690
Brazil 7807.095818 6950.283021 7957.980824 8131.212843
gdpPercap_2007
country
Argentina 12779.379640
Bolivia 3822.137084
Brazil 9065.800825
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \
count 25.000000 25.000000 25.000000 25.000000
mean 4079.062552 4616.043733 4901.541870 5668.253496
std 3001.727522 3312.381083 3421.740569 4160.885560
min 1397.717137 1544.402995 1662.137359 1452.057666
25% 2428.237769 2487.365989 2750.364446 3242.531147
50% 3048.302900 3780.546651 4086.114078 4643.393534
75% 3939.978789 4756.525781 5180.755910 5788.093330
max 13990.482080 14847.127120 16173.145860 19530.365570
gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \
count 25.000000 25.000000 25.000000 25.000000
mean 6491.334139 7352.007126 7506.737088 7793.400261
std 4754.404329 5355.602518 5530.490471 6665.039509
min 1654.456946 1874.298931 2011.159549 1823.015995
25% 4031.408271 4756.763836 4258.503604 4140.442097
50% 5305.445256 6281.290855 6434.501797 6360.943444
75% 6809.406690 7674.929108 8997.897412 7807.095818
max 21806.035940 24072.632130 25009.559140 29884.350410
gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
count 25.000000 25.000000 25.000000 25.000000
mean 8044.934406 8889.300863 9287.677107 11003.031625
std 7047.089191 7874.225145 8895.817785 9713.209302
min 1456.309517 1341.726931 1270.364932 1201.637154
25% 4439.450840 4684.313807 4858.347495 5728.353514
50% 6618.743050 7113.692252 6994.774861 8948.102923
75% 8137.004775 9767.297530 8797.640716 11977.574960
max 32003.932240 35767.433030 39097.099550 42951.653090
# get a column
data = pd.read_csv("gapminder_gdp_europe.csv", index_col = "country")
data.columns
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
dtype='object')
col1 = data["gdpPercap_1957"] # getting data by columnn label
print(col1)
country
Albania 1942.284244
Austria 8842.598030
Belgium 9714.960623
Bosnia and Herzegovina 1353.989176
Bulgaria 3008.670727
Croatia 4338.231617
Czech Republic 8256.343918
Denmark 11099.659350
Finland 7545.415386
France 8662.834898
Germany 10187.826650
Greece 4916.299889
Hungary 6040.180011
Iceland 9244.001412
Ireland 5599.077872
Italy 6248.656232
Montenegro 3682.259903
Netherlands 11276.193440
Norway 11653.973040
Poland 4734.253019
Portugal 3774.571743
Romania 3943.370225
Serbia 4981.090891
Slovak Republic 6093.262980
Slovenia 5862.276629
Spain 4564.802410
Sweden 9911.878226
Switzerland 17909.489730
Turkey 2218.754257
United Kingdom 11283.177950
Name: gdpPercap_1957, dtype: float64
# Pandas introduces new data types
print(type(data))
<class 'pandas.core.frame.DataFrame'>
print(type(col1))
<class 'pandas.core.series.Series'>
subset1 = data.iloc[0, 0]
print(subset1)
1601.056136
subset2 = data.loc["Albania", "gdpPercap_1952"]
print(subset2)
1601.056136
data.loc["Albania",:]
gdpPercap_1952 1601.056136
gdpPercap_1957 1942.284244
gdpPercap_1962 2312.888958
gdpPercap_1967 2760.196931
gdpPercap_1972 3313.422188
gdpPercap_1977 3533.003910
gdpPercap_1982 3630.880722
gdpPercap_1987 3738.932735
gdpPercap_1992 2497.437901
gdpPercap_1997 3193.054604
gdpPercap_2002 4604.211737
gdpPercap_2007 5937.029526
Name: Albania, dtype: float64
country_subset = data.loc["Italy":"Poland", "gdpPercap_1962":"gdpPercap_1972"]
country_subset
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
print(type(country_subset))
print(country_subset.describe())
<class 'pandas.core.frame.DataFrame'>
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
count 5.000000 5.000000 5.000000
mean 8894.635868 10842.506571 13162.799194
std 4093.410673 4855.106424 5517.298708
min 4649.593785 5907.850937 7778.414017
25% 5338.752143 6557.152776 8006.506993
50% 8243.582340 10022.401310 12269.273780
75% 12790.849560 15363.251360 18794.745670
max 13450.401510 16361.876470 18965.055510
# Gives you dataframes for 2 specific countries in your data
data.loc[["Italy","Poland"], :]
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
country
Italy 4931.404155 6248.656232 8243.582340 10022.401310 12269.273780 14255.984750 16537.483500 19207.234820 22013.644860 24675.02446 27968.09817 28569.71970
Poland 4029.329699 4734.253019 5338.752143 6557.152776 8006.506993 9508.141454 8451.531004 9082.351172 7738.881247 10159.58368 12002.23908 15389.92468
alt solution:
italy = data.loc["Italy", "gdpPercap_1952":"gdpPercap_1962"]
poland = data.loc["Poland", "gdpPercap_1952":"gdpPercap_1962"]
pd.concat([italy, poland])
gdpPercap_1952 4931.404155
gdpPercap_1957 6248.656232
gdpPercap_1962 8243.582340
gdpPercap_1952 4029.329699
gdpPercap_1957 4734.253019
gdpPercap_1962 5338.752143
dtype: float64
data.iloc[0:2, 0:2]
gdpPercap_1952 gdpPercap_1957
country
Albania 1601.056136 1942.284244
Austria 6137.076492 8842.598030
#Filtering data by a criterion
country_subset
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
country_subset > 10000
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy False True True
Montenegro False False False
Netherlands True True True
Norway True True True
Poland False False False
country_subset[country_subset > 10000]
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy NaN 10022.40131 12269.27378
Montenegro NaN NaN NaN
Netherlands 12790.84956 15363.25136 18794.74567
Norway 13450.40151 16361.87647 18965.05551
Poland NaN NaN NaN
# Using the where() method for filtering
country_subset.where(country_subset > 10000)
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy NaN 10022.40131 12269.27378
Montenegro NaN NaN NaN
Netherlands 12790.84956 15363.25136 18794.74567
Norway 13450.40151 16361.87647 18965.05551
Poland NaN NaN NaN
# Method chaining
country_subset.where(country_subset > 10000).describe()
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
count 2.000000 3.000000 3.000000
mean 13120.625535 13915.843047 16676.358320
std 466.373656 3408.589070 3817.597015
min 12790.849560 10022.401310 12269.273780
25% 12955.737548 12692.826335 15532.009725
50% 13120.625535 15363.251360 18794.745670
75% 13285.513522 15862.563915 18879.900590
max 13450.401510 16361.876470 18965.055510
country_subset.rank()
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 3.0 3.0 3.0
Montenegro 1.0 1.0 1.0
Netherlands 4.0 4.0 4.0
Norway 5.0 5.0 5.0
Poland 2.0 2.0 2.0
# An elaborate chaining example
country_subset.rank().corr("kendall")
country_subset.to_csv("country_subset.csv")
country_subset
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
LIVE LESSON NOTES:
Day 2 live notes A
Day 2 live notes B
Name (first & last) | Organization | Dept. | |
---|---|---|---|
Zhiyuan Yao | UCLA | Data Science Center | zyao@ucla.edu |
Mario Cuaya | UCR | Computer Science | mcuay001@ucr.edu |
Amber Heidbrink | UCSD | Cell and Developmental Biology | aheidbrink@ucsd.edu |
Douglas Zhang | UCSD | Chemistry and Biochemistry | doz023@ucsd.edu |
Stella Yuan | UCLA | Ecology and Evolutionary Biology | scy8@ucla.edu |
Benjamin Nauman | UCLA | Geography | bnauman@ucla.edu |
Belina Chong | UCLA | Ecology and Evolutionary Biology | moonmoon394@ucla.edu |
Haley Potts | UCSD | Math & Economics | hpotts@ucsd.edu |
Igor Aprelev | UCSD | Mathematics and Economics | iaprelev@ucsd.edu |
Jun Tan | UCSD | Economics | j4tan@ucsd.edu |
Jonathan Le | UCR | Mathematics | jle173@ucr.edu |
Bineh Ndefru | UCLA | Materials Science | bndefru@ucla.edu |
Jay Chi | UCSB | ETS | jaychi@ucsb.edu |
Kazuma Nagatsuka | UCSD | Robotics(Mechanical Engineering) | knagatsuka@ucsd.edu |
Josiah Piceno | UCM | MBSE | jpiceno3@ucmerced.edu |
Yibing Zhang | UCM | Bioengineering | yzhang291@ucmerced.edu |
Simran Kanal | UCSF | Epidemiology and Biostatistics | simran.kanal@ucsf.edu |
Dilawer Ali | UC Merced | Mechanical Engineering | dali4@ucmerced.edu |
Tahirah Williams | UCM | QSB | twilliams76@gmail.com |
Christian Henry | UC Berkeley | UC Berkeley | chrishenry@berkeley.edu |
Zhaoning (Johnny) Wang | UCSD | CMM | zhw063@health.ucsd.edu |
Daryl Han | UC Irvine | Student Center and Event Services | ddhan@uci.edu |
Jacob Ross | UCSD | Anesthesiology | jaross@ucsd.edu |
Jay Colond | UCM | Sociology | jcolond@ucmerced.edu |
John Thompson | UC Merced | Molecular & Cellular Biology | jthompson44@ucmerced.edu |
Apisit Kaewsanit | UCSF | Epidemiology and Biostatistics | apisit.kaewsanit@ucsf.edu |
Caitlin Tribelhorn | UCSD | Pediatrics | ctribelh@ucsd.edu |
Waleed Rajabally | UCM | Sociology | wrajabally@ucmerced.edu |
Junxiao Gao | UCSF | Epidemiology and Biostatistics | Junxiao.Gao@ucsf.edu |
Sam Erickson | UC Merced | Physics | serickson3@ucmerced.edu |
Christopher Gray | UCR | Computer Science | cgray024@ucr.edu |
Please enter any questions not answered during live session here:
1.
# Day 3 Lists
# brackets[]
# can have different data types
# it is mutable - character string is not mutable
# you can extend/append a slist to make it longer
pressure = [0.6, 0.7, 0.8, 0.9]
print(pressure)
#output
[0.6, 0.7, 0.8, 0.9]
list_a = ['a', 'b', 4, 6.7]
print(list_a)
#output
['a', 'b', 4, 6.7]
#array
import numpy as np
a = np.array
len(list_a)
#output
4
list_a[1]
#output
'b'
pressure
#output
[0.6, 0.7, 0.8, 0.9]
# assign a new value to a list
pressure[3] = 5
pressure
#output
[0.6, 0.7, 0.8, 5]
# extend or append new values to make a list longer
a = [1,2,3,4]
b = [5,6,7,8,9]
a.append(b)
print(a)
#output
[1, 2, 3, 4, [5, 6, 7, 8, 9]]
a[4][1]
#output
6
a = [1,2,3,4]
a.append(8)
print(a)
#output
[1, 2, 3, 4, 8]
# extend
a = [1,2,3,4]
b = [5,6,7,8,9]
a.extend(b)
print(a)
#output
[1, 2, 3, 4, 5, 6, 7, 8, 9]
list_empty = []
print(list_empty)
#output
[]
# character string in immutable
string_list = 'address'
string_list[3]
#output
'r'
string_list[3] = 'o'
#output
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-0babce904faa> in <module>
----> 1 string_list[3] = 'o'
TypeError: 'str' object does not support item assignment
# But if you convert the string to a list, you can use index to change the list
string_list = list('address')
string_list[3] = 'o'
print(string_list)
#output
['a', 'd', 'd', 'o', 'e', 's', 's']
# from a string to a list and back
string_a = 'gold'
string_list = list('gold')
print(string_list)
#output
['g', 'o', 'l', 'd']
string_list[0]
#output
'g'
#convert a list to a string using string.join()
string_list
print(''.join(string_list))
#output
gold
# Stepping through a list
string_list = list('address')
print(string_list)
#output
['a', 'd', 'd', 'r', 'e', 's', 's']
# the double colon means I want to look through each value in the list
string_list[::1]
#output
# putting 2 instead of 1 means to look through every other (or 2nd) value
string_list[::2]
#output
['a', 'd', 'e', 's']
# putting 2 at the beginning only omits the first two index
string_list[2::]
#output
['d', 'r', 'e', 's', 's']
# Difference between sort and sorted using a list
string_list = list('gold')
result = sorted(string_list)
print(result)
# the output is sorted in alphabetical order
['d', 'g', 'l', 'o']
print(string_list)
#output
['g', 'o', 'l', 'd']
string_list = list('gold')
result = string_list.sort()
print(result)
print(string_list)
#output
None
['d', 'g', 'l', 'o']
# Use sorted(variable) to assign to a new variable; thereby creating a new list
list_num = [10,2,5,7,8,4]
result_num = sorted(list_num)
print(result_num)
print(list_num)
#output
[2, 4, 5, 7, 8, 10]
[10, 2, 5, 7, 8, 4]
list_num = [10,2,5,7,8,4]
result_num = list_num.sort()
print(result_num)
print(list_num)
#output
None
[2, 4, 5, 7, 8, 10]
# Use variable.sort() as a function acting on the list to sort the list in place
# This changes the list itself
list_num.sort()
print(list_num)
#output
[2, 4, 5, 7, 8, 10]
## Lesson: Plotting
import matplotlib. pyplot as plt
time = [1, 2, 3, 4]
position = [100, 200, 300, 400]
plt.plot(time,position, label = 'Position changes during time')
plt.xlabel('Time')
plt.ylabel('Position')
plt.legend()
plt.title('Position changes during time')
#output
Text(0.5, 1.0, 'Position changes during time')
#graph
# Plot directly from a dataframe
import pandas as pd
# import the data and save as a dataframe
data_oceania = pd.read_csv('gapminder_gdp_oceania.csv', index_col = 'country')
# Let's remove part of the columns name to only use the year
data_oceania.columns = data_oceania.columns.str.strip('gdpPercap_')
# Now let's make sure the year is an integer by converting it
data_oceania.columns.astype(int)
print(data_oceania.columns) # this data in the columns of the dataframe
print(data_oceania.index) # this data entry associated with each column
#output
Index(['1952', '1957', '1962', '1967', '1972', '1977', '1982', '1987', '1992',
'1997', '2002', '2007'],
dtype='object')
Index(['Australia', 'New Zealand'], dtype='object', name='country')
# This plot doesn't make much sense
data_oceania.plot()
#output
<AxesSubplot:xlabel='country'>
#graph has several unreadable lines
# Use transpose 'T' to switch the variable axis so the country is on y axis
data_oceania.T.plot()
plt.ylabel('GDP Per Capita') # here we added a y axis label
plt.xticks(rotation = 90) # here we rotated the x axis labels
#output
(array([-2., 0., 2., 4., 6., 8., 10., 12.]),
[Text(-2.0, 0, '2002'),
Text(0.0, 0, '1952'),
Text(2.0, 0, '1962'),
Text(4.0, 0, '1972'),
Text(6.0, 0, '1982'),
Text(8.0, 0, '1992'),
Text(10.0, 0, '2002'),
Text(12.0, 0, '')])
# graph only has two lines for each country
# Using different plot styles with ggplot
plt.style.use('ggplot')
data_oceania.T.plot()
#output
# graph
plt.style.use('seaborn')
# Let's plot one country against the other country
# s changes the size
# c changes the color
# m changes the type of marker
data_oceania.T.plot(kind = 'scatter', x = 'New Zealand', y = 'Australia', s = 60, c = 'orange', marker = '3')
#output
# graph
Fill in the blanks below to plot the minimum GDP per capita over time for all the countries in Europe. Modify it again to plot the maximum GDP per capita over time for Europe.
data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.____.plot(label='min')
data_europe.____
plt.legend(loc='best')
plt.xticks(rotation=90)
data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data_europe.min().plot(label='min')
data_europe.max().plot(label='max')
plt.legend(loc='best')
plt.xticks(rotation=90)
Fill in the blanks so that the program below produces the output shown.
values = ____
values.____(1)
values.____(3)
values.____(5)
print('first time:', values)
values = values[____]
print('second time:', values)
# output
first time: [1, 3, 5]
second time: [3, 5]
values = []
values.append(1)
values.append(3)
values.append(5)
print('first time:', values)
values = values[1:]
print('second time:', values)
Fill in the blanks in each of the programs below to produce the indicated result.
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
____ = ____ + len(word)
print(total)
total = 0
for word in ["red", "green", "blue"]:
total = total + len(word)
print(total)
Fill in the blanks so that this program creates a new list containing zeroes where the original list’s values were negative and ones where the original list’s values were positive.
original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = ____
for value in original:
if ____:
result.append(0)
else:
____
print(result)
# output
[0, 1, 1, 1, 0, 1]
original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = []
for value in original:
if value < 0.0:
result.append(0)
else:
result.append(1)
print(result)
LIVE Session Notes: https://drive.google.com/file/d/1y8A0xUEWSdSrAhS9Sbvx39Etb1Vnn4rM/view?usp=sharing