{%hackmd @rpyapp/theme %} # DataCamp Python Notes ###### tags: `Python` `DataCamp` `Notes` [TOC] ## [Python Data Science Toolbox (Part 1)](https://www.datacamp.com/courses/python-data-science-toolbox-part-1) ### 1. Writing your own functions * #### Docstrings: > functions 內開頭最好習慣寫說明。 ```python= def raise_both(value1, value2): """Raise value1 to the power of value2 and vice versa.""" new_value1 = value1 ** value2 new_value2 = value2 ** value1 new_tuple = (new_value1, new_value2) return new_tuple ``` * #### Unpacking tuples: ```python= even_nums = (2, 4, 6) a, b, c = even_nums print(a) # 2 ``` --- ### 2. Default arguments, variable-length arguments and scope * #### Global vs. local scope ```python= new_val = 10 def square(value): """Returns the square of a number.""" global new_val new_val = new_val ** 2 return new_val square(3) # 100 new_val # 100 ``` * #### Nested functions: ```python= def mod2plus5(x1, x2, x3): """Returns the remainder plus 5 of three values.""" def inner(x): """Returns the remainder plus 5 of a value.""" return x % 2 + 5 return (inner(x1), inner(x2), inner(x3)) ``` * #### Returning functions: ```python= def raise_val(n): """Return the inner function.""" def inner(x): """Raise x to the power of n.""" raised = x ** n return raised return inner square = raise_val(2) cube = raise_val(3) print(square(2), cube(4)) # 4 64 ``` * #### Using nonlocal: ```python= def outer(): """Prints the value of n.""" n = 1 def inner(): nonlocal n n = 2 print(n) inner() print(n) outer() # 2 # 2 ``` * #### *args: ```python= def add_all(*args): """Sum all values in *args together.""" # Initialize sum sum_all = 0 # Accumulate the sum for num in args: sum_all += num return sum_all add_all(5, 10, 15, 20) # 50 ``` * #### **kwargs: ```python= def print_all(**kwargs): """Print out key-value pairs in **kwargs.""" # Print out the key-value pairs for key, value in kwargs.items(): print(key + ": " + value) print_all(name="dumbledore", job="headmaster") # job: headmaster # name: dumbledore ``` * [*args vs **kwargs](https://skylinelimit.blogspot.com/2018/04/python-args-kwargs.html) --- ### 3. Lambda functions and error-handling * #### Anonymous functions: > map() applies the function to ALL elements in the sequence ```python= nums = [48, 6, 9, 21, 1] square_all = map(lambda num: num ** 2, nums) print(list(square_all)) # [2304, 36, 81, 441, 1] ``` * #### Errors and exceptions: ```python= def sqrt(x): if x < 0: raise ValueError('x must be non-negative') try: return x ** 0.5 except TypeError: print('x must be an int or float') ``` --- ## [Python Data Science Toolbox (Part 2)](https://www.datacamp.com/courses/python-data-science-toolbox-part-2) ### 1. Using iterators in PythonLand * #### Iterators vs. iterables: * Iterable * Examples: lists, strings, dictionaries, file connections * An object with an associated iter() method * Applying iter() to an iterable creates an iterator * Iterator * Produces next value with next() * #### Iterating over iterables: next(): ```python= word = 'Da' it = iter(word) next(it) # 'D' next(it) # 'a' next(it) # 出現錯誤? ``` * #### Iterating at once with *: ```python= word = 'Data' it = iter(word) print(*it) # D a t a print(*it) # 沒有東西 ``` * #### Iterating over dictionaries: ```python= for key, value in dict.items(): print(key, value) ``` * #### Using enumerate(): ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] e = enumerate(avengers) e_list = list(e) print(e_list) # [(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')] ``` ```python=+ for index, value in enumerate(avengers): print(index, value) # 0 hawkeye # 1 iron man # 2 thor # 3 quicksilver ``` ```python=+ for index, value in enumerate(avengers, start=10): print(index, value) # 10 hawkeye # 11 iron man # 12 thor # 13 quicksilver ``` * #### Using zip(): ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] names = ['barton', 'stark', 'odinson', 'maximoff'] z = zip(avengers, names) print(*z) # ('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff') z_list = list(z) print(z_list) # [('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')] ``` ```python=+ for z1, z2 in zip(avengers, names): print(z1, z2) # hawkeye barton # iron man stark # thor odinson # quicksilver maximoff ``` * #### Using iterators for big data: ```python= total = 0 for chunk in pd.read_csv('data.csv', chunksize=1000): total += sum(chunk['x']) print(total) # 9999999 ``` --- ### 2. List comprehensions * 簡單版: ```python= result = [num for num in range(11)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ``` > 有時雖然變簡短但失去易讀性 > 外for 裡for ```python= pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)] # [(0, 6), (0, 7), (1, 6), (1, 7)] ``` * 進階版: > 外for 裡if ```python= [num ** 2 for num in range(10) if num % 2 == 0] # [0, 4, 16, 36, 64] ``` > 裡if else 外for ```python= [num ** 2 if num % 2 == 0 else 0 for num in range(10)] # [0, 0, 4, 0, 16, 0, 36, 0, 64, 0] ``` * Dict comprehensions: ```python= pos_neg = {num: -num for num in range(5)} # {0: 0, 1: -1, 2: -2, 3: -3, 4: -4} ``` * Generators: ```python= result = (num for num in range(5)) print(result) # <generator object <genexpr> at 0x1046bf888> ``` ```python= result = (num for num in range(5)) for num in result: print(num) # 0 # 1 # 2 # 3 # 4 ``` ```python= result = (num for num in range(5)) print(list(result)) # [0, 1, 2, 3, 4] ``` ```python= result = (num for num in range(5)) print(next(result)) # 0 ``` ```python= even_nums = (num for num in range(10) if num % 2 == 0) print(list(even_nums)) # [0, 2, 4, 6, 8] ``` * List comprehensions vs. generators: * List comprehension * 直接產生全部 * Generators * 要使用時才會產生,所以資料量大時或串流資料好用 * Generators functions: > 使用 yield 輸出 ```python= def num_sequence(n): """Generate values from 0 to n.""" i = 0 while i < n: yield i i += 1 ``` ```python= result = num_sequence(5) for num in result: print(num) # 0 # 1 # 2 # 3 # 4 ``` --- ## [Cleaning Data in Python](https://www.datacamp.com/courses/cleaning-data-in-python) ### 1. Exploring your data * #### 初步: ```python= df.head() df.tail() df.columns() df.shape df.info() # 架構、index、column names、column types... df.columnA.value_counts(dropna=False) # 類別 column df.columnA.value_counts(dropna=False).head() df.describe() # 連續 columns ``` * #### 圖形: ```python= import matplotlib.pyplot as plt df.columnA.plot('hist') plt.show() df.boxplot(column='population', by='continent') plt.show() ``` --- ### 2. Tidying data for analysis * melt ```python= pd.melt(frame=df, id_vars='name', value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result') ``` ```python= pd.melt(frame=tb, id_vars=['country', 'year']) ``` * pivot ```python= weather_tidy = weather.pivot(index='date', columns='element', values='value') ``` ```python= weather2_tidy = weather.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean) ``` --- ### 3. Combining data for analysis * row bind > 要記得處理 index ```python= concatenated = pd.concat([weather_p1, weather_p2], ignore_index=True) ``` * 合併分割檔 ```python= import glob csv_files = glob.glob('*.csv') list_data = [] for filename in csv_files: data = pd.read_csv(filename) list_data.append(data) pd.concat(list_data) ``` ```python= pd.merge(left=state_populations, right=state_codes, on=None, left_on='state', right_on='name') ``` --- ### 4. Cleaning data for analysis * 轉換格式 > 轉成 category 可減少記憶體用量,亦可用於某些套件的分析 ```python= df['treatment b'] = df['treatment b'].astype(str) df['sex'] = df['sex'].astype('category') df['treatment a'] = pd.to_numeric(df['treatment a'], errors='coerce') # 即便有文字也要轉 ``` ```python= import re pattern = re.compile('\$\d*\.\d{2}') result = pattern.match('$17.89') bool(result) # True ``` * 計算 rows/columns 平均 ```python= df.apply(np.mean, axis=0) # columns df.apply(np.mean, axis=1) # rows ``` * apply function ![image alt](https://i.imgur.com/eqF7KFY.png) ```python= import re from numpy import NaN pattern = re.compile('^\$\d*\.\d{2}$') def diff_money(row, pattern): icost = row['Initial Cost'] tef = row['Total Est. Fee'] if bool(pattern.match(icost)) and bool(pattern.match(tef)): icost = icost.replace("$", "") tef = tef.replace("$", "") icost = float(icost) tef = float(tef) return icost - tef else: return (NaN) df_subset['diff'] = df_subset.apply(diff_money, axis=1, pattern=pattern) print(df_subset.head()) ``` * missing value ```python= df = df.drop_duplicates() # 去掉重複 rows tips_dropped = tips_nan.dropna() # 有 na 直接去掉 row tips_nan['sex'] = tips_nan['sex'].fillna('missing') tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0) mean_value = tips_nan['tip'].mean() tips_nan['tip'] = tips_nan['tip'].fillna(mean_value) ``` * assert > 用來確認資料是我們要的樣子 ```python= assert df.columnA.notnull().all() # AssertionError!! ``` --- ## [Pandas Foundations](https://www.datacamp.com/courses/pandas-foundations) ### 1. Data ingestion & inspection * #### slicing: ```python= AAPL.iloc[:5,:] # 前 5 rows AAPL.iloc[-5:,:] # 後 5 rows AAPL.iloc[::3, -1] # every 3 rows in last column ``` * #### Series: ```python= lows = AAPL['Low'].values type(lows) # numpy.ndarray ``` * #### DataFrames from dict: ```python= data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'], 'city': ['Austin', 'Dallas', 'Austin', 'Dallas'], 'visitors': [139, 237, 326, 456], 'signups': [7, 12, 3, 5]} users = pd.DataFrame(data) ``` * #### read_csv: ```python= col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'definite'] sunspots = pd.read_csv(filepath, header=None, names=col_names, na_values={'sunspots': ['-1']}, # 指定 column 的 na_values parse_dates=[[0, 1, 2]]) # 合併 0, 1, 2 columns 為日期 YMD ``` * #### Using dates as index: ```python= sunspots.index = sunspots['year_month_day'] sunspots.index.name = 'date' ``` * #### Ploting Series: ```python= import matplotlib.pyplot as plt aapl['close'].plot() plt.show() ``` * #### Fixing scales: ```python= aapl.plot() plt.yscale('log') plt.show() ``` * #### Customizing plots: ```python= aapl['open'].plot(color='b', style='.-', legend=True) plt.axis(('2001', '2002', 0, 100)) # 縮小範圍: x頭, x尾, y頭, y尾 ``` * #### Saving plots: ```python= aapl.loc['2001':'2004', ['open', 'close', 'high', 'low']].plot() plt.savefig('aapl.png') # jpg, pdf... plt.show() ``` --- ### 2. Exploratory data analysis * #### Line/Scatter/Box/Histogram plot: ```python= iris.plot(x='sepal_length', y='sepal_width', kind='scatter') # box, hist plt.xlabel('sepal length (cm)') plt.ylabel('sepal width (cm)') plt.show() ``` * #### Statistical exploratory data analysis: ```python= iris.describe() iris[['petal_length', 'petal_width']].count() iris.mean() # std, median, min, max iris.quantile([0.25, 0.75]) iris.plot(kind= 'box') ``` * #### 探索類別變項: ```python= iris['species'].describe() iris['species'].unique() ``` --- ### 3. Time series in pandas * #### Parse dates: ```python= sales = pd.read_csv('sales-feb-2015.csv', parse_dates=True, index_col= 'Date') ``` * #### selection: ```python= sales.loc['2015-2-5'] # Selecting whole day sales.loc['2015-2'] # Selecting whole month sales.loc['2015'] # Selecting whole year sales.loc['2015-2-16':'2015-2-20'] ``` * #### Convert strings to datetime: ```python= evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00', '2015-2-11 22:00', '2015-2-11 23:00']) ``` ---