Try   HackMD
HackMD Error: 404 error

DataCamp Python Notes

tags: Python DataCamp Notes

Python Data Science Toolbox (Part 1)

1. Writing your own functions

  • Docstrings:

    functions 內開頭最好習慣寫說明。

    ​​​​def raise_both(value1, value2): ​​​​ """Raise value1 to the power of value2 and vice versa.""" ​​​​ new_value1 = value1 ** value2 ​​​​ new_value2 = value2 ** value1 ​​​​ new_tuple = (new_value1, new_value2) ​​​​ return new_tuple
  • Unpacking tuples:

    ​​​​even_nums = (2, 4, 6) ​​​​a, b, c = even_nums ​​​​print(a) ​​​​# 2

2. Default arguments, variable-length arguments and scope

  • Global vs. local scope

    ​​​​new_val = 10 ​​​​def square(value): ​​​​ """Returns the square of a number.""" ​​​​ global new_val ​​​​ new_val = new_val ** 2 ​​​​ return new_val ​​​​square(3) ​​​​# 100 ​​​​new_val ​​​​# 100
  • Nested functions:

    ​​​​def mod2plus5(x1, x2, x3): ​​​​ """Returns the remainder plus 5 of three values.""" ​​​​ def inner(x): ​​​​ """Returns the remainder plus 5 of a value.""" ​​​​ return x % 2 + 5 ​​​​ return (inner(x1), inner(x2), inner(x3))
  • Returning functions:

    ​​​​def raise_val(n): ​​​​ """Return the inner function.""" ​​​​ def inner(x): ​​​​ """Raise x to the power of n.""" ​​​​ raised = x ** n ​​​​ return raised ​​​​ return inner ​​​​square = raise_val(2) ​​​​cube = raise_val(3) ​​​​print(square(2), cube(4)) ​​​​# 4 64
  • Using nonlocal:

    ​​​​def outer(): ​​​​ """Prints the value of n.""" ​​​​ n = 1 ​​​​ def inner(): ​​​​ nonlocal n ​​​​ n = 2 ​​​​ print(n) ​​​​ inner() ​​​​ print(n) ​​​​outer() ​​​​# 2 ​​​​# 2
  • *args:

    ​​​​def add_all(*args): ​​​​ """Sum all values in *args together.""" ​​​​ # Initialize sum ​​​​ sum_all = 0 ​​​​ # Accumulate the sum ​​​​ for num in args: ​​​​ sum_all += num ​​​​ return sum_all ​​​​add_all(5, 10, 15, 20) ​​​​# 50
  • **kwargs:

    ​​​​def print_all(**kwargs): ​​​​ """Print out key-value pairs in **kwargs.""" ​​​​ # Print out the key-value pairs ​​​​ for key, value in kwargs.items(): ​​​​ print(key + ": " + value) ​​​​print_all(name="dumbledore", job="headmaster") ​​​​# job: headmaster ​​​​# name: dumbledore

3. Lambda functions and error-handling

  • Anonymous functions:

    map() applies the function to ALL elements in the sequence

    ​​​​nums = [48, 6, 9, 21, 1] ​​​​square_all = map(lambda num: num ** 2, nums) ​​​​print(list(square_all)) ​​​​# [2304, 36, 81, 441, 1]
  • Errors and exceptions:

    ​​​​def sqrt(x): ​​​​ if x < 0: ​​​​ raise ValueError('x must be non-negative') ​​​​ try: ​​​​ return x ** 0.5 ​​​​ except TypeError: ​​​​ print('x must be an int or float')

Python Data Science Toolbox (Part 2)

1. Using iterators in PythonLand

  • Iterators vs. iterables:

    • Iterable
      • Examples: lists, strings, dictionaries, file connections
      • An object with an associated iter() method
      • Applying iter() to an iterable creates an iterator
    • Iterator
      • Produces next value with next()
  • Iterating over iterables: next():

    ​​​​word = 'Da' ​​​​it = iter(word) ​​​​next(it) ​​​​# 'D' ​​​​next(it) ​​​​# 'a' ​​​​next(it) ​​​​# 出現錯誤?
  • Iterating at once with *:

    ​​​​word = 'Data' ​​​​it = iter(word) ​​​​print(*it) ​​​​# D a t a ​​​​print(*it) ​​​​# 沒有東西
  • Iterating over dictionaries:

    ​​​​for key, value in dict.items(): ​​​​ print(key, value)
  • Using enumerate():

    ​​​​avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] ​​​​e = enumerate(avengers) ​​​​e_list = list(e) ​​​​print(e_list) ​​​​# [(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]
    ​​​​for index, value in enumerate(avengers): ​​​​ print(index, value) ​​​​# 0 hawkeye ​​​​# 1 iron man ​​​​# 2 thor ​​​​# 3 quicksilver
    ​​​​for index, value in enumerate(avengers, start=10): ​​​​ print(index, value) ​​​​# 10 hawkeye ​​​​# 11 iron man ​​​​# 12 thor ​​​​# 13 quicksilver
  • Using zip():

    ​​​​avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] ​​​​names = ['barton', 'stark', 'odinson', 'maximoff'] ​​​​z = zip(avengers, names) ​​​​print(*z) ​​​​# ('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff') ​​​​z_list = list(z) ​​​​print(z_list) ​​​​# [('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]
    ​​​​for z1, z2 in zip(avengers, names): ​​​​ print(z1, z2) ​​​​# hawkeye barton ​​​​# iron man stark ​​​​# thor odinson ​​​​# quicksilver maximoff
  • Using iterators for big data:

    ​​​​total = 0 ​​​​for chunk in pd.read_csv('data.csv', chunksize=1000): ​​​​ total += sum(chunk['x']) ​​​​print(total) ​​​​# 9999999

2. List comprehensions

  • 簡單版:

    ​​​​result = [num for num in range(11)] ​​​​# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

    有時雖然變簡短但失去易讀性

    外for 裡for

    ​​​​pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)] ​​​​# [(0, 6), (0, 7), (1, 6), (1, 7)]
  • 進階版:

    外for 裡if

    ​​​​[num ** 2 for num in range(10) if num % 2 == 0] ​​​​# [0, 4, 16, 36, 64]

    裡if else 外for

    ​​​​ [num ** 2 if num % 2 == 0 else 0 for num in range(10)] ​​​​# [0, 0, 4, 0, 16, 0, 36, 0, 64, 0]
  • Dict comprehensions:

    ​​​​pos_neg = {num: -num for num in range(5)} ​​​​# {0: 0, 1: -1, 2: -2, 3: -3, 4: -4}
  • Generators:

    ​​​​result = (num for num in range(5)) ​​​​print(result) ​​​​# <generator object <genexpr> at 0x1046bf888>
    ​​​​result = (num for num in range(5)) ​​​​for num in result: ​​​​ print(num) ​​​​# 0 ​​​​# 1 ​​​​# 2 ​​​​# 3 ​​​​# 4
    ​​​​result = (num for num in range(5)) ​​​​ print(list(result)) ​​​​ # [0, 1, 2, 3, 4]
    ​​​​result = (num for num in range(5)) ​​​​ print(next(result)) ​​​​ # 0
    ​​​​even_nums = (num for num in range(10) if num % 2 == 0) ​​​​print(list(even_nums)) ​​​​# [0, 2, 4, 6, 8]
  • List comprehensions vs. generators:

    • List comprehension
      • 直接產生全部
    • Generators
      • 要使用時才會產生,所以資料量大時或串流資料好用
  • Generators functions:

    使用 yield 輸出

    ​​​​def num_sequence(n): ​​​​ """Generate values from 0 to n.""" ​​​​ i = 0 ​​​​ while i < n: ​​​​ yield i ​​​​ i += 1
    ​​​​result = num_sequence(5) ​​​​for num in result: ​​​​ print(num) ​​​​# 0 ​​​​# 1 ​​​​# 2 ​​​​# 3 ​​​​# 4

Cleaning Data in Python

1. Exploring your data

  • 初步:

    ​​​​df.head() ​​​​df.tail() ​​​​df.columns() ​​​​df.shape ​​​​df.info() # 架構、index、column names、column types... ​​​​df.columnA.value_counts(dropna=False) # 類別 column ​​​​df.columnA.value_counts(dropna=False).head() ​​​​df.describe() # 連續 columns
  • 圖形:

    ​​​​import matplotlib.pyplot as plt ​​​​df.columnA.plot('hist') ​​​​plt.show() ​​​​df.boxplot(column='population', by='continent') ​​​​plt.show()

2. Tidying data for analysis

  • melt

    ​​​​pd.melt(frame=df, id_vars='name', ​​​​ value_vars=['treatment a', 'treatment b'], ​​​​ var_name='treatment', value_name='result')
    ​​​​pd.melt(frame=tb, id_vars=['country', 'year'])
  • pivot

    ​​​​weather_tidy = weather.pivot(index='date', ​​​​ columns='element', ​​​​ values='value')
    ​​​​weather2_tidy = weather.pivot_table(index='date', ​​​​ columns='element', ​​​​ values='value', ​​​​ aggfunc=np.mean)

3. Combining data for analysis

  • row bind

    要記得處理 index

    ​​​​concatenated = pd.concat([weather_p1, weather_p2], ignore_index=True)
  • 合併分割檔

    ​​​​import glob ​​​​csv_files = glob.glob('*.csv') ​​​​list_data = [] ​​​​for filename in csv_files: ​​​​ data = pd.read_csv(filename) ​​​​ list_data.append(data) ​​​​pd.concat(list_data)
    ​​​​pd.merge(left=state_populations, right=state_codes, ​​​​ on=None, left_on='state', right_on='name')

4. Cleaning data for analysis

  • 轉換格式

    轉成 category 可減少記憶體用量,亦可用於某些套件的分析

    ​​​​df['treatment b'] = df['treatment b'].astype(str) ​​​​df['sex'] = df['sex'].astype('category') ​​​​df['treatment a'] = pd.to_numeric(df['treatment a'], ​​​​ errors='coerce') # 即便有文字也要轉
    ​​​​import re ​​​​pattern = re.compile('\$\d*\.\d{2}') ​​​​result = pattern.match('$17.89') ​​​​bool(result) ​​​​# True
  • 計算 rows/columns 平均

    ​​​​df.apply(np.mean, axis=0) # columns ​​​​df.apply(np.mean, axis=1) # rows
  • apply function

    image alt

    ​​​​import re ​​​​from numpy import NaN ​​​​pattern = re.compile('^\$\d*\.\d{2}$') ​​​​def diff_money(row, pattern): ​​​​ icost = row['Initial Cost'] ​​​​ tef = row['Total Est. Fee'] ​​​​ if bool(pattern.match(icost)) and bool(pattern.match(tef)): ​​​​ icost = icost.replace("$", "") ​​​​ tef = tef.replace("$", "") ​​​​ icost = float(icost) ​​​​ tef = float(tef) ​​​​ return icost - tef ​​​​ else: ​​​​ return (NaN) ​​​​df_subset['diff'] = df_subset.apply(diff_money, ​​​​ axis=1, ​​​​ pattern=pattern) ​​​​print(df_subset.head())
    • missing value
    ​​​​df = df.drop_duplicates() # 去掉重複 rows ​​​​tips_dropped = tips_nan.dropna() # 有 na 直接去掉 row ​​​​tips_nan['sex'] = tips_nan['sex'].fillna('missing') ​​​​tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0) ​​​​mean_value = tips_nan['tip'].mean() ​​​​tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)
  • assert

    用來確認資料是我們要的樣子

    ​​​​assert df.columnA.notnull().all() ​​​​# AssertionError!!

Pandas Foundations

1. Data ingestion & inspection

  • slicing:

    ​​​​AAPL.iloc[:5,:] # 前 5 rows ​​​​AAPL.iloc[-5:,:] # 後 5 rows ​​​​AAPL.iloc[::3, -1] # every 3 rows in last column
  • Series:

    ​​​​lows = AAPL['Low'].values ​​​​type(lows) ​​​​# numpy.ndarray
  • DataFrames from dict:

    ​​​​data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'], ​​​​ 'city': ['Austin', 'Dallas', 'Austin', 'Dallas'], ​​​​ 'visitors': [139, 237, 326, 456], ​​​​ 'signups': [7, 12, 3, 5]} ​​​​users = pd.DataFrame(data)
  • read_csv:

    ​​​​col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'definite'] ​​​​sunspots = pd.read_csv(filepath, ​​​​ header=None, ​​​​ names=col_names, ​​​​ na_values={'sunspots': ['-1']}, # 指定 column 的 na_values ​​​​ parse_dates=[[0, 1, 2]]) # 合併 0, 1, 2 columns 為日期 YMD
  • Using dates as index:

    ​​​​sunspots.index = sunspots['year_month_day'] ​​​​sunspots.index.name = 'date'
  • Ploting Series:

    ​​​​import matplotlib.pyplot as plt ​​​​aapl['close'].plot() ​​​​plt.show()
  • Fixing scales:

    ​​​​aapl.plot() ​​​​plt.yscale('log') ​​​​plt.show()
  • Customizing plots:

    ​​​​aapl['open'].plot(color='b', style='.-', legend=True) ​​​​plt.axis(('2001', '2002', 0, 100)) # 縮小範圍: x頭, x尾, y頭, y尾
  • Saving plots:

    ​​​​aapl.loc['2001':'2004', ['open', 'close', 'high', 'low']].plot() ​​​​plt.savefig('aapl.png') # jpg, pdf... ​​​​plt.show()

2. Exploratory data analysis

  • Line/Scatter/Box/Histogram plot:

    ​​​​iris.plot(x='sepal_length', y='sepal_width', kind='scatter') # box, hist ​​​​plt.xlabel('sepal length (cm)') ​​​​plt.ylabel('sepal width (cm)') ​​​​plt.show()
  • Statistical exploratory data analysis:

    ​​​​iris.describe() ​​​​iris[['petal_length', 'petal_width']].count() ​​​​iris.mean() # std, median, min, max ​​​​iris.quantile([0.25, 0.75]) ​​​​iris.plot(kind= 'box')
  • 探索類別變項:

    ​​​​iris['species'].describe() ​​​​iris['species'].unique()

3. Time series in pandas

  • Parse dates:

    ​​​​sales = pd.read_csv('sales-feb-2015.csv', parse_dates=True, index_col= 'Date')
  • selection:

    ​​​​sales.loc['2015-2-5'] # Selecting whole day ​​​​sales.loc['2015-2'] # Selecting whole month ​​​​sales.loc['2015'] # Selecting whole year ​​​​sales.loc['2015-2-16':'2015-2-20']
  • Convert strings to datetime:

    ​​​​evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00', '2015-2-11 22:00', '2015-2-11 23:00'])