HackMD Error: 404 error

DataCamp Python Notes

tags: `Python` `DataCamp` `Notes`

DataCamp Python Notes

Python Data Science Toolbox (Part 1)

1. Writing your own functions

Docstrings:

functions 內開頭最好習慣寫說明。






def raise_both(value1, value2):
    """Raise value1 to the power of value2 and vice versa."""
    new_value1 = value1 ** value2
    new_value2 = value2 ** value1
    new_tuple = (new_value1, new_value2)
    return new_tuple

Unpacking tuples:




even_nums = (2, 4, 6)
a, b, c = even_nums
print(a)
# 2

2. Default arguments, variable-length arguments and scope

Global vs. local scope













new_val = 10

def square(value):
    """Returns the square of a number."""
    global new_val
    new_val = new_val ** 2
    return new_val

square(3)
# 100

new_val
# 100

Nested functions:








def mod2plus5(x1, x2, x3):
    """Returns the remainder plus 5 of three values."""

    def inner(x):
        """Returns the remainder plus 5 of a value."""
        return x % 2 + 5

    return (inner(x1), inner(x2), inner(x3))

Returning functions:














def raise_val(n):
    """Return the inner function."""

    def inner(x):
        """Raise x to the power of n."""
        raised = x ** n
        return raised

    return inner

square = raise_val(2)
cube = raise_val(3)
print(square(2), cube(4))
# 4 64

Using nonlocal:















def outer():
    """Prints the value of n."""
    n = 1

    def inner():
        nonlocal n
        n = 2
        print(n)

    inner()
    print(n)

outer()
# 2
# 2

*args:














def add_all(*args):
    """Sum all values in *args together."""

    # Initialize sum
    sum_all = 0

    # Accumulate the sum
    for num in args:
        sum_all += num

    return sum_all

add_all(5, 10, 15, 20)
# 50

**kwargs:










def print_all(**kwargs):
    """Print out key-value pairs in **kwargs."""

    # Print out the key-value pairs
    for key, value in kwargs.items():
        print(key + ": " + value)

print_all(name="dumbledore", job="headmaster")
# job: headmaster
# name: dumbledore

*args vs **kwargs

3. Lambda functions and error-handling

Anonymous functions:

map() applies the function to ALL elements in the sequence






nums = [48, 6, 9, 21, 1]

square_all = map(lambda num: num ** 2, nums)

print(list(square_all))
# [2304, 36, 81, 441, 1]

Errors and exceptions:







def sqrt(x):
    if x < 0:
        raise ValueError('x must be non-negative')
    try:
        return x ** 0.5
    except TypeError:
        print('x must be an int or float')

Python Data Science Toolbox (Part 2)

1. Using iterators in PythonLand

Iterators vs. iterables:
- Iterable
  - Examples: lists, strings, dictionaries, file connections
  - An object with an associated iter() method
  - Applying iter() to an iterable creates an iterator
- Iterator
  - Produces next value with next()

Iterating over iterables: next():








word = 'Da'
it = iter(word)
next(it)
# 'D'
next(it)
# 'a'
next(it)
# 出現錯誤?

Iterating at once with *:






word = 'Data'
it = iter(word)
print(*it)
# D a t a
print(*it)
# 沒有東西

Iterating over dictionaries:


for key, value in dict.items():
    print(key, value)

Using enumerate():





avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
e = enumerate(avengers)
e_list = list(e)
print(e_list)
# [(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]






for index, value in enumerate(avengers):
    print(index, value)
# 0 hawkeye
# 1 iron man
# 2 thor
# 3 quicksilver






for index, value in enumerate(avengers, start=10):
    print(index, value)
# 10 hawkeye
# 11 iron man
# 12 thor
# 13 quicksilver

Using zip():








avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(*z)
# ('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff')
z_list = list(z)
print(z_list)
# [('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]






for z1, z2 in zip(avengers, names):
    print(z1, z2)
# hawkeye barton
# iron man stark
# thor odinson
# quicksilver maximoff

Using iterators for big data:





total = 0
for chunk in pd.read_csv('data.csv', chunksize=1000):
    total += sum(chunk['x'])
print(total)
# 9999999

2. List comprehensions

簡單版:


result = [num for num in range(11)]
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

有時雖然變簡短但失去易讀性

外for 裡for


pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
# [(0, 6), (0, 7), (1, 6), (1, 7)]

進階版:

外for 裡if


[num ** 2 for num in range(10) if num % 2 == 0]
# [0, 4, 16, 36, 64]

裡if else 外for


 [num ** 2 if num % 2 == 0 else 0 for num in range(10)]
# [0, 0, 4, 0, 16, 0, 36, 0, 64, 0]

Dict comprehensions:


pos_neg = {num: -num for num in range(5)}
# {0: 0, 1: -1, 2: -2, 3: -3, 4: -4}

Generators:



result = (num for num in range(5))
print(result)
# <generator object <genexpr> at 0x1046bf888>








result = (num for num in range(5))
for num in result:
    print(num)
# 0
# 1
# 2
# 3
# 4



result = (num for num in range(5))
 print(list(result))
 # [0, 1, 2, 3, 4]



result = (num for num in range(5))
 print(next(result))
 # 0



even_nums = (num for num in range(10) if num % 2 == 0)
print(list(even_nums))
# [0, 2, 4, 6, 8]

List comprehensions vs. generators:
- List comprehension
  - 直接產生全部
- Generators
  - 要使用時才會產生，所以資料量大時或串流資料好用

Generators functions:

使用 yield 輸出






def num_sequence(n):
    """Generate values from 0 to n."""
    i = 0
    while i < n:
        yield i
        i += 1








result = num_sequence(5)
for num in result:
    print(num)
# 0
# 1
# 2
# 3
# 4

Cleaning Data in Python

1. Exploring your data

初步:









df.head()
df.tail()
df.columns()
df.shape
df.info() # 架構、index、column names、column types...

df.columnA.value_counts(dropna=False) # 類別 column
df.columnA.value_counts(dropna=False).head()
df.describe() # 連續 columns

圖形:







import matplotlib.pyplot as plt

df.columnA.plot('hist')
plt.show()

df.boxplot(column='population', by='continent')
plt.show()

2. Tidying data for analysis

melt



pd.melt(frame=df, id_vars='name',
        value_vars=['treatment a', 'treatment b'],
        var_name='treatment', value_name='result')


pd.melt(frame=tb, id_vars=['country', 'year'])

pivot



weather_tidy = weather.pivot(index='date',
                             columns='element',
                             values='value')




weather2_tidy = weather.pivot_table(index='date',
                                    columns='element',
                                    values='value',
                                    aggfunc=np.mean)

3. Combining data for analysis

row bind

要記得處理 index


concatenated = pd.concat([weather_p1, weather_p2], ignore_index=True)

合併分割檔











import glob

csv_files = glob.glob('*.csv')

list_data = []

for filename in csv_files:
    data = pd.read_csv(filename)
    list_data.append(data)

pd.concat(list_data)


pd.merge(left=state_populations, right=state_codes,
         on=None, left_on='state', right_on='name')

4. Cleaning data for analysis

轉換格式

轉成 category 可減少記憶體用量，亦可用於某些套件的分析






df['treatment b'] = df['treatment b'].astype(str)

df['sex'] = df['sex'].astype('category')

df['treatment a'] = pd.to_numeric(df['treatment a'],
                                  errors='coerce') # 即便有文字也要轉






import re

pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)
# True

計算 rows/columns 平均


df.apply(np.mean, axis=0) # columns
df.apply(np.mean, axis=1) # rows

apply function




























import re
from numpy import NaN

pattern = re.compile('^\$\d*\.\d{2}$')


def diff_money(row, pattern):
    icost = row['Initial Cost']
    tef = row['Total Est. Fee']

    if bool(pattern.match(icost)) and bool(pattern.match(tef)):

        icost = icost.replace("$", "")
        tef = tef.replace("$", "")

        icost = float(icost)
        tef = float(tef)

        return icost - tef
    else:

        return (NaN)


df_subset['diff'] = df_subset.apply(diff_money,
                                    axis=1,
                                    pattern=pattern)
print(df_subset.head())

missing value










df = df.drop_duplicates() # 去掉重複 rows

tips_dropped = tips_nan.dropna() # 有 na 直接去掉 row

tips_nan['sex'] = tips_nan['sex'].fillna('missing')

tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0)

mean_value = tips_nan['tip'].mean()
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)

assert

用來確認資料是我們要的樣子


assert df.columnA.notnull().all()
# AssertionError!!

Pandas Foundations

1. Data ingestion & inspection

slicing:



AAPL.iloc[:5,:] # 前 5 rows
AAPL.iloc[-5:,:] # 後 5 rows
AAPL.iloc[::3, -1] # every 3 rows in last column

Series:



lows = AAPL['Low'].values
type(lows)
# numpy.ndarray

DataFrames from dict:





data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
        'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
        'visitors': [139, 237, 326, 456],
        'signups': [7, 12, 3, 5]}
users = pd.DataFrame(data)

read_csv:






col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'definite']
sunspots = pd.read_csv(filepath, 
                       header=None,
                       names=col_names,
                       na_values={'sunspots': ['-1']}, # 指定 column 的 na_values
                       parse_dates=[[0, 1, 2]]) # 合併 0, 1, 2 columns 為日期 YMD

Using dates as index:


sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'

Ploting Series:



import matplotlib.pyplot as plt
aapl['close'].plot()
plt.show()

Fixing scales:



aapl.plot()
plt.yscale('log')
plt.show()

Customizing plots:


aapl['open'].plot(color='b', style='.-', legend=True)
plt.axis(('2001', '2002', 0, 100)) # 縮小範圍: x頭, x尾, y頭, y尾

Saving plots:



aapl.loc['2001':'2004', ['open', 'close', 'high', 'low']].plot()
plt.savefig('aapl.png') # jpg, pdf...
plt.show()

2. Exploratory data analysis

Line/Scatter/Box/Histogram plot:




iris.plot(x='sepal_length', y='sepal_width', kind='scatter') # box, hist
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
plt.show()

Statistical exploratory data analysis:





iris.describe()
iris[['petal_length', 'petal_width']].count()
iris.mean() # std, median, min, max
iris.quantile([0.25, 0.75])
iris.plot(kind= 'box')

探索類別變項:


iris['species'].describe()
iris['species'].unique()

3. Time series in pandas

Parse dates:


sales = pd.read_csv('sales-feb-2015.csv', parse_dates=True, index_col= 'Date')

selection:





sales.loc['2015-2-5'] # Selecting whole day
sales.loc['2015-2'] # Selecting whole month
sales.loc['2015'] # Selecting whole year

sales.loc['2015-2-16':'2015-2-20']

Convert strings to datetime:


evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00', '2015-2-11 22:00', '2015-2-11 23:00'])

DataCamp Python Notes

tags: Python DataCamp Notes

Python Data Science Toolbox (Part 1)

1. Writing your own functions

Docstrings:

Unpacking tuples:

2. Default arguments, variable-length arguments and scope

Global vs. local scope

Nested functions:

Returning functions:

Using nonlocal:

*args:

**kwargs:

3. Lambda functions and error-handling

Anonymous functions:

Errors and exceptions:

Python Data Science Toolbox (Part 2)

1. Using iterators in PythonLand

Iterators vs. iterables:

Iterating over iterables: next():

Iterating at once with *:

Iterating over dictionaries:

Using enumerate():

Using zip():

Using iterators for big data:

2. List comprehensions

Cleaning Data in Python

1. Exploring your data

初步:

圖形:

2. Tidying data for analysis

3. Combining data for analysis

4. Cleaning data for analysis

Pandas Foundations

1. Data ingestion & inspection

slicing:

Series:

DataFrames from dict:

read_csv:

Using dates as index:

Ploting Series:

Fixing scales:

Customizing plots:

Saving plots:

2. Exploratory data analysis

Line/Scatter/Box/Histogram plot:

Statistical exploratory data analysis:

探索類別變項:

3. Time series in pandas

Parse dates:

selection:

Convert strings to datetime:

Read more

Python BeautifulSoup Notes

Selenium 爬蟲使用 Tor 匿蹤教學

結合 PyCharm 與 Anaconda

Python Selenium Notes

tags: `Python` `DataCamp` `Notes`