Python Data Science Toolbox (Part 2)

# Python Data Science Toolbox (Part 2) ###### tags: `Datacamp` `python` `function` `Python Programming` >**作者:何彥南** >Datacamp 課程: [Python Data Science Toolbox (Part 2)](https://www.datacamp.com/courses/python-data-science-toolbox-part-2) **注意:** 1. df 為 pandas 的 DataFrame 的縮寫。 2. pd 為 panda 套件的縮寫。 3. 請以官方文件 [panda doc](https://pandas.pydata.org/pandas-docs/stable/) 為主。 4. 注意panda 的版本，有些功能可能在新版無法使用。 5. 程式碼內`#`標記的地方為 output [TOC] # Iterators in Pythonland ## [1-1]Introduction to iterators > 介紹迭代(迴圈) ### 1. Iterating with a for loop * for 迴圈的 3種簡單用法 > **方法一:** 對 list ，利用 for 迴圈的迭代特性，對 list 中的每個元素做動作。 ```python= employees = ['Nick', 'Lore', 'Hugo'] #執行 for employee in employees: print(employee) #[Out]: # Nick # Lore # Hugo ``` > **方法二:** 對 string(字串) ，利用 for 迴圈的迭代特性，對字串中的每個字母做動作。 ```python= for letter in 'DataCamp': print(letter) #[Out]: # D # a # t # a # C # a # m # p ``` * 原理: 每個str(字串)的每個字都有index(位置)，而這邊 for 迴圈只是依序對字串裡每個位置的字母做動作而已。 * 例子: ```python= a='DataCamp' #執行 print(a[0]) #[Out]:D print(a[4]) #[Out]:C ``` > **方法三:** 使用 range() ，利用 for 迴圈的迭代特性，對range範圍內的數字做動作。 ```python= for i in range(4): print(i) #[Out]: # 0 # 1 # 2 # 3 ``` * 上面range(4) 相當於放入 [0,1,2,3] 的 list，就是0~(4-1)的意思。 ### 2. Iterators vs. iterables * **Iterable**: 可迭代的對象，使用 iter(Iterable)可轉換成Iterator(迭代器)。 * 以下為常用 Iterable: * string * list * dictionaries * file connections * **Iterator**: 迭代器，對其調用 next(Iterator) 將會得到下一個元素。 ### 3. Iterating over iterables: next() > next() 的使用 ```python= word = 'Da' it = iter(word) next(it) #[Out]:'D' next(it) #[Out]:'a' next(it) #[Out]: ''' ----------------------------------------------------------------- StopIteration Traceback (most recent call last) <ipython-input-11-2cdb14c0d4d6> in <module>() ----> 1 next(it) StopIteration: ''' ``` * 最後在執行 next(it) 時，因為迭代器已經沒有下個元素可呼叫，此時會顯示 StopIteration 。 * 注意: 必須把 iter(word) 指向到某個物件(這裡是it)才行運作，不然每次調用 iter() 時順序會重製到第一個項目。 * 示範 ```python= word = 'Da' next(iter(word)) #[Out]: 'D' next(iter(word)) #[Out]: 'D' ``` ### 4. Iterating at once with * > *it :這可以把迭代的物件(這邊是it)，裡面還沒呼叫過的值一次丟出來。 ```python= word = 'Data' it = iter(word) print(*it) #[Out]: D a t a print(*it) #[Out]: ``` * 可以發現，當 it 裡面的值都被呼叫過了，就會丟出空值。 * 補充: ```python= word = 'Data' it = iter(word) next(it) #[Out]: 'D' print(*it) #[Out]: a t a ``` * 上面可看出當迭代物件 it 中 'D' 這個值已經被 next() 呼叫過，所以後面用 *it 能呼叫的值只剩 ' a t a' ，可以看出用 iter() 指定的迭代物件，呼叫值的動作是不可逆的。 ### 5. Iterating over dictionaries >對 dict(字典) 迭代 ```python= pythonistas = {'hugo': 'bowne-anderson', 'francis': 'castro'} #執行 for key, value in pythonistas.items(): print(key, value) #[Out]: # francis castro # hugo bowne-anderson ``` * items() 將 dict 換成 Iterable(可迭代物件)。 * 使用 for 回圈可以實現對每個 key(鍵) 和 value(鍵值) 的操作。 ### 6. Iterating over file connections >可以對資料迭代 ```python= file = open('file.txt') it = iter(file) print(next(it)) #[Out]: the first line. print(next(it)) #[Out]: This is the second line. ``` * 以行為迭代的值，用 next() 可以呼叫出下一行的資料。 ## [1-2]Playing with iterators ### 1. Using enumerate() > 我們來看看 enumerate() 對 list 做了甚麼 ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] e = enumerate(avengers) #將 list 轉成 enumerate 型態。 print(type(e)) #[Out]: <class 'enumerate'> e_list = list(e) #轉換回 list print(e_list) #[Out]: [(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')] ``` * 這邊我們可以看到他把每個 list 裡面的元素 'hawkeye' 轉成 (0, 'hawkeye') 的形式，就像是為每個元素加上標籤。 * 在 enumerate() 型態下，有兩個變數 (index,value) 代表 (標籤，值)。 ### 2. enumerate() and unpack > 這裡我們使用 for 迴圈對 print 出已被 enumerate() 的 list。 ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] for index, value in enumerate(avengers): print(index, value) #[Out]: # 0 hawkeye # 1 iron man # 2 thor # 3 quicksilver for index, value in enumerate(avengers, start=10):#使用 start print(index, value) #[Out]: # 10 hakweye # 11 iron man # 12 thor # 13 quicksilver ``` * 這邊我們可以在 for 迴圈裡對 index(標籤) 和 value(值) 做操作。 * 在enumerate()裡，我們可以利用 start 設定我們的 index(標籤) 要從哪開始。 ### 3. Using zip() > 使用 zip() 將兩個 list 的元素合併 ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] names = ['barton', 'stark', 'odinson', 'maximoff'] z = zip(avengers, names) print(type(z)) <class 'zip'> #[Out]: z_list = list(z) print(z_list) #[Out]: [('hawkeye', 'barton'), ('iron man', 'stark'), ('thor','odinson'), ('quicksilver', 'maximoff')] ``` * 從這邊我們可以看到，兩個 list 在同個位置的元素被合成一個 tuple 並匯入成一個 list。 ### 4. zip() and unpack > 我們也可以使用 for 迴圈分別print 出zip()過後兩個 list 的元素 ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] names = ['barton', 'stark', 'odinson', 'maximoff'] for z1, z2 in zip(avengers, names): print(z1, z2) #[Out]: # hawkeye barton # iron man stark # thor odinson # quicksilver maximoff ``` * 這邊我們使用 z1、z2 分別代表 list1 和 list2 的元素。 ### 5. Print zip with * > 同樣的 zip() 也可以使用 * 將迭代的物件全部丟出來 ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] names = ['barton', 'stark', 'odinson', 'maximoff'] z = zip(avengers, names) print(*z) #[Out]:('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson')('quicksilver', 'maximoff') ``` * 補充: zip() 可以合併兩個以上的 list ```python= avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver'] names = ['barton', 'stark', 'odinson', 'maximoff'] mutants = ['charles xavier','bobby drake','kurt wagner','max eisenhardt'] z = zip(avengers, names,mutants) for list1,list2,list3 in z: #相當於 print(*z) print(list1,list2,list3) ``` ## [1-3]Using iterators to load large files into memory ### 1. Loading data in chunks * 當我們要處理大檔案，且我們電腦的 Ram 無法負荷時，可以使用 chunk 。 * chunk * 我們可以裡用 chunk 分段讀取大檔案 * 加入 while 或 for 迴圈，可以對分段後的 chunk 分別處理。 * 結合 panda 的 read_csv ，裡面有個 chunksize 的參數，透過設定 chunksize 可以，指定一次要讀取的量(大小) ### 2. Iterating over data > 這邊我們利用 chunk 和 panda 將 data.csv 檔案裡 'x' 這欄的值加總。 ```python= import pandas as pd path='D\data\data.csv' #這邊自訂 result = [] for chunk in pd.read_csv(path, chunksize=1000): result.append(sum(chunk['x'])) total = sum(result) print(total) #[Out]: 4252532 ``` * 首先先創個空 list ，我們每次(一個 chunk )匯入 1000筆(row) 的資料，每個 chunk 都分別吧 'x' 欄的值加總並添加到 list 。 * 之後我們利用這個 list 在對每次 chunk 的結果加總，就可以得到整個資料 'x' 的加總。 > 下面一樣的意思，但不是使用空 list ，取而代之的是用數值的方式達到加總的效果 ```python= import pandas as pd total = 0 for chunk in pd.read_csv('data.csv', chunksize=1000): total += sum(chunk['x']) print(total) #[Out]: 4252532 ``` >補充: 結合之前學的 function ```python= def count_entries(csv_file,c_size,colname): """Return a dictionary with counts of occurrences as value for each key.""" # Initialize an empty dictionary: counts_dict counts_dict = {} # Iterate over the file chunk by chunk for chunk in pd.read_csv(csv_file,chunksize=c_size): # Iterate over the column in DataFrame for entry in chunk[colname]: if entry in counts_dict.keys(): counts_dict[entry] += 1 else: counts_dict[entry] = 1 # Return counts_dict return counts_dict # Call count_entries(): result_counts result_counts = count_entries("tweets.csv",10,"lang") # Print result_counts print(result_counts) ``` * 這邊我們在 function 裡設定了三個參數 csv_file、 c_size、 colname ，控制整個 chunk 執行所需的變數。 * 我們的 function 利用 chunk 的特性，去數在某個資料欄下同個值出現的次數，並將結果匯入成 dict(字典) 的形式輸出。 # List comprehensions and generators ## [2-1]List comprehensions ### 1. Populate a list with a for loop > 一般人對 list 都是使用 for 迴圈操作的 ```python= nums = [12, 8, 21, 3, 16] new_nums = [] for num in nums: new_nums.append(num + 1) print(new_nums) #[Out]: [13, 9, 22, 4, 17] ``` * 這個方法雖然很好用且直觀，但是整個　code 佔的版面較大。 ### 2. A list comprehension > list comprehension(列表解析式)的使用 * 由三個主要物件組成 * Iterable (可迭代對象) : nums * Iterator variable (表示Iterable裡的成員): num * Output expression (輸出表達式): new_nums ```python= nums = [12, 8, 21, 3, 16] new_nums = [num + 1 for num in nums] print(new_nums) #[Out]: [13, 9, 22, 4, 17] ``` * 這邊我們帶入 list comprehension ，只使用一行 code 達到上面 for圈的效果。 > list comprehension 與 for迴圈比較: ![](https://i.imgur.com/BQkDFpk.png) ### 3. List comprehension with range() > List comprehension 結合 range() ```python= result = [num for num in range(11)] print(result) #[Out]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ``` ### 4. Nested loops > Nested loops(巢狀迴圈) ```python= pairs_1 = [] for num1 in range(0, 2): for num2 in range(6, 8): pairs_1.append(num1, num2) print(pairs_1) #[Out]: [(0, 6), (0, 7), (1, 6), (1, 7)] ``` >也可以使用 list comprehension ```python= pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)] print(pairs_1) #[Out]: [(0, 6), (0, 7), (1, 6), (1, 7)] ``` * 要注意的是「程式的可讀性」，雖然使用 List comprehension 較簡潔，但是程式的可讀性不一定會變好，所以在使用上還是要自己拿捏。 ## [2-2]Advanced comprehensions ### 1. Conditionals in comprehensions > 在 list comprehensions 加入條件 > 對 iterable (迭代的對象) ```python= [num ** 2 for num in range(10) if num % 2 == 0] #[Out]: [0, 4, 16, 36, 64] #比較(無條件時) [num ** 2 for num in range(10)] #[Out]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] ``` * 條件式(if num % 2 == 0) 是**加在最後面**。 * % : 取餘數 > 對 output expression (輸出物件) ```python= [num ** 2 if num % 2 == 0 else 0 for num in range(10)] #[Out]: [0, 0, 4, 0, 16, 0, 36, 0, 64, 0] #比較(無條件時) [num ** 2 for num in range(10)] #[Out]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] ``` * 條件式(if num % 2 == 0 else 0) 是**加在中間**。 ### 2. Dict comprehensions >Dict comprehensions(字典解析式)的使用 ```python= pos_neg = {num: num+1 for num in range(9)} print(pos_neg) #[Out]: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9} print(type(pos_neg)) #[Out]: <class 'dict'> ``` * 跟 list comprehensions 的方法很像。首先將 [ ] 替換成 { } ，但 dict(字典) 必須要有 key(鍵) 和 value(鍵值)，所以中間我們要使用 : 連接 ( key : value) >補充 ```python= {num: num+1 for num in range(9)} #[Out]: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9} {num**2: num+1 for num in range(9)} #[Out]: {0: 1, 1: 2, 4: 3, 9: 4, 16: 5, 25: 6, 36: 7, 49: 8, 64: 9} ``` * 也就是說 ( key : value)前後都各代表一個變數，而 key 和 value 都是依賴同個迭代的對象(num)，而我們也可以依不同需求去設定。 * 補充: Dict comprehensions 的條件式只能放在後面，且對 key 和 value 都有作用。 ## [2-3]Introduction to generator expressions ### 1. Generator expressions >跟 List comprehension(列表解析式)很像，只是把 [ ] 換成 ( ) ```python= (2 * num for num in range(10)) #[Out]: <generator object <genexpr> at 0x1046bf888> ``` ### 2. List comprehensions vs. generators * List comprehension(列表解析式): 返回 list * Generators(生成器): 返回 generator object * 都能被iterated(迭代) ### 3. Printing values from generators > Generators 可使用迭代 ```python= result = (num for num in range(6)) for num in result: print(num) #[Out]: # 0 # 1 # 2 # 3 # 4 # 5 ``` >也可以直接轉成 list ```python= print(list(result)) #[Out]: [0, 1, 2, 3, 4, 5] ``` >也可以使用 next() 和 * ```python= print(next(result)) #[Out]: 0 print(*result) #[Out]: 1 2 3 4 5 ``` ### 4. Generators vs list comprehensions >當需要大量 Ram 時，list comprehensions 就沒辦法使用了(通常都死當) ![](https://i.imgur.com/bPPrliZ.png) ![](https://i.imgur.com/x3DonHa.png) >而 Generators 就沒有問題了 ![](https://i.imgur.com/nxMt5qm.png) ### 5. Conditionals in generator expressions > 也可以加入條件式 ```python= even_nums = (num for num in range(10) if num % 2 == 0) print(list(even_nums)) #[Out]: [0, 2, 4, 6, 8] ``` ### 6. Generator functions * Produces generator objects when called * Defined like a regular function - def * Yields a sequence of values instead of returning a single value * Generates a value with yield keyword ### 7. Build a generator function > 這樣可以產生 0 到 n 的迭代對象 ```python= def num_sequence(n): """Generate values from 0 to n.""" i = 0 while i < n: yield i i += 1 ``` ### 8. Use a generator function >可以看出來，此 function 可產生 generator ```python= result = num_sequence(5) print(type(result)) #[Out]: <class 'generator'> for item in result: print(item) #[Out]: # 0 # 1 # 2 # 3 # 4 ``` ## [2-4]Wrapping up comprehensions and generators. ### 1. Re-cap: list comprehensions >一般格式: [output expression + for iterator variable in iterable] >進階格式: [output expression + conditional on output + for iterator variable in iterable + conditional on iterable] # test ## test1 > 完成以下程式碼，將列表 x 裡的 string 挑出來 ```python= x=[7,'D','E',8,9,'F'] strings=[___] print(strings) #[Out]:['D','E','F'] ``` * 解答: y for y in x if type(y)==str ## test2 > 輸出是甚麼 ```python= y = range(0, 5) print(list(y)) ``` * 解答:[0,1,2,3,4] ## test3 >輸出是甚麼 ```python= int_list = [-2, 4, 1, 6, -3] print(x for x in int_list if x > 0) ``` 1. [4,1,6] 2. a generator object 3. (False,True,True,True,False) 4. (4,1,6) *解答:2 ## test4 >輸出是甚麼 ```python= str1 = 'AB' str2 = '34' [x + y for y in str1 for x in str2] ``` * 解答:['3A', '4A', '3B', '4B'] ## test5 > 完成以下程式碼 ```python= x = [2, 4, 1, 5] squares = { ___ ___ ___ ___ ___} print(squares) #[Out]:{1: 1, 2: 4, 4: 16, 5: 25} ``` * 解答: y:y**2 for y in sorted(x) ## test6 >請問以下程式碼輸出是甚麼? ```python= int_list = [-2, 4, 1, 6, -3] print(x for x in int_list if x > 0) ``` * 解答: generator object ## test7 > 請問以下程式碼輸出是甚麼? ```python= teams = [['barry', 'cisco', 'caitlin'], ['oliver', 'john', 'felicity']] [member[-1] for member in teams] ``` * 解答:　['caitlin','felicit'］