Counter | Tips in Python

--- title: Counter | Tips in Python description: How to use counter/defualtdict/heapq in Python. tags: python-tips, tips, python, programming, coding, defaultdict, counters, collections GA: UA-206161787-1 lang: zh --- # [EP0] 計數 | Python小技巧 Python從1991年發表到現在也已經三十年了。在這三十年的過程中，語言本身也隨著使用者的需求而成長。身為碼農的我們，也需要與時俱進。這邊我們透過一個常見的例子，來看看這三十年來，在Python中，對於同一個問題的解法，是如何演化的吧！ ## 在一個陣列中，尋找出現次數最多的前三名元素。這種統計次數的問題在生活中非常實用，就拿台灣最喜歡的選舉來說，今天我們要統計選票，在不考慮廢票的情況下，每一張票都有一個指定的候選人。那出來的資料可能是像下面這樣：（這邊就以`P0`到`P9`來代表十組候選人的名字） ```python= # Input candidates = ["P1", "P2", "P1", "P5", "P3", ...] ``` 我們最後希望能找到得票數最高的前三名候選人，列出他們的`名字`以及`得票數`。 ```python= # Output winners = [("P3", 556), ("P1", 318), ("P7", 242)] ``` ### 方法一：1991年的直觀解法 ```python= def top3(candidates: List) -> List[Tuple[str, int]]: counts = dict() for c in candidates: if c not in counts: counts[c] = 1 else: counts[c] += 1 result = [(name, count) for name, count in counts.items()] # Return the second element in the tuple for sorting. cmp_fn = lambda t: t[1] result.sort(key=cmp_fn, reverse=True) return result[:3] ``` 最簡單直觀的想法，我們就一個一個數，最後得到十組候選人的票數，最後再依照得票數排序，選出前三名。即使是古老的Python，整體的程式碼還不算太糟糕，用到`Dictionary`, `List comprehension`, `sort(key=,)`的排列函式，以及`[:3]`的語法來取出排序後的前三個元素。 ### 方法二：Python 2.3 (2003)的改進為了找到前三名，我們就對整個序列的十名候選人去做排序（`O(NLogN)`），顯然是有點浪費。這種取前N名的問題，我們其實可以利用資料結構裡的`Min Heap`來加速（`O(N)`），在這邊先不討論`Min Heap`的細節，我們只要知道他是一種容器，每當在加入物件時，能維持整體的大小排序。在Python 2.3中，加入了`heapq`這個模組到內建的函式庫裡，讓我們可以很簡單的來應用這個資料結構。 ```python= import heapq def top3_heapq(candidates): # Same counts = dict() for c in candidates: if c not in counts: counts[c] = 1 else: counts[c] += 1 # --- New --- result = heapq.nlargest(3, [(counts[name], name) for name in counts]) return [(name, count) for count, name in result] ``` ### 方法三：Python 2.5 (2005)的`defaultdict` 我們改進了後半段的程式，那麼前半段在統計各個候選人的得票部分呢？大家有沒有發現，在統計每張票時，我們都得檢查該候選人的名字有沒有在`dictionary`裡面了，如果沒有，我們得先在`dictionary`中為他建立一個`key`並且給他第一張票。 ```python= # Same counts = dict() for c in candidates: if c not in counts: counts[c] = 1 else: counts[c] += 1 ``` 在Python2.5中，引入了`collections`的模組，其中的`defaultdict`是我們今天的主角。每當使用`defaultdict`建立`dictionary`時，我們可以傳入一個工廠函式（factory function），之後當我們存取某個不存在的`key`時，他會自動呼叫這個工廠函式來為這個`key`建立一個預設的值。 ```python= import collections import heapq def top3_defaultdict(candidates): # --- New --- # python中，呼叫內建的int函式而不提供任何參數，會返回0。 # int() == 0 counts = collections.defaultdict(int) for c in candidates: counts[c] += 1 # Same result = heapq.nlargest(3, [(counts[name], name) for name in counts]) return [(name, count) for count, name in result] ``` ### 方法三：Python 2.7 (2007)的`Counter` 由於計數這件事實在是太常見了，在2007的Python2.7中，加入了`Counter`到`collections`的內建模組下，我們就不用再關心`heapq`到底該如何使用，我們直接透過簡單的幾行程式碼就可以完成一樣的工作了！（甚至有更好的效能） ```python= def top3_counter(candidates): counts = collections.Counter(candidates) return counts.most_common(3) ``` ## 效能新的方法不只有可讀性更高的程式碼，在效能上也是有大大的提升。像這邊使用`Counter`比起我們最一開始的方法快了兩倍。 ``` naive_time: 31.5s heapq_time: 35.3s defaultdict_time: 27.1s counter_time: 16.0s ``` ## 後記 - 程式碼範例: https://colab.research.google.com/drive/1ToHXKGUjUyiCs_IEzZkPsHEYT3W3T_b3?usp=sharing