python youtube 下載器筆記

# python youtube 下載器筆記 ## 安裝套件 ``` pip install pytube pip install beautifulsoup4 pip install requests pip install lxml (執行時期才會報錯) ``` 成果 ![](https://i.imgur.com/HC0GhdB.png) ## 改造開始改造方向 1. 盡可能移除 lock 2. 用不同的線程模型 ## 移除 lock 移除 lock 最直觀的方法就是 -> 減少 race condition 觀察一下 code 裡面的 race condition 有哪些 ```python #---------------↓ 鎖定區域 A ↓---------------# lock.acquire() # 進行鎖定 no = listbox.size() # 以目前列表框筆數為下載編號 listbox.insert(tk.END, f'{no:02d}:{name}.....下載中') print('插入:', no, name) lock.release() # 釋放鎖定 #---------------↑ 鎖定區域 A ↑---------------# yt.streams.first().download() # 開始下載影片 (不可鎖定) #---------------↓ 鎖定區域 B ↓---------------# lock.acquire() # 進行鎖定 print('更新:', no, name) listbox.delete(no) listbox.insert(no, f'{no:02d}:●{name}.....下載完成') lock.release() # 釋放鎖定 #---------------↑ 鎖定區域 B ↑---------------# ``` 鎖定區域內其實只有編號那邊是會有 race 的... 那為什麼不在外面傳 index 進來就好？稍微改了一下程式發現其實有 race condition 的是 listbox，雖然我可以用 pass by value 或是 atomic 解決 no 的問題，但輸出的排序會亂.... ```python #--------↓ 下載清單中所有影片 ↓---------# print('開始下載清單') for index, u in enumerate(urls): # 建立與啟動執行緒 threading.Thread(target = m.start_dload, args=(u, listbox, index)).start() #--------↓ 下載單一影片 ↓---------# #---------------↓ 鎖定區域 A ↓---------------# #lock.acquire() # 進行鎖定 no = index # 以目前列表框筆數為下載編號 listbox.insert(tk.END, f'{no:02d}:{name}.....下載中') print('插入:', no, name) #lock.release() # 釋放鎖定 #---------------↑ 鎖定區域 A ↑---------------# yt.streams.first().download() # 開始下載影片 (不可鎖定) #---------------↓ 鎖定區域 B ↓---------------# #lock.acquire() # 進行鎖定 print('更新:', no, name) listbox.delete(no) listbox.insert(no, f'{no:02d}:●{name}.....下載完成') #lock.release() # 釋放鎖定 #---------------↑ 鎖定區域 B ↑---------------# ``` ![](https://i.imgur.com/PWtezg6.png) 還有這樣做其實會有一個很大的問題就是 ![](https://i.imgur.com/2IRV4J4.png) 我感覺他每個下載開一個 thread，效能其實很爛.... 原因參考[context switch](https://zh.wikipedia.org/wiki/%E4%B8%8A%E4%B8%8B%E6%96%87%E4%BA%A4%E6%8F%9B) ## 線程模型改進最簡單基礎的方法引入 thread pool ，通常語言都會內建不用自己刻 ``` pip install threadpool ``` 用法 ```python pool = ThreadPool(poolsize) # 產生 pool ，指定 pool 內有多少 thread requests = makeRequests(some_callable, list_of_args, callback) # 設定每個 thread 的執行任務，最後一個是選填。 [pool.putRequest(req) for req in requests] # 把所有任務丟進 pool pool.wait() # block 直到所有 thread 返回 ``` 上面是已經棄用的庫，難怪我覺得有夠難用下面是內建的庫 ```python import concurrent.futures executor = concurrent.futures.ThreadPoolExecutor(max_workers=8) print('開始下載清單') for index, url in enumerate(urls): executor.submit(m.start_dload, url, listbox, index) ``` pool size 通常設為 cpu 上邏輯處理器的數量，在我電腦上是 8。實際效果 ![](https://i.imgur.com/VueyiKi.png) 去掉網路影響，我們試試看單純計算數字的話效能差距多大 ```python #yt.streams.first().download() # 開始下載影片 (不可鎖定) i = 0 for num in range(1,10000000): i = i + num ``` 8 thread 1563350881.6716301 1563350912.9156864 32 thread 1563350384.8802974 1563350412.9731643 稍微差惹一點科科，參考一下別人的發現 python 居然有 GLI 這種鬼東西 https://zh.wikipedia.org/wiki/%E5%85%A8%E5%B1%80%E8%A7%A3%E9%87%8A%E5%99%A8%E9%94%81 幹其實第七章有講，對不起我低能QQ 真要多執行緒的話請用 `ProcessPoolExecutor` 就可以繞過 GLI 了。 ### 使用 message queue 統一管理對 UI 的操作今天如果多個函式會在多線程的情況下操作 listbox ，最直覺的方式就是瘋狂上鎖，但這樣寫到後面很容易一個鎖沒上好就大家一起爆炸。我們有個更簡單更高效率的解決方案 Message queue multiprocessing.Queue 是 python 內建函式庫內少數保證 thread-safe 的物件。 ```python # at main q = Queue() threading.Thread(target = m.read, args=(q, listbox)).start() # at download executor = concurrent.futures.ThreadPoolExecutor(max_workers=16) print('開始下載清單') ticks = time.time() print ("執行前:", ticks) {executor.submit(m.start_dload, url, listbox, q): url for url in urls} ``` 在下載時，將對 listbox 的操作訊息丟入 MessageQueue 內，統一由我們剛剛起的專用 thread 處理，由於只有一個 thread 會處理 listbox ，故也不用上鎖 ```python q.put((name, False)) #lock.release() # 釋放鎖定 #---------------↑ 鎖定區域 A ↑---------------# yt.streams.first().download() # 開始下載影片 (不可鎖定) #---------------↓ 鎖定區域 B ↓---------------# #lock.acquire() # 進行鎖定 #listbox.delete(no) #listbox.insert(no, f'{no:02d}:●{name}.....下載完成') q.put((name, True)) ``` 專用 thread 讀取 MessageQueue 內的訊息並修改 listbox ```python def read(q, listbox): nameMap = {} while True : name, complete = q.get(True) if complete == False : no = listbox.size() listbox.insert(tk.END, f'{no:02d}:{name}.....下載中') nameMap[name] = no else : no = nameMap[name] listbox.delete(no) listbox.insert(no, f'{no:02d}:●{name}.....下載完成') ``` 成果圖 ![](https://i.imgur.com/B2vSOEL.png) ### 火力加強版 thread 控制優化自己寫個無限迴圈來等待 thread 跑完實在很蠢，來應用 thread pool 解決這個需求改造前 ```python def multi_dload(urls, listbox): max_thread = threading.activeCount() + 20 #←計算開啟執行緒的數量上 urls.sort(key = lambda s: int(re.search('index=\d+', s).group()[6:])) #←將清單排序 (詳情後述) for url in urls: # 建立與啟動執行緒 while threading.activeCount() >= max_thread: pass threading.Thread(target = m.start_dload, args=(url, listbox)).start() ``` 改造後 ```python def multi_dload(urls, listbox): urls.sort(key = lambda s: int(re.search('index=\d+', s).group()[6:])) #←將清單排序 (詳情後述) executor = concurrent.futures.ThreadPoolExecutor(max_workers=20) print('開始下載清單') for url in urls: executor.submit(m.start_dload, url, listbox) ``` ## title 無效取 youtube 物件的 title 時會炸掉 ![](https://i.imgur.com/2usxbK0.png) 前天更新才出現的問題，上 github 找一下有沒有人討論 ![](https://i.imgur.com/uiOjX03.png) 第一個就是 ![](https://i.imgur.com/sK8U4Ot.png) 看起來這個 commit 可以修好 ![](https://i.imgur.com/4Pi8G7p.png) 把對應的 code 改一下就正常惹 ## 加入 SQLlite ### 建立資料庫先建立對應 db，使用工具(DB Browser for SQLite) ![](https://i.imgur.com/nFKdsps.png) 記得根據後續會搜尋的條件設定 Index，這樣搜尋的時間就會從 O(N) 降為 O(logN) 寫好對 db 操作的指令，參考 http://www.runoob.com/sqlite/sqlite-python.html ```python def queryVideo(name, url, db): c = db.cursor() sql = str(f"SELECT F_ID FROM T_VIDEO \ WHERE F_Name == '{name}' AND F_URL == '{url}'") cursor = c.execute(sql) for row in cursor: print ("ID = ", row[0], "\n") return True return False def addVideo(name, url, db): c = db.cursor() c.execute(f"INSERT INTO T_VIDEO (F_Name, F_URL) \ VALUES ('{name}', '{url}')") db.commit() ``` 開跑，然後炸掉 ![](https://i.imgur.com/7kMHx8N.png) 很好看來我每次用都得開個 connection，應該是 SQLite 非 thread-safe 才會這樣限制修改成每個 thread 各自連線 DB ```python #------------↓ SQLite ↓------------# db = sqlite3.connect('hdb.db') yt = YouTube(url) name = yt.title if queryVideo(name, url, db) == True : print(f'Video {name} at {url} already being download') db.close() return #.... print ("執行後:", ticks) addVideo(name, url, db) ``` 第一次下載 ![](https://i.imgur.com/0PVZca1.png) 第二次下載 ![](https://i.imgur.com/yLxnmRW.png) 可以看出不會重複下載了 ![](https://i.imgur.com/i18Mlxh.png) DB打開來就可以看到資料更新了 ![](https://i.imgur.com/vJYm8Jf.png) ## 總結 GLI IO/CPU python thread process thread pool message queue SQLite