###### tags: `Python` `網路爬蟲` `Proxy` `csv` # Proxy網路爬蟲應用 #### Proxy原理: Client發出request時,會先連到Proxy Server,proxy server在接到 client 端的要求時,會去連遠端的機器,把那邊的資料抓回來存在Proxy Server的硬碟上,再把這份資料傳給client端(主要就是用來跳脫IP位置) ![](https://i.imgur.com/H9hQvdD.png) :satellite: * [Free Proxy List](https://free-proxy-list.net/)網站提供300組免費Proxy使用,不過每一頁複製效率太差,故用爬蟲方式自動取得並轉存CSV檔,供後續爬取網站使用 ```python= import csv import munch import pyquery import requests response = requests.get( url='https://free-proxy-list.net/' ) if response.status_code != 200: #判斷伺服器是否正常回應 print(f'response status is not 200 ({response.status_code})') with open('free-proxy.html', 'wb') as f: #測試是否能將網站抓取下來 f.write(response.content) proxies = [] dom = pyquery.PyQuery(response.text) table_list = dom('table#proxylisttable') trs = table_list('tbody > tr').items() for tr in trs: tds = list(tr('td').items()) ip = tds[0].text() port = tds[1].text() print(f'{ip}\t{port}') proxies.append(munch.munchify({ 'ip': ip, 'port': port })) with open('proxies.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow([ 'IP', 'PORT' ]) for proxy in proxies: writer.writerow([ proxy.ip, proxy.port ]) ```