爬蟲3(selenium)

--- tags: web crawler --- # 爬蟲3(selenium) ## selenium 是什麼 `selenium`是一個可以讓你的程式碼模擬真實操作 ex: 點擊、輸入文字、播放影片..... 十分的方便，基本上有了他你也不會想用`requests`了，那為什麼不先教他呢，<del>因為我不會</del>，因為`requests`可以讓你對網站的瀏覽有最基本的認識`get`, `post`, `put`, `delete`，當然，這只是極度基礎的部分，如果還想要深挖的話可能還有一堆協議`https`、`tcp`...，還有封包之類的東西(我也不會了 :poop: )。總而言之，高中想做專案的話`selenium`就可以了，因為他也可以抓取網頁上的原始碼(html)，而且還可以更方便的進行登錄(基本上就是跟你登陸的過程差不多( 點擊 -> 輸入密碼 -> 送出 ) , 基本上你在瀏覽器上手動作的他都可以做到。 ## 準備工作基本上，如果你需要瀏覽網頁的話你需要一個瀏覽器 :poop:，那對於`selenium`來說，你就會需要[chrome driver](https://chromedriver.chromium.org/downloads)這個瀏覽器，當然你的電腦上要有chrome，然後版本要記得對上，不然會出事，所以就可以把他下仔之後跟你的code放在同一個目錄之下。之後在`terminal(cmd)`上安裝`selenium`就可以了 ``` 在 cmd/terminal 內: pip install selenium ``` ## 初使化瀏覽器最基礎的就這樣就可以了 ```python= from selenium import webdriver driver = webdriver.Chrome() ``` 當然，如果沒有遇到bug的話那你可能沒有寫過程式 :poop: 有時候可能遇到這種錯誤 ``` USB: usb_device_handle_win.cc:1046 Failed to read descriptor from node connection: 連結到系統的某個裝置失去作用。 (0x1F) ``` 那這個時候就可以查一下，基本上就把錯誤複製貼上就好了 :poop: 那你就可以發現說他可能是因為某些東西然後出錯，耶會找到解決辦法。 ```python= options = webdriver.ChromeOptions() options.add_experimental_option('excludeSwitches', ['enable-logging']) driver = webdriver.Chrome(options=options); ``` 那就愉快的複製上去 :poop: 他就又能動了 ## click 那我們先來看一個最簡單的例子[pop cat](https://popcat.click/)，基本上用selenium一下子就會被封了，所以我們只是拿來當作練習（不要問我怎麼知道的 :poop: ) ，然後python十個線程跟一個線程差不多　:poop: :::spoiler 舊的寫法, 會抱錯 ```python= """ 初始化　driver""" options = webdriver.ChromeOptions() options.add_experimental_option('excludeSwitches', ['enable-logging']) driver = webdriver.Chrome(options=options); """ 獲取網址　（有點像你在搜尋欄那邊直接打網址）　他還會幫你enter""" driver.get("https://popcat.click/") login = driver.find_element_by_class_name("title") while 1==1 : # this is magic login.click() 就是點擊　 ``` ::: ```python= from selenium import webdriver from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get("https://popcat.click/") login = driver.find_element(By.CLASS_NAME, "title") while 1==1 : # this is magic login.click() ``` ## send key 基本上這個就是可以打字的東西，他可以幫你模擬鍵盤 ```python= from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from time import sleep driver = webdriver.Chrome() driver.get("https://google.com/") typing_blank = driver.find_element(By.XPATH, "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/textarea") typing_blank.send_keys("hello, world") typing_blank.send_keys(Keys.RETURN) sleep(5) driver.quit(); driver.close() ``` 至於XPATH如何或取，你只要到網頁原始碼（右鍵 -> 檢查）然後找到你要的區域再按一次又見，找到複製，就可以複製XPATH了，當然，它除了可以`By.XPATH`外，`id`, `class`也都是被允許的。 ## login 那現在就讓我們來進行合併的練習吧。 `./data.json` ```jsonld= { "contestid": [229, 226], "url" :{ "login": "https://dandanjudge.fdhs.tyc.edu.tw/Login", "Ranking" : "https://dandanjudge.fdhs.tyc.edu.tw/ContestRanking", "Edit": "https://dandanjudge.fdhs.tyc.edu.tw//EditContestants" }, "TimeLimit": 3 } ``` ```python= BaseDataFileFatherName = "Test_1" DataFilePos = "./data.json" Cookie = list() def WriteData( filepos, Data ) : print( f"Ouptut Data into file {filepos}..." ); with open( filepos, 'w', encoding="utf-8" ) as f : f.write( Data ); def readJson ( filepos ) : with open( filepos ) as f : data =json.loads( f.read() ); return data; UnitData = readJson( DataFilePos ); def Login(driver) : """login""" driver.get( UnitData['url'][ 'login' ] ); """ 如果有cookie存下來的話，在完整的code的時候可以把它刪掉""" try: lgcookie = readJson( f'./{BaseDataFileFatherName}/cookie.json' ); for c in lgcookie : driver.add_cookie( {'name':c['name'], 'value':c['value']} ); except: print("No Cookie Now") driver.refresh(); if driver.current_url == "https://dandanjudge.fdhs.tyc.edu.tw/" : Cookie = driver.get_cookies() print(Cookie); WriteData(f'./{BaseDataFileFatherName}/cookie.json', json.dumps(Cookie, indent=4, ensure_ascii=False)); return "Login Sucess" """ 輸入帳號""" AccountName = input("Your account:"); """ 找到帳號框( blank )""" AccountEle = driver.find_element(By.CSS_SELECTOR ,'div.col-md-4:nth-child(2) > form:nth-child(1) > div:nth-child(1) > div:nth-child(2) > input:nth-child(2)' ) ; """ 輸入帳號框( blank )""" AccountEle.send_keys( AccountName ); """ 輸入密碼 """ PassWordName = input("your password:") """ 找到密碼框( blank )""" PasswordEle = driver.find_element( By.XPATH, '//*[@id="passwd"]' ); """ 輸入密碼框( blank )""" PasswordEle.send_keys( PassWordName ) """ 找到登入按鈕 """ driver.find_element( By.XPATH,'/html/body/div[4]/div[2]/div[2]/form/button[1]' ).click(); sleep(5) """ 這邊是觀察到如果有成功登陸的話就可以發現它會自動導回到首頁 """ if driver.current_url == "https://dandanjudge.fdhs.tyc.edu.tw/" : Cookie = driver.get_cookies() print(Cookie); WriteData(f'./{BaseDataFileFatherName}/cookie.json', json.dumps(Cookie, indent=4, ensure_ascii=False)); print("Login Sucess") return "Login Sucess" else : print( "Login Fail" ) return "Login Fail" options = webdriver.ChromeOptions() options.add_experimental_option('excludeSwitches', ['enable-logging']) driver = webdriver.Chrome(options=options); Login(driver) ``` ## 取代 selenium `driver.page_source`就可以拿到網頁的原碼了，所以就可以用`BeautifulSoup`處理， ## 成果 [我的code](https://github.com/william1010121/DDJ-contest-score/tree/WindowCode)