Use selenium to crawl website

# table of contents [toc] # requirement - install the following requirements first. ```bash! pip install selenium sudo apt update sudo apt install -y chromium-chromedriver ``` ## make sure you use python3 version >= python3.11 ### check version - `python3 --version` ### Using conda to Specify Python 3 Version 1. Install conda first. 2. Add conda to the PATH: - ```bash echo 'export PATH="/opt/conda/bin:$PATH"' >> ~/.bashrc ``` 4. Create a conda virtual environment: - ```bash conda create --name myenv python=3.10 ``` 5. Activate the virtual environment: - ```bash conda activate myenv ``` - Confirm the Python 3 version in use: - ```bash python3 --version ``` - ```bash which python3 ``` 6. Deactivate the conda virtual environment: - ```bash conda deactivate ``` ## Activate conda - ```bash conda activate myenv ``` - Add the activation to ``.bashrc`: - ```bash echo 'conda activate myenv' >> ~/.bashrc ``` ### (Obsolete)configure python3 to use version python3.11 - First, make sure Python 3.11 is installed in your WSL Ubuntu environment. You can do this by running: ``` sudo apt update sudo apt install python3.11 ``` - Now, use the update-alternatives command to set Python 3.11 as the default: ``` sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 ``` - The 1 at the end is the priority, and it's set to the highest priority. - You can then use the following command to configure the alternatives and choose Python 3.11 as the default: ``` sudo update-alternatives --config python3 ``` - After selecting Python 3.11, you can verify the change by running: ``` python3 --version ``` ### Resolve the `ModuleNotFoundError: No module named 'apt_pkg'`issue ``` sudo apt remove python3-apt sudo apt autoclean sudo apt install python3-apt ``` # running version: GUI vs no GUI ## GUI version ### sample code ```python! from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By from bs4 import BeautifulSoup import time options = webdriver.ChromeOptions() options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') options.add_argument('disable-infobars') options.add_argument('--disable-extensions') wd = webdriver.Chrome(options=options) # Navigate to the specified URL wd.get('target_url') # make the website last until ctrl-c while True: time.sleep(1) # Close the WebDriver wd.quit() ``` ## no GUI version with firefox(run in server) ### requirements ```bash apt-get update apt install firefox-geckodriver ``` - Add following line to your code ```python import sys sys.path.insert(0,'/usr/lib/firefox-geckodriver') ``` ### sample code ```python! from selenium import webdriver #from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By import time import sys sys.path.insert(0,'/usr/lib/firefox-geckodriver') options = webdriver.FirefoxOptions() options.add_argument('--headless') # Run Firefox in headless mode (no GUI) options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') driver = webdriver.Firefox(options=options) # Specify the URL of the web page url = "target_url" driver.get(url) element = driver.find_element(By.XPATH, '//tagname[@attribute="value"]') ``` # some common command ## Get element by XPATH ```python element = driver.find_element(By.XPATH, '//tagname[@attribute="value"]/tagname[@attribute="value"]...') ``` ## wait ### Explicit Waits - wait for a condition to happen before certain time #### sample code ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit() ``` ### Implicit Waits - wait for certain second after each driver operation ```python from selenium import webdriver driver = webdriver.Firefox() driver.implicitly_wait(10) # seconds driver.get("http://somedomain/url_that_delays_loading") myDynamicElement = driver.find_element_by_id("myDynamicElement") ``` ## action chains - define a chain of actions, then perfome it in certain time. - sample code (not runnable) ```python from selenium.webdriver.common.action_chains import ActionChains # Click the element actions = ActionChains(webdriver) actions.move_to_element(button_element) actions.click(button_element) try: actions.perform() #button.click() except Exception as e: # Handle other exceptions pass #print(f"An unexpected error occurred: {e}") ## take screenshot ```python driver.save_screenshot("file_name.png") ``` ## with OCR ### install ```bash sudo apt-get install tesseract-ocr pip install pytesseract Pillow ``` ### sample code ```python from PIL import Image import pytesseract # Open the captured screenshot screenshot = Image.open('captcha_area.png') # Perform OCR to extract text extracted_text = pytesseract.image_to_string(screenshot) # Print the extracted text print("Extracted Text:") print(extracted_text) ``` # References - https://steam.oxxostudio.tw/category/python/spider/selenium.html - [XPath in Selenium: All You Need to Know](https://www.simplilearn.com/tutorials/selenium-tutorial/xpath-in-selenium) - [selenium-python.readthedocs.io](https://selenium-python.readthedocs.io/waits.html)