Python BeautifulSoup Notes

tags: `Python` `BeautifulSoup` `Notes`

Python BeautifulSoup Notes

起手式











import requests
from bs4 import BeautifulSoup as bs

url = ''
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

session = requests.session()
page = session.get(url, headers=headers)
page_source = bs(page.text, 'html.parser')

一般篩選方式

使用`find_all`

各種方法可合併在一起寫






class_A_tags = page_source.find_all(class_='A')

# id 包含 ABC
id_ABC_tags = page_source.find_all(id=re.compile('ABC'))

tag_div_tags = page_source.find_all('div')

使用`select`

使用 select 比 find 快

https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select


tags_1 = page_source.select('#um > p:nth-child(2) > strong > a')
tags_2 = tags_1.select('.title a')

使用`xpath`




from lxml import etree

page_source = etree.HTML(search_page.content)
tag = page_source.xpath('//*[@class="Info"]/tbody/tr[2]/td[2]/a/span')

特殊篩選方式

使用find_next_sibling，尋找特定 tag 之後符合的 tags


class_B_tags_after_tag_A = tag_A.find_next_sibling(class_='B')

讀取內容


# 文字
tag[0].text

Python BeautifulSoup Notes

tags: Python BeautifulSoup Notes

起手式

一般篩選方式

使用find_all

使用select

使用xpath

特殊篩選方式

讀取內容

Read more

Selenium 爬蟲使用 Tor 匿蹤教學

結合 PyCharm 與 Anaconda

Python Selenium Notes

DataCamp Python Notes

tags: `Python` `BeautifulSoup` `Notes`

使用`find_all`

使用`select`

使用`xpath`