Python BeautifulSoup Notes

{%hackmd @yun-cheng/theme %} # Python BeautifulSoup Notes ###### tags: `Python` `BeautifulSoup` `Notes` [TOC] ## 起手式 ```python= import requests from bs4 import BeautifulSoup as bs url = '' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } session = requests.session() page = session.get(url, headers=headers) page_source = bs(page.text, 'html.parser') ``` ## 一般篩選方式 ### 使用`find_all` > 各種方法可合併在一起寫 ```python= class_A_tags = page_source.find_all(class_='A') # id 包含 ABC id_ABC_tags = page_source.find_all(id=re.compile('ABC')) tag_div_tags = page_source.find_all('div') ``` ### 使用`select` 使用 select 比 find 快 https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select ```python= tags_1 = page_source.select('#um > p:nth-child(2) > strong > a') tags_2 = tags_1.select('.title a') ``` ### 使用`xpath` ```python= from lxml import etree page_source = etree.HTML(search_page.content) tag = page_source.xpath('//*[@class="Info"]/tbody/tr[2]/td[2]/a/span') ``` ## 特殊篩選方式 > 使用`find_next_sibling`，尋找特定 tag 之後符合的 tags ```python= class_B_tags_after_tag_A = tag_A.find_next_sibling(class_='B') ``` ## 讀取內容 ```python= # 文字 tag[0].text ```