Try   HackMD

Python BeautifulSoup Notes

tags: Python BeautifulSoup Notes

起手式

import requests from bs4 import BeautifulSoup as bs url = '' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } session = requests.session() page = session.get(url, headers=headers) page_source = bs(page.text, 'html.parser')

一般篩選方式

使用find_all

各種方法可合併在一起寫

class_A_tags = page_source.find_all(class_='A') # id 包含 ABC id_ABC_tags = page_source.find_all(id=re.compile('ABC')) tag_div_tags = page_source.find_all('div')

使用select

使用 select 比 find 快

https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select

tags_1 = page_source.select('#um > p:nth-child(2) > strong > a') tags_2 = tags_1.select('.title a')

使用xpath

from lxml import etree page_source = etree.HTML(search_page.content) tag = page_source.xpath('//*[@class="Info"]/tbody/tr[2]/td[2]/a/span')

特殊篩選方式

使用find_next_sibling,尋找特定 tag 之後符合的 tags

class_B_tags_after_tag_A = tag_A.find_next_sibling(class_='B')

讀取內容

# 文字 tag[0].text