python資料探勘

###### tags: `資料探勘` `python` # python資料探勘 ## urlopen 通常透過socket取得的資料為byte,和py3 string不同，可以透過 `html = urlopen(url).read().decode('UTF-8')` decode ## findAll ```python= from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.mis.kuas.edu.tw/main.php?mod=teacher&site_id=0" html = urlopen(url).read().decode('UTF-8') bsObj = BeautifulSoup(html, 'html.parser') tbls = bsObj.findAll("table", {'class': "styleTable"}) personList = [] for tbl in tbls: tbldatas = tbl.findAll('tr'); personList.append({'name': tbldatas[0].findAll('td')[2].get_text(), 'mail': tbldatas[4].findAll('td')[1].get_text()}) personList ``` ## regux - 可以用寫死的string or regux pattern ```python from urllib.request import urlopen from bs4 import BeautifulSoup import re def isValid(): pattern = "aa*" if re.match(pattern,s): return True else: return False re.compile() ``` ## .parent.previous_sibling... - .parent - .previous_sibling - previous_siblings - next_simbing - next_simbings ## lambda func ```python bsObjs.findAll(lambda tag: len(tag.attrs) == 2) ```