# WebCrawler
## key 心法
- 蟲disguise越像user越好
- 要先有要爬的確切目標 觀察 在撰寫程式
---
- First
to write a network connected program
- Second
to fake self by creating fake user request header
```python=
# web connetion program
import urllib.request as request # urllib
url="https://www.ptt.cc/bbs/hotboards.html"
with request.urlopen(url) as respond:
sourceCode = respond.read().decoding("utf-8")
print(sourceCode)
```
## urllib.request.Request(url,header)
- Request() could add header messages. it's very useful
## Crawing 18 website(ppt)
- useing cookie
- Add cookie message to header
- 想不到吧 還想不讓我爬?

- bypass verify
```python=
superRequest=request.Request(url,headers={
"cookie":"over18=1", #bypass verify
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
})
```
- program
```python=
# web connetion program
import urllib.request as request # urllib
url="https://www.ptt.cc/man/sex/DB29/index.html"
# to create the request object to disguise (requestHeader)
superRequest=request.Request(url,headers={
"cookie":"over18=1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
})
with request.urlopen(superRequest) as respond:
sourceCode = respond.read().decode("utf-8")
print(sourceCode)
# to parse a resource code by using beatuifulsoup4 to get waht you want .
import bs4
root = bs4.BeautifulSoup(sourceCode,"html.parser") #declare what type of web was paresed.
title = root.find_all("div", class_="m-ent") # find_all(tag,context)
for title in title :
if title.a != None: # select tag
print(title.a.string)
```
[Reference_link](https://ithelp.ithome.com.tw/articles/10272679)
## Final program(crawl ppt)
- getedata()
- print title
- return next link
Remark:
answer = a + function(answer) //function(answer) will "frist execute"
```python=
# Crawler(PTT)
import urllib.request as request # urllib
def getdata(url):
# to create the request object to disguise (requestHeader)
superRequest=request.Request(url,headers={
"cookie":"over18=1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
})
with request.urlopen(superRequest) as respond:
sourceCode = respond.read().decode("utf-8")
#print(sourceCode)
# to parse a resource code by using beatuifulsoup4 to get waht you want .
import bs4
root = bs4.BeautifulSoup(sourceCode,"html.parser") #declare what type of web was paresed. 解析HTML 語法工具
title = root.find_all("div", class_="title") # find_all(tag,context)
for title in title :
if title.a != None: # select tag
print(title.a.string)
previosuPage = root.find("a", string="下頁 ›")
#print(previosuPage) # inspect
#print(previosuPage["href"])
return previosuPage["href"]
#url="https://www.ptt.cc/bbs/sex/index1.html"
#print("https://www.ptt.cc/"+getdata(url))
''' Use while to simplify program
nextlink1 = "https://www.ptt.cc/"+getdata(url)
print(getdata(nextlink))
nextlink2 = "https://www.ptt.cc/"+getdata(nextlink1)
print(getdata(next2link))
'''
nextlink="https://www.ptt.cc/bbs/sex/index1.html"
Counter = 0
while Counter <10:
nextlink = "https://www.ptt.cc/"+getdata(nextlink) #recursive 反向
Counter +=1
```