---
# System prepended metadata

title: '[資訊安全] 防止爬蟲'
tags: [資訊安全, robots]

---

# [資訊安全] 防止爬蟲
###### tags: `資訊安全` `robots`
[TOC]
## User Agent
```
Mozilla/<版本> (<系統與瀏覽器訊息>) <平台> (<平台詳細資訊>) [擴展]
//Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com / bot.html)
```
資料來源:
[User Agents list](https://developers.whatismybrowser.com/useragents/explore/software_name/baidu-spider/)

## 惡意爬蟲過濾器
* /etc/fail2ban/filter.d/nginx-badbots.conf
```=conf
[Definition]

badbotscustom = Sogou web spider|DotBot|AhrefsBot|Baiduspider|PetalBot|WOW64|Daum|Barkrowler|360Spider|Buck|Photon|SEOkicks|magpie-crawler|SemrushBot|SeznamBot|MJ12bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|(?:Mozilla/\d+\.\d+ )?Jorgee

badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL \(efp@gmx\.net\)|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 \(compatible; NEWT ActiveX; Win32\)|Mozilla/3\.0 \(compatible; Indy Library\)|Mozilla/3\.0 \(compatible; scan4mail \(advanced version\) http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 \(compatible; Advanced Email Extractor v2\.xx\)|Mozilla/4\.0 \(compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at\)|Mozilla/4\.0 \(compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx\)|NameOfAgent \(CMS Spider\)|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 \(Snap Shots&#44; \+http\://www\.snap\.com\)|sogou develop spider|Sogou Orion spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sogou spider|Sogou web spider/3\.0\(\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07\)|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 \(compatible; MSIE 6\.0; Windows NT 5\.1\)|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00

failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"*(?:%(badbots)s|%(badbotscustom)s).*"$

ignoreregex =

datepattern = {^LN-BEG}%%ExY(?P<_sep>[-/.])%%m(?P=_sep)%%d[T ]%%H:%%M:%%S(?:[.,]%%f)?(?:\s*%%z)?
              ^[^\[]*\[({DATE})
              {^LN-BEG}
```

* 自定義 /etc/fail2ban/filter.d/nginx-custom.conf
```=conf
[Definition]
failregex = ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+XDEBUG.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+GponForm.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+phpunit.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+ajax-index\.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+sellers\.json .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+adminer\.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+wp-configuration\.php.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+ThinkPHP.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+wp-config.+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+dede\/login\.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+plus\/recommend\.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+e\/install\/index.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+m\/e\/install\/index\.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+e_bak\/install\/index.php .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+\.aspx .+$
            ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+\.act .+$
            ^<HOST>.*] "(GET|POST|HEAD) .*xmlrpc\.php.*
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Mozilla/5\.0 \(Windows NT 6\.1\; rv\:60\.0\) Gecko\/20100101 Firefox\/60\.0.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Photon.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Mozilla\/5\.0 zgrab\/0\.x.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.XTC.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.python-requests.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.bidswitchbot.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Google-adstxt.*"$
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Apache-HttpClient.*"$
            ^<HOST>.*] "POST .*HTTP\/1\.1.*
            ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Go-http-client\/1\.1.*"$
            ^<HOST> .* ".*\\x.*" .*$
            ^<HOST> -.*"(GET|POST|HEAD) .+wp-login\.php .*HTTP\/1\.1.*
            ^<HOST> -.*"(GET|POST|HEAD) .*wp-includes/wlwmanifest.xml .*HTTP\/1\.1.*
ignoreregex = ^<HOST>.*] "POST /xmlrpc\.php\?for=jetpack.*
              ^<HOST>.*] "POST /wp-cron\.php\?doing_wp_cron=.*
datepattern = {^LN-BEG}%%ExY(?P<_sep>[-/.])%%m(?P=_sep)%%d[T ]%%H:%%M:%%S(?:[.,]%%f)?(?:\s*%%z)?
              ^[^\[]*\[({DATE})
              {^LN-BEG}
```

* 防止Wordpress xmlrpc攻擊的規則，並且讓Jetpack正常運行 (XMLRPC ATTACK)
```=conf
failregex = ^<HOST>.*] "(GET|POST|HEAD) .*xmlrpc\.php.*
ignoreregex = ^<HOST>.*] "POST /xmlrpc\.php\?for=jetpack.*
```

* 防止URL中的特殊字符攻擊
```=conf
failregex = ^<HOST> .* ".*\\x.*" .*$
// 可以過濾掉如下的訪問紀錄
// 185.202.1.188 - - [23/Jun/2020:08:18:33 +0800] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 157 "-" "-"
```
資料來源: https://szeching.com/.xmlrpc.php.fail2ban-nginx-filter-config/

## nginx.conf 設定 (或新增 agent-deny.conf 再引入到 nginx.conf)
```=Vim!
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
  return 403;
}



#禁止 Scrapy 等工具抓取
if ($http_user_agent ~* (Python|Java|Wget|Scrapy|Curl|HttpClient|Spider|PostmanRuntime)) {
    return 403;
}

#這三句if是禁止使用代理ip來訪問，或禁止使用壓力測試軟件進行dos攻擊
if ($http_user_agent ~* ApacheBench|WebBench|java/){
                return 403;
}
if ($http_user_agent ~* LWP::Simple|BBBike|wget) {
            return 403;
}

#禁止指定UA及UA爲空的訪問
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|Bytespider|Jullo|Apache-HttpAsyncClient| |BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|iaskspider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|LinkpadBot|Ezooms|^$" ){
  return 403;
}

#禁止境內常見爬蟲(根據需求自行控制是否禁止)
if ($http_user_agent ~* "qihoobot|Yahoo! Slurp China|Baiduspider|Baiduspider-image|spider|Sogou spider|Sogou web spider|Sogou inst spider|Sogou spider2|Sogou blog|Sogou News Spider|Sogou Orion spider|ChinasoSpider|Sosospider|YoudaoBot|yisouspider|EasouSpider|Tomato Bot|Scooter") { 
    return 403;
}

#禁止境外常見爬蟲(根據需求自行控制是否禁止)
if ($http_user_agent ~* "Googlebot|Googlebot-Mobile|AdsBot-Google|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|MSNBot|Catall Spider|ArchitextSpider|AcoiRobot|Applebot|Bingbot|Discordbot|Twitterbot|facebookexternalhit|ia_archiver|LinkedInBot|Naverbot|Pinterestbot|seznambot|Slurp|teoma|TelegramBot|Yandex|Yeti|Infoseek|Lycos|Gulliver|Fast|Grabber") { 
    return 403;
}

```

nginx.conf 引入 agent-deny.conf
```=Vim!
http {
    #引入限制爬蟲UA配置文件
    include agent_deny.conf;

    location =/robots.txt {  # 爬蟲規則說明，沒有啥實際作用
        default_type text/html;
        add_header Content-Type "text/plain; charset=UTF-8";
        return 200 "User-agent: *\nDisallow: /";
    }
}
```

## robots.txt 建立規則
### 常見屬性
#### User-agent 
指定允許的爬蟲。若設定為 `User-agent:*` 表示不允許任何爬蟲
#### Disallow 
指定不允許爬取的目錄或檔案。若設定為 `Disallow: /` 表示全部資源都不允許爬取
#### Crawl-delay 
指定延遲時間 (單位:秒)
#### Allow
允許指定爬取的目錄或檔案。
#### Sitemap 
指定網站內的 sitemap 檔案放置位置，需使用絕對路徑。


### 示例

拒絕任何爬蟲抓取全部資源
```
User-agent: *
Disallow: /
```

拒絕指定的 User-agent 爬取全部資源
```
User-agent: Baiduspider
Disallow: /
```

僅允許 Google 爬取 `/private/` 目錄下的資源
```
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /private/
```

表示延遲30秒的時間，將允許在8.3小時內檢索1000張網頁。
```
User-agent: *
Crawl-delay: 30
```


`$` 拒絕所有爬蟲抓取站內附檔名為 png 的圖檔。
```
User-agent: *
Disallow: *.png$
```
`^` 拒絕 Bing搜尋引擎 抓取站內/wp-admin目錄下所有資源及站內開頭為test的所有檔名。
```
User-agent: bingbot
Disallow: /wp-admin/
Disallow: ^test*
```
:::success
`chrome://version/` 此連結可查詢自己的 User-agent 資訊
資料來源：[是爬蟲還是手機？淺談網路訪問的識別證UserAgent](https://progressbar.tw/posts/234)
:::
[Apache Bot 攔截器](https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/tree/master/robots.txt)
[User-agent 黑名單討論串](https://github.com/fail2ban/fail2ban/issues/1950)

---
參考資源:
[如何使用robots.txt阻止搜尋引擎爬(抓取)你的網站?](https://www.newscan.com.tw/all-seo/robots-block-search-engines.htm)
[建立及提交 robots.txt 檔案 | Google搜尋中心](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt)
[【開發人員工具】將userAgent設為googlebot](https://www.togetherhoo.com/how-to/other/2748/)
[Googlebot | Google搜尋中心](https://developers.google.com/search/docs/crawling-indexing/googlebot)
