[資訊安全] 防止爬蟲

# [資訊安全] 防止爬蟲 ###### tags: `資訊安全` `robots` [TOC] ## User Agent ``` Mozilla/<版本> (<系統與瀏覽器訊息>) <平台> (<平台詳細資訊>) [擴展] //Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com / bot.html) ``` 資料來源: [User Agents list](https://developers.whatismybrowser.com/useragents/explore/software_name/baidu-spider/) ## 惡意爬蟲過濾器 * /etc/fail2ban/filter.d/nginx-badbots.conf ```=conf [Definition] badbotscustom = Sogou web spider|DotBot|AhrefsBot|Baiduspider|PetalBot|WOW64|Daum|Barkrowler|360Spider|Buck|Photon|SEOkicks|magpie-crawler|SemrushBot|SeznamBot|MJ12bot|EmailCollector|WebEMailExtrac|TrackBack/1\.02|sogou music spider|(?:Mozilla/\d+\.\d+ )?Jorgee badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|Guestbook Auto Submitter|Industry Program 1\.0\.x|ISC Systems iRc Search 2\.1|IUPUI Research Bot v 1\.9a|LARBIN-EXPERIMENTAL $efp@gmx\.net$|LetsCrawl\.com/1\.0 \+http\://letscrawl\.com/|Lincoln State Web Browser|LMQueueBot/0\.2|LWP\:\:Simple/5\.803|Mac Finder 1\.0\.xx|MFC Foundation Class Library 4\.0|Microsoft URL Control - 6\.00\.8xxx|Missauga Locate 1\.0\.0|Missigua Locator 1\.9|Missouri College Browse|Mizzu Labs 2\.2|Mo College 1\.9|MVAClient|Mozilla/2\.0 $compatible; NEWT ActiveX; Win32$|Mozilla/3\.0 $compatible; Indy Library$|Mozilla/3\.0 $compatible; scan4mail \(advanced version$ http\://www\.peterspages\.net/?scan4mail\)|Mozilla/4\.0 $compatible; Advanced Email Extractor v2\.xx$|Mozilla/4\.0 $compatible; Iplexx Spider/1\.0 http\://www\.iplexx\.at$|Mozilla/4\.0 $compatible; MSIE 5\.0; Windows NT; DigExt; DTS Agent|Mozilla/4\.0 efp@gmx\.net|Mozilla/5\.0 \(Version\: xxxx Type\:xx$|NameOfAgent $CMS Spider$|NASA Search 1\.0|Nsauditor/1\.x|PBrowse 1\.4b|PEval 1\.4b|Poirot|Port Huron Labs|Production Bot 0116B|Production Bot 2016B|Production Bot DOT 3016B|Program Shareware 1\.0\.2|PSurf15a 11|PSurf15a 51|PSurf15a VA|psycheclone|RSurf15a 41|RSurf15a 51|RSurf15a 81|searchbot admin@google\.com|ShablastBot 1\.0|snap\.com beta crawler v0|Snapbot/1\.0|Snapbot/1\.0 $Snap Shots, \+http\://www\.snap\.com$|sogou develop spider|Sogou Orion spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sogou spider|Sogou web spider/3\.0$\+http\://www\.sogou\.com/docs/help/webmasters\.htm#07$|sohu agent|SSurf15a 11 |TSurf15a 11|Under the Rainbow 2\.2|User-Agent\: Mozilla/4\.0 $compatible; MSIE 6\.0; Windows NT 5\.1$|VadixBot|WebVulnCrawl\.unknown/1\.0 libwww-perl/5\.803|Wells Search II|WEP Search 00 failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"*(?:%(badbots)s|%(badbotscustom)s).*"$ ignoreregex = datepattern = {^LN-BEG}%%ExY(?P<_sep>[-/.])%%m(?P=_sep)%%d[T ]%%H:%%M:%%S(?:[.,]%%f)?(?:\s*%%z)? ^[^\[]*\[({DATE}) {^LN-BEG} ``` * 自定義 /etc/fail2ban/filter.d/nginx-custom.conf ```=conf [Definition] failregex = ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+XDEBUG.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+GponForm.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+phpunit.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+ajax-index\.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+sellers\.json .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+adminer\.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+wp-configuration\.php.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+ThinkPHP.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+wp-config.+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+dede\/login\.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+plus\/recommend\.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+e\/install\/index.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+m\/e\/install\/index\.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+e_bak\/install\/index.php .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+\.aspx .+$ ^<HOST> \- \S+ \[\] \"(GET|POST|HEAD) .+\.act .+$ ^<HOST>.*] "(GET|POST|HEAD) .*xmlrpc\.php.* ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Mozilla/5\.0 $Windows NT 6\.1\; rv\:60\.0$ Gecko\/20100101 Firefox\/60\.0.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Photon.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Mozilla\/5\.0 zgrab\/0\.x.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.XTC.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.python-requests.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.bidswitchbot.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Google-adstxt.*"$ ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Apache-HttpClient.*"$ ^<HOST>.*] "POST .*HTTP\/1\.1.* ^<HOST> -.*"(GET|POST|HEAD).*HTTP\/1\.1.*"*.Go-http-client\/1\.1.*"$ ^<HOST> .* ".*\\x.*" .*$ ^<HOST> -.*"(GET|POST|HEAD) .+wp-login\.php .*HTTP\/1\.1.* ^<HOST> -.*"(GET|POST|HEAD) .*wp-includes/wlwmanifest.xml .*HTTP\/1\.1.* ignoreregex = ^<HOST>.*] "POST /xmlrpc\.php\?for=jetpack.* ^<HOST>.*] "POST /wp-cron\.php\?doing_wp_cron=.* datepattern = {^LN-BEG}%%ExY(?P<_sep>[-/.])%%m(?P=_sep)%%d[T ]%%H:%%M:%%S(?:[.,]%%f)?(?:\s*%%z)? ^[^\[]*\[({DATE}) {^LN-BEG} ``` * 防止Wordpress xmlrpc攻擊的規則，並且讓Jetpack正常運行 (XMLRPC ATTACK) ```=conf failregex = ^<HOST>.*] "(GET|POST|HEAD) .*xmlrpc\.php.* ignoreregex = ^<HOST>.*] "POST /xmlrpc\.php\?for=jetpack.* ``` * 防止URL中的特殊字符攻擊 ```=conf failregex = ^<HOST> .* ".*\\x.*" .*$ // 可以過濾掉如下的訪問紀錄 // 185.202.1.188 - - [23/Jun/2020:08:18:33 +0800] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 157 "-" "-" ``` 資料來源: https://szeching.com/.xmlrpc.php.fail2ban-nginx-filter-config/ ## nginx.conf 設定 (或新增 agent-deny.conf 再引入到 nginx.conf) ```=Vim! #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; } #禁止 Scrapy 等工具抓取 if ($http_user_agent ~* (Python|Java|Wget|Scrapy|Curl|HttpClient|Spider|PostmanRuntime)) { return 403; } #這三句if是禁止使用代理ip來訪問，或禁止使用壓力測試軟件進行dos攻擊 if ($http_user_agent ~* ApacheBench|WebBench|java/){ return 403; } if ($http_user_agent ~* LWP::Simple|BBBike|wget) { return 403; } #禁止指定UA及UA爲空的訪問 if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|Bytespider|Jullo|Apache-HttpAsyncClient| |BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|iaskspider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|LinkpadBot|Ezooms|^$" ){ return 403; } #禁止境內常見爬蟲(根據需求自行控制是否禁止) if ($http_user_agent ~* "qihoobot|Yahoo! Slurp China|Baiduspider|Baiduspider-image|spider|Sogou spider|Sogou web spider|Sogou inst spider|Sogou spider2|Sogou blog|Sogou News Spider|Sogou Orion spider|ChinasoSpider|Sosospider|YoudaoBot|yisouspider|EasouSpider|Tomato Bot|Scooter") { return 403; } #禁止境外常見爬蟲(根據需求自行控制是否禁止) if ($http_user_agent ~* "Googlebot|Googlebot-Mobile|AdsBot-Google|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|MSNBot|Catall Spider|ArchitextSpider|AcoiRobot|Applebot|Bingbot|Discordbot|Twitterbot|facebookexternalhit|ia_archiver|LinkedInBot|Naverbot|Pinterestbot|seznambot|Slurp|teoma|TelegramBot|Yandex|Yeti|Infoseek|Lycos|Gulliver|Fast|Grabber") { return 403; } ``` nginx.conf 引入 agent-deny.conf ```=Vim! http { #引入限制爬蟲UA配置文件 include agent_deny.conf; location =/robots.txt { # 爬蟲規則說明，沒有啥實際作用 default_type text/html; add_header Content-Type "text/plain; charset=UTF-8"; return 200 "User-agent: *\nDisallow: /"; } } ``` ## robots.txt 建立規則 ### 常見屬性 #### User-agent 指定允許的爬蟲。若設定為 `User-agent:*` 表示不允許任何爬蟲 #### Disallow 指定不允許爬取的目錄或檔案。若設定為 `Disallow: /` 表示全部資源都不允許爬取 #### Crawl-delay 指定延遲時間 (單位:秒) #### Allow 允許指定爬取的目錄或檔案。 #### Sitemap 指定網站內的 sitemap 檔案放置位置，需使用絕對路徑。 ### 示例拒絕任何爬蟲抓取全部資源 ``` User-agent: * Disallow: / ``` 拒絕指定的 User-agent 爬取全部資源 ``` User-agent: Baiduspider Disallow: / ``` 僅允許 Google 爬取 `/private/` 目錄下的資源 ``` User-agent: * Disallow: / User-agent: Googlebot Disallow: /private/ ``` 表示延遲30秒的時間，將允許在8.3小時內檢索1000張網頁。 ``` User-agent: * Crawl-delay: 30 ``` `$` 拒絕所有爬蟲抓取站內附檔名為 png 的圖檔。 ``` User-agent: * Disallow: *.png$ ``` `^` 拒絕 Bing搜尋引擎抓取站內/wp-admin目錄下所有資源及站內開頭為test的所有檔名。 ``` User-agent: bingbot Disallow: /wp-admin/ Disallow: ^test* ``` :::success `chrome://version/` 此連結可查詢自己的 User-agent 資訊資料來源：[是爬蟲還是手機？淺談網路訪問的識別證UserAgent](https://progressbar.tw/posts/234) ::: [Apache Bot 攔截器](https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/tree/master/robots.txt) [User-agent 黑名單討論串](https://github.com/fail2ban/fail2ban/issues/1950) --- 參考資源: [如何使用robots.txt阻止搜尋引擎爬(抓取)你的網站?](https://www.newscan.com.tw/all-seo/robots-block-search-engines.htm) [建立及提交 robots.txt 檔案 | Google搜尋中心](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt) [【開發人員工具】將userAgent設為googlebot](https://www.togetherhoo.com/how-to/other/2748/) [Googlebot | Google搜尋中心](https://developers.google.com/search/docs/crawling-indexing/googlebot)