Web Crawling - HackMD

# Web Crawling • A web crawler refers to a program that automatically retrieves information from the World Wide Web according to certain rules. ## Request the specified URL to retrieve the response body. #### Requests Module: Reading Website Files • To systematically automate the collection of information from the internet, the first step is to be able to extract webpage content or files from websites for further processing. • Python provides a "requests" module that allows users to make requests to websites and obtain the response content using a simple and readable syntax. •you can install it using: "pip install -U requests". #### Send a GET request • When you open a browser, enter a URL, and submit, the specified website's server receives the request and responds with content that can be viewed in the browser. This method of making a request is called GET(向服务器请求特定资源) • The requests module allows for making a GET request without using a browser. The syntax is: ![image](https://hackmd.io/_uploads/BJGua9YPT.png) #### Request object properties • The Request object can access different response contents using the following attributes. • text: Get the webpage source code data. • The default reading encoding for Requests is Latin-1. If the encoding of the page being read is different, it often leads to garbled characters. You can set the encoding of the Response object to UTF-8 or Big5. – Response object.encoding='UTF-8' • content: Retrieve website binary file data • status_code: Retrieve HTTP status code. - –Informational responses, 100–199 – Successful responses, 200–299 – Redirects, 300–399 – Client errors, 400–499 – Server errors, 500–599 #### get.py (Let’s practice) #### Add URL query parameters • In a GET request, in addition to specifying the URL, you can append URL parameters afterwards to allow the interactive program to receive and output different response content. • Using a '?' between parameters and the URL. • Parameters and values are separated by '='. • Multiple parameters should be connected using '&'. – http://www.test.com/?x=value&y=value2 #### get_params.py (Let’s practice) • 利用httpbin.org服務來測試HTTP的傳送及回應 • httpbin.org: A service for testing requests and responses is provided, allowing users to send GET, POST, and other request actions to the website. Upon receiving the request, the service will respond in JSON format with the request's args (parameters), headers (header data), origin (source URL), and url (request URL) information. This serves as an excellent testing tool for API developers. #### Send POST request • A POST request (向指定的资源提交数据，以便处理或更新资源的状态。)is a commonly used HTTP request, required whenever there is a form in a webpage that allows users to input data. • In the requests module, the parameters passed in a POST request should be defined as a dictionary data type. Subsequently, when making a POST request, the content of the parameters to be passed should be set as the data parameter. #### post.py (Let’s practice) ## Analyze the content of the response and extract the necessary information. #### BeautifulSoup Module: Web Scraping and Parsing • BeautifulSoup: enables rapid and precise analysis and extraction of specific targets within a web page. • If using a different environment, you can install it with: pip install -U beautifulsoup4." #### Understanding Website Structure • The content of a webpage is essentially pure text and is typically stored as a file with the extension .htm or .html. • A webpage is constructed using HTML (Hypertext Markup Language) syntax, utilizing tags to structure content. This allows the browser to present the webpage based on its descriptions after reading it. • HTML provides a structured representation of a document: the Document Object Model (DOM). • All tags are enclosed in <...>, with most having both opening and closing tags. ![image](https://hackmd.io/_uploads/rJ-tBoKvT.png) ![image](https://hackmd.io/_uploads/HJ7BxiKvT.png) ● BeautifulSoup module functions by parsing the raw HTML source code of a webpage into structured objects, allowing the program to quickly access its content. #### Using BeautifulSoup • After importing BeautifulSoup, by utilizing the requests module to obtain the webpage's source code, you can use 'lxml' to parse the code ![image](https://hackmd.io/_uploads/H1C2gjYDp.png) • Creating BeautifulSoup requires two parameters: – First parameter: The original code to be parsed – Second parameter: Parser – "html.parser" is a built-in parser in Python. – "lxml" is a parser based on the C language, with faster execution speed #### BeautifulSoup attributes ![image](https://hackmd.io/_uploads/rJxCgsYPp.png) #### bs1.py (Let’s practice) •Install lxml: pip3 install lxml #### BeautifulSoup: find(), find_all() ![image](https://hackmd.io/_uploads/B1N5boFPa.png) • Include tag attributes as search criteria. #### bs2.py (Let’s practice) #### BeautifulSoup: select() ![image](https://hackmd.io/_uploads/rkDJfoYDp.png) #### bs3.py (Let’s practice) #### BeautifulSoup: select(‘a’) • Extract the content within thetag and retrieve the value of the tag's attribute (ex:href). #### bs4.py (Let’s practice) #### Using Chrome for assistance • Many webpages have complex structures, making it difficult to quickly pinpoint the location of the content to be scraped within the HTML source code.。 • So we use Chrome's 'Developer Tools' for assistance ![image](https://hackmd.io/_uploads/rJO6NsYDT.png) ![image](https://hackmd.io/_uploads/H1f0EiFv6.png)