MLE - Week 4 - HackMD

# MLE - Week 4 Summary ------- - [HTML, CSS, JS](#HTML-CSS-JS) - [BeautifulSoup](#BeautifulSoup) - [Selenium library](#Selenium) HTML, CSS, JS ------------- :point_right: `HTML = Structure` :point_right: `CSS = Styles` :point_right: `JS = Behaviour` *Web browser normally will do the rendering part: fetch the resources from the server, combine them, and make a beautiful web page.* Common `HTML` tags: - `html` `head` `body` `title` `link` `script` - `div` `h1` `h2` `h3` `table` `thead` `tbody` `tr` `td` - `a` `span` `ul` `ol` `li` `p` - `img` :penguin: `HTML5` only tags: | Tag | Description | | ------------ | ---------------------------------------------------------------------------------------------------- | | `article` | Defines an article. | | `audio` | Embeds a sound, or an audio stream in an HTML document. | | `dialog` | Defines a dialog box or subwindow. | | `embedded` | Embeds external application, typically multimedia content like audio or video into an HTML document. | | `figure` | Represents a figure illustrated as part of the document. | | `figcaption` | Defines a caption or legend for a figure. | | `footer` | Represents the footer of a document or a section. | | `header` | Represents the header of a document or a section. | | `main` | Represents the main or dominant content of the document. | | `nav` | Defines a section of navigation links. | | `picture` | Defines a container for multiple image sources. | | `section` | Defines a section of a document, such as header, footer etc. | | `svg` | Embed SVG (Scalable Vector Graphics) content in an HTML document. | | `video` | Embeds video content in an HTML document. | BeautifulSoup ------------- >:point_right: **Beautiful Soup** is **a library** that makes it easy to **scrape information** from web pages. It sits atop an **HTML or XML parser**, providing Pythonic idioms for **iterating, searching, and modifying** the parse tree. [[1]] :bangbang: Need to install before using ```bash= pip install bs4 ``` ```python= from bs4 import BeautifulSoup # r.text is a HTML file so we will use html.parser soup = BeautifulSoup(r.text, 'html.parser') # Make the soup object look nicer print(soup.prettify()[:500]) ``` :star: Beside the `html.parser`, there're other parsers as well. ```python= print(BeautifulSoup("<a></a>", "xml")) # <?xml version="1.0" encoding="utf-8"?> # <a></a> BeautifulSoup("<a>", "lxml") # <html><body><a></a></body></html> BeautifulSoup("<a>", "html5lib") # <html><head></head><body><a></a></body></html> BeautifulSoup("<a>", "html.parser") # <a></a> ``` ### Encoding :point_right: By default, `BeautifulSoup` will discover and convert the string to Unicode. There's an autodetected encoding feature. ```python= soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8") print(soup.h1) # <h1>םולש</h1> print(soup.original_encoding) # iso8859-8 ``` :star: Default input and output encoding is `UTF-8`. ### Kind of Objects |Object|Description| |------|-----------| |`Tag`|A Tag object corresponds to an XML or HTML tag in the original document.| |`Name`|Every tag has a name, accessible as `.name`.| |`Attributes`|A tag may have any number of attributes. use square brackets to get the tag attributes.| ### Searching the tree #### find_all The `find_all()` can take many kinds of filters ```python= # A string soup.find_all('b') # [The Dormouse's story] # A regular expression import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b # A list soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] # True for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p # A function def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') soup.find_all(has_class_but_no_id) # [The Dormouse's story, # Once upon a time there were…bottom of a well., # ...] ``` #### find `find` returns the **FIRST** occurence of `<tag>` in `<soup>` object. #### CSS selectors ```python= css_soup.select("p.strikeout.body") # [] ``` ### Navigating the tree `Tags` may contain *strings* and other *tags*. These elements are the *tag*’s children. All the `tags` should have 1 parent `tag`, except the top-level `<html>` tag. #### Going down ```python= html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') # Return FIRST occurence of <tag> in <soup> object. soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # Return the FIRST occurence of <tag> with <attributes> equal to <values> in <soup> object. soup.find('a', {'id':'link2'}) # <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> # Return ALL occurences of <tag> with <attributes> equal to <values> in <soup> object. soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ``` :star: Bonus |Method|Description| |------|-----------| |`.contents`|Return all the children of that tag, return **a list**| |`.children`|Same as the above, but return **a generator**.| |`descendants`|Flatten all the children and nested one recursively, return **a generator**.| ```python= head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head> head_tag.contents # [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # ['The Dormouse's story'] for child in title_tag.children: print(child) # The Dormouse's story for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story ``` #### Going up |Method|Description| |------|-----------| |`.parent`|Return an element's parent.| |`.parents`| Return **a generator** which generates parent, then grandparent, then grand-grandparent,...| ```python= title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head> ``` :star: The parent of the top-level tag <html> is the ` object itself: ```python= html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'> # Don't go too far print(soup.parent) # None ``` #### Going side way :point_right: You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree: ```python= sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>", 'html.parser') print(sibling_soup.prettify()) # <a> # # text1 # # <c> # text2 # </c> # </a> sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # text1 ``` :penguin: Same with the `.parents`, you can iterate over a tag’s siblings with `.next_siblings` or `.previous_siblings`. #### Going back and forth :point_right: The `.next_element` attribute of a string or tag points to whatever was parsed **immediately afterwards**. Same to `.previous_element` ### Output ```python= # Make the soup object look nicer print(soup.prettify()[:500]) ``` ### Advanced parser customization #### Parsing only part of a document >:point_right: The `SoupStrainer` class allows you to choose which parts of an incoming document are parsed. [[2]] ```python= from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): return string is not None and len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) ``` ```python= html_doc = """<html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # <a class="sister" href="http://example.com/tillie" id="link3"> # Tillie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # ``` #### Handling duplicate attributes The default behavior is to use the last value found for the tag: ```python= markup = '<a href="http://url1/" href="http://url2/">' soup = BeautifulSoup(markup, 'html.parser') soup.a['href'] # http://url2/ soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace') soup.a['href'] # http://url2/ ``` With `on_duplicate_attribute='ignore'` Beautiful Soup will use the first value found and ignore the rest: ```python= soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore') soup.a['href'] # http://url1/ ``` Selenium -------- Although the full page can be got by using GET request: ```python= import requests r = requests.get('https://vnexpress.net/') # print(r.text) ``` Some elements are rendered dynamically (which is not available in the plain text response) :arrow_right: Have to use *Selenium* to render the full page and get the content. *From the 4.3a colab* <img src="https://i.imgur.com/0MVRzha.jpg" width=300> :bangbang: Need to install the selenium lib and the browser driver (chrome driver in this case) ```python= # install selenium and chromium browser driver to crawl data !pip install selenium !apt install chromium-chromedriver ``` ```python= from selenium import webdriver ``` ```python= # Set driver with specific options for Chrome chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('-headless') chrome_options.add_argument('-no-sandbox') ``` ```python= driver = webdriver.Chrome('chromedriver',options=chrome_options) ``` ```python= tiki_url = 'https://tiki.vn/' driver.get(tiki_url) html_data = driver.page_source # after driver.get() is done, you can get back HTML string by using .page_source driver.close() # close the browser after getting what you want ``` [1]: https://pypi.org/project/beautifulsoup4/ [2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document