# MLE - Week 4
Summary
-------
- [HTML, CSS, JS](#HTML-CSS-JS)
- [BeautifulSoup](#BeautifulSoup)
- [Selenium library](#Selenium)
HTML, CSS, JS
-------------
:point_right: `HTML = Structure`
:point_right: `CSS = Styles`
:point_right: `JS = Behaviour`
*Web browser normally will do the rendering part: fetch the resources from the server, combine them, and make a beautiful web page.*
Common `HTML` tags:
- `html` `head` `body` `title` `link` `script`
- `div` `h1` `h2` `h3` `table` `thead` `tbody` `tr` `td`
- `a` `span` `ul` `ol` `li` `p`
- `img`
:penguin: `HTML5` only tags:
| Tag | Description |
| ------------ | ---------------------------------------------------------------------------------------------------- |
| `article` | Defines an article. |
| `audio` | Embeds a sound, or an audio stream in an HTML document. |
| `dialog` | Defines a dialog box or subwindow. |
| `embedded` | Embeds external application, typically multimedia content like audio or video into an HTML document. |
| `figure` | Represents a figure illustrated as part of the document. |
| `figcaption` | Defines a caption or legend for a figure. |
| `footer` | Represents the footer of a document or a section. |
| `header` | Represents the header of a document or a section. |
| `main` | Represents the main or dominant content of the document. |
| `nav` | Defines a section of navigation links. |
| `picture` | Defines a container for multiple image sources. |
| `section` | Defines a section of a document, such as header, footer etc. |
| `svg` | Embed SVG (Scalable Vector Graphics) content in an HTML document. |
| `video` | Embeds video content in an HTML document. |
BeautifulSoup
-------------
>:point_right: **Beautiful Soup** is **a library** that makes it easy to **scrape information** from web pages. It sits atop an **HTML or XML parser**, providing Pythonic idioms for **iterating, searching, and modifying** the parse tree. [[1]]
:bangbang: Need to install before using
```bash=
pip install bs4
```
```python=
from bs4 import BeautifulSoup
# r.text is a HTML file so we will use html.parser
soup = BeautifulSoup(r.text, 'html.parser')
# Make the soup object look nicer
print(soup.prettify()[:500])
```
:star: Beside the `html.parser`, there're other parsers as well.
```python=
print(BeautifulSoup("<a><b/></a>", "xml"))
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>
BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>
BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>
BeautifulSoup("<a></p>", "html.parser")
# <a></a>
```
### Encoding
:point_right: By default, `BeautifulSoup` will discover and convert the string to Unicode.
There's an autodetected encoding feature.
```python=
soup = BeautifulSoup(markup, 'html.parser', from_encoding="iso-8859-8")
print(soup.h1)
# <h1>םולש</h1>
print(soup.original_encoding)
# iso8859-8
```
:star: Default input and output encoding is `UTF-8`.
### Kind of Objects
|Object|Description|
|------|-----------|
|`Tag`|A Tag object corresponds to an XML or HTML tag in the original document.|
|`Name`|Every tag has a name, accessible as `.name`.|
|`Attributes`|A tag may have any number of attributes. use square brackets to get the tag attributes.|
### Searching the tree
#### find_all
The `find_all()` can take many kinds of filters
```python=
# A string
soup.find_all('b')
# [<b>The Dormouse's story</b>]
# A regular expression
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
# A list
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# True
for tag in soup.find_all(True):
print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p
# A function
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were…bottom of a well.</p>,
# <p class="story">...</p>]
```
#### find
`find` returns the **FIRST** occurence of `<tag>` in `<soup>` object.
#### CSS selectors
```python=
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
```
### Navigating the tree
`Tags` may contain *strings* and other *tags*. These elements are the *tag*’s children.
All the `tags` should have 1 parent `tag`, except the top-level `<html>` tag.
#### Going down
```python=
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
# Return FIRST occurence of <tag> in <soup> object.
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# Return the FIRST occurence of <tag> with <attributes> equal to <values> in <soup> object.
soup.find('a', {'id':'link2'})
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
# Return ALL occurences of <tag> with <attributes> equal to <values> in <soup> object.
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```
:star: Bonus
|Method|Description|
|------|-----------|
|`.contents`|Return all the children of that tag, return **a list**|
|`.children`|Same as the above, but return **a generator**.|
|`descendants`|Flatten all the children and nested one recursively, return **a generator**.|
```python=
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
# [<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# ['The Dormouse's story']
for child in title_tag.children:
print(child)
# The Dormouse's story
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
```
#### Going up
|Method|Description|
|------|-----------|
|`.parent`|Return an element's parent.|
|`.parents`| Return **a generator** which generates parent, then grandparent, then grand-grandparent,...|
```python=
title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
```
:star: The parent of the top-level tag <html> is the ` object itself:
```python=
html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>
# Don't go too far
print(soup.parent)
# None
```
#### Going side way
:point_right: You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree:
```python=
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>
```
:penguin: Same with the `.parents`, you can iterate over a tag’s siblings with `.next_siblings` or `.previous_siblings`.
#### Going back and forth
:point_right: The `.next_element` attribute of a string or tag points to whatever was parsed **immediately afterwards**. Same to `.previous_element`
### Output
```python=
# Make the soup object look nicer
print(soup.prettify()[:500])
```
### Advanced parser customization
#### Parsing only part of a document
>:point_right: The `SoupStrainer` class allows you to choose which parts of an incoming document are parsed. [[2]]
```python=
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")
only_tags_with_id_link2 = SoupStrainer(id="link2")
def is_short_string(string):
return string is not None and len(string) < 10
only_short_strings = SoupStrainer(string=is_short_string)
```
```python=
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())
# Elsie
# ,
# Lacie
# and
# Tillie
# ...
#
```
#### Handling duplicate attributes
The default behavior is to use the last value found for the tag:
```python=
markup = '<a href="http://url1/" href="http://url2/">'
soup = BeautifulSoup(markup, 'html.parser')
soup.a['href']
# http://url2/
soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='replace')
soup.a['href']
# http://url2/
```
With `on_duplicate_attribute='ignore'` Beautiful Soup will use the first value found and ignore the rest:
```python=
soup = BeautifulSoup(markup, 'html.parser', on_duplicate_attribute='ignore')
soup.a['href']
# http://url1/
```
Selenium
--------
Although the full page can be got by using GET request:
```python=
import requests
r = requests.get('https://vnexpress.net/')
# print(r.text)
```
Some elements are rendered dynamically (which is not available in the plain text response) :arrow_right: Have to use *Selenium* to render the full page and get the content.
*From the 4.3a colab*
<img src="https://i.imgur.com/0MVRzha.jpg" width=300>
:bangbang: Need to install the selenium lib and the browser driver (chrome driver in this case)
```python=
# install selenium and chromium browser driver to crawl data
!pip install selenium
!apt install chromium-chromedriver
```
```python=
from selenium import webdriver
```
```python=
# Set driver with specific options for Chrome
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('-headless')
chrome_options.add_argument('-no-sandbox')
```
```python=
driver = webdriver.Chrome('chromedriver',options=chrome_options)
```
```python=
tiki_url = 'https://tiki.vn/'
driver.get(tiki_url)
html_data = driver.page_source # after driver.get() is done, you can get back HTML string by using .page_source
driver.close() # close the browser after getting what you want
```
[1]: https://pypi.org/project/beautifulsoup4/
[2]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document