Scratch and Python 2017 Summer - Python Lecture 4
===
[對戰組合](https://hackmd.io/s/BkDQD-svZ)
[提問連結](https://app2.sli.do/event/v9vez7ai/ask)
## String Processing and Webpages
### `requests`: a browser (?)
+ A website has different looks
+ When transferring its content, it looks like a sequence of bytes
+ A sequence of bytes? A sequence of characters? A string.
+ We need a different kind of eyes
+ `import requests`: don't forget `s`
+ URL: Uniform Resource Locator
+ 這術語好長我們叫他網址好了
+ `result = requests.get('http://www.nctu.edu.tw/')`
+ 這網址好長請試著複製貼上吧
+ `result.text`
+ 這內容好字串:`type(result.text)`
+ String processing
+ `str`
+ `import re`: regular expression
+ `result.raise_for_status()`
+ 連網頁做了好多事情,要是網址有錯瀏覽器會直接當掉給你看嗎?
+ It is OK to raise an exception, always using `try-except` blocks is tedious.
+ Java is tedious: almost always `try` or `throws` (`raise` in Python)
+ `raise_for_status()`: If there is any exception, raise it now.
+ Save the content
+ Open a file to save it: `the_file = open('a_name.html', 'wb')`
+ Filename: `a_name.html`
+ Mode: `wb` means "write binary"
+ `the_file.write(chunk)`: write a chunk of bytes
+ Mode: `wt` means "write text"
+ `print(some_str,file=the_file)`: write a string
+ `for chunk in result.iter_content(102400):` to iterate 102400-byte chunks of `result`
+ Remember to close the file: `the_file.close()`
+ Sample code
```python3=
import requests
url = input('Input URL: ')
result = requests.get(url)
result.raise_for_status()
name = input('input filename: ')
file = open(name,'wb')
for chunk in result.iter_content(102400):
file.write(chunk)
file.close()
```
+ Sample code 2: `with`-block (Probably, every body sometimes forgets to close the file.)
```python3=
import requests
url = input('Input URL: ')
result = requests.get(url)
result.raise_for_status()
name = input('input filename: ')
with open(name,'wb') as file:
for chunk in result.iter_content(102400):
file.write(chunk)
```
+ Try to download some images with the sample codes.
```python3=
import requests
def url_to_file(url,filename):
result = requests.get(url)
result.raise_for_status()
with open(filename,'wb') as file:
for chunk in result.iter_content(102400):
file.write(chunk)
```
### String processing
+ Escape Characters: [Table](https://automatetheboringstuff.com/chapter6/#calibre_link-40)
+ Multiline Strings: triple quotes `'''`
+ Integers, characters and strings
+ `chr`
+ try `chr(65)`
+ `ord`
+ try `ord('A')`
+ Python uses `str`'s of length 1 to represent characters.
+ Indexing:
+ `print('123'[0])`
+ `print('123'[-1])`
+ `index()`
+ `print('123123123'.index('312'))`
+ Slicing:
+ `print('abcde'[2:])`
+ `print('abcde'[:2])`
+ `print('abcde'[1:3])`
+ `in` and `not in`
+ `print('abc' in 'zabcy')`
+ `print('gg' not in 'zabcy')`
+ `upper()` and `lower()`
+ `isupper()`, `islower()`, `isdecimal()`, `isalpha()`, `isalnum()`
+ try `'123.0'.isdecimal()`
+ `join(list_str)`
+ try `', '.join([str(i) for i in range(10)])`
+ `split(string)`
+ try `'<a href="https://automatetheboringstuff.com/">'.split('"')`
+ `startswith(string)` and `endswith(string)`
+ Many URLs start with `http` and end with `html`
+ `strip()`, `rstrip()`, `lstrip()`
+ URL
+ Almost everything is specified: `https://en.wikipedia.org/w/api.php?action=rsd`
+ Same protocol: `//en.wikipedia.org/w/api.php?action=rsd`
+ Same address: `/w/api.php?action=rsd`
+ Relative path: `api.php?action=rsd`
+ Task: open all URLs which end with `html` in a wikipedia page.
+ `import webbrowser` then use `webbrowser.open(URL)` to open the page
+ Task: Download an image from a webpage.
+ Hint: reuse [previous sample code](lec10-2.py)
+ Hint: find `<img` in `result.text`. You should discover there is a URL append to a `src=` nearby.
+ Bonus: Download random 3 images from a webpage
+ Bonus: Download all identifiable images from a webpage
```python3=
import requests, webbrowser, time
result = requests.get('https://en.wikipedia.org/wiki/Basketball')
tokens = result.text.split('"')
tokens = [x for x in tokens if x.startswith('http')]
tokens = [x for x in tokens if x.endswith('html')]
for url in tokens[:5]:
print(url)
webbrowser.open(url)
time.sleep(2)
```
```python3=
import requests, webbrowser, time
formats = ['jpg','ico','png','gif']
result = requests.get('https://en.wikipedia.org/wiki/Basketball')
tokens = result.text.split('<img')
tokens = [x.split('src=')[1] for x in tokens[1:] if 'src=' in x]
tokens = [x.split('"')[1] for x in tokens if x.startswith('"')]
tokens = [x for x in tokens if any(x.lower().endswith(y) for y in formats)]
for url in tokens[:5]:
print(url)
if url.startswith('//'):
webbrowser.open('https:'+url)
elif url.startswith('/'):
webbrowser.open('https://en.wikipedia.org'+url)
elif '://' in url and url.startswith('http'):
webbrowser.open(url)
elif '://' not in url:
webbrowser.open('https://en.wikipedia.org/wiki/'+url)
time.sleep(2)
```
### Beautiful soup
+ `from bs4 import BeautifulSoup` and using `BeautifulSoup` versus `import bs4` and using `bs4.BeautifulSoup`
+ Parsing HTML: the real webpages
+ View source!
+ [Example code 1](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-1.py)
+ [HTML](https://mzshieh.github.io/snp2016/html/1.html)
+ `find(tag)`
+ Get a `Tag` from the html
+ Try `type(div)`
+ [Example code 2](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-2.py)
+ [HTML](https://mzshieh.github.io/snp2016/html/2.html)
+ Get information from `Tag`: use `.get`
+ `.get` is safer than `[]`
+ [Example code 3](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-3.py)
+ [HTML](https://mzshieh.github.io/snp2016/html/4.html)
+ `find_all` returns a list
+ CSS selector: `select(tag)`
+ Get all `tag`s
+ Return a list
+ [Example code 4](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-4.py)
+ [HTML](https://mzshieh.github.io/snp2016/html/3.html)
+ Advanced `find` usage
+ Accessing attributes
+ keyword argument
+ dictionary
+ Try `soup.select('div.find_by_class')`
+ [Example code 5](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-5.py)
+ [HTML](https://mzshieh.github.io/snp2016/html/5.html)
+ Try to use `select` to replace `find`
+ Redo Task: open five webpages
+ Tag: `a`
+ Attribute: `href`
+ [Sample code](lec11-6.py)
+ `find_all(tag)` returns a list
+ CSS selector: `select(tag)`
+ Get all `tag`s
+ Return a list
+ Similar to `find_all`
+ Special attribute: `class` and `id`
+ Use `requests` to get `https://tw.news.yahoo.com/sports/` and cook `soup`
+ `class` use `.` : Try `soup.select('div.Z(10).Pos(a).Lh(32px)')`
+ `id` use `#`: Try `soup.select('div#mrt-node-Col1-1-Her')`
+ Redo Task: open five webpages
+ Tag: `a`
+ Attribute: `href`
+ `endswith('html')`
+ `import webbrowser`
+ Task: Open five images in browser
+ Tag: `img`
+ Attribute: `src`
+ Task: Save five images from some webpage
+ Task: Crawl the [Yahoo Sport News](https://tw.news.yahoo.com/sports/)
+ 找出五條新聞,並選出報導中的第一句話。
+ Hint: `<p type="text">`
+ Bonus: 擷取報導附圖。
列印所有圖片來源的參考程式碼
```python3=
import requests
from bs4 import BeautifulSoup as Soup
res = requests.get('https://en.wikipedia.org/wiki/Basketball')
soup = Soup(res.text,'lxml')
images = soup.find_all('img')
for tag in images:
if tag.get('src') != None:
print(tag['src'])
```
## Project 3
### [ballfight](https://github.com/sunset1995/ballfight/blob/master/install/ballfight-c9.md) 對打
- 需求
- 您與您的隊友 的 AI 需要打贏 `耍廢無罪` + `魯蛇`
- 如果沒有隊友了話,可以一人分飾兩角,在 c9.io 裡,開兩個 hero.py,並將 `api.play` 的第三個參數 (name) 改成不一樣的,同時跑起來即可
- 在 hero.py 上右鍵複製貼上,產生兩個 hero


- 將其中一個 hero 的名字改掉

- 同時讓兩個跑起來即可

- 這時你就會看到競技場上有兩個 AI 了

- [問題討論區](https://hackmd.io/BwYwrADGBsBMCGBaAZiAprRAWW1mICNk0xFRosJktkBOCYYIA===)