Scratch and Python 2017 Summer - Python Lecture 4

Scratch and Python 2017 Summer - Python Lecture 4 === [對戰組合](https://hackmd.io/s/BkDQD-svZ) [提問連結](https://app2.sli.do/event/v9vez7ai/ask) ## String Processing and Webpages ### `requests`: a browser (?) + A website has different looks + When transferring its content, it looks like a sequence of bytes + A sequence of bytes? A sequence of characters? A string. + We need a different kind of eyes + `import requests`: don't forget `s` + URL: Uniform Resource Locator + 這術語好長我們叫他網址好了 + `result = requests.get('http://www.nctu.edu.tw/')` + 這網址好長請試著複製貼上吧 + `result.text` + 這內容好字串：`type(result.text)` + String processing + `str` + `import re`: regular expression + `result.raise_for_status()` + 連網頁做了好多事情，要是網址有錯瀏覽器會直接當掉給你看嗎？ + It is OK to raise an exception, always using `try-except` blocks is tedious. + Java is tedious: almost always `try` or `throws` (`raise` in Python) + `raise_for_status()`: If there is any exception, raise it now. + Save the content + Open a file to save it: `the_file = open('a_name.html', 'wb')` + Filename: `a_name.html` + Mode: `wb` means "write binary" + `the_file.write(chunk)`: write a chunk of bytes + Mode: `wt` means "write text" + `print(some_str,file=the_file)`: write a string + `for chunk in result.iter_content(102400):` to iterate 102400-byte chunks of `result` + Remember to close the file: `the_file.close()` + Sample code ```python3= import requests url = input('Input URL: ') result = requests.get(url) result.raise_for_status() name = input('input filename: ') file = open(name,'wb') for chunk in result.iter_content(102400): file.write(chunk) file.close() ``` + Sample code 2: `with`-block (Probably, every body sometimes forgets to close the file.) ```python3= import requests url = input('Input URL: ') result = requests.get(url) result.raise_for_status() name = input('input filename: ') with open(name,'wb') as file: for chunk in result.iter_content(102400): file.write(chunk) ``` + Try to download some images with the sample codes. ```python3= import requests def url_to_file(url,filename): result = requests.get(url) result.raise_for_status() with open(filename,'wb') as file: for chunk in result.iter_content(102400): file.write(chunk) ``` ### String processing + Escape Characters: [Table](https://automatetheboringstuff.com/chapter6/#calibre_link-40) + Multiline Strings: triple quotes `'''` + Integers, characters and strings + `chr` + try `chr(65)` + `ord` + try `ord('A')` + Python uses `str`'s of length 1 to represent characters. + Indexing: + `print('123'[0])` + `print('123'[-1])` + `index()` + `print('123123123'.index('312'))` + Slicing: + `print('abcde'[2:])` + `print('abcde'[:2])` + `print('abcde'[1:3])` + `in` and `not in` + `print('abc' in 'zabcy')` + `print('gg' not in 'zabcy')` + `upper()` and `lower()` + `isupper()`, `islower()`, `isdecimal()`, `isalpha()`, `isalnum()` + try `'123.0'.isdecimal()` + `join(list_str)` + try `', '.join([str(i) for i in range(10)])` + `split(string)` + try `'<a href="https://automatetheboringstuff.com/">'.split('"')` + `startswith(string)` and `endswith(string)` + Many URLs start with `http` and end with `html` + `strip()`, `rstrip()`, `lstrip()` + URL + Almost everything is specified: `https://en.wikipedia.org/w/api.php?action=rsd` + Same protocol: `//en.wikipedia.org/w/api.php?action=rsd` + Same address: `/w/api.php?action=rsd` + Relative path: `api.php?action=rsd` + Task: open all URLs which end with `html` in a wikipedia page. + `import webbrowser` then use `webbrowser.open(URL)` to open the page + Task: Download an image from a webpage. + Hint: reuse [previous sample code](lec10-2.py) + Hint: find `<img` in `result.text`. You should discover there is a URL append to a `src=` nearby. + Bonus: Download random 3 images from a webpage + Bonus: Download all identifiable images from a webpage ```python3= import requests, webbrowser, time result = requests.get('https://en.wikipedia.org/wiki/Basketball') tokens = result.text.split('"') tokens = [x for x in tokens if x.startswith('http')] tokens = [x for x in tokens if x.endswith('html')] for url in tokens[:5]: print(url) webbrowser.open(url) time.sleep(2) ``` ```python3= import requests, webbrowser, time formats = ['jpg','ico','png','gif'] result = requests.get('https://en.wikipedia.org/wiki/Basketball') tokens = result.text.split('<img') tokens = [x.split('src=')[1] for x in tokens[1:] if 'src=' in x] tokens = [x.split('"')[1] for x in tokens if x.startswith('"')] tokens = [x for x in tokens if any(x.lower().endswith(y) for y in formats)] for url in tokens[:5]: print(url) if url.startswith('//'): webbrowser.open('https:'+url) elif url.startswith('/'): webbrowser.open('https://en.wikipedia.org'+url) elif '://' in url and url.startswith('http'): webbrowser.open(url) elif '://' not in url: webbrowser.open('https://en.wikipedia.org/wiki/'+url) time.sleep(2) ``` ### Beautiful soup + `from bs4 import BeautifulSoup` and using `BeautifulSoup` versus `import bs4` and using `bs4.BeautifulSoup` + Parsing HTML: the real webpages + View source! + [Example code 1](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-1.py) + [HTML](https://mzshieh.github.io/snp2016/html/1.html) + `find(tag)` + Get a `Tag` from the html + Try `type(div)` + [Example code 2](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-2.py) + [HTML](https://mzshieh.github.io/snp2016/html/2.html) + Get information from `Tag`: use `.get` + `.get` is safer than `[]` + [Example code 3](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-3.py) + [HTML](https://mzshieh.github.io/snp2016/html/4.html) + `find_all` returns a list + CSS selector: `select(tag)` + Get all `tag`s + Return a list + [Example code 4](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-4.py) + [HTML](https://mzshieh.github.io/snp2016/html/3.html) + Advanced `find` usage + Accessing attributes + keyword argument + dictionary + Try `soup.select('div.find_by_class')` + [Example code 5](https://github.com/mzshieh/snp2017spring/blob/master/lec11/lec11-5.py) + [HTML](https://mzshieh.github.io/snp2016/html/5.html) + Try to use `select` to replace `find` + Redo Task: open five webpages + Tag: `a` + Attribute: `href` + [Sample code](lec11-6.py) + `find_all(tag)` returns a list + CSS selector: `select(tag)` + Get all `tag`s + Return a list + Similar to `find_all` + Special attribute: `class` and `id` + Use `requests` to get `https://tw.news.yahoo.com/sports/` and cook `soup` + `class` use `.` : Try `soup.select('div.Z(10).Pos(a).Lh(32px)')` + `id` use `#`: Try `soup.select('div#mrt-node-Col1-1-Her')` + Redo Task: open five webpages + Tag: `a` + Attribute: `href` + `endswith('html')` + `import webbrowser` + Task: Open five images in browser + Tag: `img` + Attribute: `src` + Task: Save five images from some webpage + Task: Crawl the [Yahoo Sport News](https://tw.news.yahoo.com/sports/) + 找出五條新聞，並選出報導中的第一句話。 + Hint: `<p type="text">` + Bonus: 擷取報導附圖。列印所有圖片來源的參考程式碼 ```python3= import requests from bs4 import BeautifulSoup as Soup res = requests.get('https://en.wikipedia.org/wiki/Basketball') soup = Soup(res.text,'lxml') images = soup.find_all('img') for tag in images: if tag.get('src') != None: print(tag['src']) ``` ## Project 3 ### [ballfight](https://github.com/sunset1995/ballfight/blob/master/install/ballfight-c9.md) 對打 - 需求 - 您與您的隊友的 AI 需要打贏 `耍廢無罪` + `魯蛇` - 如果沒有隊友了話，可以一人分飾兩角，在 c9.io 裡，開兩個 hero.py，並將 `api.play` 的第三個參數 (name) 改成不一樣的，同時跑起來即可 - 在 hero.py 上右鍵複製貼上，產生兩個 hero ![](https://i.imgur.com/TkOBqTV.png =250x) ![](https://i.imgur.com/M2AVEFx.png =250x) - 將其中一個 hero 的名字改掉 ![](https://i.imgur.com/a90UKGR.png) - 同時讓兩個跑起來即可 ![](https://i.imgur.com/6d4m982.png =250x) - 這時你就會看到競技場上有兩個 AI 了 ![](https://i.imgur.com/aaa47i3.png =250x) - [問題討論區](https://hackmd.io/BwYwrADGBsBMCGBaAZiAprRAWW1mICNk0xFRosJktkBOCYYIA===)