資芽 C 班一階大作業 - Quote Crawler

# 資芽 C 班一階大作業 - Quote Crawler [大作業一 Slide](https://tw-csie-sprout.github.io/c2021/slides/homework1/) 嗨嗨，各位資芽的學員們大家好，恭喜大家快撐過這個階段的課程啦！我們在第一階段準備了一份大作業，讓大家練習並驗收各位在這個階段學到的內容。這次大作業的主題是 Web Crawler！以往大家想到網路爬蟲往往都是想到用 Python 加上一些套件來實作，但這次我們要用 C 來寫一個網頁的爬蟲！除此之外，我們也已經幫各位寫好了一個 Template，提供各種好用的 function，也把需要各位實作功能的地方標上 `Todo` ，話不多說，讓我們接著開始吧！ ## 環境準備 1. 註冊一個 [Repl.it](https://replit.com) 帳號 2. 進到這次作業的 [Template](https://replit.com/@jasonhsieh/Homework1-Template#main.cpp) 後點 ![](https://i.imgur.com/Osykkwp.png =60x) 3. 可以開始 Coding 啦 🔥🔥 p.s. 之後要回來改 code 的話就只需要到 repl.it 的 My repls 底下找 fork 好的 project 就可以囉 ## Homework Spec 這次要爬的網站是專門用來給爬蟲新手練習的網站 toscrape.com 旗下的 quotes.toscrape.com。 ![](https://i.imgur.com/B4Sc0Hh.png) 這次大作業的目標就是把這個網站的所有 Quotes 爬下來並且輸出！（所以各位需要研究怎麼翻頁跟停止條件！）輸出的格式規則如下（以前兩個 quotes 為例）： ``` “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” --- Albert Einstein [change, deep-thoughts, thinking, world] “It is our choices, Harry, that show what we truly are, far more than our abilities.” --- J.K. Rowling [abilities, choices] "{quote}" --- {author} [{tags separated by comma}] {an empty line after each quote} ``` 其中用大括號包起來的部分是要放東東的地方ㄛ ## 評分方式與繳交辦法 - 在 Deadline 前把 Fork 出來的連結貼到 Google Classroom 的大作業繳交區 ex: https://replit.com/@somebody/someproject#main.cpp - 評分時以 Repl History 最後一個 Revision 評分（所以注意死線後就不要再修改了，不然就會照遲交扣分喔！死線前要怎麼改都可以，所以可以先去 Google Classroom 交連結！） - 評分主要依照功能的完成度評分，遲交則原始分數 * 0.8，盡力完成所有部分就好！ - 評分的項目分成 basic 跟 bonus，bonus 的部分是非必要的，其中 basic 的滿分是 100 分，bonus 的滿分是 40 分且只做加分用，總分計算時取 min(100, basic + bonus) * (1 or 0.8) - bonus part 的部分需另外實作在 bonus.cpp 內 - 發現抄襲直接 0 分計算！ **Basic Part 評分項目** - 25pts：quote - 25pts：author - 20pts：tags - 20pts：完整爬完所有 quotes （依完整度評分） - 10pts：按照 spec 規定格式輸出（quote, author, tags 都要）最後的 30 分我們會寫好自動測試的 script 來批改，所以一定要確定有 follow 到 spec 寫的格式喔！ **Bonus Part 評分項目** - 20pts：創意 - 20pts：實作完整度 Bonus Part 的批改會是全手動的，bonus 的說明部分（想做什麼、做了什麼）請打在 bonus.md 裡面。 ## 好用的函數悶 ```cpp= // Crawl page specified by url[], the response html // will be stored in global buffer[] and the status // code (eg 200, 404) is returned // 這個 function 會幫你把 url 指到的網頁的 html 爬下來, 並且 // 存在一個叫 buffer[] 的全域變數裡, 除此之外還會把 http 的 // Status Code (例如 200, 404) 回傳 int requestPage(const char url[]); // Parse html in to global CDocument Object doc, // remember to parse html before stripping contents! // 這個 function 會幫你把 html 內的內容解析成 dom-tree 的格式， // 只要記得在對同樣的 html stripContent() 前呼叫過一次就好了 void parseHtml(const char html[]); // Strip content within global doc with specified CSS // selector, the stripped content is stored in global // char array stripped[], if there is no matching nodes // then it returns values other than 0 // 這個 function 會幫你從 parse 過的最後一個 html 中依照你給的 // CSS Selector 抓取對應的元件內容，如果成功擷取到的話會回傳 0， // 否則回傳其他值。 int stripContent(const char selector[]); ``` ## 作業時程 | Time | | | -------------------- | -------- | | 2021/4/10 Sat. 12:00 | 公告作業內容 | | 2021/4/30 Fri. 23:59 | 死線 |