HW1 Crawler by Postman

# HW1 Crawler by Postman ###### tags: `HW1` In this assignment, you will learn to build a crawler that collects data from social media(e.g., Reddit, FB, Instagram, PTT, Dcard, etc.). Your data should be relevant to your final project. So first, you have to discuss the project topic with your teammates, then select the social media platform you need and crawl your own data. Requirement: 1. Python programming language only. 2. Any library is allowed (Scrapy, Beautifulsoup, Selenium etc.). Grading: 1. Completed crawler source code.(50%) 2. 100 sample data.(40%) 3. Briefly describe your project topic and the connection between data and project.(10%) Submission Rule: Please pack up your source_code.py, sample_data.xml/json and report.docx into teamid_hw1, and upload it to new ee-class.(Check your teamid in '小組專區', which will be published on 3/4) * If you don't have any experience in writing crawlers and don't know how to learn, please contact TA directly (phoenix000.taipei@gmail.com) ## Crawler for Twitter ### Target * Our target is to build a graph for stance detection. Utilize benchmark dataset for training and testing. * Get `tweet_id` and target `stance` from`acl2020-wtwt-tweets` dataset, combine with `Postman` to send request and get the context of tweet. ### Dataset #### [acl2020-wtwt-tweets](https://github.com/cambridge-wtwt/acl2020-wtwt-tweets) Will-They-Won't-They (WT-WT) is a large dataset of English tweets targeted at stance detection for the rumor verification task. The dataset is constructed based on tweets that discuss five recent merger and acquisition (M&A) operations of US companies, mainly from the healthcare sector. #### Preprocess for dataset preprocess `wtwt_ids.json`, get the proper format of `tweet_id` to feed into Postman * preprocess.py (Get 100 tweet_id) ```python= import json with open('wtwt_ids.json', newline='') as jsonfile: data = json.load(jsonfile) tweet_id=[] stance=[] for tweet in data: tweet_id.append(tweet['tweet_id']) stance.append(tweet['stance']) request_id=[] for i,id in enumerate(X_train): request_id.append(id+',') if(i==99): break str1 = ''.join(request_id) print(str1) ``` ### Tools: Twitter API V2 + Postman #### Twitter API ##### sign up for twitter developer and get the API key [Twitter developer](https://developer.twitter.com/en/docs/twitter-api) * copy all the key it will be used latter ![](https://i.imgur.com/SSfyNHu.png) #### Postman ##### Step 1: Sing up for postman and fork the twitter's API workspace to own workspace collection [twitter's API workspace](https://www.postman.com/twitter/workspace/twitter-s-public-workspace/collection/9956214-784efcda-ed4c-4491-a4c0-a26470a67400?ctx=documentation) ![](https://i.imgur.com/x6FBquF.png) ##### Step 2: Add keys and tokens as environmental variables * Select collection `twitter api v2` > Select endpoint `Multiple tweets` > Authorization * Change the authorization type to `Bearer token` and paste the Bearer which just copied from `Twitter developer platform` ![](https://i.imgur.com/K88ULQ8.png) ##### Step 3: Add values to the Params tab Example parameters: * `id`: Required, Enter up to 100 comma-separated Tweet IDs. * `text`: Text of the tweet * `author_id`: id of tweet author * `context_annotations`: some relative annotation for the tweet ![](https://i.imgur.com/n2f4rdj.png) ##### Step 4: Send request and review response Press the send buttom and see the response in below playload ![](https://i.imgur.com/27yp3Jw.png) ### Reference https://developer.twitter.com/en/docs/tutorials/postman-getting-started