HW1: Crawler - HackMD

# HW1: Crawler - Group: No.9 - Team leader: 王霈玄(110522127) - Team members: 彭彥霖(109526011)、何名曜(110522110) ## Describe our project topic ### Target - Our target is to build a graph for stance detection. Utilize benchmark dataset for training and testing. - Get `tweet_id` and target `stance` from`acl2020-wtwt-tweets` dataset, combine with `twitter api`, `python` and `Postman` to send request and get the context of tweet. ### Dataset #### [acl2020-wtwt-tweets](https://github.com/cambridge-wtwt/acl2020-wtwt-tweets) Will-They-Won't-They (WT-WT) is a large dataset of English tweets targeted at stance detection for the rumor verification task. The dataset is constructed based on tweets that discuss five recent merger and acquisition (M&A) operations of US companies, mainly from the healthcare sector. We can download this dataset from [here](https://raw.githubusercontent.com/cambridge-wtwt/acl2020-wtwt-tweets/master/wtwt_ids.json). ## Method - We provide three methods to get the context of tweet. 1. `source_code1.py`: use `python` and `pytwitter`(python tool) 2. `source_code2.py`: use `python` and `Postman`(API tool) 3. `source_code3.py`: use `python` and `requests`(python tool) ### Method 1. Use `python` and `pytwitter`(python tool) #### Description We use the wtwt dataset and get tweet_ids in this. Then, we put the tweet_ids to pytwitter which is a tweet api tool and get the tweet text. #### Code ```python= import pandas as pd import json from pandas import json_normalize import numpy as np from pytwitter import Api # can not provide the keys of twitter api consumer_key = '' consumer_secret = '' access_token = '' access_token_secret = '' bearer_token = '' with open('wtwt_ids.json', newline='') as jsonfile: data = json.load(jsonfile) df = json_normalize(data) # search the merger merger_id_dict = {} for i, merger in enumerate(df.merger.unique()): merger_id_dict[merger] = i # take some samples take_some_samples = [] for merger in df.merger.unique(): take_some_samples.append(df[df['merger'] == merger].head(30)) new_df = pd.concat(take_some_samples).reset_index(drop=True) # drop the old index # Use twitter api by pytwitter api = Api(bearer_token=bearer_token) api = Api( consumer_key= consumer_key, consumer_secret= consumer_secret, access_token= access_token, access_secret= access_token_secret ) def get_tweet(tweet_id): try: query_result = api.get_tweet(str(tweet_id), return_json=False) #print('[!] preparing for {}. Success!'.format(tweet_id)) except Exception as e: print('[!] Preparing for {}. Failed!!!'.format(tweet_id)) return np.nan return query_result.data.text # Remove the columns which have nan value new_df['tweet_text'] = new_df.apply(lambda x: get_tweet(x['tweet_id']), axis=1) new_df = new_df.dropna(axis = 0) # Panads to json new_json = new_df.to_json(orient="records") parsed = json.loads(new_json) with open('sample_data.json', 'w') as fp: json.dump(obj=parsed, fp=fp, indent=4) ``` #### Result ``` [ { "tweet_id": "971761970117357568", "merger": "CI_ESRX", "stance": "support", "tweet_text": "Cigna and ESI set to merge. Here we go..." }, ..., ] ``` ### Method 2. Use `python` and `Postman`(API tool) #### Code preprocess `wtwt_ids.json`, get the proper format of `tweet_id` to feed into Postman ```python= import json with open('wtwt_ids.json', newline='') as jsonfile: data = json.load(jsonfile) tweet_id=[] stance=[] for tweet in data: tweet_id.append(tweet['tweet_id']) stance.append(tweet['stance']) request_id=[] for i,id in enumerate(X_train): request_id.append(id+',') if(i==99): break str1 = ''.join(request_id) print(str1) ``` #### Twitter API ##### sign up for twitter developer and get the API key [Twitter developer](https://developer.twitter.com/en/docs/twitter-api) * copy all the key it will be used latter ![](https://i.imgur.com/SSfyNHu.png) #### Postman ##### Step 1: Sing up for postman and fork the twitter's API workspace to own workspace collection [twitter's API workspace](https://www.postman.com/twitter/workspace/twitter-s-public-workspace/collection/9956214-784efcda-ed4c-4491-a4c0-a26470a67400?ctx=documentation) ![](https://i.imgur.com/x6FBquF.png) ##### Step 2: Add keys and tokens as environmental variables * Select collection `twitter api v2` > Select endpoint `Multiple tweets` > Authorization * Change the authorization type to `Bearer token` and paste the Bearer which just copied from `Twitter developer platform` ![](https://i.imgur.com/K88ULQ8.png) ##### Step 3: Add values to the Params tab Example parameters: * `id`: Required, Enter up to 100 comma-separated Tweet IDs. * `text`: Text of the tweet * `author_id`: id of tweet author * `context_annotations`: some relative annotation for the tweet ![](https://i.imgur.com/n2f4rdj.png) ##### Step 4: Send request and review response Press the send buttom and see the response in below playload ![](https://i.imgur.com/27yp3Jw.png) ### Reference https://developer.twitter.com/en/docs/tutorials/postman-getting-started ### Method 3. Use `python` and `requests`(python tool) 使用user key對twitter api方送request，一次最多可查詢100 tweet_id對應的貼文。 #### Code ```python= import requests import os import json bearer_token = "YOUR_BEARER_TOKEN" def create_url(ids): # Tweet fields are adjustable. # Options include: # attachments, author_id, context_annotations, # conversation_id, created_at, entities, geo, id, # in_reply_to_user_id, lang, non_public_metrics, organic_metrics, # possibly_sensitive, promoted_metrics, public_metrics, referenced_tweets, # source, text, and withheld tweet_fields = "tweet.fields=lang,author_id,source" # You can adjust ids to include a single Tweets. # Or you can add to up to 100 comma-separated IDs url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields) return url def bearer_oauth(r): """ Method required by bearer token authentication. """ r.headers["Authorization"] = f"Bearer {bearer_token}" r.headers["User-Agent"] = "v2TweetLookupPython" return r def connect_to_endpoint(url): response = requests.request("GET", url, auth=bearer_oauth) print(response.status_code) if response.status_code != 200: raise Exception( "Request returned an error: {} {}".format( response.status_code, response.text ) ) return response.json() tweets_found = [] for idx in range(5): # add 100 ids to query str data_train = df[idx*100:(idx+1)*100] ids = 'ids='+",".join(str(id) for id in data_train['tweet_id']) # query with requests url = create_url(ids) json_response = connect_to_endpoint(url) tweets_found = tweets_found + json_response['data'] # join data with id data_train = df.rename(columns={'tweet_id':'id'}).merge(json_normalize(tweets_found), on='id') # Output data new_json = data_train.to_json(orient="records") parsed = json.loads(new_json) with open('sample_data.json', 'w') as fp: json.dump(obj=parsed, fp=fp, indent=4) ``` #### Result ``` [ { "id": "971761970117357568", "merger": "CI_ESRX", "stance": "support", "author_id": "3118851863", "source": "Twitter for iPhone", "lang": "en", "text": "Cigna and ESI set to merge. Here we go..." }, ... ] ```