# HW1: Crawler
- Group: No.9
- Team leader: 王霈玄(110522127)
- Team members: 彭彥霖(109526011)、何名曜(110522110)
## Describe our project topic
### Target
- Our target is to build a graph for stance detection. Utilize benchmark dataset for training and testing.
- Get `tweet_id` and target `stance` from`acl2020-wtwt-tweets` dataset, combine with `twitter api`, `python` and `Postman` to send request and get the context of tweet.
### Dataset
#### [acl2020-wtwt-tweets](https://github.com/cambridge-wtwt/acl2020-wtwt-tweets)
Will-They-Won't-They (WT-WT) is a large dataset of English tweets targeted at stance detection for the rumor verification task. The dataset is constructed based on tweets that discuss five recent merger and acquisition (M&A) operations of US companies, mainly from the healthcare sector.
We can download this dataset from [here](https://raw.githubusercontent.com/cambridge-wtwt/acl2020-wtwt-tweets/master/wtwt_ids.json).
## Method
- We provide three methods to get the context of tweet.
1. `source_code1.py`: use `python` and `pytwitter`(python tool)
2. `source_code2.py`: use `python` and `Postman`(API tool)
3. `source_code3.py`: use `python` and `requests`(python tool)
### Method 1. Use `python` and `pytwitter`(python tool)
#### Description
We use the wtwt dataset and get tweet_ids in this. Then, we put the tweet_ids to pytwitter which is a tweet api tool and get the tweet text.
#### Code
```python=
import pandas as pd
import json
from pandas import json_normalize
import numpy as np
from pytwitter import Api
# can not provide the keys of twitter api
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
bearer_token = ''
with open('wtwt_ids.json', newline='') as jsonfile:
data = json.load(jsonfile)
df = json_normalize(data)
# search the merger
merger_id_dict = {}
for i, merger in enumerate(df.merger.unique()):
merger_id_dict[merger] = i
# take some samples
take_some_samples = []
for merger in df.merger.unique():
take_some_samples.append(df[df['merger'] == merger].head(30))
new_df = pd.concat(take_some_samples).reset_index(drop=True) # drop the old index
# Use twitter api by pytwitter
api = Api(bearer_token=bearer_token)
api = Api(
consumer_key= consumer_key,
consumer_secret= consumer_secret,
access_token= access_token,
access_secret= access_token_secret
)
def get_tweet(tweet_id):
try:
query_result = api.get_tweet(str(tweet_id), return_json=False)
#print('[!] preparing for {}. Success!'.format(tweet_id))
except Exception as e:
print('[!] Preparing for {}. Failed!!!'.format(tweet_id))
return np.nan
return query_result.data.text
# Remove the columns which have nan value
new_df['tweet_text'] = new_df.apply(lambda x: get_tweet(x['tweet_id']), axis=1)
new_df = new_df.dropna(axis = 0)
# Panads to json
new_json = new_df.to_json(orient="records")
parsed = json.loads(new_json)
with open('sample_data.json', 'w') as fp:
json.dump(obj=parsed, fp=fp, indent=4)
```
#### Result
```
[
{
"tweet_id": "971761970117357568",
"merger": "CI_ESRX",
"stance": "support",
"tweet_text": "Cigna and ESI set to merge. Here we go..."
},
...,
]
```
### Method 2. Use `python` and `Postman`(API tool)
#### Code
preprocess `wtwt_ids.json`, get the proper format of `tweet_id` to feed into Postman
```python=
import json
with open('wtwt_ids.json', newline='') as jsonfile:
data = json.load(jsonfile)
tweet_id=[]
stance=[]
for tweet in data:
tweet_id.append(tweet['tweet_id'])
stance.append(tweet['stance'])
request_id=[]
for i,id in enumerate(X_train):
request_id.append(id+',')
if(i==99):
break
str1 = ''.join(request_id)
print(str1)
```
#### Twitter API
##### sign up for twitter developer and get the API key
[Twitter developer](https://developer.twitter.com/en/docs/twitter-api)
* copy all the key it will be used latter

#### Postman
##### Step 1: Sing up for postman and fork the twitter's API workspace to own workspace collection
[twitter's API workspace](https://www.postman.com/twitter/workspace/twitter-s-public-workspace/collection/9956214-784efcda-ed4c-4491-a4c0-a26470a67400?ctx=documentation)

##### Step 2: Add keys and tokens as environmental variables
* Select collection `twitter api v2` > Select endpoint `Multiple tweets` > Authorization
* Change the authorization type to `Bearer token` and paste the Bearer which just copied from `Twitter developer platform`

##### Step 3: Add values to the Params tab
Example parameters:
* `id`: Required, Enter up to 100 comma-separated Tweet IDs.
* `text`: Text of the tweet
* `author_id`: id of tweet author
* `context_annotations`: some relative annotation for the tweet

##### Step 4: Send request and review response
Press the send buttom and see the response in below playload

### Reference
https://developer.twitter.com/en/docs/tutorials/postman-getting-started
### Method 3. Use `python` and `requests`(python tool)
使用user key對twitter api方送request,一次最多可查詢100 tweet_id對應的貼文。
#### Code
```python=
import requests
import os
import json
bearer_token = "YOUR_BEARER_TOKEN"
def create_url(ids):
# Tweet fields are adjustable.
# Options include:
# attachments, author_id, context_annotations,
# conversation_id, created_at, entities, geo, id,
# in_reply_to_user_id, lang, non_public_metrics, organic_metrics,
# possibly_sensitive, promoted_metrics, public_metrics, referenced_tweets,
# source, text, and withheld
tweet_fields = "tweet.fields=lang,author_id,source"
# You can adjust ids to include a single Tweets.
# Or you can add to up to 100 comma-separated IDs
url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields)
return url
def bearer_oauth(r):
"""
Method required by bearer token authentication.
"""
r.headers["Authorization"] = f"Bearer {bearer_token}"
r.headers["User-Agent"] = "v2TweetLookupPython"
return r
def connect_to_endpoint(url):
response = requests.request("GET", url, auth=bearer_oauth)
print(response.status_code)
if response.status_code != 200:
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
return response.json()
tweets_found = []
for idx in range(5):
# add 100 ids to query str
data_train = df[idx*100:(idx+1)*100]
ids = 'ids='+",".join(str(id) for id in data_train['tweet_id'])
# query with requests
url = create_url(ids)
json_response = connect_to_endpoint(url)
tweets_found = tweets_found + json_response['data']
# join data with id
data_train = df.rename(columns={'tweet_id':'id'}).merge(json_normalize(tweets_found), on='id')
# Output data
new_json = data_train.to_json(orient="records")
parsed = json.loads(new_json)
with open('sample_data.json', 'w') as fp:
json.dump(obj=parsed, fp=fp, indent=4)
```
#### Result
```
[
{
"id": "971761970117357568",
"merger": "CI_ESRX",
"stance": "support",
"author_id": "3118851863",
"source": "Twitter for iPhone",
"lang": "en",
"text": "Cigna and ESI set to merge. Here we go..."
},
...
]
```