Summer Institute in Computational Social Science

# Data Sources for Computational Social Sciences  slides: https://hackmd.io/@timdennis/sicss-data as notebook: https://hackmd.io/awB1KU6RTlqHdfbjWec23Q?view --- ## Who am I? - Data Librarian :blue_book: - Run the Data Science Center in the UCLA Library :bear: - Carpentries Instructor :computer: - BITSS Catalyst :chart_with_upwards_trend: --- ### Today: 1. Go over some sources for data 2. Some hands on with TAGS - a twitter-google spreadsheet tool --- ## Social Media --- ## Twitter Data ![](https://i.imgur.com/hGw360n.jpg) --- ### Questions: * Are historical tweets needed? Or current tweets? * Do you have funding to acquire Twitter data? * Do you need to share the Twitter dataset as part of publication / reproducible research? * What are your technical skills? Source: https://gwu-libraries.github.io/sfm-ui/posts/2017-09-14-twitter-data --- ### 4 ways of Acquiring Twitter data: * Retrieve from Twitter public API (7 days of tweets) * Find an existing Twitter ID dataset * Purchase from Twitter (per month by # queries for historical data) * Use a Twitter service provider (Crimson Hexagon, now brandwatch) --- ### Public API Tools * [Tweepy](http://www.tweepy.org/) (Python) & [rtweet](https://github.com/mkearney/rtweet) (R package) * [Twarc](https://github.com/docnow/twarc) (command line) * [TAGS for Google Spreadsheet](https://tags.hawksey.info/) (We'll come back to this) --- ### Finding existing twitter datasets * [2016 US Presidential Election Tweet IDs](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PDI7IN) * Deen Freelon's [40 million tweet IDs #Blacklivesmatter](http://dfreelon.org/2017/01/03/beyond-the-hashtags-twitter-data/) * DocNow's catalog of [Tweet ID Datasets](https://www.docnow.io/catalog/) Note: If you are sharing datasets of tweets, you can only publicly share the ids of the tweets, not the tweets themselves. Election data 280 mil from 2016 campaign. DocNow is a listing of --- ## How You Hydrate - Hydrator ![Hydrator](https://i.imgur.com/WKvuhQl.png) Note: Use the desktop client called Hydrator. Takes the Tweet ID Datasets and hydrates them with the content. You can also use twarc to hydrate. --- ## Facebook ![](https://i.imgur.com/VuBJQA5.png) Note: FB provides a Graph API for FB posts and Instagram. Note there are limits based on permissions and restrictions. Other --- ![](https://i.imgur.com/sNhl2C2.png) --- For others, look for an API or developers tools. --- ## Text --- You can build corpora of text data from previous sources --- The library also licenses resources which allow TDM. Note this is a thing our licensing people are now negotiating for -- research access to the assets --- ![](https://i.imgur.com/xOpxmQX.png) Link: http://guides.library.ucla.edu/c.php?g=755671&p=5492102 --- ![](https://i.imgur.com/VlTelxn.png) Note: The HathiTrust Digital Library contains over 16 million volumes that span the history of printed text, primarily in English, but also in German, French, Spanish, and Russian, among over 400 other language. You can use their preprocessed data (extracted features) or build your own data container or workset. Since the volumns of data is large, they provide an infrastructure for you to do your analysis in Hathi --- ![](https://i.imgur.com/Tz69Cdw.png) Download raw data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html --- ## Administrative Data >Administrative data are collected by governments or other organizations for non-statistical reasons to provide overviews on registration, transactions, and record keeping (Wikipedia) --- ## Government * Tax data * Death and birth data * Crime data * Immigration Note: Some of this data is available via agencies or data archives ICPSR as aggregates, raw data or case level harder to get (TAX data). IRS has a research program where you can apply for access. --- ## Businesses * They can sell you data or have APIs with more limited access (Zillow) * Third party sites can post or sell data they acquire ([Inside Airbnb](http://insideairbnb.com/)) * Some datasets are posted to open data sites ([Uber's 2014 NYC dataset](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city)) * Contact your data librarian for help finding the data --- ### Hands-On with TAGS --- Tags is a plugin for Google Sheets that lets you build a twitter dataset via the public API. --- ### Follow these steps: Link to this section: <https://tinyurl.com/sicss-tags-exercise> 1. Go to <https://tags.hawksey.info/> & click `Get TAGS` 2. Select `TAGS v6.1` - note how to get around the 'This app isn't verified' notice below 3. Select `Make a copy` - after, wait a moment and a `TAGS` menu item will appear 4. Select `TAGS > Set up Twitter Access` and follow thru to give TAGS access to your twitter account, for me this took a couple of steps. 5. Once authenticated enter a hashtag in the enter term box (number 2) - include the pound 6. Select `TAGS > Run now!` 7. Switch to the `Archive` sheet and see the results populated 8. Make the google sheet public on the web and then you can use the `TagsExplorer` or `TAGS Archive` to see different aspects of your result set