Network data analysis: Course-work 2 ideas 2019-20

# Network data analysis: Course-work 2 ideas 2019-20 ## Notes - Below are some project ideas. You are not restricted to use these ideas, but this is intended to give a flavour for the scale and kind of projects expected. - Some projects may provide datasets that you can start from. But no further support is guaranteed or provided. :::info **E-mail addreses for the projects** Pushkal: "Agarwal, Pushkal" <pushkal.agarwal@kcl.ac.uk> Abdul: "Abdullahi Abubakar" <abdullahi.abubakar@kcl.ac.uk> - All projects should be done individually without seeking help from others, or copying from the Internet. If a project includes multiple variants then individual reports having significant difference from other team members should be submitted. - If you are choosing a project from here, please use the following doodle link to coordinate among yourselves: https://doodle.com/poll/c2uw5kw68h9t4q6f Projects must be selected or proposed not later than thursday of the week of release. :::info **Output expected** - 4 minute presentation of plans followed by 2 minute of Q&A/feedback. - 10-20 page report of your project (submit as PDF on keats). - Code - In the first page of your report, provide a link to an online repository (repo) of your source code (You can use, for example, github.com, bitbucket.org or any other version control solution). We *do not* require the data but feel free to include in your repo if allowed by the dataset's terms and conditions. :::warning **Notes about submitting code:** - If you want to keep your repos private, KCL has a github enterprise instance (https://github.kcl.ac.uk), or bitbucket allows private repositories. Ensure that the markers will have access to the repo (eg provide a user name and password for a dummy user who has access to your repo). - We need to ensure that all code was created before the deadline. Therefore, please use some mechanism to timestamp your code base and share that version of your code in your report. e.g., in SVN it could be the version number of the commit you want to share. Git has git tags (https://git-scm.com/book/en/v2/Git-Basics-Tagging) ::: ## **Marks will be given as follows** - Planning (20%, presentation): This will be evaluated through a 4 minute presentation followed by 2 minute of Q&A/feedback. - Execution (20%, report + code): This will be evaluated by examining the quality of the code you submit, and the report for the thoroughness of the questions asked, and the scientific nature of the investigations done. - Results (30%, report + code): This looks at the nature and significance of the results, and the analytical methods used to obtain them. - Originality of ideas (15%, presentation + report): This will be evaluated from the presentation as well as from the introduction of your report. - Ability to use and build on concepts in class (15%): This is determined from your weekly quiz performace in the Lab. The quiz will be based on concepts that were discussed in class. ### Timetable | Date | Events | |--------------|---------------------| | 03 Mar 2020 | CW2 Release | | 10 Mar 2020 | Presentations | | 24 Mar 2020 | Final Presentations | ### Project 1: LondonAir: Urban living choices (Pushkal) Air quality is a key metric that decides how suitable a city or an area is for living. Densely populated big cites tend to have bad air quality. Low visibility, breathing problems, and many major problems are due to poor air quality. To monitor air quality across the UK, KCL Environmental Research Group has created various dashboards (http://www.londonair.org.uk/LondonAir/Default.aspx) and also has an open source tool (http://davidcarslaw.github.io/openair/). In this project you will make use of the ([**openair**](http://davidcarslaw.github.io/openair/)), a popular air quality data analysis tool and dataset in R. You are expected to first present spatial network analysis on this data. Second you need to add a layer of user demographics to make this as a multiplex network (could be say variant 1: populations density, variant 2: number of cars, etc.) and compute network metrics. - Dataset: Air quality from above link and a layer of user demographics. Each variant of the project must use different user demographics from any available dataset. An example of dataset website is https://www.ons.gov.uk/. - Visualise and compute: Air quality on the map of UK. Identify clusters and define edges based on different groups of the air quality metric you computed. Then present layer of user demographics to show as multiplex network. For the multiplex network report metrics such as clustering coefficient, degree distribution etc. by (a) converting to it's monoplex equivalent (b) defining multiplex equivalents of same metrics. - Additional comments: Check if there are any specific clusters of high or low pollution. Can you comment based on your user demographis on reasons that might be causing this clustering? - Dataset link: Install and import in R via http://davidcarslaw.github.io/openair/ These analytics should be focussed towards answering specific hypotheses of interest. ### Project 2: Santander Cycle: Movements within the city (Pushkal) Cycling is good for your health and wealth. Hop-to-hop hire cycles are offered across London by many vendors. One such big sponsor is Santander which is running under TFL. Santander cycles are becoming a popular choice among tourists, students and office going professionals. In this project you will use the hop-to-hop trips data (files under usage-stats/ at https://cycling.data.tfl.gov.uk/) stored with TFL. You are expected to investigate data to answer questions (but not limited to)- - Dataset: Columns- Rental Id,Duration,Bike Id,End Date,EndStation Id,EndStation Name,Start Date,StartStation Id,StartStation Name. link: https://cycling.data.tfl.gov.uk/ - Visualise and compute: Spatial network analysis between pairs of points to check balance at any given point in time? If not, then can you show examples points those have high incomings. How about over a day, or over a week? Report metrics such as clustering coefficient, degree distribution etc. Then defining multiplex equivalents of same data as follows and show corresponding metrics. - Variant 1: Divide the above daily data to form temporal network taking time as a dimension. Once such example could be taking all morning jouneys in one layer and all evening journeys as second layer. - Variant 2: Using bike id and duration of ride in the dataset form a jurney duration network. Example could be rides of short and long journey from various hops. - Additional comment: Are trip timings between two given stations always same? Can you predict average journey time between nodes? ### Project 3: Increasing reachability with Multimodal transport networks (Pushkal) In this project, you will get a taste of flow networks in real life. We have shared a 5% sample of TFL's Oyster swipe ins and swipe outs during 2 months. Each swipe record contains swipe in time, swipe out time, swipe in station, swipe out station, total charge incurred or the plan the user is using (e.g. a weekly pass). As a preparation for this problem, you would first need to clean the data, as the data also contains a lot of no-swipers and people who travel on buses. After cleaning, you should be left with some proportion of original data, containing perfect in and out swipes. Your aim should be formulating a spatial network of different modes of travel network (a) taken all journeys together as monoplex network (b) defining multiplex equivalents of same data as follows and show corresponding network metrics. - Dataset: You will be provided with 5% journeys data. If you want the tube system data then you can refer to this [dataset](https://github.com/nicola/tubemaps/tree/master/datasets) , which has csv files for tube station connections , tube station locations and line memberships for each connection. - Variant 1: Your multiplex layers can be peak hours and off-peak hours journeys. - Variant 2: Your multiplex layers can be journeys with travel card or journey where swipe out is missing. - Additional comments: Can you run a path finding algorithms discover routes between source and sink stations. Dataset link: pushkal.agarwal@kcl.ac.uk ### Project 4: Geotagged Tweets: Are there spatio-temporal localities in tags? (Pushkal) Is'nt it quite often when you post on social media by adding a location to make you post better. In general, users share their daily experiences while travelling, visiting places, eating etc. These geotagged feeds can reveal how people choose and move. In this project we look geotagged tweets from Twitter data (City of London area) with spatial properties. - Dataset: Tweets with metadata and also annotated with the sentiments score of the text. pushkal.agarwal@kcl.ac.uk - Visualise and compute: Spatial network analysis and check for possible clusters. Also present multiplex network layers whose variant can be as follows and compute various clustering metrics. - Variant 1: What type of hashtags people attach when posting? (#Food, #River, #Shard, etc) Are the hashtags recurring for same place using lat-long data? - Variant 2: What type of mentions people attach when posting? (@Shard, @KCL, @wasabi, etc.) Are the metions recurring for same place using lat-long data? - Additional comments What are the sentiments attached to the place/tweet? What are the sentiments of a platforms from where these tweets are shared? (Instagram etc. as tweetsource in the dataset) ### Project 5: Network of networks for multi-lingual social network 'Sharechat'. (Pushkal) Sharechat (https://sharechat.com/) is a multi-lingual social network. Anyone can join an instance of language communities. Having joined an instance, they can follow identities on that instance, or post on the forum. Thus creating a federated “network of networks”. This is similar to a Brazilian who can become friends with a person in Australia, or another person in Brazil. You will be provided with a massive dataset of Sharechat, and you will analyse this to understand the characteristics of such language networks. You will formulate hypotheses around this dataset by makeing a multiplex network. - Dataset: . pushkal.agarwal@kcl.ac.uk - Visualise and compute: Multiplex network layers of different language communities. You are expected to presents clustering metrics for (a) this network as monoplex taken all together (b) multiplex using languages, hashtags, etc. - Additional comments: Do you observe an association of users of some languages to others? ### Project 6: MPs Geolocated Wikipedia Edits (Pushkal) All members of UK parliament (MPs) have their wikipedia pages. These pages are free to edit. For non logged in users their IP addresses are stored and displayed as user name in edit history. In this project student will get MPs pages edit records which are from these public IPs (~39k). These have been geocoded with lat-long and constituency labels. Some questions to investigate- - Dataset: . pushkal.agarwal@kcl.ac.uk - Visualise and compute: Spatial network analysis. Multiplex network analysis with different layers as in variants below. You are expected to presents clustering metrics for (a) this network as monoplex taken all together (b) multiplex using languages, hashtags, etc. - Variant 1: Editors to MPs. Are people editing their local MP or not? You can use party of MPs to add addtional layers of editors editing a party. - Variant 2: Editors to constituency area. Are these edits focused from some small areas (Eg. staff edits from Parliament) or are evenly spread? What can you infer from the constituency level analysis? Additional comments: Are layers of text additions and text deletions different from each? ### Project 7: User journeys through checkins (Abdullahi) Humans are quite predictible. The tech giants know this and are able to map your daily journey times, your daily coffee habits and commute preferences to get recommendations on your phone. This project is giving your a slight taste of that. Here we use 4 square checkins, which are user generated checkins across new york and tokyo, to understand user mobilility. In this project, you have been given user ids, and their checkin patterns with checkin location and category of business. Using this , model a user's journey as some form of a process that takes into account past N checkins and predicts the N+1th checkin. Leverage the knowledge from class to understand if there is any clustering phenomenon around certain types of busineses. * Dataset : https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset * Two datasets are provided, for each variant use one of the datasets. * Variant 1: NYC Dataset * Variant 2: Tokyo Dataset Additional Comments: Simple characterization of the dataset would be very helpful in putting confidence in your observations. Comment on any interesting anomalies in terms of characteristics of data. ### Project 8: A multilayer network understanding of foursquare checkins (Abdullahi) In this project you will formulate a multi layered network problem using the dataset below. You can create one layer of this network, which we can call a user-colocation layer, which is formed by placing edges between two user nodes, if they have checked into the same business at least once. The second layer of this network could be a user taste network, created by placing edges between two users if they have checked into business establishements of similar type (restaurants, home, gaming parlour, bars etc.) The initial hunch is that the user taste network could result in a lot more dense network than co-location, as we are collapsing multiple businesses into one type. Test this hunch (ie whether the network is actually more dense) by evaluating network metrics of individual layers and the multilayered network. Using the now formed multi-layered networks of users (based on co-location, and based on user tastes), find users with most similar taste ego networks i.e. whose immediate neighbours have the highest overlap in the business type networks. Can you find any application based on this match of users in taste network? (If two users have a very high match in terms of taste ego networks, but have gaps in the co-location ego networks, you may recommend the non-covered nodes to each other as recommendations). Make observations based on this setup. * Dataset : https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset * Two datasets are provided, for each variant use one of the datasets. * Variant 1: NYC Dataset * Variant 2: Tokyo Dataset The datasets contains checkings from users across the city of New york and Tokyo. Each dataset has the time stamps of the checking, the user id, the lat and long of the business where the checking happened, type of that business (bar, restaurant, gaming , home etc) and some more fields. ### Project 9: Visualizing city through its businesses: (Abdullahi) Cities tend to be mono centric or poly centric (so single downtown or many areas for leisure and shopping). Both the variants have their pros and cons. Some cities have strict bifurcation between where people live, where people work and where people spend their leisure time. London, being an ancient city, shows a [mix of polycentric and gentrified arrangement](https://www.ucl.ac.uk/news/2010/feb/ucl-researchers-reveal-polycentric-london). In this project, you will try to visualize the structure of the two cities in terms of business centers, through the foursquare checkin behaviour of its citizens. * Variant 1: Using the checkins from new york, try and find if you see spatial clustering of business along types. e.g. do all chinese restaurants tend to cluster around each other (there is a china town in new york too), or do all theaters fall on the same street (e.g. broadway street in NY). * Variant 2: Similarly, using the checkins from Tokyo, try and find if you see spatial clustering of business along types. e.g. do restaurants tend to cluster around each other, or do all theaters fall on the same street. * Dataset : https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset Quantify this phenomenon using some metrics and graphs to differentiate between business types and their clustering. Further, can you predict using the timestamps of checkins, any sort of flow between business types. e.g. you are likely to see a checkin in a bar after a checkin in a restaurant. Make observations, and explore predictive value of such analysis. ### Project 10: Birds of same feather eat together : (Abdullahi) An interesting phenomenon that you see in social systems, is homophily of agents. We tend to be friends who have similar tastes, or our friends tend to have similar tastes as us. To test this, but this time around using a multilayered approach, we will look at the checkin data from the descendant of foursquare called SwarmApp. Through a dataset of 60 million users, you will look at the social network and the checkins of users, to calculate the amount of homphily in a friends network, when it comes to visiting same/similar restaurants. The dataset can be found [here.](https://github.com/chenyang03/Swarm_dataset) [mirror Friends network](https://www.dropbox.com/s/iwj7b9kygqhykxh/friends.tar.gz?dl=0) [mirror Checkins](https://www.dropbox.com/s/iufxpaymdsaiy2a/check-in_nums.tar.gz?dl=0). Formulate a multilayered abstraction of this data (2 layers), where nodes in each layers represent users. In one layer, the edges would follow the friendship relationships between users. In the second layer, the edges would follow colocation (checkins in the same place) relationships. Using this formulation, evaluate homophily among the users. If there is some degree of homophily in one layer, is there an equal degree of homophily in the other layer? Or, is there more homophily in one layer than in the other layer -- is there more homophily among users who checkin together, or is there more homophily in the social network of declared friendships? Explain your results using a temporal graph approach, perhaps by taking into consideration the times of checkins and user histories over time. Quantify the concept of homophily in terms of overlap of ego-network or any other technique, and justify the choice. Make any intersting observations that you may find. ### Project 11: Specialisation of crops across countries over time. (Abdullahi) Different places are characterised by different weather conditions and soil type which causes disparity in crops scattered around the globe. Though with the advancement of technology some of this barriers are no more, the project aim to answer the following questions and more. * Variant 1: What crops can be associated with which regions of the world. And is there specialisation of crops among regions/countries? Is this specialisation increasing or decreasing over time? * Variant 2: Using the dataset provided, formulate a multilayered abstraction of the data. For example one layer can be crops grown in Asia and the second layer can be crops grown in America. Using this formualtion, Can you recommend crops amongst regions. * Dataset: http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_E_All_Data.zip Additional Note: To better understand the dataset, you might find this URL useful http://www.fao.org/faostat/en/#data/QC