Network data analysis: Course-work 1 ideas 2019-20

# Network data analysis: Course-work 1 ideas 2019-20 ## Notes - Below are some project ideas. You are not restricted to use these ideas, but this is intended to give a flavour for the scale and kind of projects expected. - Some projects may provide datasets that you can start from. But no further support is guaranteed or provided. :::info **E-mail addreses for the projects** Pushkal: "Agarwal, Pushkal" <pushkal.agarwal@kcl.ac.uk> Abdul: "Abdullahi Abubakar" <abdullahi.abubakar@kcl.ac.uk> - All projects should be done individually without seeking help from others, or copying from the Internet. If a project includes multiple variants then individual reports having significant difference from other team members should be submitted. - If you are choosing a project from here, please use the following doodle link to coordinate among yourselves: https://doodle.com/poll/e5mu48r9y97z33wc Projects must be selected or proposed not later than thursday of the week of release. :::info **Output expected** - 4 minute presentation of plans followed by 2 minute of Q&A/feedback. - 10-20 page report of your project (submit as PDF on keats). - Code - In the first page of your report, provide a link to an online repository (repo) of your source code (You can use, for example, github.com, bitbucket.org or any other version control solution). We *do not* require the data but feel free to include in your repo if allowed by the dataset's terms and conditions. :::warning **Notes about submitting code:** - If you want to keep your repos private, KCL has a github enterprise instance (https://github.kcl.ac.uk), or bitbucket allows private repositories. Ensure that the markers will have access to the repo (eg provide a user name and password for a dummy user who has access to your repo). - We need to ensure that all code was created before the deadline. Therefore, please use some mechanism to timestamp your code base and share that version of your code in your report. e.g., in SVN it could be the version number of the commit you want to share. Git has git tags (https://git-scm.com/book/en/v2/Git-Basics-Tagging) ::: ## **Marks will be given as follows** - Planning (20%, presentation): This will be evaluated through a 4 minute presentation followed by 2 minute of Q&A/feedback. - Execution (20%, report + code): This will be evaluated by examining the quality of the code you submit, and the report for the thoroughness of the questions asked, and the scientific nature of the investigations done. - Results (30%, report + code): This looks at the nature and significance of the results, and the analytical methods used to obtain them. - Originality of ideas (15%, presentation + report): This will be evaluated from the presentation as well as from the introduction of your report. - Ability to use and build on concepts in class (15%): This is determined from your weekly quiz performace in the Lab. The quiz will be based on concepts that were discussed in class. ### Timetable | Date | Events | |--------------|---------------------| | 11 Feb 2020 | CW1 Release | | 18 Feb 2020 | Presentations | | 03 Mar 2020 | Final Presentations | | 03 Mar 2020 | CW2 Release | | 10 Mar 2020 | Presentations | | 24 Mar 2020 | Final Presentations | ### Project 1: L-BRAND: Leader Board for Ranking Algorithms in Network Data (Abdullahi) There can be many different rankings of users on social media. For instance, users can be ranked by number of followers, number of retweets received etc. Variant 1: Should we expect someone who is ranked highly by retweets, to also have large number of followers, and vice versa? This project tests the idea by using a formal statistical analysis. The idea would be to take a large number of network datasets, and rank its nodes and edges by different means. Then check the rank correlations between the different ranks. You will then explain any differences found. Variant 2: Another variant of this project would be leaders ranking based on topics they share. Should we expect someone who tweets about certain topic to be more highly ranked? Is there a dis-proportion in the particiaption of topics? The project includes topics extraction by various means, rank users and then check rank correlations between the different ranks. You will then explain any differences found. - Dataset link: Brexit Tweets Data. Mail to abdullahi.abubakar@kcl.ac.uk - Visualise and Compute: Followers Rank, Retweet Rank, Rank Correlation, etc. - Additional Comment: - Example: By Daniel Tunkelang https://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank - Also read the Million Followers Fallacy paper: http://twitter.mpi-sws.org/icwsm2010_fallacy.pdf ### Project 2: Homophily of hashtags (Pushkal) Twitter is an ubiquitous social platform, and one of the best ways to make a tweet visible is using a hashtag in the tweet. Hashtags are extremly important in social movements too. Certain hashtags like #MeToo or #Brexit, have made considerable impact in the news coverage around respective issues. This project will give you an opportunity to deepdive into analysing and visualizing Twitter data. For this project, you will be given access to large twitter dataset in a format that you can analyse. Variant 1: Using dataset on #Brexit your aim is to dissect this data, extract hashtags, quantify which hashtags co-occur and come up with a way to formalize this co-occurance in some form of a network. You may also further visualize this network, and characterise it along different metrics we have learned through the module. Try and explain how hashtags co-occur with other hashtags. How are hashtags related to users? Variant 2: Using dataset on #MeToo your aim is to dissect this data, extract mentions (@), quantify which mentions co-occur and come up with a way to formalize this co-occurance in some form of a network. You may also further visualize this network, and characterise it along different metrics we have learned through the module. Try and explain how mentions co-occur with other mentions. How are some mentions related to users? - Dataset link: Tweets data. Mail to pushkal.agarwal@kcl.ac.uk - Visualise and Compute: Extaract hashtags, create a graph (Undirected Edge could be between hashtags from coming in a single tweet), Visualise using D3.JS etc. ### Project 3 Will you be my friend? Link prediction in network data (Abdullahi) Link prediction is one of the most fundamental problems of social networks. Be it the link between two friends, or link between a person and a product on amazon, link prediction is used in all areas of computer science, and is behind billions of dollars in revenue. Here we will be dealing with a problem of link prediction in the context of predicting friendships. For this you will need to work with real world social graphs, which you may download from this [website](https://snap.stanford.edu/data/index.html#socnets). Once you download a dataset, you will need to convert it into a graph structure using tools of your choice. Then you will delete 5% of edges between friends, and then use this modified graph to test link prediction algorithms to predict links. You should test the accuracy of your algorithm by comparing how many links predicted match the original graph. The results should be enumerated in the report using popular metrics like F-1 score, precision and accuracy. You can choose any prediction algorithm of your choice. This resouce can be a good primer of what is available out there https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.link_prediction.html. I recommend testing at-least 2 algorithms on this dataset. You can furter repeat the process by removing progressivly more edges from the original graph in increments, and plotting how the accuracy of prediction algorithm deteriorates as more links are removed. Please plot and explain your results in the report and critique any interesting observations you see. - Dataset link: - Variant 1: Any Dataset with friendship edges. - Variant 2: Any Dataset with E- commerce data exploration. - Visualise and Compute: Remove random links from existing network data. Recommend friends based on your model. Compare the result from the actual links you removed at first place. ### Project 4: A review of graph generation (Pushkal) Through the course of this module, we have come across graph generation models such as the Erdos Renyi model. These models made it easy to approximate real world networks, and allowed scientists and engineers to test their ideas in a scalable and replicable manner. This project goes beyond to understand and explain other graph generation models. You will need to choose 5 graph generation models (ER graphs, Watts-strogatz graphs, Forest fire graphs, Small world graphs etc). This project can have two variants given that each variant holder must choose different set of 5 models. After choosing them, you will need to write code, to generate and visualize these graphs through parameterised inputs. These graphs further need to be compared on various metrics that we reviewed in the module, like clustering , degree distribution, diameter etc. The more exhaustive the comparison, the better. The report should contain a full comparison of the similarity and differences between different graph generation models, along with your critique on the utility of these models for different kinds of networks. - Dataset link: Self Generated Model - Variant 1: 5 models other than in Variant 2. - Variant 2: 5 models other than in Variant 1. - Visualise and Compute: Generate Various models and compute various statistics (Distribution of Indegree and Outdegree etc.) and properties (clustering coefficient, diameter, etc). Models could be, Forest fire model, etc. - Additional Comment: - What you get when you compare with Erdos Renyi random graph? ### Project 5: Anatomy of a rumours (Pushkal) Rumours, fake news and misinformation is the talk of the day. They have been researched, measured and analysed in a spectrum of fields in order to understand and reason about this phenomenon. With the digital age, rumours now have a vareity of channels to spread. Variant 1: In this project, you will try to partially replicate one of these studies, that measure how a rumour about a scientific event spreads through the twitter verse. The paper under consideration is [this](https://www.nature.com/articles/srep02980). The data used in this paper is also available as open source. You will analyse this data and try and replicate the first 5 figures in the paper. Please interpret the results on your own and discuss it in the report. Variant 2: In this project, you will understand the use third-party cookies added on hyper-partisan news websites. Taking two different (Right and Left news) ecosystem measure the centrality, clustering, degree distribution and other network parameters. The paper under consideration is [this](https://arxiv.org/pdf/2002.00934.pdf). The data used in this paper will be made available. Also, discuss about the shared (common) cookies in both ecosystems. Please interpret the results on your own and discuss it in the report. - Dataset link: Paper- - V1: Data is available at https://www.nature.com/articles/srep02980 - V2: Mail pushkal.agarwal@kcl.ac.uk https://arxiv.org/pdf/2002.00934.pdf https://arxiv.org/pdf/1803.03576.pdf - Visualise and Compute: - V1: Metrics as given in figures of suggested paper - Metrics such as (centrality, degree distribution, etc.) for nodes with left versus right news urls. ### Project 6: Bots and Brexit (Pushkal) Twitter is riddled with bots. They come in all flavours, from the most benign bots who simply retweet stuff, to the ones with ulterior motives like spreading propaganda, and faking followers to inflate popularity numbers. And sometimes these bots can have measurable influence on the discourse happening over the twitterverse. Brexit was one such discourse, where bots were involved in large numbers. The aim of this project is to understand the reach, influence and importance of these bot accounts. We will provide you with a large brexit dataset, with accounts of users annotated with probabilities of being "Botlike". Variant 1: Assuming that as a ground truth, you will abstract the tweet contents from bot accounts and human accounts into two seperate **hashtag** networks in accordance with the homophily in hashtag project. Variant 2: Assuming that as a ground truth, you will abstract the tweet contents from bot accounts and human accounts into two seperate **mentions** networks in accordance with the homophily in hashtag project. You will then compare the two networks using all the metrics we have reviewed in the module using tools of your choice. In the report you should critique your observations and present the results using plots and visualizations. Are there differences between bots' twitter, and humans' twitter? How intermixed are these networks? - Topic: Bot-Brexit: Computing botness in Brexit data. - Dataset link: Mail pushkal.agarwal@kcl.ac.uk - Visualise and Compute: Data set of #Brexit tweets will be shared. Each user will be labelled with a botness score. Compute metrics (popularity, induced activity. etc) for suspected users with high bot score versus real users. - Additional Comment: - Botometer Project by Indiana University- https://botometer.iuni.iu.edu/ - Can you compute botness score on you own? ### Project 7 Why did Vine fail? (Pushkal) In this project, you will take two separate online social networks and looked at their social properties. The first one was Vine, which is now closed, and the other was Instagram, which is thriving. Both used similar business logic, but one has failed. To dig deeper, we have created two separate graphs, using interactions between users and posts for Vine and Instagram. The hypothesis is that there is something different in the way people interact, which made Vine loose traction. You will help this research by doing the investigative work. You will be provided with these two graphs in suitable formats, and your task is to compare them using all the different metrics we reviewed during this module. You should plot all your analysis in the report and critique any interesting differences you find. You can also further look for the rich club effect (https://en.wikipedia.org/wiki/Rich-club_coefficient) to see if that has anything to do with the failure. Explain the effect in the report and the results that you find. Formulate hypotheses and test using hypothesis testing methods. - Dataset: approach pushkal.agarwal@kcl.ac.uk - Visualise and Compute: What would you compare and get with two or more different network dataset? Eg Vine(micro video platform) versus instagram. Can you check for the Rich club effect? Also compute clustering coefficient etc. - Additional Comment: - Vine: https://en.wikipedia.org/wiki/Vine_(service) - https://en.wikipedia.org/wiki/Rich-club_coefficient ### Project 8: Network of networks is a decentralised social network. (Pushkal) Variant 1: Mastodon (https://joinmastodon.org/) is a decentralised social network. Anyone can setup an instance of Mastodon forming a new social network. Others can join one or more instances, creating new identities. Having joined an instance, they can follow identities on that instance, or on other instances; thus creating a federated "network of networks". This is similar to a Brazilian who can become friends with a person in Australia, or another person in Brazil. You will be provided with a massive dataset of Mastodon, and you will analyse this to understand the characteristics of such networks. You will formulate and test 3-5 hypotheses about how such a network of networks might work (for instance, do people of a particular instance link more to identities from the same instance, or is a truly global network being formed? If a local network, how do we characterise the "locality" vs "global" nature). Variant 2: Sharechat (https://sharechat.com/) is a multi-lingual social network. Anyone can join an instance of language communities. Having joined an instance, they can follow identities on that instance, or post on the forum. Thus creating a federated "network of networks". This is similar to a Brazilian who can become friends with a person in Australia, or another person in Brazil. You will be provided with a massive dataset of Sharechat, and you will analyse this to understand the characteristics of such language networks. You will formulate and test 3-5 hypotheses about how such a network of networks might work (for instance, do people of a particular instance link more to identities from the same instance, or is a truly global network being formed? If a local network, how do we characterise the "locality" vs "global" nature). - Topic: Network of Networks - Dataset link: Mail pushkal.agarwal@kcl.ac.uk - Visualise and Compute: Ever thought of having your own social network? Mastodon is here for you. Things one can check are - How user manage their own instance of OSN (online social network)? Which instance is more real? Do they cluster naturally? ### Project 9 Who is the hero of your story ? (Abdullahi) Novels and stories are human kind's favourite pastime. But have you thought, why some stories seem familiar? You will try to answer this question using an innovative technique. You should choose 5 of your favourite novels which can be found on the guthenbeg project https://www.gutenberg.org/ and then extract the character and named entity network using this cool tool: https://github.com/harrisonpim/bookworm. Variant 1 and 2 should choose different set of novels. The tool should give you a network of characters and named entities extracted from the text of the project. Once the networks are out, you should sanitize them and check them for any errors. Further you should visualize them and then compare them on accounts of all the metrics we have seen in the module. Can you find any similarities of these networks with the network models you have seen in the module ? Critique your observations on these lines in the report. Try and extract methods to identify the most central nodes in these network and reason about why these nodes are central. Who are the *real* heros of your favourite stories? - Dataset link: Any text book- https://www.gutenberg.org/ - Variant 1: 5 novels different from variant 2. - Variant 2: 5 novels different from variant 1. - Visualise and Compute: Refer this project on github- https://github.com/harrisonpim/bookworm. Come up with characters and cool insights with the project functions. - Additional Comment: - Can you think of more ideas and suggest commits to this project? ### Project 10: Understanding how hashtags and mentions trend (Abdullahi) Online social media is the full of "trends" and "trending topics". A content can start trending based some external or internal events. For instance, facebook's #10yearschallenge or news events like #Brexit. Hashtags bring together a community of people that participate with the same theme. Similarly mentions (@) seeks attention by an individual and adds additional load of replying.The aim of this project is to compute various metrics which tell about the time a hashtag (or mentions) was trending. Form a few (say 3) hypotheses about what kind of properties might make a hashtag (mention) trend, and then test these hypotheses. For example, one hypothesis may be that hashtags (mentions) that are retweeted by or made up by someone who is hugely popular (eg someone with a lot of followers) may help make a hashtag (mention) popular. You can use statistical techniques like hypothesis testing to confirm or deny such hypotheses. - Dataset link: Twitter data. Mail to abdullahi.abubakar@kcl.ac.uk - Variant 1: Use hastags in the tweets - Variant 2: Use mentions in the tweets - Visualise and Compute: When # start and end? Co-occurance hastags. Can you predict how long a hastag will live? ### Project 11: Graph Sampling (Abdullahi) In many cases, the graphs we observe are really large. A natural question to ask is whether we need to look at the entire graph, or if a smaller subset would also work. ie can we take a sample of the graph and determine its various properties? If not why not? How to make sure that a sample is "representative"? See here for a survey of various graph sampling techniques: https://arxiv.org/abs/1308.5865 You will code up some (3-5) of the graph sampling algorithms here, and test whether they are good enough for sampling 3-5 real world graphs, and 3-5 randomly generated graphs, and which properties of these graphs are preserved during sampling, and which properties are destroyed. Your report should analyse and explain your results. ### Project 12: Cases of corona virus in the world "add flight data approach" (Abdullahi) We are are now aware of the spreading deadly corona virus. Things are becoming harder and harder as there seems to be no cure except keeping infected people in isolation. The virus has also infected people across other nations than China. A recent study made data of infected people public [here]( https://www.sciencemag.org/news/2020/02/scientists-are-racing-model-next-moves-coronavirus-thats-still-hard-predict). In this project, you will understand the connection of spreading of virus using multiple network analysis techniques. Variant 1: This would investigate cases of virus reported using different geographical aspects. First question will be which are the countries getting affected? Then if we remove top country cases (i.e. China) what are the network statistics of the ones that remain. Can you comment of different properties you analysed such as travel history dates? Variant 2:This would investigate cases of virus reported using different demographics information is the dataset. First question will be which demographics (age, gender etc) getting affected? What are the network statistics of the ones getting affected? Can you comment on the location, population, weather stats and other properties you analysed which are related to the infection? - Dataset link: Find in article https://www.sciencemag.org/news/2020/02/scientists-are-racing-model-next-moves-coronavirus-thats-still-hard-predict - Data - - [Reported cases of the Virus](https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=0) - - Flight Data: flight data can be obtained from [Openflight](https://openflights.org/data.html#route), [flightrader](https://www.flightradar24.com/data/airports/wuh) and other sources - Visualise and Compute: - Additional Comment: Can you suggest any measure to combat this virus based on your findings?