# CSCI 347: Project 2: Exploring Graph Data
Sam Behrens and Karl Molina
## Introduction
The graph data set that we chose to complete our project on is called "CollegeMsg temporal network". It can be found at this link: https://snap.stanford.edu/data/CollegeMsg.html. The data set holds information about private messages being sent between college students. Nodes in the graph are the users in a social network. The edges are messages sent from one user to another.
This data set is interesting to us because we can relate to it. We are college students that often send private messages to each other. We think it would be interesting to see different statistics about a graph like this. We think that it would be cool to analyze this graph, because then we might be able to relate it to the messaging that we participate in in our daily lives.
To pre-process our data we first took a sample of 500 edges from the original data set. Then, we extracted the largest connected component from the new graph. This resulted in a total of 324 nodes and 377 edges. The original data set had 1899 nodes and 20296 edges.
We think that vertices or users with high centrality values will have sent a high amount of messages on the social media platform. This could mean that these users are very popular in the university or they are connected with a lot of people.
We think that the network will be scale-free, in other words, the network's degree distribution will follow a power law. We believe this because in a social network preferential attachment will be at play. In the case of this college messaging network, popular users that already have messaged a lot of people will continue connect to new users. From another perspective, we can imagine that new users to the social media platform could be introduced by a network moderator. The network moderator would already have messaged many people in the messaging application since he introduced many people.
We also believe that this graph will exhibit the small world property. We think this because social media networks usually consist of sub groups or cliques. In a college setting these cliques could be groups of friends, or college clubs. In the messaging application, almost everyone will have messaged everyone else in the group. It is easy to see that these different cliques would be connected by one or more people. We can image club leaders contacting each other for leadership events.
## Analysis
In order to complete analysis on our graph we implemented functions in Python. They can be seen here:
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the degree of a vertex:
```python=
def vertex_degree(vertex, edges):
"""
Calculates the degree of a vertex.
:param vertex: The vertex to find degree for.
:param edges: The edges in the graph.
"""
return len([edge for edge in edges if vertex in edge])
```
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the clustering coefficient of a vertex:
```python=
def vertex_clustering_coefficient(vertex, edges):
"""
Calculates the clustering coefficient of a vertex.
:param vertex: The vertex to find the clustering coefficient for.
:param edges: The edges in the graph.
"""
neighbors = [edge[0] if edge[0] != vertex else edge[1]
for edge in edges if vertex in edge]
edges_between_neighbors = [
edge for edge in edges if edge[0] in neighbors and edge[1] in neighbors]
neighbor_count = len(neighbors)
max_edge_count = neighbor_count * (neighbor_count - 1) / 2
return 0 if max_edge_count == 0 else len(edges_between_neighbors) / max_edge_count
```
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the clustering coefficient for the graph:
```python=
def graph_clustering_coefficient(edges):
"""
Calculates the clustering coefficient of a graph.
:param edges: The edges in the graph.
"""
vertices = set(np.array(edges).flatten())
total_clustering_coefficients = sum(
vertex_clustering_coefficient(vertex, edges) for vertex in vertices)
return total_clustering_coefficients / len(vertices)
```
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the closeness centrality of a vertex:
```python=
def closeness_centrality(vertex, edges):
"""
Calculates the closeness centrality of a vertex.
:param vertex: The vertex to find the closeness centrality for.
:param edges: The edges in the graph.
"""
path_lengths = nx.single_source_shortest_path_length(
get_graph_from_edges(edges), vertex)
return 1 / sum(path_lengths.values())
```
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the betweenness centrality of a vertex:
```python=
def betweenness_centrality(vertex, edges):
"""
Calculates the betweenness centrality of a vertex.
:param vertex: The vertex to find betweenness centrality for.
:param edges: The edges in the graph.
"""
graph = get_graph_from_edges(edges)
vertex_i = vertex
total = 0
nodes = list(graph.nodes)
all_shortest_paths = find_all_shortest_paths(graph)
for j in range(graph.number_of_nodes()):
vertex_j = nodes[j]
if vertex_j == vertex_i:
continue
for k in range(j + 1, graph.number_of_nodes()):
vertex_k = nodes[k]
if vertex_k == vertex_i:
continue
first = all_shortest_paths[vertex_j]
shortest_paths = first[vertex_k]
num_shortest_paths_j_k = 0
num_shortest_paths_pass_through_i = 0
try:
for path in shortest_paths:
num_shortest_paths_j_k += 1
if vertex_i in path:
num_shortest_paths_pass_through_i += 1
except nx.NetworkXNoPath:
continue
total += num_shortest_paths_pass_through_i / num_shortest_paths_j_k
return total
global_shortest_paths = {}
def find_all_shortest_paths(graph):
"""
Finds all the shortest path in a graph and stores it in a json file.
This is memoized.
:param graph: The graph to find the shortest paths in.
:return: A dictionary keyed by two nodes to get a list of shortest paths between them.
"""
if Path('shortest-paths.json').exists():
if global_shortest_paths:
return global_shortest_paths
with open('shortest-paths.json') as file:
shortest_paths = json.load(file)
shortest_paths = {int(node1): {int(node2): lst for node2, lst in d.items()} for node1,
d in
shortest_paths.items()}
global_shortest_paths.update(shortest_paths)
return shortest_paths
nodes = list(graph.nodes)
shortest_paths = {node: {} for node in nodes}
for j in range(graph.number_of_nodes()):
vertex_j = nodes[j]
for k in range(j + 1, graph.number_of_nodes()):
vertex_k = nodes[k]
s = nx.all_shortest_paths(graph, source=vertex_j, target=vertex_k)
shortest_paths[vertex_j][vertex_k] = list(s)
with open('shortest-paths.json', 'w') as file:
file.write(json.dumps(shortest_paths, indent=2))
return shortest_paths
```
A function that takes list of edges representing a graph, where each edge is a pair, as input, and outputs the average shortest path length of the graph:
```python=
def average_shortest_path(edges):
"""
Calculates the average shortest path of a graph.
:param edges: The edges in the graph.
"""
graph = get_graph_from_edges(edges)
nodes = list(graph.nodes)
total_shortest_paths = 0
shortest_paths_count = 0
for i in range(len(nodes)):
vertex_i = nodes[i]
for j in range(i + 1, len(nodes)):
vertex_j = nodes[j]
if vertex_i == vertex_j:
continue
total_shortest_paths += nx.shortest_path_length(graph, source=vertex_i, target=vertex_j)
shortest_paths_count += 1
return total_shortest_paths / shortest_paths_count
```
### Graph Visualization After Pre-processing

```python=
def visualize_graph():
"""
Visualizes the graph.
:return: None.
"""
graph = get_college_msg_graph()
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx(graph, pos, node_size=10, with_labels=False)
# plt.legend()
plt.title('Visualization of Preprocessed Graph')
plt.show()
```
### Top nodes by metric
Visualizations of the graphs showing the top ten nodes based on a certain metric. The python code used to create the plot follows each of them.

```python=
def plot_top_ten_betweenness_centrality_nodes():
"""
Shows the top ten betweenness centrality nodes in the graph.
:return: None.
"""
graph = get_college_msg_graph()
bc = nx.betweenness_centrality(graph, normalized=False)
top_ten_betweenness = get_top_nodes(10, graph, lambda vertex, edges: bc[vertex])
# remove two lines above and uncomment below when submitting
# top_ten_betweenness = get_top_nodes(10, graph, betweenness_centrality)
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_nodes(graph, pos, top_ten_betweenness, node_color='magenta', node_size=20,
label='top 10 betweenness centrality nodes')
plt.legend()
plt.title('Top 10 Betweenness Centrality Nodes')
plt.show()
def get_top_nodes(number, graph, function):
"""
Gets the a number of highest nodes based on what function returns.
:param number: The number of top nodes to get.
:param graph: The graph to get the nodes from.
:param function: A function that takes in vertex, edges and returns a number.
:return: None.
"""
nodes = list(graph.nodes)
metrics = []
for i in range(len(nodes)):
vertex = nodes[i]
metrics.append(function(vertex, graph.edges))
metrics = np.array(metrics)
args = np.argpartition(-metrics, number, )[:number]
return np.array(nodes)[args]
```

```python=
def plot_top_ten_closeness_centrality_nodes():
"""
Plots the ten nodes with the highest closeness centralities.
:return: None.
"""
graph = get_college_msg_graph()
top_ten_closeness = get_top_nodes(10, graph, closeness_centrality)
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_nodes(graph, pos, top_ten_closeness, node_color='magenta', node_size=20,
label='top 10 closeness centrality nodes')
plt.legend()
plt.title('Top 10 Closeness Centrality Nodes')
plt.show()
```

```python=
def plot_top_ten_eccentricity_nodes():
"""
Plots the ten nodes with the highest eccentricity values.
:return: None
"""
graph = get_college_msg_graph()
eccentricities = nx.eccentricity(graph)
top_ten_eccentricity = get_top_nodes(10, graph, lambda vertex, edges: eccentricities[vertex])
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_nodes(graph, pos, top_ten_eccentricity, node_color='magenta', node_size=20,
label='top 10 eccentricity nodes')
plt.legend()
plt.title('Top 10 Eccentricity Nodes')
plt.show()
```

```python=
def plot_top_ten_eigenvector_centrality_nodes():
"""
Plots the ten nodes with the highest eigenvalue centrality values.
:return: None.
"""
graph = get_college_msg_graph()
eigenvalue_centralities = nx.eigenvector_centrality(graph)
top_ten_eigenvalue = get_top_nodes(10, graph, lambda vertex, edges: eigenvalue_centralities[vertex])
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_nodes(graph, pos, top_ten_eigenvalue, node_color='magenta', node_size=20,
label='top 10 eigenvalue centrality nodes')
plt.legend()
plt.title('Top 10 Eigenvalue Centrality Nodes')
plt.show()
```

```python=
def plot_top_ten_pagerank_nodes():
"""
Plots the ten nodes with the highest pagerank.
:return: None.
"""
graph = get_college_msg_graph()
pagerank = nx.pagerank(graph)
top_ten_pagerank = get_top_nodes(10, graph, lambda vertex, edges: pagerank[vertex])
pos = nx.spring_layout(graph, seed=0)
nx.draw_networkx_edges(graph, pos)
nx.draw_networkx_nodes(graph, pos, top_ten_pagerank, node_color='magenta', node_size=20,
label='top 10 pagerank nodes')
plt.legend()
plt.title('Top 10 Pagerank Nodes')
plt.show()
```
### 5 Nodes with Highest Betweenness Centrality and their Clustering Coefficients
| Node | Betweenness Centrality | Clustering Coeffiecient |
| -------- | -------- | -------- |
| 97 | 12175.651634476619 | 0.0 |
| 9 | 11587.779856255003 | 0.01282051282051282 |
| 617 | 9317.13865023868 | 0.0 |
| 103 | 8538.471142746152 | 0.0 |
| 1624 | 8214.950396825416 | 0.047619047619047616 |
```python=
def top_five_betweenness_centrality_and_clustering_coefficients():
"""
Finds the five nodes with the highest betweenness centralities along with their clustering coefficients.
:return: None.
"""
graph = get_college_msg_graph()
top_five = get_top_nodes(5, graph, betweenness_centrality)
for node in top_five:
print('|', node,
'|', betweenness_centrality(node, graph.edges),
'|', vertex_clustering_coefficient(node, graph.edges), '|')
```
### 5 Nodes with Highest Degree and their Clustering Coefficient
| Node | Degree | Clustering Coefficient |
| -------- | -------- | -------- |
| 9 | 13 | 0.01282051282051282 |
| 103 | 11 | 0.0 |
| 617 | 10 | 0.0 |
| 12 | 10 | 0.0 |
| 97 | 9 | 0.0 |
```python=
def top_five_degrees_and_clustering_coefficients():
"""
Finds the five nodes with the highest degrees along with their clustering coefficients.
:return: None.
"""
graph = get_college_msg_graph()
top_five = get_top_nodes(5, graph, vertex_degree)
for node in top_five:
print('|', node,
'|', vertex_degree(node, graph.edges),
'|', vertex_clustering_coefficient(node, graph.edges), '|')
```
The graph clustering coefficient is 0.002354. The average shortest path length in the graph is 6.897. It does not exhibit small-world behavior because it has a high average shortest path and a low clustering coefficient.

```python=
def plot_log_log_degree_distribution(G):
"""
Plots the log log distribution of degrees.
:param G: The graph.
:return: None.
"""
k = np.asarray(nx.degree_histogram(G))
plt.loglog(k, 'b^')
plt.title('Log-log degree distribution')
plt.xlabel('k (degree)')
plt.ylabel('log f(k) (proportion of nodes with degree k)')
plt.show()
```
This graph does exhibit power law behavior because it follows a line in the negative direction.

```python=
def plot_log_log_cc_vs_degree(G):
"""
Plots the log of the degree vs the clustering coefficient.
:param G: The graph.
:return: None.
"""
nodes_by_degree = {}
for node, degree in G.degree:
if degree not in nodes_by_degree:
nodes_by_degree[degree] = []
nodes_by_degree[degree].append(node)
average_clustering_coefficients = []
for degree, nodes in nodes_by_degree.items():
average_clustering_coefficients.append(nx.average_clustering(G, nodes))
plt.loglog(average_clustering_coefficients, 'b^')
plt.title('Log-log distribution of average clustering coefficients')
plt.xlabel('Degree')
plt.ylabel('Average clustering coefficient')
plt.show()
```
This graph does not exhibit power law distribution because the data does not follow any line and seems to be distributed randomly.