# Network Application Diagnostics [toc] :::info - Three exams. - First is from seminars. - Second is from slide of course. - Third is oral exam. - Harik's contact: (+420) 602 393 927 ::: --- ## Reference :::info - [Course Wiki -- CTU](https://cw.fel.cvut.cz/wiki/courses/be2m32dsaa/start) ::: --- ## Overview ### Prerequisities :::info - Knowledge of linear algebra, graph theory, algorithms - Knowledge of network application and protocol fundamentals ::: :::success - Introduction to complex network structures - Network characteristics identification - Recognition of both structural static and dynamic patterns - Anomaly detection - Specification methods of static and dynamic behavior - Model verification - Diagnostic process automation - Examples dealing with digital network application issues ::: --- ## Introduction to Complex Networks ### Examples :::info - Definition: A complex network is a graph (network) with attributes, and the attributes do not occur in simple networks such as lattices or random graphs but often occur in networks representing real systems. ::: - Complex network not only refers to internet. :::success - Biological networks - predator-prey relations - brain network - the food web - Social networks - network of acquaintances - phone call network - opinion formation - Technological networks - the internet - sensor network - Informational networks - World Wide Web - Facebook - peer to peer ::: ### Approaches to Complex network :::info - analysis --> prototype --> try to solve issue but not practical yet - production --> practical product ::: :::success - Volume: Suitable tools(endpoints) for measuring the fractal property of complex networks (a lot of data in real world) - Visualization: Usually we have no idea about raw data, we need to start from visualization, and it usually gives us the insight of the attributes of the whole thing. - **Do remember tools do not often scale with real data volumes.(Analysis can usually only deal with a small part of total real data volumes.)** ::: ### Netflow :::info - Condensed records of packet flows (compression of packet flows which have similar attribute) ::: :::success - Example of netflow statistics: An enterprise traffic as a netflow sample taken during 9 days - measurement 1 ![](https://hackmd.io/_uploads/BkFwbJfg6.png) - measurement 2 ![](https://hackmd.io/_uploads/rkfo-kGg6.png) - Another example ![](https://hackmd.io/_uploads/r1EvGyzea.png) - Vacancy parts means connection but without data transaction. - **Last thing wanna recall: Huge amount of netflows is hard to visualize** - for example: 600000000 netflows = 60000 samples x big sample size 10000 - sample size 10000 is huge, and it will easily miss data ::: ### Egypt Data - Family Recognition :::info - Using family designation. - husband, wife, son, etc - It is assumed as sparse data. ::: :::success - example of circular layout > a family as a connected component ![](https://hackmd.io/_uploads/ryOLLyfgp.png) - example of hierarchical layout > a family tree ![](https://hackmd.io/_uploads/Hk0v8Jfea.png) ::: ### Coding :::info - We could also visualize the code before we start the project, then we at least can know the attribute. - assembly code ![](https://hackmd.io/_uploads/BkzuDkMxa.png) - visualization ![](https://hackmd.io/_uploads/SymFvJfxa.png) ::: :::success - when it comes to big software product with over 10000000 lines of code, over 400 modules, it is not possible to analyze directly. ![](https://hackmd.io/_uploads/B1gIdJGlT.png) - so we can use multi-layer circular layout, it can make it more readable ![](https://hackmd.io/_uploads/Hybj_1fe6.png) - but for small software, hierarchical layout is still good ::: ### Graph :::info - A graph is a set of vertices and a set of lines between pairs of vertices. ![](https://hackmd.io/_uploads/BJnZikflp.png) - explanation of terms ![](https://hackmd.io/_uploads/HyzrsJzea.png) - simple graph - A simple undirected graph has no loops and no parallel edges. - A simple directed graph has no parallel arcs. ::: ### Complex network :::info - A network consists of a graph or more graphs and additional information on the vertices or the lines of the graph. ![](https://hackmd.io/_uploads/rynf31zgp.png) ::: :::success - Asymptotic Notation ![](https://hackmd.io/_uploads/S1sKnyflT.png) - and we had better make sure the algorithm is linear not exponential - NP-Completeness ![](https://hackmd.io/_uploads/B1YZTJMxT.png) - and other complexity classes harder than NP ![](https://hackmd.io/_uploads/HkorTJzgT.png) - tree search - DFS tree: go find leaf first one by one - BFS tree: go find sibling first one by one - DFS-tree Search Edge Classification - discovery time Td(v): when v is incorporated into T - the finish time Tf(v): when all the neighbors of v are found to be already in T - tree edge: the edge belongs to tree - back edge: the edge can not belong to tree, since it will destroy the structure of tree, e.g. tree can not have a loop ::: ### The shortest path :::info - Dijkstra's Single Source Shortest Paths, O(N^2) - Floyd-Warshall All Pairs Shortest Paths, O(N^3) --> more powerful but higher time complexity ![](https://hackmd.io/_uploads/rykQXlfea.png) ::: ### Test 1 :::danger **Name several examples of complex networks application domains?** - Biological network, like brain network - Social network, like network of acquaintances, phone call network - Technological network, like the internet, sensor network, World Wide Web **What are the two difficult issues linked with processing of complex networks?** - For me, it is not only two issues hard to deal with - The data we handle, is usually raw data we have no idea about it, just deal with it based on the behavior - Usually, we can only deal with a small part of total real data volumes - The computational cost is also significant **What is the range of complex network volume?** - It depends on number of nodes and edges, and also density - It can vary significantly **Name several drawing layouts used for complex network visualizations?** - hierarchical layout for family tree - circular layout for community - spring layout (Force-Directed Layout) - random layout **Define a complex network and its basic features** - is a system of interconnected nodes or entities - basic features - node - edge - connectivity(degree) - small-world properties (short paths between nodes) - etc **Define asymptotic bounds used for assessment of algorithm complexity** ![S1sKnyflT](https://hackmd.io/_uploads/r1WUng2_T.png) **Describe DFS-tree search edge classification** - tree edges: part of the DFS tree - back edges: not part of the DFS, since it will form a cycle, it might destroy tree structure **Describe depth-first search algorithm** - graph traversal algorithm that explores as far or deep as possible along each branch before backtracking - procedure 1. follow one of the unvisited edges 2. explore as deeply as possible 3. dead end is reached, backtrack to the previous node with unexplored paths, back to step 1 **Describe breath-first search algorithm** - graph traversal algorithm that explores graph level by level - move to the next level of nodes only after all nodes at the current level have been visited **Describe the Dijkstra's single source shortest paths** - finding the shortest paths from a single source node to all other nodes in a weighted graph - process 1. set the distance from the source node to itself as 0 and all other distances to infinity, and mark all nodes as unvisited 2. choose the unvisited node with the smallest known distance as the current node 3. for each neighboring node of the current node - calculate the tentative distance from the source to that node through the current node - if the tentative distance is smaller than the current known distance, update it 4. mark as visited, and repeat step 2 to 4 until all nodes are visited **Describe the Floyd-Warshall all pairs shortest paths** - finding the shortest paths between all pairs of nodes in a weighted graph (there is no negative cycle) - process 1. initialize the matrix with the direct edge weights between connected nodes and set infinity for unconnected pairs 2. iterative update - for each intermediate node k, iterate through all pairs of nodes (i, j) - If the path from i to j through node k is shorter than the current known distance, update the distance matrix 3. repeat the iterative update process for all possible intermediate nodes ::: --- ## Fundamental Characteristics of Networks & Models of Random Graphs ### Preface :::info - Complex Network Analysis (CNA) - Relations (dyads, triads) are the unit of analysis - Actions of actors are interdependent. - Static: no change when time is going on - Dynamic: change when time is going on - Social Network Analysis (SNA) - Humanities and social science - actually same as CNA, but input is different - Graph can be represented as sets or with matrices - pros: more readable - cons: not efficient if network is sparse (too many zero element) - definition of parameters - numerical: in order and the elements are real number - ordinal: in order and can be string(categorical value) like: high, medium, low - nominal: can not be order(categorical value) - size of network/graph - number of vertices N - number of lines M ::: ### How to analyze Complex networks :::info - procedure of analyzation - find the properties of network - find more important nodes - find the relation of each group - typical characteristics of complex networks - degree heterogeneity: each node has different number of neighbors - bridges: tend to be short cut and connect to other nodes/groups - Degree Heterogeneity - Gaussion: randomly choose nodes and connect ![](https://hackmd.io/_uploads/SklGlUTx6.png) - Skewed distribution: choose preferential nodes (closer or weight) ![](https://hackmd.io/_uploads/BJJZlLalT.png) - Theorem - Vertex Degree Statistics -> determine which methods to use ![](https://hackmd.io/_uploads/ryIqgL6ea.png) - Degree Variability & mean ![](https://hackmd.io/_uploads/rJSUZUpla.png) - graph sparsity - definition of dense network ![](https://hackmd.io/_uploads/B1JbzL6eT.png) - definition of sparse network ![](https://hackmd.io/_uploads/SyCZM8aep.png) - if the density is medium and the network is huge, it will be changing to analyze (no formula to approximate) - degree distribution ![](https://hackmd.io/_uploads/Byau38Tx6.png) - regular graph: each vertex has the same degree - random graph: with Gaussian graph - small world graph: skewed distribution - random graphs - basic idea: edges are added at random between a fixed number of vertices - four basic models of complex networks ![](https://hackmd.io/_uploads/SkHFbPTeT.png) - regular lattices (meshes) and trees - Erdos-Renyi Random Graphs (ER): a disconnected set of nodes that are paired with a uniform probability - Watts-Strogatz Models (WS, SW): - small world networks - connections between the nodes in a regular graph were rewired with certain prob. - Barabasi-Albert Model (BA, SF): scale-free networks characterized by a highly heterogeneous degree distribution ![](https://hackmd.io/_uploads/rJaSWPpg6.png) - basic topologies of graphs - 1 ![](https://hackmd.io/_uploads/rJSC-vTlT.png) - 2 ![](https://hackmd.io/_uploads/SJ7kfP6gT.png) - regular graph ![](https://hackmd.io/_uploads/Bk-fMDTlp.png) - all vertices have the same degree - ER model ![](https://hackmd.io/_uploads/SyfOzPag6.png) - prob. to have an edge between any pair of nodes is distributed uniformly at random ![](https://hackmd.io/_uploads/HJt3fvTx6.png) - ER-model is binomial distribution ![](https://hackmd.io/_uploads/BkXy7Paxa.png) - the procedure of ER model ![](https://hackmd.io/_uploads/B17sXDagT.png) - approaching Poisson distribution as N is infinite(big number) ![](https://hackmd.io/_uploads/H1XVmwag6.png) - characteristic of Giant compenent ![](https://hackmd.io/_uploads/BygR7vplT.png) - closer to the median, it gradually becomes a component - Watts-Strogatz Small World Model ![](https://hackmd.io/_uploads/S1WzDw6l6.png) - from "six degree of separation": can reach any node more quick (only 6 hops) - degree distribution ![](https://hackmd.io/_uploads/SJL8Dwae6.png) - the model takes a regular clustered network - rewire the endpoint of each link to a random node with prob. p - Scale-Free(BA) network (real-world networks with fat-tail distribution) ![](https://hackmd.io/_uploads/SyOAqvalp.png) - many networks in the real-world have a fat-tailed degree distribution ![](https://hackmd.io/_uploads/Bk4M_PaeT.png) - degree distribution ![](https://hackmd.io/_uploads/H1cQFv6gT.png) - prefer higher degree nodes (like we like to use Google since most of people use Google) - Rich Club ![](https://hackmd.io/_uploads/SyYYivTea.png) - hubs: like manager of each team, it's only contact window with other teams, and fully connect with other managers - teams: each group in complex network - also degree distribution - rich-club coefficient: degree of members in hubs (obviously dense on the figure) - example: collaboration of people on project ::: - Appendix -> Assortativity: - assortative mixing: similar nodes and preferentially connect to each other - disassortative mixing: dissimilar nodes but preferentially connect to each other - degree correlation: assortativity regarding to node degree - assortativity coefficient: vertex is labeled with a scalar value like weight ### Test 2 :::danger **Describe the network perspective approach to problem solutions.** - for the main stream social science, we think - society is a set of independent individuals - each individual is the unit of analysis, treated as bundles of attributes - for Complex Network Analysis (CNA) - relations (dyads, triads) are the unit of analysis, and relation does matter - actions of actors are interdependent. - static: no change when time is going on - dynamic: change when time is going on **What are the typical characteristics of complex networks?** - local view - Degree Heterogeneity: different number of degree of each node - Bridges and Small Worlds: tend to be short cut and connect to other nodes/groups (bridges between different communities) - global view - dense subgraphs - clusters **Describe the meaning of degree heterogeneity.** each node has different number of neighbors **Define graph density and sparsity.** - density: the number of edges is about quadratic in their number of vertices ![image](https://hackmd.io/_uploads/HyW8lH3_6.png) - sparse: the number of edges is about linear in their number of vertices ![image](https://hackmd.io/_uploads/rk1DeS3up.png) **Define graph degree distribution and show some its typical examples.** - prob. of degree in a graph (total number of vertices with the same degree) - examples ![image](https://hackmd.io/_uploads/HyYcbShup.png) - regular distribution - random distribution (Gaussian) - small world distribution (skewed distribution) **List the four basic models of complex networks and their characteristics.** - Regular random graph: each node has the same degree and pair with each other with uniform prob. - Erdos-Renyi random graph: nodes pair with each others with uniform prob. - Watts-Strogatz graph: small world networks - Barabasi-Albert graph: scale-free networks, characterized by a highly heterogeneous degree distribution, which follows power law ![image](https://hackmd.io/_uploads/Hy62Bu3_a.png) **List basic graph topologies.** - example 1 ![image](https://hackmd.io/_uploads/By7DzBhda.png) - example 2 ![image](https://hackmd.io/_uploads/ByvdMBhdp.png) - empty graph - path graph - star graph - tree graph - cycle graph - wheel graph - complete graph **Describe Erdos-Renyi graph model.** - edges between any pair of nodes is distributed uniformly at random - ER model is binomial distribution - approach Poisson distribution as N is infinite **Describe Watts-Strogatz graph model.** - can reach every node in few hops - the middle between regular and random ![S1WzDw6l6](https://hackmd.io/_uploads/r1W1wu2dp.png) - degree distribution ![SJL8Dwae6](https://hackmd.io/_uploads/SkfXP_nua.png) - it is regular graph - but rewire the link to a random node with prob. p **Describe Barabasi-Albert graph model and its scale-free property.** - it is similar to real-world networks, and with fat-tail distribution - preferential attachment **What is the meaning of "the rich-club phenomenon".** - there are some nodes as hubs - there are some nodes as groups - each group has a hub - the only contact window with other groups is hub - hubs have full connection with each other ::: --- ## Network Properties - matrix is used to prove network is valid, but we need to formulate the network in simpler way instead of matrix - definition of field: (CZ pole, komutativni teleso)is a set on which are defined addition, subtraction, multiplication, and division satisfying the field axioms (commutativity, associativity, a unit) - like every number in the field can do operation with each other, and the result is also in the field ### Algebra :::info - reminder of Algebra ![](https://hackmd.io/_uploads/BJiMRx4Zp.png) - reminder of Matrix ![](https://hackmd.io/_uploads/r1rhRgV-p.png) - reminder of Matrix Transposition ![](https://hackmd.io/_uploads/B1ZtlbEbT.png) - example ![](https://hackmd.io/_uploads/Sk9igW4-p.png) - reminder of Orthogonality ![](https://hackmd.io/_uploads/ByL7QQVbp.png) - reminder of Matrix Inversion ![](https://hackmd.io/_uploads/BJ0LmQVb6.png) - Matrix Eigenvalues ![](https://hackmd.io/_uploads/H1vEPXVZT.png) - example ![](https://hackmd.io/_uploads/Sk_MwQVZT.png) - Schur decomposition ![](https://hackmd.io/_uploads/Byz0umVW6.png) - Adjacency Matrix ![](https://hackmd.io/_uploads/SJhqiXEbp.png) - Cocitation matrix ![](https://hackmd.io/_uploads/HJ9ajm4-p.png) - example ![](https://hackmd.io/_uploads/rJyXAQ4Z6.png) - Bibliographic coupling ![](https://hackmd.io/_uploads/SJo_0mVWp.png) - Bi-adjacency Matrix ![](https://hackmd.io/_uploads/Hyzp07V-6.png) - Adjacency and Bi-adjacency Matrix ![](https://hackmd.io/_uploads/HyP-eV4-6.png) - Incidence matrix ![](https://hackmd.io/_uploads/SJUPlEVb6.png) - projection ![](https://hackmd.io/_uploads/HJPFeE4-T.png) - projection property 1 ![](https://hackmd.io/_uploads/rJT6gE4-a.png) - projection property 2 ![](https://hackmd.io/_uploads/SywJ-EVZp.png) - Undirected Graph - Node Degree ![](https://hackmd.io/_uploads/r12B-EEZ6.png) - Density ![](https://hackmd.io/_uploads/r1YPZ4N-a.png) - Vertex Degree ![](https://hackmd.io/_uploads/rJksWVN-a.png) - Paths in Simple Graph ![](https://hackmd.io/_uploads/S1blfEVZ6.png) - Cycles in Simple Graph ![](https://hackmd.io/_uploads/SJBffN4bT.png) - Cycles in Simple Graph and Eigenvalues ![](https://hackmd.io/_uploads/Hy4SYSN-a.png) ::: :::success - centrality measures/ranking - measuring the importance/prominence of a node within the network - Degree Centrality (Node Activity) - Betweenness Centrality (Intermediate Position) - Closeness Centrality (Distance to other nodes) - Eigenvector Centrality (Important nodes have important friends) - Power Centrality (Close to hubs) - Page Rank - evaluation of the location actors in the network - Insight into various roles and groupings in a network - Connectors, mavens, leaders, bridges, isolates, broker, hubs - Where are the clusters and who is in them? - Who is in the core of the network? Who is on the periphery? - What is a single point of failure? - Degree centrality in matrix ![](https://hackmd.io/_uploads/r1ckiB4Zp.png) - Closeness centrality in matrix ![](https://hackmd.io/_uploads/BytBsHNWT.png) - example ![](https://hackmd.io/_uploads/BkH9jSEWT.png) - Betweenness centrality - concept: how many shortest path will go through the specific node, more nodes pass through it, show it more important - how to get it? - traditional: Floyd-Warshall algorithm, computational cost too high - sometimes we can see also - Dijkstra's algorithm - Johnson's algorithm: specific for sparse networks - Newman derive faster algorithm -> use Breath-first search algorithm for each node first, from bottom to top, accumulate the weight of nodes and edges, there is an example ![](https://hackmd.io/_uploads/HJ58yOaWp.png) - important nodes have important friends - the importances are eigenvalue, and its elements should be positive(since show something is "negative" important is weird) - Eigenvalue Centrality - Iterative Approach ![](https://hackmd.io/_uploads/SJtfZdpZp.png) - we try to get Eigenvalue to know how importance of each node - even calculation of 3 by 3 matrix is challenging, but in reality we face one million by million - so we use iterative approach - assume the network is linear combination of Eigenvalue, and we can ignore the vector which has less contribution - properties - works well for undirected networks because the mathematical properties - for directed network, sometimes ignore un-important nodes too much - Katz Centrality ![](https://hackmd.io/_uploads/S1VfmOpZa.png) - to resolve the issue with zero eigenvalue centralities for directed network - Alpha Centrality - another solution for Katz centrality, but add constant value to avoid ignoring some nodes - procedure - generalize - add constants - Centrality Measures - Importance of Nodes ![](https://hackmd.io/_uploads/ry5GE_6-6.png) - different characteristics of each centrality - PageRank - sometimes some nodes are super important and lots of nodes access it - however we don't want the importance of the important node propogates to neighbor nodes which are not important - we change to calculate the degree of spreading out from the node ::: ### Test 3 :::danger **Define adjacency matrix, cocitation matrix, and bibliographic coupling.** - adjacency matrix: is a square matrix representing a graph, each element (i, j) indicates whether there is an edge between nodes i and j in the graph - cocitation matrix: number of times nodes i and j are cited by other node together - bibliographic coupling: the number of common references they share - equation of cocitation matrix and bibliographic coupling matrix ![BkPjsOpWT](https://hackmd.io/_uploads/BkWaQFnua.png) - illustration ![image](https://hackmd.io/_uploads/BkzLBcnOp.png) **Define bi-adjacency matrix, incidence matrix, edge incidence matrix.** - bi-adjacency matrix: is a rectangular matrix that represents the bipartite structure of a graph, rows correspond to one set of nodes, columns to another ![image](https://hackmd.io/_uploads/rJUwvq2uT.png) - incidence matrix: representing the relationships between nodes and edges in a graph, rows correspond to nodes, columns to edges ![image](https://hackmd.io/_uploads/HJiy_q3_p.png) - edge incidence matrix: same as incidence matrix, rows to edges, but columns correspond to nodes **Define one-mode projection and its relation to bi-adjacency matrix.** - focus on one set of nodes and create connections between them based on shared interactions in another set - example: users would create connections between users who have interacted with the same movies **Show how to compute degree of vertex, the number of edges, the mean degree, and graph density based on the adjacency matrix for undirected and directed graphs.** - undirected - degree of vertex ![image](https://hackmd.io/_uploads/Skxl5dRda.png) - the number of edges ![image](https://hackmd.io/_uploads/ry71qOAua.png) - the mean degree ![image](https://hackmd.io/_uploads/SyPG9_R_6.png) - graph density ![image](https://hackmd.io/_uploads/B1hw5uAdp.png) - directed - degree of vertex ![image](https://hackmd.io/_uploads/HyYt9_Cup.png) - the number of edges ![image](https://hackmd.io/_uploads/BJ8ccuCda.png) - the mean degree ![image](https://hackmd.io/_uploads/H1TCc_0Oa.png) - graph density ![image](https://hackmd.io/_uploads/B1hw5uAdp.png) **Show how to compute number of paths and cycles based on the adjacency matrix.** - number of paths ![image](https://hackmd.io/_uploads/H1PymKAu6.png) - number of cycles ![image](https://hackmd.io/_uploads/rJdz7YROp.png) **Define degree centrality.** the number of edges connected to that node **Define closenes centrality.** how close a node to all other nodes in the network **Define betweenness centrality.** - show how many shortest path will go through the specific node, more nodes pass through it, show it more important **Describe an algorithm for betweenness centrality computation.** - traditional: Floyd-Warshall algorithm, computational cost too high - sometimes we can see also - Dijkstra’s algorithm - Johnson’s algorithm: specific for sparse networks - Newman's betweenness algorithm - fast algorithm - use Breath-first search algorithm to find the shortest path for each node first - and accumulate the weight of nodes and edges **Define eigenvalue centrality.** - eigenvector of the adjacency matrix - important nodes have important friends which have positive gain **Define Katz centrality.** measure the influence of a node by summing the contributions from its neighbors, further nodes with less weight **Define PageRank index.** it assigns importance scores to nodes based on their incoming links ::: --- ## Network structure identification ### Node Influence - Why we use Eigenvalue -> we hope input x and output y have linear relationship :::info - defintion of nouns - influence maximization: if we inject the info. into a node, we can know how far it can spread out the info. to other nodes - link-based classification: the properties of neighborhood - a community: is defined by a clique (maximal complete subgraph) in a network - rawComm: approximate measured number of communities which a node is attached ![](https://hackmd.io/_uploads/r1ybF_T-6.png) - example of community-based node roles ![](https://hackmd.io/_uploads/BJCjYOpba.png) - community metric: the number of communities -> rawComm - relative degree: the number of edges - hubs: nodes that tell us where the best authorities are to be found - authorities: nodes that contain useful information on a topic of interest - another example - Karate club ![](https://hackmd.io/_uploads/B1K4qu6W6.png) - The centrality algorithm: called hyperlink-induced topic search or HITS ![](https://hackmd.io/_uploads/r11ZiuT-a.png) - two major parameters - authority centrality: sum of hub centrality - hub centrality: sum of authority centrality - each vertex has an authority centrality and a hub centrality - choose one of both to calculate - recall the definition of nouns ![](https://hackmd.io/_uploads/BkPjsOpWT.png) ::: ### Data Clustering :::info - data clustering - cluster is a collection of objects that are similar to each other using some attribute - supervised: a finite set of labels/tags is provided - unsupervised: based on similarities of objects - input data types ![](https://hackmd.io/_uploads/rkXsXuPfT.png) - example of anomalies ![](https://hackmd.io/_uploads/HyGrVuwzT.png) - N1, N2 are regions of "normal" behavior - normal behavior ![](https://hackmd.io/_uploads/SyWoVdwMT.png) - O1, O2, O3 are anomalies - concurrent communication detection ![](https://hackmd.io/_uploads/SkCoFuDfp.png) - it is hard to analyze, so can only use message in payload to classify what kind of packet it is - hierarchical or partitional clustering - partitional clustering: assigns a set of data points into k-clusters by using iterative processes ![](https://hackmd.io/_uploads/HJALcOwMa.png) > nested - hierarchical clustering: it is nested and can be displayed as a tree - different from classification tree - classification provides label, tag, but clustering doesn't - example - title of Viziers ![](https://hackmd.io/_uploads/SJkaoOwG6.png) - right hand area: position in a group - bottom area: different person - try to find people group which has similar combination of position in a group ![](https://hackmd.io/_uploads/BJF_hOwMa.png) - steps of cluster analysis - feature selection: selects distinguishing features from a set of candidates - selection of clustering algorithm: after proximity measure, a criterion function and an algorithm is determined - cluster validation: provide user with degree of confidence and check if clustering result makes sense or not - result interpretation - membership of the nodes in resulting clusters - disjoint cluster: is a member of exactly one cluster - overlapping cluster: is a member of more than one cluster - fuzzy cluster: assign a membership weight between 0 and 1 to each node - 1 means absolute membership - 0 means a non-member - K-Means clustering ![](https://hackmd.io/_uploads/SJnT-5vz6.png) - algorithm ![](https://hackmd.io/_uploads/HyQSz9PMp.png) - feature - a partitional clustering - it is basically based on multi-dimensional vector - if we want to use other input, need to transfer them into multi-dimensional vector - clustering, triplets and triangles - clustering coefficient: a measure of the degree to which nodes in a graph tend to cluster together (local density of each node) - triplet: open triangle, three nodes connected by two undirected ties - triangle: closed triangle, a triplet connected by three undirected ties - the trio - Anchor: This is the reference point, an item in your dataset - Positive: Similar to the anchor. The model should pull this closer to the anchor - Negative: Dissimilar to the anchor. The model's task is to push this away - definition of triplets and triangles in math ![](https://hackmd.io/_uploads/H1PNZ1df6.png) - different types of clustering coefficient ![](https://hackmd.io/_uploads/Sk0DNJuMa.png) - diffusion equation - partial differntial equation - show the distribution of increasing and decreasing - how the communication flows in the network using increasing and decreasing to express - diffusion process - network diffusion equation ![](https://hackmd.io/_uploads/BkCPr1_fp.png) - diffusion matrix form ![](https://hackmd.io/_uploads/BkgA-tJ_MT.png) - only decaying exponentials because we need convergence to be stable - graph laplacian - if there is a complex network, graph Laplacian is like the map that helps you navigate through this network - graph laplacian is a matrix derived from the graph, and one common method to derive it is using degree matrix and adjacency matrix ![](https://hackmd.io/_uploads/Skdp01dfa.png) - Degree Matrix(D): this matrix shows the degree(number of connections) of each node - Adjacency Matrix(A): this matrix shows the connections between nodes - in a well-connected cluster, the values of the Fiedler vector will be similar - in short, graph laplacian can help to analyze the graph to get hidden structure using its eigenvalues and eigenvectors ::: ### Test 4 :::danger **What are the basic roles of nodes?** - they represent entities within a network - they also represent connection points for edges **How is it possible to assess a role of a given nodes?** - we can base on these twos - influence maximization: if we inject the info. into a node, we can know how wide it can spread out the info. to other nodes - link-based classification: the properties of the node and its neighborhood **Provide definitions of authorities and hubs.** - authorities: nodes that contain useful info. on a topic of interest - hubs: nodes that tell us where the best authorities are to be found **How are the hub and authority centralities defined?** - hub centralities: proportional to the sum of the authority centralities of vertices ![image](https://hackmd.io/_uploads/HJPpBFhOp.png) - authority centralities: proportional to the sum of the hub centralities of vertices ![image](https://hackmd.io/_uploads/SJ23Sthda.png) - just choose one of them to calculate **What is the goal of clustering?** it is to divide all nodes into number of groups, so that data belonging to a group are similar to each other within the same group **What are the two fundamental approaches to data clustering?** - supervised: with a finite set of labels/tags - unsupervised: base on the similarities of objects **What are the typical steps of a cluster analysis?** 1. feature selection 2. measurement function, criterion function, and algorithm are determined 3. perform cluster validation 4. interpret the result **What ate the basic forms of node memberships in clusters?** - disjoint clusters: each node is a member of exactly one cluster - overlapping clusters: a node may be a member of more than one clusters - fuzzy cluster: assign membership weight between 0 to 1, 1 means absolute membership, 0 means non-member **Describe k-means clustering.** - a partitional clustering - there are K clusters - each cluster has one center - each node will select the closest center to itself, and join the cluster **Define a triplet and triangle.** - triplet: open triangle, three nodes connected by two undirected edges - triangle: closed triangle, a triplet connected by three undirected edges **Describe a diffusion equation.** - partial differential equation - show the distribution of increasing and decreasing - like fast it can spread info. **What is the graph Laplacian?** - a matrix derived from the adjacency matrix of a graph - if there is a complex network, graph Laplacian is like the map that helps you navigate through this network ::: --- ## Network Community Detection :::info - the elements are too many so we can shrink the matrix to make the calculation easier ![image.png](https://hackmd.io/_uploads/BkXsXAJXT.png) - concept of community - communities, also called clusters or modules, are groups of vertices which might share common properties - communities are dense(lots of edges) subgraphs of a network - global definition - a graph has community structure if it is different from a random graph - null graph refers to the graphs which are very hard to visualize. - overview of methods ![image.png](https://hackmd.io/_uploads/ByW-8RJm6.png) - basic concept of nonoverlapping communities ![image.png](https://hackmd.io/_uploads/H1IR8AJ7a.png) - Kernighan-Lin Algorithm ![image.png](https://hackmd.io/_uploads/rkSto0176.png) - try to partition into two random communities, and try to find the minimum of edges, the precision is based on the number of iterations - first number of biggest elements in vector are in a group, remaining elements are in another group - Spectral bisection ![image.png](https://hackmd.io/_uploads/HkiNi0yXT.png) - only two communities - smaller eigenvalues, less number of edges (modularity) - first number of biggest elements in vector are in a group, remaining elements are in another group - Hierarchical clustering ![image.png](https://hackmd.io/_uploads/SJEX2Ak7T.png) > modularity: the number of edges inside communities (which measures the quality of a partition) - derive the modularity from the bottom to the top - pursue large modularity as possible (more edges) - merge each node one by one - not guarantee the best choice - Newman Spectral Method - many communities - mix of spectral bisection and hierarchical clustering - try to find max eigenvalue to find max modularity - Louvain Method ![image.png](https://hackmd.io/_uploads/rJWyeJlm6.png) - it will truely merge after generate a community - kind of "heirarchical algorithm", but even betterO(it will simplify the calculation) - if the gain is not positive (be negative), they should not merge - at the end, the total modularity will achieve the highest value ![image.png](https://hackmd.io/_uploads/BkB6bye76.png) - Resolution limit of Louvain Algorithm: small group must be merged with other groups, so there is no small group ::: :::success - each node has the probabilities belonging to the communities, so there is probability one node doesn't belong to any communities ![image.png](https://hackmd.io/_uploads/SJUtb7Yma.png) - Affiliation graph model ![image.png](https://hackmd.io/_uploads/B1SA-XY7a.png) - there is prob. creating edge, and also don't create edge ::: ### Test 5 :::danger **Describe the concept of community.** communities, also called clusters or modules, are groups of vertices which might share common properties **What is null model of a graph?** is a reference or baseline model when analyzing the properties of a given graph **What types of community dection methods do you know?** - Nonoverlapping community detection: each node in a network is assigned to exactly one community - Overlapping community detection: nodes can belong to multiple communities simultaneously - Community detection in bipartite graphs: community detection involves two distinct sets (partitions) of the graph **Describe Kernighan-Lin algorithm.** - try to partition into two random communities, iteratively select pairs of nodes, one from each partition, and swap them between the partitions ![rkSto0176](https://hackmd.io/_uploads/S1RfJPAup.png) - try to find the minimum of edges between two communities, the precision is based on the number of iterations **Describe graph partitioning using the spectral bisection method.** - only two communities, smaller eigenvalues, less number of edges (modularity) ![HkiNi0yXT](https://hackmd.io/_uploads/S1GNyvRup.png) - first number of biggest elements in vector are in a group, remaining elements are in another group **What is modularity of graph proposed by Newman?** - function which is used to measure the quality of a partition - basically the number of edges within communities minus those for a null model **How can modularity be used for community detection?** - which is used to measure the quality of a partition by community detection algorithm like Louvain method **Describe principles of the Louvain algorithms.** - illustration ![BkB6bye76](https://hackmd.io/_uploads/HygsXvC_T.png) - it will merge after generate a community - similar principle as hierarchical algorithm, but much efficient - if gain is not positive, they should not merge - finally the total modularity will achieve the highest value **What is the resolution limit in community detection based on modularity?** small group will be merged with other groups, so there is no small group **Describe principles of overlapping community detection.** the nodes in graph can belong to multiple communities simultaneously ::: --- ## Alloy :::info - Alloy is a modeling language for software design - everything in Alloy is a set - there are no scalars - in this case, we just want to express normal family relationship, we need to set lots of constraints ![image.png](https://hackmd.io/_uploads/ByM5M7K7p.png) - for me, signatures are time independent ![image](https://hackmd.io/_uploads/SyeHrBMNT.png) - Alloy is mainly for relationship, so harder to create time-seried model ::: --- ## Model Checking and UPPAAL - the difference between Alloy and UPPAAL - UPPAAL: specially for state space machine, provide huge amounts of state - Alloy: relation based, but we need to specify state (like we execute for three times by default) :::info - procedures of testing - testing and simulation: sample execution, just choose one or two case, not test all possible cases (subset of formal verification) - runtime verification: for checking if it can run or not (subset of formal verification) - formal verification: test all possible cases ::: :::success - model checking - building the final system model (finite set states) - check if required property is compiled with the model - security: all reachable states are not bugs - liveness: reach every state in limited time - fairness, and so on - based on full state space search (using UPPAAL) - advantages/disadvantages of model checking - advantages - full automation: full automatic proof, dont need to come to a solution to proof - high speed - can just verify partial SPEC: part of protocol - procedures counterexample: help find special error event - disadvantages - state explosion: - sometimes it will generate too many states - we need to set the constraints properly - something without finiteness - continuous variables - continuous time - something with probability - else - temporal logic: logic with notion of time ::: :::success - description of state space - applicable to finite state spaces only - we need input parameters and statements (we describe using atomic statements) - and state space which can be formalized using atomic statements and Kripke's structure > Kripke's structure is undeterministric finite state machine - exmaple of Kripke's structure ![image](https://hackmd.io/_uploads/BJ9VtPGEa.png) - basic statement describing the system - expressions - constants - predicate symbols - atomic statements - each atomic statement is algorithmically decidable based on a given state - status is evaluation of all variables - Kripke's structure ![image](https://hackmd.io/_uploads/ryORtPME6.png) - time automation clock: use a clock set with real numbers, to sample the time passed - "path" is an infinite sequence of states - concept of time - physical time: time measurement between two states - logical time: how many steps between two states - temporal logic - LTL is used by Alloy - UPPAAL is not CTL and CTL\* - example of urgent channels ![image](https://hackmd.io/_uploads/SknNLd4HT.png) - urgent channel will keep polling for the condition - if we don't use urgent channel, it take action after random long time ::: --- ## FSM Testing and Checking Sequences :::info - decision of the tool we use - simple network: Python - complex network: like UPPAAL, there is graphic user interface - transition table ![image](https://hackmd.io/_uploads/Hy0QWF4B6.png) - greatly reduce the table size - example ![image](https://hackmd.io/_uploads/Hy1Lbt4r6.png) - hidden states problem ![image](https://hackmd.io/_uploads/HyK_bt4r6.png) - it is very dangerous if there is infinite loop, for example if the machine is for surgery - a set of input sequences is called a characterization set - is a finite set of input sequences that distinguishes any pair of different states - example ![image](https://hackmd.io/_uploads/Bkuqpa4S6.png) - partition of groups ![image](https://hackmd.io/_uploads/rk7qlJHS6.png) - we can do first partition for grouping based on the output ![image](https://hackmd.io/_uploads/HJZag1HHa.png) - do further partition using the groups of next states ![image](https://hackmd.io/_uploads/BylZ-Jrrp.png) - do it again ![image](https://hackmd.io/_uploads/B1rVbkrrT.png) - do it again ![image](https://hackmd.io/_uploads/BJNSWJBrT.png) - do it again ![image](https://hackmd.io/_uploads/ByHUZ1Hra.png) ::: --- ## FSM Sequences :::info - definition of set theory and strings - cardinality of a set = number of elements of the set - partition of a set is a set of nonempty subsets of set - notation ![image](https://hackmd.io/_uploads/ryswExAHT.png) - I/O sequence - present form: single decision is one of the entire output sequence - adaptive form: a decision tree constructed by all of the decisions - overview of FSM sequences ![image](https://hackmd.io/_uploads/B1-wvgCHp.png) - **state identification:** derive initial state from output using input sequence - distinguishing sequence: different from characteristic set, it will not duplicate the state it passes, and characteristic set is a set for distinguishing any state, and distinguishing sequence is just one sequence - distinguishing tree is all processes of distinguishing sequence - main purpose of distinguishing sequence is to do state identification - totally have no info. about where the state you are - Preset Distinguishing Sequence (PDS algorithm) - step 1: come out a tree ![image](https://hackmd.io/_uploads/SJwH4fCrp.png) - step 2: get a table of distinguishing sequences ![image](https://hackmd.io/_uploads/SkJ34GCSa.png) - **state verification:** verify where the state of current location it is - to verify the state where you are really at for double confirmation - state verifying sequence (SVS algorithm) - step 1: come out a tree to check if the state can be distinguished from other states (distinguish B from any other states), if it can be distinguished then it will be thrown out ![image](https://hackmd.io/_uploads/Hky6rzArp.png) - step 2: come out this table, when we get E then we can differentiate the state since E has a different output from others ![image](https://hackmd.io/_uploads/BJWxDG0Ha.png) - **state characterizing set** is used to distinguish from other state (similar to characteristic set, but it is used to distinguish from any other states) - state characterizing set (SCSet) ![image](https://hackmd.io/_uploads/S1p_PM0rT.png) - fill in simple sequences first then gradually fill in complicated sequences - **homing sequence example:** to guide FSM to some specific states - run it until they are fully distinguished ![image](https://hackmd.io/_uploads/HkO_OMAST.png) - come out the table ![image](https://hackmd.io/_uploads/SJxMKMRHa.png) - Synchronizing Sequence (SS): different from homing sequence, it doesn't derive it from output of process, it is easier but not always exist - try if leaves will be converged to one state or not ![image](https://hackmd.io/_uploads/rJIwqG0Sa.png) - come out a table for SS ![image](https://hackmd.io/_uploads/HyzG9GRST.png) ::: ### Test 6 :::danger **Define state identification and state verification.** - state identification: derive initial state from output using input seq. - state verification: comfirm and validate the current state of a system **Define preset distinguishing sequence and describe its construction algorithm.** - present distinguishing seq. is a sequence of inputs used to differentiate between different states of a FSM system - algorithm 1. determine the states of the system that need to be distinguished or differentiated 2. assign binary code to each state that serve as a representation for the states 3. for coming out distinguishing tree, each branch represents different input and the leaves corresponds to a set of output, a set of state will be divided into different group based on the output ![SJwH4fCrp](https://hackmd.io/_uploads/Sy0UEY6_T.png) 4. tidy up a table which including each combination of inputs and corresponding output ![SkJ34GCSa](https://hackmd.io/_uploads/ByV_Vta_6.png) **Define state verifying sequence and describe its construction algorithm.** - verify the state where you are really at for double confirmation - state verifying sequence (SVS algorithm) 1. for coming out a tree, check if state can be distinguished from other states, if it can be distinguished then it will be thrown out from the set ![Hky6rzArp](https://hackmd.io/_uploads/B13l8t6u6.png) 2. tidy up a table which including each combination of inputs and corresponding output ![BJWxDG0Ha](https://hackmd.io/_uploads/BJuXIYpd6.png) **Define state characterizing set and describe its construction algorithm.** - SCSet is used to distinguish from any other states - algorithm ![S1p_PM0rT](https://hackmd.io/_uploads/SkfjPJAuT.png) **Define homing sequence and describe its construction algorithm.** - sequence which is to guide FSM to some specific states - algorithm ![HkO_OMAST](https://hackmd.io/_uploads/rkNiOJAOp.png) - come out a table ![SJxMKMRHa](https://hackmd.io/_uploads/Sk6h_kRO6.png) **Define synchronizing sequence and describe its construction algorithm.** - sequence which is to guide FSM to some specific states, but different from homing sequence, it doesn’t derive it from output of process, it is easier but not always exist - algorithm (derive it just based on the states) ![rJIwqG0Sa](https://hackmd.io/_uploads/Hku1cy0da.png) - come out a table for SS ![HyzG9GRST](https://hackmd.io/_uploads/H1AWqyCOa.png) ::: --- ## Network Dynamics :::info - refer to networks change over time - changes in parameters or topological structures - like virus, disease spreading - multi-compartment model - each compartment is assumed to be a homogeneous entity which can be modelled - inside the same compartment, they will behave in the same way - ODEs deal with a single variable, often time, and their derivatives with respect to that variable. - PDEs involve multiple variables, considering how a function changes with respect to each of them through partial derivatives - Kermack-McKendrick theory ![image](https://hackmd.io/_uploads/rkdoOTWDT.png) - susceptible: can be infected, but not yet - infected: already infected, can come into susceptible or recovered - recovered: will not infected again - SI model - parameters - susceptible - infected - behavior ![image](https://hackmd.io/_uploads/S16EmHMDp.png) - after being infected, it will not be susceptible anymore ![image](https://hackmd.io/_uploads/SyCi7BGwT.png) - SIS model - parameters - susceptible - infected - behavior ![image](https://hackmd.io/_uploads/SklA7Sfva.png) - and it will be infinite state exchange (between susceptible and infected) ![image](https://hackmd.io/_uploads/rkl_XrGPp.png) - SIR model - parameters ![image](https://hackmd.io/_uploads/BJF1NBMPp.png) - susceptible - infected - recovered or died - behavior 1 ![image](https://hackmd.io/_uploads/HyRfEBMD6.png) - behavior 2 ![image](https://hackmd.io/_uploads/BkvBVHzwT.png) - giant component is similar to spreading processes - critical point (epidemic) is the point that pandemic will outbreak ![image](https://hackmd.io/_uploads/HJO0vSMDp.png) - epidemic threshold has the same concept - basic reproduction number: the average number of people infected by a person before his recovery ::: ### Test 7 :::danger **Define compartment model.** - usually used to describe the spread of a disease through different compartments in a system - population is divided into compartments, each representing different state, and transition uses a set of differential equations **Describe SI model.** - only two parameters: susceptible and infected - after being infected, it will not be susceptible anymore **Describe SIS model.** - only two parameters: susceptible and infected - after being infected, it will back to susceptible, and it might be re-infected ::: --- ## Network Inference & Link Prediction :::info - The expectation: the terms positive and negative refer to the classifier's prediction. ![image](https://hackmd.io/_uploads/BksBqb2dT.png) - precision and recall ![image](https://hackmd.io/_uploads/BJbfnZh_T.png) - accuracy ![image](https://hackmd.io/_uploads/rJr6oW3O6.png) - F1-Measure ![image](https://hackmd.io/_uploads/Skvth-nua.png) - the harmonic mean of precision and recall - true positive rate(TPR) & false positive rate(FPR) ![image](https://hackmd.io/_uploads/SJXx6Z2u6.png) - ROC curve ![image](https://hackmd.io/_uploads/r1dm0-2_T.png) - The one at bottom is the best. - AUC should be as high as possible. - If the AUC is below 0.5, do the exact opposite of what the model recommends. - Inference problems - link prediction: we know something about edges and vertices - association graph: we can only know something from vertices - tomographic network inference: we can only measure vertices within perimeter of the network, and we can not observer the core of network - link prediction motivation - networks are highly dynamic object by adding or removing edges - link prediction definition - link prediction: predict edges are added in the interval - link completion: given a network, complete the missing edges - link reliability: estimate the reliability of given links in the graph - and link existence, link weight, link type, and link cardinality can also be predicted ::: ### Test 8 :::danger **Define precision, recall, accuracy, and F1-measure used in classification evaluation.** - precision: show how many selected items are relevant - recall: show how many relevant items are selected - accuracy: show how many items are correctly predicted - F1-measure: the harmonic mean of precision and recall, used to combine precision and recall into a single value **How ROC curves are used in classication problems?** - is a illustration of a binary classification model's performance across different threshold - plotted by true positive rate (TPR) against the false positive rate (FPR) ![image](https://hackmd.io/_uploads/r1rCDVTda.png) - the worst AUC is 0.5 - If the AUC is below 0.5, do the exact opposite of what the model recommends **Define the network inference problem and its subproblems.** - link prediction: we know something about edges and vertices - association graph inference: we can only know something from vertices - tomographic network inference: we can only measure vertices within perimeter of the network, and we can not observer the core of network **How is it possible to detect packet sequence concurrency?** illustrate it based on the timestamp it carries and analyze it **Define the link prediction problem is its subproblems.** - link prediction: predict edges are added in the interval - link completion: given a network, try to complete the missing edges - link reliability: estimate the reliability of given links - link existence, link weight, link type, and link cardinality can also be predicted **Define typical scoring functions used in the link prediction problem.** - neighborhood based - number of common neighbors: count the number of common neighbors between two nodes - Jaccard's coefficient: measure the ratio of the intersection of neighbors to the union of neighbors for two nodes - Adamic/Adar: weights common neighbors by the inverse logarithm of their degree - Preferential attachment: the nodes with higher degrees are more likely to gain new connections - path based - shortest path: measure the length of the shortest path between two nodes - Katz score: sum of weights of all paths between two nodes with decreasing weights for longer paths - personalized Pagerank: computes the prob. of reaching one node from another in random walk, for sure there is preference on certain nodes :::