# 2021 Fall Data Mining Project 03 - Report
* Subject of this project - **Link Analysis**
* This report and source code are written by Steven Hsun-Hsiang Chen.
* Github Link: https://github.com/StevenChen8759/2021_DataMining_Project_03
## :bookmark_tabs: Table of Content
[toc]
## :cyclone: Dataset Description
* In this project, we utilize two datasets.
* HW3 graph set provided by TA.
* IBM graph data generated by IBM QSDG (Quest Syntheic Data Generator)
* HW3 Graph Visualization - Graph 1
* 6 Vertexes, 5 Edges

* HW3 Graph Visualization - Graph 2
* 5 Vertexes, 5 Edges

* HW3 Graph Visualization - Graph 3
* 4 Vertexes, 6 Edges

* HW3 Graph Visualization - Graph 4
* 7 Vertexes, 18 Edges

* HW3 Graph Visualization - Graph 5
* 469 Vertexes, 1102 Edges

* HW3 Graph Visualization - Graph 6
* 1228 Vertexes, 5220 Edges

* IBM Graph Visualization
* 836 Vertexes, 4798 Edges

## :water_buffalo: Algorithm Implementation Details
* Programming Language: **`Python`**
* Use **`Directed Graph`** with **`Adjancy Matrix`** to represent input graphs.
* Based on self-defined Python class, we implement HITS, PageRank, and SimRank Algorithm.
### :building_construction: Structure `directedGraph`
* With `numpy ndarray`, `Tuple`, and `List` operations, the directed graph can be represented by an adjancy matrix based on structure `numpy ndarray`.
* Please refer `src/utils/GraphStruct.py` for more details.
### :writing_hand: HITS Algorithm
* Cooperate with `directedGraph` structure and `numpy ndarray`.
* Flow of algorithm
1. Initialize `authority` and `hub` array
* By `np.ones()`
2. Evaluate new authority and hub
* By `np.matmul()` with old authority, hub, and adjancy matrix
* Do normalization in this step.
3. Evaluate convergence score to check whether the value is actually converged.
* Using 2-norm with `np.linalg.norm()` on old/new authority/hubs.
4. Loop until statisfy at least one condition listed below:
* The convergence score is less than specific threshold **$\epsilon = 0.000001$**.
* Iteration count hits maximum bound. $max\_iter = 100$
* The convergence score of current iteration is bigger than the score of last iteration.
### :hand: PageRank Algorithm
* Cooperate with `directedGraph` structure and `numpy ndarray`.
* Flow of algorithm
1. Initialize `pgrank` array
* By `np.ones()`
2. Evaluate new page rank
* Evaluate by traversing each vertex
* Get index of in-neighbors of specific vertex by structure operation `graph.get_in_neighbours_index()`.
* Get out-degree of specific vertex by structure operation `graph.out_degree()`.
* Default damping factor: **`0.15`**
* Do normalization in this step.
3. Evaluate convergence score to check whether the value is actually converged.
* Using 1-norm with `np.linalg.norm()` with argument `ord=1` on old/new pagerank.
4. Loop until statisfy at least one condition listed below:
* The convergence score is less than specific threshold **$\epsilon = 0.001$**.
* Iteration count hits maximum bound. $max\_iter = 100$
* The convergence score of current iteration is bigger than the score of last iteration.
### :call_me_hand: SimRank Algorithm
* Cooperate with `directedGraph` structure and `numpy ndarray`.
* Flow of algorithm
1. Initialize `simrank` matrix
* By `np.identity()`, which obtains identity matrix.
2. Evaluate new simrank
* Evaluate by traversing each vertex pair
* Time Complexity: $O(n^2)$
* Get index of in-neighbors of specific vertex by structure operation `graph.get_in_neighbours_index()`.
* Default decay factor: **`0.9`**
3. Evaluate convergence score to check whether the value is actually converged.
* Using 1-norm with `np.linalg.norm()` with argument `ord=1` on old/new simrank.
4. Loop until statisfy at least one condition listed below:
* The convergence score is less than specific threshold **$\epsilon = 0.001$**.
* Iteration count hits maximum bound. $max\_iter = 100$
* The convergence score of current iteration is bigger than the score of last iteration.
## :bulb: Result of Algorithm with Input Dataset
* Please refer output txt files of each graph in directory `./output/hw03/` and `./output/ibm_data/` in the [github repo](https://github.com/StevenChen8759/2021_DataMining_Project_03).
* Or run my script following procedure described in `Readme.md` to obtain the result on the command line interface.
* Obtain result with running `./src/graph_adjust.py`
### :movie_camera: Result Discussion - Graph 1 ~ 3
* Graph Visualization
* [1] ==>  [2] ===>  [3] ===> 
* Analysis on Authority
* Graph 1: `[0. 0.2 0.2 0.2 0.2 0.2]`
* Graph 2: `[0.2 0.2 0.2 0.2 0.2]`
* Graph 3: `[0.19098309 0.30901691 0.30901691 0.19098309]`
* The more in-degree, the higher authority value.
* Graph 2 has all equivalent authority value since in-degree of all vertices are also equivalent.
* Analysis on Hub
* Graph 1: `[0.2 0.2 0.2 0.2 0.2 0.]`
* Graph 2: `[0.2 0.2 0.2 0.2 0.2]`
* Graph 3: `[0.19098309 0.30901691 0.30901691 0.19098309]`
* The more out-degree, the higher hub value.
* Graph 2 has all equivalent hub value since out-degree of all vertices are also equivalent.
* Analysis on PageRank
* Graph 1: `[0.06071611 0.11232481 0.1561922 0.19347948 0.22517367 0.25211373]`
* Graph 2: `[0.2 0.2 0.2 0.2 0.2]`
* Graph 3: `[0.17529425 0.32494913 0.32446285 0.17529377]`
* For a chain-like graph such as graph 1, the highest page rank value locates at the vertex which out-degree is zero.
* Also called sink-vertex pagerank absorbing
* For a cycle-like graph such as graph 2, the flow of each node is permanently equivalent, so its pagerank is also permanently equivalent unless some edge is added or deleted.
### :hammer: Discussion on adjusting damping factor of PageRank algorithm.
* Review the pagerank formula
* $PR(P_i) = \frac{d}{n} + (1-d) \times \underset{l_{j,i} \in E}{\Sigma}\frac{P_j}{Outdegree(Pj)}$
* $d$: damping factor, to prevent sink-vertex pagerank absorbing.
* While d is close to 1, the pagerank between vertices will be an uniform distribution.
* While d is close to 0, it is more possible to sufer from sink-vertex pagerank absorbing.
* So, we set 0.15 for best result. (As the suggestion from TA)
* Ref: https://www.quora.com/What-is-the-function-of-the-damping-factor-in-PageRank
### :wrench: Discussion on adjusting decay factor of SimRank algorithm.
* Review the simrank formula
* $S(a,b) = \frac{C}{|I(a)I(b)|}\Sigma^{|I(a)|}_{i=1}\Sigma^{|I(b)|}_{j=1}S(I_i(a),I_j(b))$
* Using decay factor to distinguish diagonal values 1.
* Select 0.9 in our cases
* Ref: https://towardsdatascience.com/simrank-similarity-analysis-1d8d5a18766a
## :compression: Adjustment Graph to Increase Hub, Authority, and PageRank
* Strategy: add sufficient input edge and output edge for vertex 1.
* While the input edge count increases, the authority increases.
* While the output edge count increases, the hub increases.
* While the input and output edge count increases, the page rank increases due to more flow passed through.
* Complete script for adjusting graph is in `./src/graph_adjust.py`
* Graph 1
* Add edge `(1, 3)`, `(1, 4)`, `(1, 5)`, `(2, 1)`
*  -----------------------> 
* Result of Vertex 1
* Hubs: `0.2` to `0.535`
* Authority: `0.0` to `0.066`
* PageRank: `0.060` to `0.129`
* 
* Graph 2
* Add edge `(1, 3)`, `(1, 4)`, `(1, 5)`, `(2, 1)`, `(3, 1)`, `(4, 1)`
*  ------------> 
* Result of Vertex 2
* Hubs: `0.2` to `0.288`
* Authority: `0.0` to `0.288`
* PageRank: `0.2` to `0.374`
* 
* Graph 3
* Add edge `(1, 3)`, `(1, 4)`, `(3, 1)`, `(4, 1)`
*  -----------------> 
* Result of Vertex 3
* Hubs: `0.190` to `0.280`
* Authority: `0.190` to `0.280`
* PageRank: `0.175` to `0.295`
* 
## :chart_with_upwards_trend: Time Performance Analysis
* Running time analysis by timestamp generated by `loguru`.
* This is an estimated (粗估的) measurement. It may be different due to some factors such as hardware environment, software environment...etc.
* ATPI of an algorithm represents Average Time Per Iteration.
* To lookup detailed ATPI timestamps, please refer the output log in appendix.
| Graph Name | Vertex Count | Edge Count | HITS ATPI | PageRank ATPI | SimRank ATPI |
| ---------- | ------------ | ---------- | --------- | ------------- | ------------ |
| HW03 - 1 | 6 | 5 | 1ms | 1ms | 2ms |
| HW03 - 2 | 5 | 5 | 1.5ms | 3ms | 4ms |
| HW03 - 3 | 4 | 6 | 0.2ms | 0.4ms | 0.33ms |
| HW03 - 4 | 7 | 18 | 0.08ms | 0.33ms | 0.44ms |
| HW03 - 5 | 469 | 1102 | 3.59ms | 307.5ms | 2085ms |
| HW03 - 6 | 1228 | 5220 | 8.03ms |13646.25ms | 26820.5ms |
| IBM-5000 | 836 | 4798 | 5.33ms | 9076ms | 32551.3ms |
* Time Complexity Analysis for each algorith. ($v$: Vertex Count)
| Algorithm | Theorical Time Complexity | Notes |
| --------- | --------------- | --- |
| HITS | $O(v^2)$ | Operation on adjancy matrix, but it should be faster based on [numpy `matmul()` optimization](https://stackoverflow.com/questions/10442365/why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python). |
| PageRank | $\Omega(v)$ | Running time may increace due to increasing average edge count of each vertex. |
| SimRank | $\Omega(v^2)$ | Vertex-pair traversal. Running time may increace due to increasing average edge count of each vertex. |
### :mag: Discussion on Complexity
* HITS always runs fastest. It seem that the numpy actually do a really well optimization on `matmul()`.
* Especially on graph with large vertices and edges.
* Using $\Omega(\cdot)$ symbol for complexity analysis on vertex count while considering edge count in a simplified manner.
* The difference of runtime between PageRank and SimRank is not fully equivalent.
* Based on the behavior of vertex traversal, the time complexity of PageRank is better than SimRank.
## :clapper: Conclusion and Future Work
* Implement HITS, PageRank, and SimRank algorithms.
* Based on numpy ndarray ops and self-defined `directedGraph` structure.
* Apply these dataset on 6 graphs provided by TA and 1 graph from IBM data generator.
* Do analysis on outcome, graph adjustment, and time efficiency.
* The test on divergence can be improved in future.
* Score increasing tolerance count.
* Currently 1 time.
* Cost on evaluating score, i.e. 1-norm, 2-norm...etc.
* Memory cost for last iteration outcome.
* Currently keep outcomes for 2 iteration.
* In SimRank algorithm, we also utilize 1-norm to check divergence. It can be discussed if 1-norm is suitable in this case or not.
## :file_folder: Appendix
* Running Time Table - Timestamp Sampling Reference Output
```shell
$ pipenv run python .\src\main.py
2022-01-11 03:12:14.539 | INFO | __main__:<module>:16 - Initialize...
2022-01-11 03:12:14.542 | INFO | __main__:<module>:23 - Read Graphs
2022-01-11 03:12:14.757 | INFO | __main__:<module>:32 - Run hw03 dataset
2022-01-11 03:12:14.759 | INFO | __main__:<module>:34 - Graph 1 - HITS
2022-01-11 03:12:14.761 | DEBUG | LinkAnalysis.HITS:directedGraph_HITS:65 - Converged after 2 iterations, score: 0.00
2022-01-11 03:12:14.764 | DEBUG | __main__:<module>:36 -
Authority:
[0. 0.2 0.2 0.2 0.2 0.2]
Hubs:
[0.2 0.2 0.2 0.2 0.2 0. ]
2022-01-11 03:12:14.768 | INFO | __main__:<module>:40 - Graph 1 - PageRank, damping factor: 0.15
2022-01-11 03:12:14.770 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 2 iterations, score: 0.00
2022-01-11 03:12:14.772 | DEBUG | __main__:<module>:43 -
PageRank:
[0.06071611 0.11232481 0.1561922 0.19347948 0.22517367 0.25211373]
2022-01-11 03:12:14.775 | INFO | __main__:<module>:45 - Graph 1 - SimRank, decay factor: 0.9
2022-01-11 03:12:14.777 | DEBUG | LinkAnalysis.SimRank:directedGraph_SimRank:82 - Converged after 1 iterations, score: 0.00
2022-01-11 03:12:14.780 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 1.]]
2022-01-11 03:12:14.783 | INFO | __main__:<module>:34 - Graph 2 - HITS
2022-01-11 03:12:14.786 | DEBUG | LinkAnalysis.HITS:directedGraph_HITS:65 - Converged after 2 iterations, score: 0.00
2022-01-11 03:12:14.789 | DEBUG | __main__:<module>:36 -
Authority:
[0.2 0.2 0.2 0.2 0.2]
Hubs:
[0.2 0.2 0.2 0.2 0.2]
2022-01-11 03:12:14.794 | INFO | __main__:<module>:40 - Graph 2 - PageRank, damping factor: 0.15
2022-01-11 03:12:14.797 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 1 iterations, score: 0.00
2022-01-11 03:12:14.801 | DEBUG | __main__:<module>:43 -
PageRank:
[0.2 0.2 0.2 0.2 0.2]
2022-01-11 03:12:14.806 | INFO | __main__:<module>:45 - Graph 2 - SimRank, decay factor: 0.9
2022-01-11 03:12:14.810 | DEBUG | LinkAnalysis.SimRank:directedGraph_SimRank:82 - Converged after 1 iterations, score: 0.00
2022-01-11 03:12:14.812 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
2022-01-11 03:12:14.815 | INFO | __main__:<module>:34 - Graph 3 - HITS
2022-01-11 03:12:14.818 | DEBUG | LinkAnalysis.HITS:directedGraph_HITS:65 - Converged after 15 iterations, score: 0.00
2022-01-11 03:12:14.820 | DEBUG | __main__:<module>:36 -
Authority:
[0.19098309 0.30901691 0.30901691 0.19098309]
Hubs:
[0.19098309 0.30901691 0.30901691 0.19098309]
2022-01-11 03:12:14.824 | INFO | __main__:<module>:40 - Graph 3 - PageRank, damping factor: 0.15
2022-01-11 03:12:14.826 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 5 iterations, score: 0.00
2022-01-11 03:12:14.829 | DEBUG | __main__:<module>:43 -
PageRank:
[0.17529425 0.32494913 0.32446285 0.17529377]
2022-01-11 03:12:14.830 | INFO | __main__:<module>:45 - Graph 3 - SimRank, decay factor: 0.9
2022-01-11 03:12:14.833 | DEBUG | LinkAnalysis.SimRank:directedGraph_SimRank:82 - Converged after 9 iterations, score: 0.00
2022-01-11 03:12:14.834 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0. 0.81680604 0. ]
[0. 1. 0. 0.81680604]
[0.81680604 0. 1. 0. ]
[0. 0.81680604 0. 1. ]]
2022-01-11 03:12:14.836 | INFO | __main__:<module>:34 - Graph 4 - HITS
2022-01-11 03:12:14.838 | DEBUG | LinkAnalysis.HITS:directedGraph_HITS:65 - Converged after 25 iterations, score: 0.00
2022-01-11 03:12:14.839 | DEBUG | __main__:<module>:36 -
Authority:
[0.13948439 0.17791184 0.20082311 0.14017779 0.20142508 0.05608945
0.08408834]
Hubs:
[0.27545244 0.04776281 0.10868354 0.19865896 0.18373521 0.11673493
0.06897212]
2022-01-11 03:12:14.842 | INFO | __main__:<module>:40 - Graph 4 - PageRank, damping factor: 0.15
2022-01-11 03:12:14.844 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 6 iterations, score: 0.00
2022-01-11 03:12:14.846 | DEBUG | __main__:<module>:43 -
PageRank:
[0.28040101 0.15859165 0.13870053 0.10806754 0.18441137 0.06067357
0.06915433]
2022-01-11 03:12:14.848 | INFO | __main__:<module>:45 - Graph 4 - SimRank, decay factor: 0.9
2022-01-11 03:12:14.860 | DEBUG | LinkAnalysis.SimRank:directedGraph_SimRank:82 - Converged after 27 iterations, score: 0.00
2022-01-11 03:12:14.862 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0.5624587 0.55275161 0.55537485 0.54338325 0.60223215
0.50851755]
[0.5624587 1. 0.59622329 0.56689323 0.60339041 0.5014645
0.63232195]
[0.55275161 0.59622329 1. 0.62791811 0.58396937 0.62632672
0.6295095 ]
[0.55537485 0.56689323 0.62791811 1. 0.54477508 0.69443253
0.69443253]
[0.54338325 0.60339041 0.58396937 0.54477508 1. 0.48980258
0.59974758]
[0.60223215 0.5014645 0.62632672 0.69443253 0.48980258 1.
0.48886506]
[0.50851755 0.63232195 0.6295095 0.69443253 0.59974758 0.48886506
1. ]]
2022-01-11 03:12:14.867 | INFO | __main__:<module>:34 - Graph 5 - HITS
2022-01-11 03:12:14.946 | DEBUG | LinkAnalysis.HITS:directedGraph_HITS:65 - Converged after 22 iterations, score: 0.00
2022-01-11 03:12:14.956 | DEBUG | __main__:<module>:36 -
Authority:
[0.00000000e+00 3.29762416e-20 4.28175672e-13 ... 3.37676714e-17 2.56004331e-10 3.37676714e-17]
Hubs:
[6.52120933e-12 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00 0.00000000e+00]
2022-01-11 03:12:14.980 | INFO | __main__:<module>:40 - Graph 5 - PageRank, damping factor: 0.15
2022-01-11 03:12:20.516 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 18 iterations, score: 0.00
2022-01-11 03:12:20.522 | DEBUG | __main__:<module>:43 -
PageRank:
[0.00090049 0.00143941 0.0011405 ... 0.00105472 0.00094814 0.00103352]
2022-01-11 03:12:20.531 | INFO | __main__:<module>:45 - Graph 5 - SimRank, decay factor: 0.9
2022-01-11 03:12:24.701 | WARNING | LinkAnalysis.SimRank:directedGraph_SimRank:86 - SimRank Algorithm has stopped at 2-th iterations due to expected divergence.
2022-01-11 03:12:24.705 | WARNING | LinkAnalysis.SimRank:directedGraph_SimRank:87 - Will return old sequence, Score: from 34.20 to 35.58
2022-01-11 03:12:24.708 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
...
[0. 0. 0. ... 1. 0. 0.]
[0. 0. 0. ... 0. 1. 0.]
[0. 0. 0. ... 0. 0. 1.]]
2022-01-11 03:12:24.744 | INFO | __main__:<module>:34 - Graph 6 - HITS
2022-01-11 03:12:25.547 | DEBUG | __main__:<module>:36 -
Authority:
[0.00000000e+00 5.60427863e-04 6.13124816e-07 ... 3.12230818e-05
3.42598934e-05 1.76820999e-04]
Hubs:
[0.0026931 0. 0. ... 0. 0. 0. ]
2022-01-11 03:12:25.553 | INFO | __main__:<module>:40 - Graph 6 - PageRank, damping factor: 0.15
2022-01-11 03:15:09.308 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 12 iterations, score: 0.00
2022-01-11 03:15:09.314 | DEBUG | __main__:<module>:43 -
PageRank:
[0.00040809 0.00071446 0.00052214 ... 0.00046348 0.00042836 0.00059307]
2022-01-11 03:15:09.317 | INFO | __main__:<module>:45 - Graph 6 - SimRank, decay factor: 0.9
2022-01-11 03:16:02.958 | WARNING | LinkAnalysis.SimRank:directedGraph_SimRank:86 - SimRank Algorithm has stopped at 2-th iterations due to expected divergence.
2022-01-11 03:16:02.960 | WARNING | LinkAnalysis.SimRank:directedGraph_SimRank:87 - Will return old sequence, Score: from 30.19 to 124.52
2022-01-11 03:16:02.964 | DEBUG | __main__:<module>:47 -
SimRank:
[[1. 0. 0. ... 0. 0. 0. ]
[0. 1. 0. ... 0. 0. 0. ]
[0. 0. 1. ... 0. 0. 0. ]
...
[0. 0. 0. ... 1. 0. 0. ]
[0. 0. 0. ... 0. 1. 0.18]
[0. 0. 0. ... 0. 0.18 1. ]]
2022-01-11 03:16:03.197 | INFO | __main__:<module>:53 - Run IBM dataset - HITS
2022-01-11 03:16:03.213 | WARNING | LinkAnalysis.HITS:directedGraph_HITS:69 - HITS Algorithm has stopped at 3-th iterations due to expected divergence.
2022-01-11 03:16:03.214 | WARNING | LinkAnalysis.HITS:directedGraph_HITS:70 - Will return old sequence, Score: from 0.03 to 0.08
2022-01-11 03:16:03.227 | DEBUG | __main__:<module>:55 -
Authority:
[0. 0. 0. ... 0. 0. 0.0186942 ]
Hubs:
[7.15240052e-04 2.26783431e-03 2.18574076e-04 ... 2.07183596e-03 9.11238402e-04 0.00000000e+00]
2022-01-11 03:16:03.261 | INFO | __main__:<module>:59 - Run IBM dataset - PageRank, damping factor: 0.15
2022-01-11 03:16:57.717 | DEBUG | LinkAnalysis.PageRank:directedGraph_PageRank:73 - Converged after 6 iterations, score: 0.00
2022-01-11 03:16:57.741 | DEBUG | __main__:<module>:61 -
PageRank:
[0.00019481 0.00019481 0.00019481 ... 0.00019481 0.00019481 0.00241488]
2022-01-11 03:16:57.795 | INFO | __main__:<module>:64 - Run IBM dataset - SimRank, decay factor: 0.9
2022-01-11 03:18:35.449 | DEBUG | LinkAnalysis.SimRank:directedGraph_SimRank:82 - Converged after 3 iterations, score: 0.00
2022-01-11 03:18:35.453 | DEBUG | __main__:<module>:66 -
SimRank:
[[1. 0. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
...
[0. 0. 0. ... 1. 0. 0.]
[0. 0. 0. ... 0. 1. 0.]
[0. 0. 0. ... 0. 0. 1.]]
2022-01-11 03:18:35.540 | SUCCESS | __main__:<module>:69 - End of main script...
```