CS170 Project: notsoclever
===
Sumer Kohli, Kevin Lin, Neha Hudait
# Initial Thoughts
We recognized the problem we are solving is equivalent to solving the *Minimum Routing Cost Tree* algorithm on some subset of vertices.
We decided to research and implement multiple different solvers, run each solver in parallel, and output the best result out of all of the solvers. In particular, we had to decide on an algorithm which would output a valid set of vertices and then an algorithm which would solve the Minimum Routing Cost Tree Algorithm.
# Explored Vertex Algorithms
In this section, we will discuss the algorithms which we explored while coming up with the optimal vertex subset algorithms.
## Min-Weighted Dominating Set + Add Minimum Edges
In `connect_dom_set_minimal_edge.py`, we started off with created a minimum weighted dominating set. This subset of vertices is not necessarily connected. To get a valid connected graph, we sorted the edges and looped through them, popping the minimum edge off at each iteration. We would check if adding the edge would create a cycle and if not, we would add it to the graph. We repeated this process until the graph became connected and then we returned the reseulting subset.
## Epsilon Min-Weighted Dominating Sampling
In `eps_mwd_sample.py`, found the minimum weighted dominating set. After this, we decided to use a variable epsilon to calculate a sample size. We iterated and computed a random subset of vertices of size sample size. After this, we would randomly remove some vertex from this subset and see if the the removal of that vertex would decrease our cost, if so, we made our best tree so far that subset.
>Source: [Weighted Connected Dominating Set](https://link.springer.com/referenceworkentry/10.1007%2F978-1-4939-2864-4_476)
## Greedy Vertex Removal
In `greedy_vertex_removal.py`, we found an MST of G, and iterated over E(G) - E(T), adding each edge and then remove whichever edge in the created cycle reduces the cost the most.
# Explored Minimum Routing Cost Tree Algorithms
In this section, we will discuss the algorithms which we explored while coming up with a solution for the MRCT problem described in the spec.
We first created `mst.py` which acts as a baseline MST of the given graph. We then explored different algorithms based on available research.
## Gradual Edge Replacement
Gradual Edge replacement takes advantage of Kruskal of Prim's algorithms to find the minimum spanning tree of the graph, then replace gradually each edge of the spanning tree with a better edge. We took two separate implementations.
First implementation (`ger_h1.py`):
1. Find a minimum spanning tree T in G
2. Insert respective edge e in the edge set of E-T into T. This will form a cycle. In this cycle, we find the best edge e' so that T-e U e' has a better cost than the cost of T. If there is such an edge e', then we replace T with T U e-e'
We repeat the second step until in a loop there is no edge replacement could be done in the spanning tree.
Second Implementation (`ger_h2.py.disabled`):
1. Find the best shortest path tree among all possible shortest path's trees in graph G using Wong's algorithm (cited in the paper)
2. Remove gradually each e of T. With each e, find the best e' within E-{e}. Suppose T’= T-e+e’. If T’ better than T then replace T=T’.
We repeat the second step until in a loop there is no edge removal could be done in the spanning tree. We quickly realized that this second implementation of GER, GER H2 was slower and had the same results as GER H1.
> Source: [A Heuristic Approach for Solving Minimum Routing Cost Spanning Tree Problem](https://www.ijmlc.org/papers/154-C00896-004.pdf)
# High Performance Computing, Testing, and Hillclimbing with C++
We utilized the Open Computing Facility's High Performance Computing (HPC) cluster, which is free to all Berkeley students, faculty, and staff.
We rewrote and reimplemented the core GER H1 algorithms along with epsilon min-weighted sampling using C++ and Boost (see the implementation inside the `boosted` directory).
How does hillclimbing work? Well, we download the leaderboard from Firebase (command is `curl "https://cs-170-project-sp20.firebaseio.com/leaderboard.json" > cur_leaderboard.json`), use the `read_scores.py` script to generate hillclimbing output (execute `python read_scores.py hill 5 medium`, for example, for all medium graphs we were rank #5 or worse on). It generates a hillclimbing target array in both Python and C++ syntax, which we then use epsilon sampling for repeatedly until we can improve our current score to something closer to the "target" score (aka closer to the #1 team's score). Basically, just keep sampling and pruning (aka using Greedy Vertex Removal on sample to come up with a valid, pruned MRCT) until our score is closer to the target score.
This hillclimbing approach got us from ranking #20 or 30 on small graphs and >#50 on medium and large graphs to #1 in small graphs, #27 in medium graphs, and #44 in large graphs.
We used about 50-60 OCF HPC CPU-hours in total, and were able to heavily parallelize, so actual running time was more like 8 hours across many cores. We deliberately computed all our graphs almost 1.5 weeks before project due date so that we would not use too much compute.
# Why C++?
Way faster. Way way way faster. We used Boost Graphing Library, which was a breath of fresh air when it came to speed, but an absolute nightmare to code with (just see `boosted/Main.cpp`, it's uglier than Java and dumber than PHP).
# Results
Using the downloaded leaderboard from Firebase, our rankings across graph types was as follows (from the `read_scores.py` script:
```
Final average rank: 27.754473161033797
Final average rank for 303 small graphs: 1.0495049504950495
Final average rank for 303 medium graphs: 32.240924092409244
Final average rank for 400 large graphs: 44.585
```
TL;DR Incredibly good at small graphs due to smaller sample space for epsilon sampling, gets way worse for larger graphs.