CS410 F24: Final Project: Go Engine

# CS410 Final Project: Go Engine | Milestones | Date | Time | | ------------------------- | ----- | ---------- | | Partner Form Due | | 11:59am ET | | Part 1 (Search) | 12/4 | 11:59pm ET | | Part 2 (Learned Heuristics) | 12/11 | 11:59pm ET | | Final Bot & Writeup Due | 12/17 | 11:59pm ET | | Tournament Ends | 12/18 | 11:59pm ET | **Need help?** Remember to check out Edstem and our website for TA assistance. ## Assignment Overview The board game Go (a.k.a. Wieqi, Igo, or Baduk) is a two-player strategy game popular throughout the world. Players alternate moves, placing stones (pieces) on the board attempting to surround more territory than the opponent. Game-playing agents have always been a large focus of the AI community. Up until very recently, it was thought that human-level Go agents were far out of reach. When Alpha-Go defeated the world champion Lee Sedol in a 5 game match in 2016, it shocked much of the AI and Go communities. In this final project, you will implement game playing agents for the game of Go. We provide the structure for creating an initial set of agents using algorithms from assignment 2 (Minimax and $\alpha\beta$-search), as well as new algorithms (Monte Carlo Tree Search). Your final agent's strategy is up to you. Your agent will play against of staff-made agents of varying levels of competence and your peers' agents in a class-wide tournament. ### Learning Objectives ## The Game of Go ### History of Computer Go ### The Basics Go is a played on grid. Competitive Go is usually played on a 23x23 board, but we will be playing on a smaller 9x9 board. Players alternate placing colored stones on a game board. The player with the black pieces plays first, followed by the white pieces. If one player completely encircles the stones of her opponent, the opponent's stones are removed from the board. The player with the most [territory](https://www.pandanet.co.jp/English/learning_go/learning_go_4.html) surrounded at the end of the game is the winner. We highly encourage you try out an interactive tutorial, such as [this one](https://gomagic.org/how-to-play-go-rules/) to better understand the rules of Go before moving on. At any point, a player may pass (no stone is placed). If both players pass, the game ends. At the end of the game, the player with the most pieces on the most territory wins. :::spoiler A note on scoring There are multiple scoring methods for Go. This is called Tromp-Taylor scoring. Other methods take account of the number of pieces captured as well. ::: The first player (black) typically has a significant advantage (similar to Tic-Tac-Toe, Connect 4, Chess, etc.). To account for this advantage, additional points, called komi, are added to white's score at the end of the game. We will use a komi of 7.5 in this project. This means that black has to score 6 or more points better than white to win. If a fractional value is used for komi, the game cannot end in a draw. ### Time Controls Most competitive board games are played with per-move or per-game time limits; without such limits, games could take forever! These time limits are often referred to as *time controls*. A clock is used for each player and it counts down when it is that player's turn. After a player makes a move, their clock stops counting down and their opponent's begins. If a player's time hits 0, they forfeit the game. Our games will use an *incremental* time control. In addition to an initial amount of time for each player, incremental time controls give players additional time whenever a move is played. Incremental time controls are typically written as: *Initial Time + Increment*. We will use a time control of $15' + 1'$. Each player starts with 15 seconds on their clock. After a player plays their move, one second is added to their clock. Time management, or how a player uses their available time, is key to a successful Go agent. ## Data Structures & Algorithms ### GoSearchProblem ### Board Representation ### Agents ### Go Data ### MCTS #### Pseudo-code ``` Monte-Carlo-Tree-Search(state, π_S) → action tree ← MCTSNode(state) while time-remaining do leaf ← SELECT(tree, π_S) children ← EXPAND(leaf) result ← SIMULATE(children) BACKPROPAGATE(results, children) Return The action of the node in children(tree) with the highest number of visits ``` Select: ``` SELECT(state, π_S): currNode ← state while !isLeaf(currNode) do currNode ← π_S(currNode.children) currNode.visits += 1 end while Return currNode ``` Expand: ``` Expand(leaf): children ← [] state ← leaf.state actions ← state.legalActions() for action in actions do children.append(transition(state, action)) Return children ``` Simulate: ``` SIMULATE(children) results ← [] for child in children do result = rollout(child) results.append(result) Return results ``` Backpropagate: ``` BACKPROPAGATE(results, children) for child in children, result in results do currNode ← child while currNode do if result == 1 AND currNode.player == 0 then currNode.value += 1 else if result == 0 AND currNode.player == 1 then currNode.value += 1 ``` ### Learned Heuristics ## Part 1: Tree Search Search In this first part of the final project, you will revisit three previously implemented algorithms, `minimax`, `alpha_beta_search`, and `iterative_deepening` and implement a new tree search algorithm, Monte Carlo Tree Search. :::info **Task 1.1:** Minimax and $\alpha\beta$-Search Agents ::: `GoSearchProblem` extends the same `AdversarialSearchProblem` interface used in assignment 2. This means that we can directly use your implementations of minimax and $\alpha\beta$-Search from assignment 2. Recall that both algorithms can take a cutoff depth, which limits the search depth. When a cutoff depth is used, a `HeuristicAdversarialSearchProblem` must be used. We have provided `GoProblemSimpleHeuristic` which implements, as the name would suggest, a simple heuristic. The heuristic is the difference in the number of pieces for each player. This heuristic incentivizes capturing your opponents pieces, which would increase the difference between the number of your own pieces and the number of opponent pieces. Implement the `get_move` functions of MinimaxAgent and AlphaBetaAgent in `agents.py`. Note that `get_move` takes as input a `time_limit`. You do not need to account for this in your minimax and $\alpha\beta$-search implementations. We have provided implementations of `RandomAgent`, which randomly selects a legal action, and `GreedyAgent`, which selects the best action with 1 ply lookahead. Your agents should call your existing implementations from assignment 2, you should not need to rewrite anything. **Running instructions** **Find a suitable search depth** :::info **Task 1.2:** Iterative Deepening Agent (with cutoff time) ::: Manually finding a suitable search depth is frustrating and inneficient. Even worse, the depth you found on your local computer may not work on another (slower) computer. It would be much better to have an *anytime algorithm*. Anytime algorithms are algorithms that can terminate at... anytime. Minimax and AlphaBeta are not anytime, because they explore actions from the root one at a time. If the algorithm is stopped before the full tree is searched, they won't look at some of the actions from the root (which might have been good actions to take). Iterative Deepening (from homework 1), *is* an anytime algorithm. Recall that Iterative Deepening runs DFS at increasing depths. If we have run out of time, say while searching at depth 4, we can always return the best action we found for depth 3. Implement the `get_move` method for `IterativeDeepeningAgent` in `agents.py`. Note, unlike minimax and $\alpha\beta$-search, your previous implementation of Iterative Deepening ran on `SearchProblem` (not `AdversarialSearchProblem`). You will have to modify your existing Iterative Deepening implementation to work on the `HeuristicAdversarialSearchProblem` setting. You will have to incorporate concepts from minimax search (there is a minimizing player and maximizing player) to account for the fact that it is an adversarial setting. Add a time cutoff to the `get_move` method of your agent such that the agent always returns after `self.cutoff_time` second has passed. :::info **Task 1.3:** Monte Carlo Tree Search ::: Our previous agents are hindered by the weakness of our heuristic. We could try and fix this by finding a better heuristic (stay tuned for Part 2), but Go boards are generally difficult to evaluate. Go differs from games like chess or checkers where you can count up the number of pieces you have and have a good heuristic. In Go, captures are relatively rare and the game focuses on territorial control, a concept much harder to formalize in mathematical formulae. This is part of the reason that many people thought Go engines wouldn't surpase human play until far into the future. It's hard to encode the intuition for what makes a Go position favorable or not. You will now implement Monte Carlo Tree Search (MCTS). MCTS does not rely on heuristics. Instead, MCTS performs game simulations (rollouts/playouts) to determine the value of states. Therefore, MCTS depends only on the rules of the game (by running simulation) and not any heuristic function. Additionally, MCTS is an anytime algorithm. The longer MCTS runs, the more simulations it runs and the more confident it becomes in its decision, but it can be terminated at any time. We have provided the MCTSNode class, which you will use to build your MCTS search tree. Implement each of the four methods described above: `select`, `expand`, `simulate`, and `backpropagate` in `MCTSAgent`. Then implement the `get_move` method that runs combines these four methods and returns the best move found after `self.cutoff_time`. **Running Instructions** **Handin Instructions for Part 1** ## Part 2: Learned Heuristic Functions In this section, you will revisit heuristics. You will learn a series of heuristics using supervised learning and reinforcement learning and compare their performance to `MCTSAgent` and the other agents produced in part 1. :::info **Task 2.1:** Learn a value function with supervised learning ::: It's hard for humans to determine what makes a Go position favorable for a player. If it were easy, Go would be easy and Go is definitely not easy. Instead of trying to produce a complicated set of rules for a heuristic, we will try and learn a Neural Network model for a heuristic. You will work in `supervised_learning_go.ipynb` to train a neural network to produce a value estimate for a given `GoState`. Follow the instructions in the notebook and save a model file. :::info **Task 2.2:** Implement GoLearnedHeuristic ::: We will now use the model trained in Task 2.1 to create a heuristic that can be used in our existing algorithms (other than MCTS for now). In `heuristic_go_problems.py` implement `GoProblemLearnedHeuristic`, which loads your trained model, encodes a state into the desired input format for your model, and evaluates your model. Test your problem with GreedyAgent. Create two GreedyAgents, one which will use the simple heuristic and one which uses the learned heuristic. Compare their performance over some number of games. Does the agent with the new heuristic outperform the simple heuristic? Now try to use $\alpha\beta$-search with the learned heuristic. You may find that you can't run the learned heuristic at the same depth as the simple heuristic. Neural networks take longer to evaluate than simply counting the number of pieces for each player. Is using a better heuristic better if we can't search as deep in the search tree? It depends! You may want to try smaller neural networks (fewer layers, smaller layers) and see if you can get the same prediction accuracy with a faster execution time. :::info **Task 2.3** Train Neural Network Policy with Reinforcement Learning ::: A model trained with supervised learning is limited by the availability of data. It might learn an accurate value function for the distribution of data it has observed, but may fail if an *out of distribution* state is entered. What if your opponent plays a move a human would never play (e.g., starting in a corner)? In general, we can *hope* that supervised learning can generalize to out of distribution data, but we have few guarantees. Instead of relying on expert performance, we can use reinforcement learning and simulated games to gather more data. ## Part 3: Final Agent :::info **Task 3.1** Create the strongest agent you can ::: :::info **Task 3.2** Provide a writeup of your strategy in your Readme ::: ### Possible Directions 1. Time Management: We have, up until this point, only spent a constant time on each move (or searched to a constant depth). However, Go masters do not spend a constant amount of time on each move. Rather, they play the opening quickly, spend much time thinking in the early-middle game, and then increase in speed towards the end of the game. They do not waste time on moves with only one good action and spend their time thinking in the most complex positions. 2. Transposition Table: We tend to treat AdversarialSearchProblems as search on a tree of states. However, it is often possible to reach the same board state through multiple move orders. Since states can be reached from multiple predecessors, the search problem actually takes place on a Directed Acyclic Graph (DAG)! If the same board state appears in the search tree twice, we shouldn't only try to compute it's value once! 4. Opening Book: How long should the first player spend thinking on the first move? It has the highest branching factor (82 possible actions for a 9x9 board), so a naive minimax will spend the most time thinking on the first move. Chess and Go Agents will often use what's called an [Opening Book](https://www.chessprogramming.org/Opening_Book). The opening book contains a set of positions and actions to take in that position. If a position is in the book, the agent plays the optimal move. Once a position is no longer found in the opening book, the agent starts to run other search algorithms. Avoiding unnecessary computation and making moves quickly will allow agents to spend more time thinking later in the game. A sample 9x9 set of openings can be found [here](https://online-go.com/review/291143). Note: The data can be downloaded in [sgf](https://mjw.woodcraft.me.uk/sgfmill/doc/1.1.1/) format, which you used during the supervised learning portion of Part 2. 5. Rapid Action Value Estimation (RAVE): Achieves master level performance in 9x9 Go. Details of this algorithm can be found in this [paper](https://www.cs.utexas.edu/~pstone/Courses/394Rspring13/resources/mcrave.pdf). This paper also contains a literature review of other techniques for state-of-the-art go play, which you might find useful. ## Tips and Hints 1. **Version control is your friend only if you use it well.** Don't rely on ctrl+z to go backwards if you introduce bugs or your agent's performance decreases. Commit when you have an agent you are happy with and use a useful commit message! If you have to revert later, you'll be happy you gave yourself an informative commit message. 2. **Test early and often.** This is a large project with many components. You can't expect to write every component and put them together without any issues. Test everything you write **independently** as soon as possible. 3. **Early optimization is the root of all evil (or at least most of it) in Programming** - [Donald Knuth](https://www-cs-faculty.stanford.edu/~knuth/taocp.html). The faster your code runs, the more search nodes your agent can explore. You can get significant improvement in play from optimizing your code's performance. However, as Donald Knuth so eloquently put: "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming." Only spend time optimizing your code once you know it works as intended and is bug free. 4. TAs are here to help, but keep in mind the difficulty of what they have been tasked with. This is an open ended project and everyone will be implementing different approaches. If you have a bug specific to your implementation, it may be hard for them to help than during a normal homework assignment. You should view TA hours as an opportunity to talk through your planned approach and get a second opinions about your plan. If you want to get help debugging, be prepared with specific descriptions of what is wrong, how you've determined what is wrong, and what you've done to try and fix it. ## Rules - No *pondering*: Pondering refers to the practice of thinking during your opponent's turn. While generally interesting to think about how you might do this, we want to avoid any nefarious and sabotaging behavior (spooling up empty work during your opponent's turn to starve them of computational resources). If you use multiple threads, you must terminate them before `get_move` returns. - No catching of `support.TimeoutException` (or generic exceptions). If your agent takes longer than the time cutoff, we will terminate it and your agent will forfeit the game. If your agent attempts to evade this termination (by catching the exception we will interrupt with), you will be disqualified from the entire tournament. - No use of system signal library. Again, we wish to avoid nefarious behavior. Anything deemed as an attempt to sabotage your opponent will be an instant disqualification from the entire tournament. - Must be written in Python. Search algorithms work better the more states they can explore. Python is a relatively slow language compared to compiled languages, like C and C++. Yes, your agents would be more efficient if written in a faster language, but that is not in the spirit of the class. The focus should be on AI techniques used, not programming language used. - Memory limits? ## Downloads Please click [here]() to access the assignment code. ### Support Code ### Stencil Code ### Testing Code i'm OLD and need editing: - `game_dag.py`: Implements an `AdversarialSearchProblem` as a game DAG (directed, acyclic graph). You can use the `GameDAG` class to create simple, abstract adversarial search problems for testing purposes. Read the `GameDAG` class's docstrings to learn how to create a `GameDAG`. - `unit_tests.py`: Includes basic test cases and correctness checks for simple game trees. Below you can see a visualization of the game tree created in the `_get_test_dag` function in `unit_tests.py`. You can test your various implementations by running them on the game trees generated by `_get_test_dag` and `_get_test_dag_2`. These tests should take no more than 1 second. ## Submission ### Handin Your handin should contain the following: - all modified files, including comments describing the logic of your implementations, and your tests - the graphs you generated - and, as always, a README containing: - a brief paragraph summarizing your implementation and highlighting the outcomes of all tests - your solutions to any conceptual questions - known problems in the code - anyone you worked with - any outside resources used (eg. Stack Overflow, ChatGPT) ### Gradescope Submit your assignment via Gradescope. To submit through GitHub, follow these commands: 1. `git add -A` 2. `git commit -m "commit message"` 3. `git push` Now, you are ready to upload your repo to Gradescope. :::danger **⚠️WARNING⚠️** Make sure you have the correct assignment repo and branch selected before submitting. ::: *Tip*: If you are having difficulties submitting through GitHub, you may submit by zipping up your hw folder. ### Rubric > will we include a rubric for all assignments with the handout? i'm OLD and need editing: :::success Congrats on submitting your homework; Steve is proud of you!! ![image](https://hackmd.io/_uploads/S1OQ2aCwA.png) :::