CSCI 0410/1411 Spring 2026 Final Project Part 1: Go Engine

# CSCI 0410/1411 Spring 2026 Final Project Part 1: Go Engine   ==**Due Date: 05/04/2026 at 11:59pm**== **Need help?** Feel free post on [Edstem](https://edstem.org/us/courses/93617) for TA assistance. :::danger **⚠️ Battle Alert 🏆 ⚠️** You must activate the CS410 virtual environment every time you work on a CS410 assignment! You can find the activation instructions [here](https://hackmd.io/@cs410/BJvhqHXuR). If you forget to activate this virtual environment, you will almost certainly encounter import errors, and your code will not run. ::: ![Arena24](https://hackmd.io/_uploads/Bk5iMzuS-e.jpg) ## Introduction Welcome to the end of trophy road! You've officially made it past the mega knights, past rage quits, and to the final battle. Along the way, you've learned new strategies, mastered key skills, and absolutely demolished all the towers in your way. Just one battle left, good luck! ⚔️ Go (*a.k.a.* Wieqi, Igo, or Baduk) is a two-player zero-sum game of strategy popular throughout the world. Game-playing agents have long been a focus of the AI community. Indeed, the term “machine learning” was dubbed by Arthur Samuel in 1959, while building a Checkers-playing agent that learned by playing against itself. Since then, AI researchers have capitalized on Samuel’s idea of self-play to build successful gameplaying agents for many parlor games, including Backgammon (TDGammon, 1992) and Chess (DeepBlue, 1997). Nonetheless, until very recently, it was thought that Go-playing agents that could perform as well as humans were out of reach. When Alpha-Go defeated the world champion Lee Sedol in 2016, it shocked the AI and Go communities alike. In this project, you will implement game-playing agents for the game of Go. You, too, may be shocked by how well your agent learns to play! ## Learning Objectives The goal of the final project is to help you synthesize the material you’ve learned throughout this course. We will provide structure for Parts 1 and 2, but the techniques you choose to use for Part 3, your final agent, will be up to you. What you will be able to do: * Combine techniques from different subfields of AI to produce a strong Go-playing agent * Design experiments to evaluate the performance of different game-playing agents, and summarize the results in easy-to-interpret tables or plots What you will understand: * How novel methods in deep learning can be combined with classical methods in AI to improve upon each. ## Setup There are two possible *backends* you can use in this project. The *backend *refers to the game engine that runs the game of Go (i.e., checks valid moves, scores the game, and keeps track of related info). The first option for a backend is [Open-Spiel](https://github.com/google-deepmind/open_spiel), a library developed and maintained by Google Deepmind for research on learning and games. It contains implementations of many popular games, including Go.You can install it with the following command: `pip install open_spiel` Unfortunately, Open-Spiel does not cooperate well with the Windows operating system. It is possible to install, but with some additional [work](https://openspiel.readthedocs.io/en/latest/windows.html). The alternative is to use the implementation of Go provided in the `pygo` folder of your repository. If you do not have `open_spiel` installed, it will be used automatically (i.e., you don’t have to do anything special). So what’s the difference and why have two options (and why is open-spiel hard to install on a Windows computer)? Open-Spiel uses an implementation of Go written in C++ and compiled to execute very quickly. When running a large number of games, every optimization can help. When we evaluate your code (i.e., on Gradescope and during the final tournament), the `open_spiel` backend will be used to give results faster. ## Go ![Untitled design (1)](https://hackmd.io/_uploads/r1vcdMOBbg.jpg) Like Checkers and Chess, Go is a two-player, zero-sum, deterministic, alternating-move game. Also like Checkers and Chess, Go is played on grid. Two players alternate placing colored stones on the grid’s intersection points, called **territory**. The first player to move places a black stone, while the second player places a white stone. The two then alternate until the game ends. If one player completely encircles any of the other’s stones, the encircled stones are captured, and removed from the board. At any point, a player may pass. The game ends when no legal moves are available, or if both players pass consecutively. When the game ends, the players tally their total [territory](https://www.pandanet.co.jp/English/learning_go/learning_go_4.html). The player who has surrounded more territory wins. :::success There are many other variations of scoring for Go, such as counting captured pieces as part of the score ::: Go can be played on a variety of board sizes, ranging from as small as 5x5 (beginner play) up to 19x19 (professional play). In this project, your agents will play on 5x5 and 9x9 boards. :::success If you want to play against your agent, the GUI is much faster on 5x5 boards. RL is also much faster on 5x5 boards. ::: The player who moves first (Black, in Go) typically has a significant advantage, as in Tic-Tac-Toe, Connect 4, Chess, etc. To account for this advantage, additional points, called komi, are added to White’s score at the end of the game. We will be using a komi of 5.5 in this project for 9x9 games. This means that Black has to score 6 or more points better than White to win. For games on a 5x5 board, a komi of 0.5 is used. If a fractional value is used for komi, the game cannot end in a draw. We highly encourage you to try out an interactive Go tutorial, such as [this one](https://gomagic.org/how-to-play-go-rules/), to better understand the rules of Go. There are also many [video](https://www.youtube.com/watch?v=5PTXdR8hLlQ) introductions to Go available online. ### Time Controls Many competitive board games are played with per-move or per-game time limits; without such limits, games could take forever! These time limits are often referred to as *time controls*. At the start of each player’s turn, a clock starts counting down. After the player makes a move, the opponent’s clock begins. If a player’s remaining time runs out before a move is made, that player forfeits the game. Our games will use **incremental** time control, in which each player is allocated an initial block of time (in this project, 15 seconds), plus an additional increment of time after every move (in this project, 1 second). For example, if a game lasts 200 moves, then the total number of seconds of search time would be 215 seconds. Time management, meaning how a player uses the available time, is key to a successful Go agent. :::success And to a successful Brown student! ::: ## Part 1 Phase 1: Heuristic Tree Search Your first task in this project is to create a set of simple benchmarking agents to have on hand for evaluating the more sophisticated agents you will build later. Our implementation of Go implements the same abstract `AdversarialSearchProblem` interface we used in Homework 3, when you built minimax and αβ-pruning agents to play Tic-Tac-Toe and Connect 4. As a result, you can import your implementations of minimax and αβ-pruning into your final project code base. As the state space in Go is vast (e.g., there are 81 intersections on a 9 × 9 board, making for 2^81 states), it will not be feasible to run minimax and αβ-pruning without limiting the depth of the search. Recall that the depth-limited versions of these algorithms require a heuristic, i.e., a means of evaluating the quality of an arbitrary state. We have provided a very simple heuristic, `GoProblemSimpleHeuristic`, to jump start your use of your existing adversarial search implementations. This heuristic evaluates a state by computing the difference between the number of black and white stones on the board. We also include an implementation of `GreedyAgent`, which uses `GoProblemSimpleHeuristic` to greedily select an action. In Homework 2, your agents could take as long as you were willing to wait for them to make decisions. In contrast, in this project, time is not unlimited. Since you are allocated an additional one second (beyond the initial 15) for every move, the easiest way to make sure your agent does not run out of time is to make sure it uses less than one second per move. For minimax and αβ-pruning, one way to do this is to find a search depth that always terminates within one second. However, fixing a search depth throughout a game is very limiting. At the start of the game, Black, the first player, has 81 legal moves. :::success Ignoring pass for simplicity. ::: The second player then has 80 legal moves, which yields a total of (81)(80) = 6840 possible states after the first two moves. Later in the game, when there are, for example, 20 available moves, searching to a depth of 3 plies would search the same number of states as just 2 plies at the start of the game ((20)(19)(18) = 6840). :::success Assuming no captures took place and no player passed. ::: ### Phase 1 Tasks 1. Run 100 games of `GreedyAgent` vs `RandomAgent` using the command: ``` python game_runner.py --agent1 greedy --agent2 random --mode tournament --num-games 100 ``` A tournament outputs the following statistics: * Score for each agent, i.e., total wins − total losses * Score as Black (first mover) for each agent * Score as Black (first mover) for each agent :::info **Task 1.1** What are the results? Do the scores surprise you at all? Report your findings in your README. ::: 2. Play a few games against the `GreedyAgent` using the command: ``` python game_runner.py--agent1 greedy--mode gui ``` :::success Use the arrow keys to select a cell to play and use the enter key to take an action. ::: Or, if you are using Github Codespaces, windows WSL, or some other system without graphics capabilities, you can also run: ``` python game_runner.py--agent1 greedy--mode text ``` :::info **Task 1.2** What are the results? Do the scores surprise you at all? Describe your findings in your README. ::: :::info **Task 1.3** `GreedyAgent` currently has a deterministic strategy. Try shuffling the order of the actions before evaluating them. Does this alter its performance against `RandomAgent`? Why are the results so different? Explain your results in your README. ::: :::success **Tip** In the `get_available_actions` method of `GoSearchProblem` shuffle the actions before they are returned. It will benefit almost every algorithm you implement to have some amount of randomness. ::: :::info **Task 1.4** Implement `MinimaxAgent` and `AlphaBetaAgent` in `agents.py`. As these agents implement the `GameAgent` interface, all that is required is that you implement the `get_move` method for each agent. This method should simply call your implementations of `MiniMax` and `AlphaBetaSearch` from Homework 3. * We have included command line arguments for `AlphaBetaAgent` in `game_runner.py`. If you’d like to use `game_runner.py` for new agents, you can add them to the create_agent method of `game_runner.py`. * **Tip**: In this project, the terminal values of the game are either +1 or −1. In contrast, in Connect 4 in Homework 3, the values were +∞ or −∞. You may have to adapt your Homework 3 implementations slightly to handle this change (by multiplying terminal values by some large number in your search algorithm, for example). * **Tip**: In this project, the terminal values of the game are either +1 or −1. In contrast, in Connect 4 in Homework 3, the values were +∞ or −∞. You may have to adapt your Homework 3 implementations slightly to handle this change (by multiplying terminal values by some large number in your search algorithm, for example). ::: :::info **Task 1.5** Experiment with both of your agents to discover a search depth that obeys Go’s time constraints. What is this search depth for each agent? Include your answers in your README. ::: Well done! Your agent will be able to inflict some serious damge ⚔️ ![damage](https://hackmd.io/_uploads/SJQX3MuSbe.jpg) ## Part 1 Phase 2: Learning a Heuristic In this phase of part 1, you will be working on the `supervised_learning.ipynb` notebook. Completing the notebook will complete most of the tasks described in the following section. ### The Data Supervised learning is the task of building a function approximator (i.e., a model) from labeled data. We provide the dataset for you to learn from, in the files `dataset_5x5.pkl`. The dataset consist of a list of tuples of the form (state, action, outcome). We have provided two copies of this dataset, in the `dataset_5x5.pkl` file, the states are stored as Py-Spiel states, in `dataset_5x5_pygo.pkl` they are represented as PyGo states. Our PyGo board states take significantly more space to store, so we have reduced the size of the dataset down to 1/4 of the original size of the original Py-Spiel dataset. This should not significantly impact the quality or performance of your models. The 5×5 version of Go is used to teach people to play the game, but most people who fancy the game are quick to move on to larger boards. As a result, there are very few high-level games available to learn from. We therefore provide a dataset of games played by our `MCTSAgent` (which you will implement in Part 2), which will outperform the other agents you built for Part 1 of the project so far. ### Task 2.1: Feature Encoding Recall that supervised learning, or function approximation, is a means of building a model from inputs to outputs. The inputs are generally described in terms of **features**. The Go dataset, however, does not contain features. It contains `GameState` objects. As a result, your first task is to write an **encoder**, meaning a function that converts `GameState`s into feature vectors. What makes for a good encoding, or a good choice of features? We contend that an encoding should be both *expressive* and *informative*. An expressive encoding is one that is capable of expressing many, if not all, of the possible inputs to the model, in our case the `GameState`s. An informative encoding is encodes information about the input that is relevant to the task at hand, in our case, playing the game of Go. An expressive feature encoding for the game of Go should encode the positions of all the stones on the board. As a first attempt, you might try representing this information using two lists of coordinates, one for Black and another for White. A difficulty with this approach, however, is these list sizes are variable and we cannot use MLPs for variable sized inputs. :::success There are techniques to handle variable input sizes. Large Language Models, for instance, handle different context sizes (i.e., number of words in the prompts). ::: How can we instead represent all possible `GameState`s using features that are constant in size? The board size is constant, so perhaps we could use one feature per cell—25 features for a 5 × 5 board—each of which can take on one of three possible values. For example, we could represent an empty cell by 0, a cell with a white stone on it by 1, and a cell with a black stone on it by 2. Encoding categorical variables as continuous values is *not* generally a good idea, however, because a model can ascribe meaning to the continuous values where there is none: e.g., it might surmise that black stones are worth twice as much as white stones. The preferred way to represent categorical features is to use a **one-hot encoding**. To build a one-hot encoding of the aforementioned representation of the positions of the stones on a 5 × 5 Go board, i.e., 25 features, each of which can take on three possible values, we create 25×3 = 75 binary features. The first 25 are on or off depending on whether the cell is empty or not; the next 25 are on or off depending on whether the cell is occupied by a white stone or not; and the final 25 are on or off depending on whether the cell is occupied by a black stone or not. Note that there is redundancy in this feature representation: if a cell is occupied, it is not empty, and vice versa. Indeed, the 50 features indicative of the black or white stones’ positions are alone sufficient to represent all the possible positions of the stones on a 5 × 5 Go board. One additional feature is necessary to fully capture `GameState`, namely the player-to-move. These 51 binary features are sufficient to represent all the `GameState`s in 5 × 5 Go. In other words, this encoding is fully expressive. But expressivity alone is not enough; your encoding must also be informative! If you do not also encode enough relevant information about your inputs in your features, then no matter big your dataset is, and how fancy your neural network is, it will never be able to learn effectively. In the case of games (and single-agent sequential decision making), an informative feature set encodes information about the state that facilitates choosing good actions (or making good decisions). | | Not Expressive | Expressive | |----------------|------------------|-----------------------------------------------------------------------------| | Not Informative| – | A single integer feature between 0 and 2⁵¹, where each game state is mapped to a unique value | | Informative | Number of pieces per player | **Your Goal** | Feature engineering/encoding is an often overlooked aspect of machine learning, but it can be key to the practical success of models. The neural network built for AlphaGo used 17 features per grid cell on a 19 ×19 board for a grand total of 6137 features! You do not need to use anywhere near that many features in this project, but you should dream up a few informative features, and then use them together with the expressive encoding we described above to improve your models. :::info **Task 2.1.1** Write `get_features`, which encodes `GameState`s as feature vectors. Test its correctness on a few sample inputs and outputs. ::: :::success You cannot evaluate the efficacy of your encoding until you train a model to fit your dataset (Tasks 2.2) and build agents based on these models (Task 2.3). ::: ## Task 2.2: Learning a Value Function Adversarial games like Go are sequential decision problems, albeit for two players. By fixing the strategy of the opponent (e.g., the dealer in Blackjack), these games can be conceptualized as (single-agent) MDPs. When a zero-sum game with rewards +1 and −1 is viewed as an MDP, the value of a state—by definition, the expected sum of future rewards—is indicative of how likely the agent is to win or lose the game from that state. Learning a value function can therefore be framed as a binary classification problem, where the goal is to classify each state as a future win for one player or the other, or more generally, to generate a prediction in the range [−1,+1] that is indicative of which player will win the game. :::info **Task 2.2** Your task is to implement a neural network that represents a value function and to train it using the data provided to predict the outcome of a game given a(n encoded) state. Be sure that your training and test error decrease as your model learns. ::: :::success **Tip 1** Pytorch provides a number of built in [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions). We covered some of these in class (e.g., mean squared error, log-loss/cross entropy). Which one is applicable to predicting the outcome of a game? ::: :::success **Tip 2** In previous assignments, you implemented gradient descent from scratch (i.e., computing $\theta - \alpha \nabla_\theta f$ explicitly). Pytorch provides a powerful set of optimizers that implement variations of **stochastic gradient descent (SGD)**. In the stencil code, we provide an example of how to use one these optimizers, namely Adam. Adam (ADAptive Momentum) is a variation of SGD that uses momentum to try to escape local minima quickly. We provide you with an outline of the training loop, but leave it to you to fill in the details. ::: ## Task 2.3: Let's Play! Your next task is to incorporate your supervised learning models into a Go-playing agent. There are myriad ways to accomplish this task. :::success The AlphaGo algorithm itself is perhaps the most well-known among them. In Part 3 of the final project, which is open ended, you may choose to implement AlphaGo. ::: :::info **Task 2.3** Implement an agent that utilizes your learned value function as a heuristic. You can first implement this agent within `supervised_learning.ipynb`, but **you need to copy the relevant agent code to `agents.py` and `heuristic_go_search_problems.py` before moving on so that you can use this agent later**. Copy the necessary methods and classes from the notebook over to `agents.py` so that the method `create_value_agent_from_model` can run and return an agent. You will need to include the agent definition in `agents.py`, the learned heuristic search problem definition, and the feature encoding function. ::: :::warning Make sure to copy the relevant agent code to `agents.py` and `heuristic_go_search_problems.py` ::: You can test that your agent runs by running: ``` python game_runner.py --agent1 learned --agent2 greedy --mode tournament --games 20 ``` Congrats! You've taken down the first tower-- only 2 more to go until you win the final battle and join The Hall of Eternal Glory 👑 ## Downloads  You can access the support and stencil code for this assignment through Github Classroom. ## <font color="#f00">REPLACE THIS STENCIL CODE LINK.</font> ### Support Code * This project uses `GameAgent`, `GameState`, `AdversarialSearchProblem`, and `GameUI` from Homework 2. These classes can be found in `adversarial_search_problem.py` * `GoSearchProblem.py`: Provides implementations of `GoState` and `GoSearchProblem`, which implement `GameState` and `AdversarialSearchProblem`, respectively. * `GoSearchProblem.py`: Provides implementations of `GoState` and `GoSearchProblem`, which implement `GameState` and `AdversarialSearchProblem`, respectively. * `heuristic_go_problems.py`: Contains an implementation of `GoProblem` and a very simple heuristic `GoProblemSimpleHeuristic`, which you can pair with your `MinimaxAgent` and `AlphaBetaAgent`. * `game_runner.py`: Contains methods for running games between two agents, and tournaments among multiple agents. * `go_gui.py`: Provides an implementation of `GameUI`, for visualization and user interaction. * `go_utils.py`: Contains utility functions for setting up adversarial search problems. ### Stencil Code * `agents.py`: Where you will implement your agents. `GameAgent` defines an interface that your agents must implement. The core of the agent behavior is the `get_move(state, time_limit)` method, which returns an `Action` given the current game state and the remaining time on the clock. * `supervised_learning.py`: where you will develop code to learn a value function. ## Submission ### Gradescope Submit your assignment via Gradescope. To submit through GitHub, follow this sequence of commands: 1. `git add -A` 2. `git commit -m "commit message"` 3. `git push` Now, you are ready to upload your repo to Gradescope. Tip: If you are having difficulties submitting through GitHub, you may submit by zipping up your hw folder. :::danger **⚠️WARNING⚠️** Make sure you have the correct assignment repo and branch selected before submitting. ::: ### Rubric | Task | Points | Details | |-------------------------------------------|--------|-------------------------------------------------------------------------| | Heuristic Search Agents | 20 | Points awarded for implementing the `get_move` function in `MinimaxAgent` and `AlphaBetaAgent` | | Feature Encoding | 10 | Points awarded for expressive and informative feature encodings | | Learning a Value Function | 25 | Points awarded for learning a value function that achieves high accuracy on the train and test set | | Building an Agent with a learned heuristic| 25 | Points awarded based on whether the agent works (i.e., returns a move when a state is provided) and its performance | | README Questions | 20 | Points awarded for answering each README question |