# Gaming Marathon with AI
#### Description: In a gaming marathon, we investigate how Large Language Models (LLMs) enhance gameplay through strategic planning, behavioral analysis, and adaptive learning. Incorporating human evaluation and crowdsourcing benchmarks aims to improve LLM performance and gaming experiences.
## Why Evaluate LLMs in Games?
The need to evaluate LLM agents in dynamic and interactive settings is paramount as AI continues to integrate into various facets of society. Games provide a unique platform for this evaluation due to their complex, rule-based environments that require strategic thinking, real-time decision-making, and adaptive learning. The objective of this experiment is to test the practical applications of LLMs in enhancing gameplay and to understand how AI can contribute to developing more engaging and intelligent gaming experiences.
SmartPlay is a benchmark introduced to evaluate LLMs' performance as intelligent agents through various games, each challenging different cognitive abilities such as planning, reasoning, and learning from interactions. This benchmark is designed to provide a comprehensive assessment of LLMs' capabilities in a controlled yet dynamic environment <d-cite key="wu2023smartplay"></d-cite>.
## The Gaming Marathon Experiment
To test the capabilities of LLMs in a practical, engaging manner, we organized a gaming marathon featuring 3 games to highlight different LLM aspects of strategic thinking, planning, and adaptability.
#### Incorporating Human Evaluation
To truly evaluate the capabilities of LLMs in gaming, it is crucial to incorporate human evaluation during the gaming process. Human oversight, interactive feedback, and expert insight help guide the LLMs, ensuring their actions align with the game’s objectives and adapt effectively in real-time.
### Game 1: Among Us

**Game Description**:
Among Us is a multiplayer game in which users play as “imposters” or “crewmates”. Imposters must kill off everyone until only one crewmate remains. Crewmates, on the other hand, can secure their victory by either completing all their tasks before everyone is killed or by rooting out the imposter during meetings.
**Abstract**
Through this experiment, we rely on the game Among Us to analyze chatGPT’s problem-solving skills, deductive reasoning, and ability to produce effective lies based on pre-existing knowledge. To preface, chatGPT is an AI designed to respond to human inquiries based off the vast amounts of information it is trained on. In applying Among Us,—a strategy game in which crewmates and imposters coexist to survive by either discovering the imposter before they kill everyone else, or succeeding as the imposter themself—we can evaluate how well the LLM excels as an evaluative crewmate or deceptive imposter to scale their performance given the criteria stated above.
**Perspective:**
Among Us brought a different kind of challenge, where social dynamics and psychology played crucial roles. The interplay of trust and suspicion highlighted the LLMs’ ability to enhance both analytical and creative thinking. Given this game requires planning and reasoning based on knowledge acquired from previous experience, the LLM’s performance allows us to see which skills they excel or fail in.
**Experiment:**
Given the two roles in Among Us, chatGPT will follow the instructions as team alpha (representing crewmates) or team omega (representing the imposter)
**Team Alpha’s Strategy**

Visual of Steps 1-3
1.) chatGPT will be asked to help the user play among us. They will also be informed of the player's role and map (including what tasks, vents, sabotages, and other important details are involved within said map)
2.) Human user will construct other players' alibis using information obtained from emergency meetings and in-game observation
3.) The details from Step 2 will be fed into chatGPT and the LLM will be asked to point out any inconsistencies based on its background knowledge, spatial reasoning, logic, etc. which may indicate a suspicious player.!
**Team Omega’s Strategy**

Visual of Steps 1-4
1.) chatGPT will be prompted to help the user play among us. They will also be informed of the player's role and map (including what tasks, vents, sabotages, and other important details are involved within said map)
2.) chatGPT is prompted to create a fake persona: include a name, the tasks it will be doing, and the strategies it will use to deceive and sabotage crewmates based on the layout of the skeld map
4.) The user will navigate the game through this fake persona and defend itself based on chatGPT-generated excuses
**Experiment Results/Analysis**
The experimental layout was repeated twice, providing insight on both scenarios: a crewmate win and an imposter win.
Based on the data from both trials, we discovered chatGPT depends on specificity and a vast dataset in order to produce an accurate conclusion for crewmates to use against potential imposters. Despite this setback, our experimenters engaged with chatGPT’s “chain of thought” prompting by catering to each course of action it suggests so that the LLM could provide the most reliable responses. In the case of imposters, chatGPT did not require anything more than a prompt to procure a fake scenario/alibi. Key takeaways from this experiment conclude that chatGPT specializes in a complex thought process, providing foundations for a deeper analysis, however, this hinders it from efficiently approaching conclusions.
### Game 2: Wordle

**Game Description:** Wordle is an online guessing game where the user has six attempts to guess the five letter word each day. Depending on the user's input, each letter tile can change into two different colors indicating if the letter position is close or correct.
Wordle was created by a software engineer named Josh Wardle. The initial release of Wordle was October 2021. Wardle decided to create this game because during the COVID-19 Pandemic him and his partner played many word games, later encouraging him to create a game just for her.
**Objective**
The objective is assessing the accuracy of the LLM when predicting the word based on given clues. We can test the LLM by first having a person do the Wordle of the day and compare the results when asking ChatGPT to guess the word with human guidance. We will analyze the differences in approach, decision-making, and success rates between the two.
**Perspective**
Wordle is a popular online word puzzle game where players have six attempts to guess a five-letter word. After each guess the game provides feedback in the form of colored tiles. This game provides a straightforward yet challenging environment to test the LLMs' problem-solving and language processing capabilities due to its use of applying patterns in a constrained and dynamic environment.
**Experiment/Results**
This experiment used the Wordle word on July 17, 2024.
We would provide the LLM the instructions to play Wordle, what each color-coded tile means, the positions of the correct letters guessed, and provide a list of letters that are not in the word.
First, we started off the game for the LLM with the word, “ADIEU”, based on Wordle’s feedback, we structured a sentence to input into the LLM to better understand the feedback.
>[color=#40e0d0] **human:** I started with the word "ADIEU", here is what Wordle gave me from the result of this word:
Green/Correct letter in the Correct position: letter E
Yellow/Letter is in the word but in the incorrect position:letter A and D
Grey/Not in the word: letters I and U
E is in the 4th position
A is not supposed to be in the first position but is in the word
D is not in the 2nd position but is in the word.
Here is a list of the letters NOT in the word:
I
U
With this feedback, please give me another valid five-letter word.
Next, We gave the LLM detailed feedback by listing which position letters are, positions where letters do not belong, and provided a list of letters not in the word.
>[color=#eb55fc] **chatGPT-4o**: Based on the criteria:
"I" in the third position.
"E" not in the fourth position.
"I" not in the fifth position.
Letters "A" and "D" are not in the word.
A valid word could be UNITE.
Based on the criteria:
>[color=#40e0d0] **human:** Results on "UNITE"
Green/Correct letter in the Correct position: letter I, T, E
Yellow/Letter is in the word but in the incorrect position: letter U
Grey/Not in the word: letters A and D
Here is a list of what words DO NOT belong in the word:
A
D
N
With this feedback, please give me another valid five-letter word.
>[color=#eb55fc] **chatGPT-4o**: Based on the feedback:
Green: I, T, E (in correct positions).
Yellow: U (in incorrect position).
Grey: A, D, N (not in the word).
Let's try the word QUITE.
Green: I, T, E (correct positions).
Yellow: U (incorrect position).
Grey: A, D, N (not in the word).
"QUITE" fits the criteria provided.
**Results:**
The LLM was able to guess the Wordle of the day. However, the LLM needs very specific details and description of the feedback, Providing a list of incorrect/unused words also helps the LLM’s vocabulary slim down in order for the LLM to find the correct Word.

**Problem**
This experiment used the Wordle word on July 16, 2024
Prior to the previous experiment, we tested out a method to send the LLM screenshots of the actual feedback directly from Wordle.
**7/16 "DECOY"**
>[color=#40e0d0] **human:** 
Using the word "DEPOT" here are the results. Give me another word based on the feedback
>[color=#eb55fc] **chatGPT-4o**: Based on the feedback:
"D" is in the correct position (green)
"E" is in the word but in a different position (yellow)
"P," "O," and "T" are not in the word (grey)
Let's try the word "DECKS" next, as it keeps "D" and "E" while placing "E" in a different position, and uses new letters for the remaining spots.
>[color=#40e0d0] **human:** The letters D,E, and O are in the word and in the correct positions with green tiles. P and T are not in the word.
However, the LLM was unable to differentiate the color-coded feedback. Additionally, the LLM would still give feedback based on its own thinking regardless how specific we told it where certain letters belong.
Based on the LLMs feedback, we tried many times to be as specific with our explaining. Ultimately, 3 different scenarios would occur (WORD = DECOY):
Scenario 1: Give a word where it would mix the correct letters, “DONER”

Scenario 2: Give a word that wasn’t considered a word to Wordle

Scenario 3: Give a completely different word that wasn’t based on the feedback.

**Hypothesis**
The LLM struggles to differentiate the color-coded feedback from Wordle because it lacks the ability to visually perceive colors and relies solely on textual input. Without a visual understanding of color cues, the LLM cannot accurately interpret the feedback intended to indicate correct letters in the correct positions (green), correct letters in the wrong positions (yellow), and incorrect letters (gray). This limitation makes it challenging for the LLM to provide responses that consider the visual aspect of Wordle's gameplay feedback.
### Game 3: Minesweeper

**Game Description**: Minesweeper is a single-player puzzle game in which a user has to uncover the board without uncovering and bombs. With the uncovering of the bomb, the player loses and has to restart in a new randomly generated board. The first iteration of the "minesweeper" we know and love today was created by Ian Andrew at Microsoft. They began to ship the game as a preinstalled app on its computers, leading to a surge in its popularity. A popular strategy is to use the "flag" feature in order to mark the bombs on the grid. The difficulty of the game is separated by the size of the board and the amount of bombs found within the board, with higher difficulties having bigger boards and therefore more bombs. The methodology of finding how "good" someone is at minesweeper is by recording the amount of time a user uses to complete the board.
**Objective**
The objective of this game is to explore an LLMs skills within long-term planning and proper analyzation of a situation. We will test how fast the LLM is able to properly complete the board without any mistakes in comparison to a human user, and also test how easily an LLM is able to decide the next best step based on a situation started by a user.
**Abstract**
Large Language Models are extremely large datasets which include its own thinking capabilities and problem-solving skills. In this scenario, we put the LLMs to the test regarding its complex situational adaptations, primarily with understanding the odds and reasoning skills. The current state of LLMs has exceptional capabilities in regards to reasoning and problem-solving in context of sentences or bodies of text. However, our premise is to test these same skills without a body of text in order to fully grasp these LLMs limits. For this, we utilized the popular game minesweeper, a puzzle game revolved around uncovering tiles but not bombs.
**Perspective**
Minesweeper provides a great platform to understand the LLM's ability to properly understand odds and reasoning skills, as it has to take in current information on the board to decide the next best step.
**Experiment**
We would input details of an in-progress game a human user is trying to complete, and ask it for its next best step based on the data found within the current game. It would require the LLM to understand the risks behind the next move and which is the highest probable safe choice to choose.
We would first start a game of minesweeper ourselves, and once we get stuck, we would ask the LLM for the next best step we should take on the board. In this experiment, we used chatGPT 4.0 to generate a analyzation and next-best step for a board a user had already started.
In order to properly assess the LLM's abilities, we split up the minesweeper game into smaller parts, starting with extremely obvious places for bombs and gradually increasing in difficulty.
The first trial was of a single bomb within the image, and the surrounding tiles all being placed, being obvious to the user.

>[color=#40e0d0] **human**: there is one bomb hidden within this section of the minesweeper game. where is it?
We used this first portion as a gauge for the LLMs abilities within analysis and probability, and proved to be successful, responding with the correct placement of the bomb:
>[color=#eb55fc] **chatGPT-4o**:
>2 [ ] 1
2 1 1
>
Our second trial consisted of a larger grid with more bombs within it, yet obvious locations of said bombs.

>[color=#40e0d0] **human**:where are the locations and the amount of bombs within this picture?
However, after receiving the response given from the LLM (chatGPT 4.0) we realized some inconsistencies within its analysis.
>[color=#eb55fc] **chatGPT-4o**:
>1 1 1 B
1 B 2 B
B 2 2 1
2 B 2 B
2 2 2 2
2 B 2
When comparing the picture sent to the LLM and its response, we see how it has overwritten some of the already found tiles and replaced them with potential bomb locations, even though they have already been cleared.
As this idea of LLMs utilizing complex analysis and probability skills correctly is still grassroots, we can focus on resolving this issue of hallucination or error within the LLMs responses by recursion or altering its CoT, particularly by prompting it with situations in which the LLM can obviously see the location of bombs within the game. We can also use guiding inputs to point the LLM in the right direction of properly identifying the locations of the bombs.
**Other Paths**
Although sending direct pictures of an actual game board to chatGPT is not applicable due to its hallucinations, we attempted to work around this problem by translating the board into smaller sections and into a different format. For this, we tested two different formats based on text; a body of text vs a direct conversion of the picture into a grid.
++Body of Text++
For this format, we began with a basic problem in order to gauge if it can prove to be a sufficient work around:

We then translated this section into a body of text the LLM would have an easier time digesting:
>[color=#40e0d0] **human**:I'm playing minesweeper. There are currently 9 tiles within the game, with tiles being uncovered and having ones on them except for the square directly in the middle. There is one bomb in this game. Where is the bomb?
>[color=#eb55fc] **chatGPT-4o**: Given the description of the Minesweeper game, we can deduce the location of the bomb. Let's consider the 3x3 grid where the middle square is still covered, and all the uncovered squares around it have the number 1.
The grid looks like this (with `1` representing the number of adjacent bombs and `?` representing the covered tile):
The results proved to be successful, showing that translating it into text does indeed work proficiently. However, after further discussion, we decided that when tackling larger sections it will be difficult to accurately describe the board in a way the LLM can understand, therefore stopping this method and exploring the translation further.
++Translation Method++
In this pathway, we decided to translate the picture of the board directly to a text based format.
**Trial 1**
We began with a slightly larger board in order to see how much we can stress it from the get-go:

>[color=#40e0d0] **human**:
>Im playing minesweeper. There are currently 3 bombs within this minesweeper game. The current grid looks like this:
? 1 1 1
2 2 1 ?
? 1 1 1
Where are the bombs located?
>[color=#eb55fc] **chatGPT-4o**:
>? 1 1 1
2 2 1 ?
? 1 1 1
ChatGPT had correctly given us the places of bombs within the board, proving to be an effective move forward.
**Trial 2**
As the first trial proved to be successful, we attempted to further gather information on the LLMs analysis skills by requiring it to find the number of bombs themselves. Using the same board as the previous trial, we wiped the history in order to avoid error within the experiment and sent in a new prompt:
>[color=#40e0d0] **human**:
>Im playing minesweeper. There are currently 3 bombs within this minesweeper game. The current grid looks like this:
? 1 1 1
2 2 1 ?
? 1 1 1
Where are the bombs located?
This resulted in the a correct answer to both questions:
>[color=#eb55fc] **chatGPT-4o**:
B 1 1 1
2 2 1 B
B 1 1 1
**Trial 3**
We began to find the limits of chatGPT by sending in larger sections, as well as adding in a new variable. This board is where the capabilities of the LLM were found:

As seen from the picture, this trial introduced a new variable chatGPT 4.0 has yet to see- blank squares that include no number nor bomb.
>[color=#40e0d0] **human**:
im currently playing minesweeper. this is the grid of the game:
X X 1 2 2
1 1 2 ? 1
2 ? 4 2 1
2 ? ? 1 X
1 2 2 1 X
The "X" marks grids that have no number or bombs on them. where are the bombs within this picture?
>[color=#eb55fc] **chatGPT-4o**:
X B 1 2 2
1 1 2 B 1
2 B 4 2 1
2 B B 1 X
1 2 2 1 X
Analysis:
This trial resulted in a hallucination from the LLM, which replaced one of the safe tiles with a bomb and also does not satisfy the current board that was originally sent in. Our current hypothesis is the inability for chatGPT 4.0 to account for multiple variables within the game, and to improperly understand what the X stands for in regards to the game. Nonetheless, we were able to grasp a better understanding of the current chatGPT capabilities in regards to analysis and probability due to this.
**Solutions/ Directions Forward:**
Although this method proved to be insufficient, we can attempt to work around this recurring hallucination problem by giving the LLM more information. For example, reverting back to giving the amount of bombs within the current board will aid the LLM in avoiding hallucinations, but simultaneously affects the experiment due to less stress being placed on the LLMs analysis and reasoning capabilities.
### Game 4: Tic Tac Toe

I selected the game Tic Tac Toe so we can evaluate how predictable and how smart the LLM can be to beat a human. The game Tic Tac Toe is structured like this: You have 9 tiles (in a 2D 3x3 space). There are 2 players in this game. However, we are going to play against the LLM using Chat GPT. The objective is for one player to have a straight line with their selected shapes before the other player does. The difficulty of the games rather depends on the other player’s knowledge and how predictable they can be. I would rate it a 6/10 for me. I want to test the LLM's predictability in this game.
**Game Description:**
Tic Tac Toe is a game that consists of two players playing against each other in a 3x3 grid. Each player will take turns marking a place on the grid with their corresponding letter, it could be either “X” or “O”. The object of the game is for one player to complete a straight line in any direction occupying 3 tiles with his own letter. However if all 9 tiles are occupied and there are no lines that are connected, the game would be considered a draw and there would have to be a rematch.
***Perspective:**
The LLM will do a pattern recognition. Tic tac Toe is a game about studying the player’s game strategies and patterns to win the game. The LLM will come up with their own patter to block the player and strategize how to win against a human.
**Experiment:**
We challenge Chat GPT to a match of Tic Tac Toe. We are assigned our desired letters and we play. Chat GPT wants the human to make the first move. Then Chat GPT will make its move, and we take turns until we tie. In the next match, Chat GTP will probably adapt and recognize your pattern and will try to come up with a different strategy to get the player to lose.
**Below, we use image as input to chatGPT-4o. We use the google tic-tac-toe, with chatGPT playing against a computer **
>[color=#40e0d0] **human/google-computer**:
Please try to beat me in tic-tac-toe! I will provide you with an update of the tic-tac-toe board after each iteration. Please make the first move:
>[color=#eb55fc] **chatGPT-4o**:
Updated board:
```
| |
-----------
| X |
-----------
| |
```
>[color=#40e0d0] **human/google-computer**:
```
| O |
-----------
| X |
-----------
| |
```
>[color=#eb55fc] **chatGPT-4o**:
Updated board:
```
O | |
-----------
| X |
-----------
| | X
```
As you can see above, chatGPT falsely orient the image spatially.
**Here is the text-input experiment**
[link to experiment](https://chatgpt.com/share/74e851be-c70c-4d77-a2b2-98349f0feb96)
...
>[color=#eb55fc] **chatGPT-4o**:
Updated board:
```
X | X |
-----------
| X | O
-----------
| | O
```
>[color=#40e0d0] **human/google-computer**:
```
X | X | O
-----------
| X | O
-----------
| | O
```
>[color=#eb55fc] **chatGPT-4o**:
Updated board:
```
X | X | O
-----------
| X | O
-----------
| X | O
```
Your move!
>[color=#40e0d0] **human/google-computer**:
>I won! You did not detect that!
## Summary of Experimental Results
1. **Strategic Planning:** LLMs can analyze game data to suggest optimal strategies, whether it’s resource management in Civilization VI or battle tactics in StarCraft II.
2. **Adaptive Learning:** By learning from game progress and player behavior, LLMs can adapt strategies in real-time, improving their effectiveness in dynamic environments.
3. **Behavioral Analysis:** In social deduction games like Among Us, LLMs can help players analyze behavior patterns, making more informed decisions about trust and deception.
**Impact Statement:**
Including human evaluation in game simulations enhances the learning and adaptability of LLMs, providing richer insights and more accurate benchmarks. Crowdsourcing benchmarks through diverse player interactions can significantly advance the development and assessment of AI capabilities.
## Conclusion
This gaming marathon was a testament to the synergy between human ingenuity and artificial intelligence, inspired by the SmartPlay benchmark. Team Alpha and Team Omega demonstrated how LLMs could be leveraged for varied strategic approaches, from resource management and diplomacy to combat tactics and social manipulation. While the SmartPlay benchmark provides a solid foundation for evaluating LLMs, our experiment suggests areas for further enhancement to fully capture the potential of intelligent agents. Stay tuned for more exciting experiments as we continue to explore the boundaries of AI and gaming. Game on!