Research Problem

# Research Problem  *Who wants to be a millionaire* (WWTBM) is a popular, originally british, TV game show with copies and variations in multiple countries. One game is divided into 15 rounds, each posing a general knowledge question with increasing difficulty and increasing reward. The contestant has to choose between four given answers with only one being right. Furthermore, there are up to four jokers (depending on the local variation of the game), which the player can choose to use once per game to ease the choice of an answer. To an artificial agent this game poses a difficult tasks. It needs to extract the important information from the question and use some form of information retrieval (IR) to find the correct answer. The questions used in the game show are from diverse fields, making specializations impractical. The subjects of the questions are also often ambiguous so that more context from the question or the possible answers has to be taken into account. In this project report, we present an agent for a *Who wants to be a Millionaire* game. The agent uses natural language processing techniques to extract important parts of the question and use the World Wide Web (WWW) to choose an answer. Our answer searching methods focus on the use of Wikipedia and the [Google Custom search engine](https://developers.google.com/custom-search/v1/overview). # Datasets  We used two datasets during our project. The first one is offered by the Reddit user *BrianBoyko*\cite{dataset1}, which they extracted from a questions and answer list on the [GameFAQs](https://gamefaqs.gamespot.com/) website. It consists of 546 questions along with the 4 possible answers and correct answer for each question. The second dataset was used in the publication of Piero Molino et al. \cite{molino2013virtual}. The dataset was not published but we contacted the authors and received a copy of the dataset. Unfortunately, we are not allowed to publish it either due to copyright law. The dataset includes 1945 questions from the english version of the 'Who wants to be a millionaire?' boardgame, which were transcribed manually. For each question, the correct answer was annotated, together with the difficulty level and a manual annotation of whether the question is a negation. An example for a negated question is "In which of these sports do horses not take part?". This is relevant for the answering system, because our approach works by estimating the relevancy of an answer, which has to be inverted if the question is a negation. This is why we developed a negation detection component, which is described in the _Approach_ section. Furthermore, due to the design of the game, the amount of questions is not distributed equally over the levels. Up until level 10 every level includes 159 question, with the amount quickly decreasing, leaving only 29 question in level 15. See Figure TODO for details. | Level | 1 - 10 | 11 | 12 | 13 | 14 | 15 | | ------------- | ------ | --- | --- | --- | --- | --- | | __Questions__ | 159 | 119 | 88 | 69 | 49 | 29 | As this second dataset was acquired during the development process, we converted it to match the format of the first one. This included adding a column that specifies the correct answer for each question, which was implicitly the first answer every time in the WWTBM dataset. Due to this fact we also had to circumvent an issue that caused our system to "cheat" by always taking the first option in case two answers had the same score, which will be discussed in the _Answer Scoring_ section. # Approach  This section describes the methods that we implemented and tested to answer WWTBM questions. First, there are three approaches to extract information from the data. That information is then used to find relevant pages on Wikipedia, whose fulltexts are scanned to choose an answer. Secondly, a method using a web search engine is presented. Finally, the game of WWTBM and an artificial agent that plays it were implemented as a web service for people to play. The agent uses our answering system to play the game along with the player in order to compare their performance. ## Wikipedia In the first section we show how we retrieve Wikipedia articles for a given query. In the following two subsections we describe two NLP methods on how to extract information from a question that can be used as wikipedia page queries. The last subsection deals with how the retrieved pages are processed in order to predict answer confidences and finally choose an answer. ### Querying Wikipedia For gathering and using information from wikipedia there are mainly two ways: querying the website (via API or direct interaction) or downloading a wikipedia dump and running local search indexing. We chose the first option, because we deemed it more feasible to implement. For the implementation, we chose the _wikipedia_ python package, which provides a simple API to retrieve articles. In order to prevent being blocked by spam protection we built our own local stored cache of articles, that were queried at some point. This also sped up the process of running the system by a large factor. However, it took a lot of time to populate the cache initially, as we had around 2000 questions, each yielding multiple search terms, especially when using our searching for articles matching the answers instead. For some search terms, we got a _DisambiguationError_ from the API, which means there were multiple articles matching the term. On the web version of Wikipedia they are displayed as disambiguation pages offering links to several pages refered to by the query term. In these cases we opted to take only the first one (being the most relevant), because we wanted to avoid additional article queries, which would increase our risk of being banned by wikipedia and our waiting time spent running the experiments. ### Quoted Text Extraction  Our first naiive approach is to extract the quoted text from the question and use all those quoted sections as input for a wikipedia search. This idea arose from the question wording of the GameFAQ dataset, which included a lot of quoted text. This approach worked reasonably well, in respect to its simplicity. We further improved this approach by stemming the found wikipedia articles and the answers, because this way it is more likely to get a match when searching for those expressions. For this we use the SnowballStemmer from NLTK. This slightly improved the performance of our system. However for the WWTBM dataset there were no quoted sections, either because the authors of that dataset removed them when transcribing the questions, or because the official game had none in the first place. Seeing that this was the case (after we finally acquired the WWTBM dataset) we stopped pursuing this approach. ### Named Entity Extraction As a more sophisticated approach, we chose to use the NLTK toolchain for named entity extraction. In detail, we use the word tokenizer and part-of-speech tagger from NLTK to compute pos-tagged tokens, which are fed into the _ne_chunk_ function of NLTK. The resulting parse tree is parsed to get a list of named entities as strings. The named entities are seperately queried on Wikipedia to retrieve one page each. When running this approach on our datasets we noticed that NLTK does not work sufficiently well enough. For example, it does not recognize multi-word entities like "british army" but returns only "british". In consequence, we substituted NLTK with Textblob. Textblob is build upon NLTK but also extends its functionality and proved to work better on our datasets. ### Reversed Search Another approach that we implemented is built upon the idea of reversing our search process, by searching for information on the answers first, and then counting matches with the question. More precisely, we retrieve for each answer relevant articles from wikipedia, and then for each of those texts score the number of matches of the question with them. Because we are not making more requests when extracting more entities from the question text in this approach, we are not restricted in the number of entities we extract. This is why we used an n-gram method for extracting phrases from the question. This approach achieved similar results to the named entity extraction approach, albeit working well on different questions. ### Combining Approaches For each of the aforementioned approaches, we receive a confidence vector with four numbers between 0 and 1, denoting the confidence that an answer is correct. Because those confidences are relatively sparse per approach, we receive better overall results, when summing up the confidences of multiple approaches. Doing this highly improved our scores as shown later in the results section. ## Answer Scoring After a number of articles are retrieved, the answers were ranked by relevancy with respect to those articles. We implemented this by counting for each possible answer the number of matches when searching the text of all articles. Likewise, for the reverse approach, we count the number of matches of the n-grams generated out of the question text in the articles of an answer. Those counts are then divided by the number of total matches found over all 4 answers, receiving a normalized distribution between 0 and 1 that sums up to 1. Choosing a final answer is then done simply by selecting the answer with the highest score. ### Negation Detection As already briefly discussed in the dataset section, some questions are negated, which is annotated manually in the WWTBM dataset. This is important for us, because all of our approaches work by scoring the relevancy of an answer with respect to a question, meaning highly relevant answers get a higher score. As a result, a wrong answer is always chosen if our system is able to get any information on any of the answers. The portion of negated questions in the WWTBM dataset is 4.32%, therefore we deemed it worth to handle those cases. We built a negation detection component, that is able to reliably determine whether a question is a negation. Surprisingly, this was a simple task, due to the limited grammatical diversity of the questions. A simple regex with 6 different patterns is able to achieve almost perfect results. The aforementioned patterns are the following: is not, does not, do not, did not, was not, has not. When verifying the component using the annotated negation of the dataset we get no false positives (__precision = 1.0__), while covering almost all of the negated questions (__recall = 0.87__). Using these results, we invert the confidences of our algorithms, if the negation detection component detects that the question is a negation. This improved our rate of answering questions correctly overall. However, as mentioned in the _Dataset_ section, we encountered an issue where our system cheated by taking the first answer if its confidence in it was 0 and the question was detected as a negation. This is because answer A is always the correct one in the WWTBM dataset and it comes as the first option when sorting by score, having multiple scores at 0, that were negated to become 1.0. This was resolved by handling this case specifically. ## Web Search Engine Queries A whole different approach to answering WWTBM questions is to use generic web search engines. We opted for the Google Custom Search Engine (CSE), because they offer free requests on their API. The CSE and an API-key have to be created. With the API-key and the Google API Client for python we are able to send queries to Google's search engine and receive a JSON result in return. The result consists of a paginated list of retrieved websites and some metadata including the estimated result count. For each question in our dataset we compose queries for the CSE. The queries are built by concatenating the question with each of the possible answers in turn. The result with the highest estimated result count is then deemed the most likely correct answer. We didn't put much effort into this approach because of the limitation of 100 API calls per day. As a consequence, we only tested this approach on the GameFAQs dataset, because it is only a quarter of the size of the WWTBM dataset. Running the search engine approach on the GameFAQ dataset, we achieved an accuracy of $33\%$. ## Game Server In order to demonstrate the agent we constructed a web-based game server, where people can play the WWTBM game and see which questions the agent answers correctly. The backend is realized with the popular Python framework *Flask*, while most of the frontend interaction is achieved via the JavaScript library *JQuery*. A live version can be found at https://nlp-project-uhh.herokuapp.com/wwtbam/. ### Database Structure The following entity-relationship diagram shows how the game's database is structured. The upper row of entities shows tables, mapping *game_id*s to *answer_id*s/*question_id*s in an *n:m* relationship. The table *Game* represents the game state of a game identifiable by its *game_id*. Each game consists of a number of questions stored in the *Question* table. In addition each question has a number of associated answers in a *1:n* relationship, meaning that each question has *n* answers and each answer has *1* corresponding question. A game state also stores whether a player or an agent used one of the possible jokers, as well as information about whether a game has ended or not. As already said, the table *Answer* stores its corresponding *question_id*. It also stores whether it is the correct answer to that question and how many times it has been chosen by players and agents. This is used for the history joker, that can be used to see what answers have been picked in the past. The "Question" table further stores the difficulty for each question, so the difficulty level increases for each successful question. All tables prefixed with "agent" are counterparts to their normal tables which are only used by agents for e.g. learning a policy to answer questions and when to use jokers. ![](https://i.imgur.com/zrldNQ2.png) ### Routes Following are the possible routes, which can be accessed while the server is running. All routes have to be prepended with a base url, e.g. *"localhost:5000"* when starting the server locally. ***"/"***: Accepted request types: **GET** Description: Redirects to ***"/wwtbam"***. ***"/wwtbam"***: Accepted request types: **GET** Description: Renders a page with a button to start a new game. ***"/wwtbam/new-game"***: Accepted request types: **GET** Description: Server picks one question for each difficulty level and starts a new game with a corresponding *game_id*. The client is redirected to ***"/wwtbam/game/<game_id>"***. ***"/wwtbam/game/<game_id>"***: Accepted request types: **GET** Description: The main view of a game. Here all questions are rendered in succession. If a game is not finished you can come back to it via its game_id, otherwise the client is redirected to ***"/wwtbam/game/<game_id>/finished"***. If the game never existed, the client is redirected to ***"/wwtbam/game/<game_id>/not-found"***. ***"/wwtbam/game/<game_id>/answer"***: Accepted request types: **POST** Description: The server accepts a POST request containing the chosen answer. The server will determine whether the given answer is correct or not and respond with a matching status. ***"/wwtbam/game/<game_id>/next-question"***: Accepted request types: **GET** Description: The server accepts a GET request and responds with the question of the next higher difficulty level. ***"/wwtbam/game/<game_id>/history-joker"***: Accepted request types: **GET** Description: Either respond with a history of chosen answers for a particular question or a status that the joker has already been used in the current game. ***"/wwtbam/game/<game_id>/fifty-fifty-joker"***: Accepted request types: **GET** Description: Either respond with a list of Indizes of wrong answers for the current question or a status that the joker has already been used in the current game. ***"/wwtbam/game/<game_id>/finished"***: Accepted request types: **GET** Description: Renders a page informing the user about the fact, that the requested game has already ended. ***"/wwtbam/game/<game_id>/not-found"***: Accepted request types: **GET** Description: Renders a page informing the user about the fact, that the requested game has never existed. ***"/wwtbam/new-question"***: Accepted request types: **POST** Description: Accepts a JSON-request with parameters that define a question and its answers. The submitted question is persisted in the database, as long as it is not already in there. All "\*/game/\*" routes have a counterpart as "\*/agent-game/\*" that can be accessed by agents to learn certain policies. Instead of rendering templates, all of these routes return a JSON-response. If the ***"/wwtbam/game/<game_id>"*** route is changed to respond with a JSON-object, the server could be entirely decoupled from the frontend. This would enable people to program multiple frontend representations for the WWTBM game. # Results & Discussion  In this section we present the results of the experiments that we conducted using our developed approaches on the two datasets. We opted to split the results by dataset and not by approach, because for example the Stemmed Quotes approach does not work on the WWTBM dataset, because there are no quotes in its questions. We start with the GameFAQ dataset, because it is smaller and not as suited for our research problem as the WWTBM dataset, that was built from the actual game data. ## GameFAQ dataset For the GameFAQ dataset, we conducted 3 individual runs of each of our main approaches: Entity extraction via TextBlob, Reverse search using N-grams, and Entity extraction via Stemmed Quoted Text, as well as a final experiment combining the three approaches. | Approach | TextBlob | Reverse N-gram | Stemmed Quotes | Combined | | -------- | -------- | -------------- | -------------- | -------- | | Accuracy | 29.1% | 29.1% | 12.2% | 47.0% | Surprisingly, both the TextBlob and the Reverse N-gram approach yield the same accuracy of 29.1%. However, this is a coincidence, because they produce different results for the same question. The Stemmed Quotes approach performs the worst, which we expected, as a lot of questions do not have quoted text. Finally, the combination of all three approaches produces 47.0% accuracy. This is mainly because of the relatively high precision compared to low recall of our approaches, meaning their combination complements them. ![](https://i.imgur.com/6ChIHBO.png) ## WWTBM dataset For the WWTBM dataset, we are able to receive more detailed results, as the dataset has different difficulty levels annotated. Therefore we evaluated the accuracy of all of our different approaches per difficulty level. As mentioned in the previous subsection, the Stemmed Quotes approach was not applicable to this dataset, because it had no quoted text. In the figure below, we show the accuracy over all levels. ![](https://i.imgur.com/EfY12O6.png) The NLTK approach (blue line) performs the worst, mainly because the named entity extraction from NLTK does not capture the relevant phrases from the question. It for example produces 'british' from '... british army ...'. This of course means that we do not retrieve the wikipedia article for the british army, which is the topic of the question. The TextBlob approach (orange dashed line), which extracts noun phrases, performs better in this regard, which results in better scores. When adding the negation detection component, the result improves a little bit (green dotted line). For the Reverse N-gram approach (red dotted line) the scores are similar to the TextBlob approach, however are fluctuating at different levels. This is what we expected and aimed at, because this means we can improve the scores of our combined approach by a large amount. For the Combined approach (purple dashed line) this is the case, as the accuracy averages around 40% over all levels, which is significantly higher than all of the single approaches. Interestingly, there is a dip in accuracy in level 12 for all approaches, which is most likely due to the design of the game. We also plotted the Precision-Recall curves for our two main approaches that are aplicable to this dataset, together with the curve of the combined approach. As expected the combined approach covers a higher area. The initial drop around 0.05 recall is explained by our method of inverting the confidences on a negation, which results in some answers having a confidence of 1.0, even though the system is not actually that confident in them. ![](https://i.imgur.com/YHR4chO.png) Overall, as we have a low recall, we can always take a random answer when the system is not confident enough, as a last resort method of answering the question correctly. This fact also makes the intelligent usage of jokers possible. # Summary & Future Work We have explored multiple approaches, that when used by themselves yield unsatisfying results, but are better when combined. As seen in the previous section, our approach manages to answer WWTBM-like questions in a reasonably well manner, considering the vast difference between them. The performance of our system was at a level that we expected, which is significantly better than guessing randomly, but still worse than similar publications featuring such system. For example Piero Molino et al. achieved an accuracy of 76.41% on the WWTBM dataset. This is mainly explained by them using a fulltext search of wikipedia and including DBpedia information, as well as applying a combination of a lot more scoring approaches than we could have implemented, given our limited time. Further approaches to improve the performance of our model would include a direct search on a wikipedia dump instead of using its restricting API. This would remove the limit of API calls per timeframe and thus speed up the data accumulation/parsing process. It would also greatly improve the number of results we get when querying, as we would then be able to execute fulltext searches over all texts instead of just finding an article that matches in title. An additional extension would be to detect the type of a question before processing it further. This could ensure that the right model is used to answer that question. To extend the google search approach, one could parse the results for possible wikipedia pages that contain more information on that topic. # Work Responsibilities  * Wikipedia Naiive Implementation - F * Google API Approach - F * Wikipedia Quoted Entity Extraction Approach -F * Wikipedia Named Entity Extraction Approach - F * WWTBM Dataset acquisition, importing and processing - J,R * Wikipedia Textblob Nounphrase Extraction Approach - J * Wikipedia Reverse Scoring Approach - J * Evaluation of results - J,R * Flask server, Game Implementation - R * Writing of report - F,J,R