Project 2: Search

# Project 2: Search **Due: 5:00 PM, March 20th, 2020** :::danger The CIT labs will close at 5pm on Friday March 20th due to break. You will not be able to access any CIT labs after this time. If you plan to use a late day on this project, you'll have to be able to work outside of CIT. ::: ## The Mystery Continues... Ian's mysterious comments about the MSlab Fire of 2018 have been running across your mind for the last week. It definitely sounds like something the culprit who drew the Compass Rose would be involved with --- maybe this was the beginnings of their mischief? All records of the scene of the fire have been erased; however, when you get back to your office, you find an anonymous tip to the resident CIT detective that there could be evidence in an encyclopedia-like database of information that’s maintained by the computer science department. The problem is that the database is huge, and we need to build something that can search through it to find the information we need. ## Introduction You are probably familiar with a few search engines: [Google](http://www.google.com/), [Bing](http://www.bing.com/), [Duck Duck Go](http://duckduckgo.com/). For this project, you're going to be writing your own search engine! Specifically, your search engine will allow users to query a corpus (a large collection of texts) of Wikipedia pages. Your search engine will work a lot like a typical search engine. A user will type in what is called a *free text query* (i.e., some words), and your program will return what it judges to be the most relevant pages for that query. One naive way to build a search engine would be to ask the user to enter a query, and then scan through the corpus looking for documents that match that query. This approach would require many many traversals of the corpus -- one per query. Doing this for every single web query would *not* be fast. Instead, most modern search engines consist of two programs, an *indexer* and a *querier*. The indexer preprocesses the corpus before answering any queries, saving files with information about which documents are relevant to which words. (Yes, this takes a long time; but remember, this is a preprocessing step. So, if it takes one hour, or even one day, it does not negatively impact the user experience, because it does not affect query-response time. This is an example of a time vs. space trade-off. You'll use extra space to store the preprocessed information, but that enables your program to answer queries quickly). When the querier receives a query from a user, it uses the information contained in the file(s) to *quickly* respond to that query. Once built, an example run of _your_ search engine might look like this: ``` $ scala search.sol.Index MedWiki.xml titles.txt docs.txt words.txt $ scala search.sol.Query titles.txt docs.txt words.txt search> cats 1 Kiritimati 2 Kattegat 3 Lynx 4 Morphology (linguistics) 5 Kylie Minogue 6 Fable 7 Northern Mariana Islands 8 Mercalli intensity scale 9 W. Heath Robinson 10 Nebula search> ruler 1 Mohism 2 Michael 3 Manasseh 4 Monarch 5 Islamabad Capital Territory 6 Jadwiga of Poland 7 Isabella d'Este 8 Empress Suiko 9 Henry the Fowler 10 Government search> :quit ``` The following diagram summarizes the components of the search engine. The orange blocks are the inputs (the web pages and the query). The blue boxes are different components of the solution (which you will write). The white boxes are files of intermediate data that get produced and read by the blue boxes. The components split into two main parts (the *indexer* and the *querier*) which you can develop and test independently. ![](http://cs.brown.edu/courses/csci0180/content/projects/search-architecture.png) The rest of this handout gives details of the parts of this diagram, plus logistics information. Here are links so you can jump to individual sections: - [The format of the input documents](#The-Input-Documents-and-Intro-to-XML) - [Processing documents into their essential terms (parsing, etc)](#Processing-Documents-Into-Essential-Terms) - [Determining relevance of documents to terms](#Determining-Relevance-of-Documents-to-Terms-TF-and-IDF) - [Determining authority of documents (the PageRank algorithm)](#Determining-Authority-of-Documents-PageRank) - [Files to save index information](#Saving-Indexing-Information) - [Running queries](#Handling-Queries) - Expectations for [efficiency](#Efficiency-Considerations), [testing](#Testing-Expectations), [design checks](#Checks-and-Handin), and [grading](#Grading) - [Written Questions](#Written-Questions) ## Learning Objectives From this project, we hope you will: - learn how a search engine works - practice choosing data structures - practice thinking about space usage - learn about XML and tools that can help you process it - learn about PageRank, which revolutionized searching :::danger Avoid the temptation to split the work, having one partner do the indexer and one do the querier. You need the combination to hit the learning objectives, some of which could arise on the final exam. ::: ## The Input Documents (and Intro to XML) The collection of Wikipedia pages you'll be searching over is located in `/course/cs0180/src/search/src`. The three main Wikis are called `SmallWiki.xml`, `MedWiki.xml`, and `BigWiki.xml`. If you list the contents of `/course/cs0180/src/search/src` using `ls -l`, you can see the sizes of these files: ``` $ ls -l -rw-rw-r--+ 1 kfisler cs0180ta 216466779 Feb 22 2019 BigWiki.xml -rw-rw-r--+ 1 kfisler cs0180ta 24770652 Feb 22 2019 MedWiki.xml -rw-rw-r--+ 1 kfisler cs0180ta 1667422 Feb 22 2019 SmallWiki.xml ``` To avoid copying these large files into your own directories, you should instead use symbolic links to access them. To do this, navigate to whichever directory you would like put them in and use the following command to create a (symbolic) link to the files: ``` ln -s /course/cs0180/src/search/src/<filename> ``` The above command will create a (symbolic) link called `BigWiki.xml` in the current directory, and that link will link to the file `/course/cs0180/src/search/src/BigWiki.xml`. (You'll also see a `PageRankWiki.xml` file, which we will explain later). ### The Document Format: XML Programs often need to pass information to one another, and the most convenient way to do so is to use a data-interchange format, meaning a standardized text-based format. *XML*, which stands for eXtensible Markup Language, is one such format. Invented by the World Wide Web Consortium (W3C), XML uses an HTML-style angle bracket specification. Each XML *element* is encompassed by matching opening and closing *tags*. Opening tags are delimited by angle brackets, like this: `<tag>`. Closing tags include a slash, like this: `</tag>`. Text appears between the opening and closing tags of an element: ```*.html <email>kfisler@cs.brown.edu</email> ``` An XML element can have any number of *attributes*, which are specified within the opening tag, as follows: ```*.html <university name="Brown"> <concentration type="Computer Science"></concentration> <concentration type="Slavic Studies"></concentration> </university> ``` The `university` element has the attribute `name` with value `Brown`. The concentration tags have the attribute `type` with values `Computer Science` and `Slavic Studies`. The value of an attribute is always enclosed in double quotes. Note that the two `concentration` elements are nested inside the `university` element. As this nesting reflects the tree structure of XML elements, the `concentration` elements are called *children* of the *parent* element, `university`. **Pages** Each Wiki file consists of a single XML element tagged *xml*, within which there are a number of pages (*a.k.a* documents), each of which is enclosed by opening and closing `page` tags. Each page then contains a `title`, a unique, non-negative integer `id`, and some `text`, all of which are delimited by tags with those names. **Links** Links appear in the Wiki files in one of three forms: * `[[Hammer]]` * `[[Presidents|Washington]]` * `[[Category:Computer Science]]` If a link consists of two parts separated by a pipe then the text before the pipe is the page to link to, while the text after the pipe is the text to appear on the page. So, for example, the page on which `[[Presidents|Washington]]` appears would include the text `Washington` with a link to the `Presidents` page. If a link appears without a pipe or colon, then its text is both the page to link to and the text that appears on the page. If a colon appears as part of the link, as in `Category:Computer Science`, then the link is to the *meta-page* for that *namespace*. A meta-page is one that consists of links only (in this case to the pages in the `Category Computer Science`). A *namespace* is a collection of pages with a common purpose. Other common namespaces include `Help`, `File`, and `User`. ## Processing Documents Into Essential Terms An XML document has content that isn't relevant to searching, such as the tags (like `<page>`) and little words like "the". The first step in building a search engine is to reduce the document to its essential interesting words. This involves several steps: - Extracting the text of each document from the XML (*Parsing*) - Splitting the text into separate words and numbers, removing punctuation and whitespace (*Tokenizing*) - Removing little *stop words* (like "the") that don't contribute information. - Reducing words to their *stems*, which are the base roots of words. e.g., "sleep", "sleeps" and sleeping" have "sleep" as a base root. After the document is converted to a sequence of stem terms, the indexer computes which documents are relevant to which words and stores this information in the index file. We describe each of these steps in turn. ### Parsing the XML Scala has an XML library which reads an XML string into a tree of XML Nodes. To use this library, import the XML Node class at the top of the file in which you plan to parse the Wikis: ```scala import scala.xml.Node ``` Then, to access the main XML node, you use loadFile like this: ```scala val mainNode: Node = xml.XML.loadFile("SmallWiki.xml") ``` Next, to extract specific children use the selector `""`. For example, to select all the children with the tag "page">, use: ```scala val pageSeq: NodeSeq =node \ "page" ``` You can also further select all the children with the tag "id" like this: ``` scala val idSeq: NodeSeq =node \ "page"\ "id" ``` Alternatively, you can do this same deep search (i.e., search through a node's children, grandchildren, etc.), using the selector `\\`: ``` scala val idSeq: NodeSeq =node \\ "id" ``` Observe that these selectors return an object of type `NodeSeq`. If there is only one node in this sequence, then `pageSeq.text` or `idSeq.text` returns that node's contents. If there is more than one node in this sequence, then calling `.text` returns the concatenation of the text of all the nodes in the sequence. You can find a short example of a Scala program that uses this library here: `/course/cs0180/src/search/src/NodeSample.scala`. In this example, you will observe that XML syntax is valid Scala syntax, and that it is automatically converted into an XML node. Moreover, a pair of opening and closing curly brackets "escapes" the XML, so that you can write Scala code between the brackets (e.g., { otherNode }). **Note:** When parsing text, you should consider the link text as words, but not the titles of the documents to link to. In the examples presented in the Links section above, `Hammer` should be used to compute relevance; so should `Washington`, although `Presidents` should not be; and both `Category` and `Computer Science` should be used. #### Potential Errors :::spoiler Click if you get the error "Object XML is not a member of package Scala" If you get an error similar to “Object XML is not a member of package Scala”, you need to install a Maven dependency in your IntelliJ project. To do this, follow the same steps from Lab 5 for installing the CSV parser, but search for the dependency `org.scala-lang.modules:scala-xml_2.11:1.0.5` and install it. You will not have these dependencies installed if you try to run your code in the terminal on your local machine, however they are installed to run in the terminal on the department machines. ::: ### Tokenizing Once you have a parsed string, you need to break it into words/numbers and drop punctuation and whitespace. This is a good use for *regular expressions*. Scala has a built-in regular-expression class called `Regex`, that you can use to search through a string for constituent patterns. In the following example, we will use a somewhat messy looking regular expression that matches links, words with apostrophes, and words without apostrophes (we explain it shortly): First, create an instance of a Scala `Regex`: ``` scala val regex = new Regex("""\[\[[^\[]+?\]\]|[^\W_]+'[^\W_]+|[^\W_]+""") ``` Note that the regular expression is wrapped in triple quotes. :::spoiler Click for details of what that messy regular expression means Let's break it down. There are three parts, `\\[\\[[^\\[]+?\\]\\]`, `[^\W_]+'[^\W_]+`, and `[^\W_]+`, separated by pipes (`|`), which mean "or." So we are actually matching three possible alternatives: 1. `\\[\\[[^\\[]+?\\]\\]` * The meaning: Match two left brackets (`\\[\\[`) and two right brackets (`\\]\\]`), making sure there's something in the middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another link was starting inside this link (`[^\\[]+?`). * Returns links (e.g., "[[Some Wiki Page]]") 2. `[^\W_]+'[^\W_]+` * The meaning: Match at least one alphanumeric character (`[^\W_]+`), then an apostrophe (`'`), and then at least one alphanumeric character (`[^\W_]+`). * Returns words with apostrophes (e.g., "don't") 3. `[^\W_]+` * The meaning: Match at least one alphanumeric character. * Returns words (e.g., "dog") ::: <br></br> Next, use the method `findAllMatchIn`, which takes as input a `String`, to find all the matches within that `String`. This method actually returns an iterator over `Match` objects, which contain the substrings matched by the regular expression, as well as some metadata we can here discard (groups, start index, etc.). We then traverse these `Match`es to extract the matched substrings. ```scala // Call findAllMatchIn to get an iterator of Matches val matchesIterator = regex.findAllMatchIn(someString) // Convert the Iterator to a List and extract the matched substrings val matchesList = matchesIterator.toList.map { aMatch => aMatch.matched } ``` **Note**: You are welcome to use a different regular expression if you wish. For example, if you don't want to match numbers, replace each instance of `[^W_]` in the regular expression with `[^W_d]`. You can also split this regular expression up into its parts, so that you can search for only links (using only the first part), or only words (using the second two parts joined with a pipe). ### Stop Words Some words appear quite often in a language, but contribute little to meaning: words like "the", "in", "an", or "or". Such words are known as *stop words*. For example, in the phrase "the cat in the hat", the relevant words are "cat" and "hat". To help you identify stop words, we've provided you with a `StopWords` object, which exposes a single method, `isStopWord` that takes as input a `String` (i.e., a single word), and returns `True` if the word is a stop word, and `False` otherwise. ### Stemming Words have variations that aren't relevant to searching. Consider the words "sleep", "sleeps", and "sleeping" -- we'd like all of these to match the query term "sleep". The process of reducing a word to its base is called *stemming*. We have provided a `PorterStemmer` class with two methods: - `stem` takes as input a string of words and returns that same string, but with the words stemmed - `stemArray` takes as input an array of strings, each consisting of a single word, and returns an array of those same words, again with the words stemmed. The output of the stemming process are called *terms*. The indexer will determine which documents are relevant for which terms. ## Determining Relevance of Documents to Terms (TF and IDF) Processing documents as described (parsing, tokenizing, and stemming) converts documents from text into a sequence of terms. Likewise, a query (after parsing, tokenizing, and stemming) is a sequence of terms. Consequently, to measure the relevance of a document to a query -- that is, to score the document -- it suffices to compare these two sequences of terms. There are two key ideas that the similarity metrics used by most practical search engines capture: * A term that appears many times within a document is likely to be more relevant to that document than a term that appears fewer times. This notion is called the **Term Frequency**. * A term that occurs in a fewer documents in the corpus is likely to be a better signal of relevance when it occurs than a term that appears in many documents. This notion is called **Inverse Document Frequency**. We discuss how to compute each in turn. ### Term Frequency, $\pmb{tf}$ Given a list of all the terms that occur across all of the documents, one can count how many times each term appears in each document. These raw counts are only so useful for comparing documents, however: the term "cat" might appear less often in one document simply because the document is very short. We therefore want to normalize the frequency of counts across the documents. We can do this by dividing the count of a term by the count of the most frequently used term in the same document. Formally, let $c_{ij}$ be the number of times that term $i$ appears in document $j$. Then, let $a_j = \max_i c_{ij}$ be the number of occurrences of the most frequently occurring term in document $j$. We then define the *term frequency* as ${tf}_{ij} = c_{ij} / a_j$. ### Inverse Document Frequency, $\pmb{idf}$ To capture the second idea, we want to scale up the term frequency of words that occur in fewer documents in the corpus. One useful scaling factor, called *inverse document frequency* ($idf$) is computed as follows. Let $i$ be a term in the corpus, $n$ be the total number of documents, and $n_i$ be the number of documents that contain term $i$. Then, $$ \boldsymbol {idf}{_i} = {\text {log} {n \over n_i}} $$ (The use of log makes this factor less sensitive to large values of $n / n_i$.) ### Scoring Word/Document Relevance Given a term $i$ (from anywhere in the corpus) and a document $j$, we multiply the term frequency and the inverse document frequency to yield the *relevance score* of $j$ to $i$. Rather than compute these values for every single word in the corpus, we just compute the information needed to calculate these scores so that we only need to calculate the scores for each word in a query. The values needed are the frequency of a word in each document it appears in, as well as the frequency of the most common word in each document. These values get recorded in the *words.txt* file that the indexer saves for the querier to use later. The section on the querier explains how to use these scores to decide which documents to report from the search. Here is a [YouTube video on how the TF and IDF algorithms work](https://www.youtube.com/watch?v=vrjAIBgxm_w) that one of the HTAs found helpful when doing this project. :::spoiler Click for a worked example of computing the frequencies and scores Consider a corpus consisting of the following three documents: 1. the dog bit the man 2. the dog ate the cheese 3. the cheese bit the cheese Our corpus then contains the words {"the", "dog", "bit", "man", "ate", "cheese"}. For each document, we create a mapping of these words to their counts: * {2, 1, 1, 1, 0, 0} (document 1 has "the" 2 times, "dog" once, etc) * {2, 1, 0, 0, 1, 1} * {2, 0, 1, 0, 0, 2} With this information, we can compute the term frequencies $\pmb{tf}_{ij} = c_{ij} / a_j$: | | $i_0$ (the)| $i_1$ (dog)| $i_2$ (bit)| $i_3$ (man)| $i_4$ (ate)| $i_5$ (cheese) |-|---|---|---|---|---|---|--- | $j_0$ | 1 | 1/2 | 1/2 | 1/2 | 0 | 0 | $j_1$ | 1 | 1/2 | 0 | 0 | 1/2 | 1/2 | $j_2$ | 1 | 0 | 1/2 | 0 | 0 | 1 Next, we compute the inverse document frequencies for each word $i$ using $\pmb{idf}_i = \log \frac{n}{n_i}$: $\pmb{idf}$| $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | --- | --- | --- | --- | --- | --- | --- | | | $log(3/3)$ | $log(3/2)$ | $log(3/2)$ | $log(3/1)$ | $log(3/1)$ | $log(3/2)$ | | 0 | 0.405 | 0.405 | 1.099 | 1.099 | 0.405 ::: <a name="pagerank"></a> ## Determining Authority of Documents (PageRank) One could determine which documents are relevant to a query just by comparing the terms in the query to those in the documents. This is how search engines used to work. Then in the 1990s, Google was founded using a different approach based on the PageRank algorithm (designed by the founders). The algorithm ranks pages based on the links among them (without considering the content). Pages with high scores (i.e., page ranks) are then thought of as *authoritative*. Given a query, the most authoritative documents related to the words in the query are returned. Here are the key principles underlying the design of PageRank: 1. **The more pages that link to a page $j$, the more authoritative $j$ should be.** For example, if "Blog Brown" is linked to by 5 web pages, and the Wikipedia article on "Providence, Rhode Island" is linked to by 500 pages, then it makes sense to consider the Wikipedia article more authoritative. 2. **The more authoritative those pages are, the still more authoritative $j$ should be.** Now, if "Providence Place Mall" is linked to only by "Providence, Rhode Island", and "Bluno" is linked to only by "Blog Brown" it might not be enough to measure a page's authority simply by a count of the number of pages that link to that page. Indeed, it makes sense to consider "Providence Place Mall" more authoritative than "Bluno" since it is linked to by a more important page. 3. **The fewer links those pages have to pages other than $j$, the more authoritative $j$ should be.** Assume "Research at Brown" is linked to only by a "NYTimes" article which links to only 10 other pages, while "Why Brown?" is linked to only by "Blog Brow", which links to 200 other pages. Because the "NYTimes'' page has only a few links, and "Brown Blog" has many links, a link from the "NYTimes" page can be considered to be more important than a link from the "Brown Blog", leading us to attribute more authority to "Research at Brown" than "Why Brown?" 4. **The closer $j$ is to another page $k$ (measured by the number of links that must be followed to get from $j$ to $k$), the more $k$'s score should influence $j$'s.** For example, if the average path from "Brown University" to "Boeing" is 100 links, and the average path from "Brown University" to "Amy's Personal Website" is 5 links, then all other things equal, Amy's page should be considered more authoritative than Boeing's. PageRank uses these principles to compute the authority of each document. The algorithm works roughly as follows: Each page's authority is a number between 0 and 1, where the total authority across all documents always equals 1. The algorithm starts by assuming that all documents have the same authority. Then, following the principles, the algorithm updates the authorities based on the links between pages. Since the authority of one page can affect that of another, this process of refining authorities repeats until the authorities stabilize across all documents. (The underlying mathematics guarantees eventual stabilization). Here is a [YouTube video on how PageRank works](https://www.youtube.com/watch?v=v7n7wZhHJj8) that one of the HTAs found useful when doing this project. We use the term *weight* to capture the impact of links from one page to another. Let $k$ and $j$ be pages, where $k$ has a link to $j$. We denote the *weight that $k$ gives to $j$* as $w_{jk}$. The weights in turn are used to compute a page's rank. **Computing rank:** For a page $j$, the rank $r_j$ is defined as follows: $$ \begin{equation} r_j = \sum_k w_{jk} r_k \end{equation} \hspace{4em} [1] $$ **Computing weights:** The weight is defined by the following formula, where $n$ is the total number of pages and $n_k$ is the number of (unique) pages that $k$ links to. $\epsilon$ is a small number (e.g., $\epsilon = 0.15$), the use of which helps achieve the 4th principle. $$ \begin{equation} w_{jk} = \left\{ \begin{array}{ll} \frac{\epsilon}{n} + (1 - \epsilon) \frac{1}{n_k} & \mbox{if $k$ links to $j$} \\ \frac{\epsilon}{n} & \mbox{otherwise} \end{array} \right. \end{equation} \hspace{4em} [2] $$ **Special cases:** There are a few special cases to handle when defining the weights $w_{jk}$ (before adjusting those weights by $\epsilon$): * Links from a page to itself are ignored. * Links to pages outside the corpus are ignored. * Pages that link to nothing can be considered to link (once) to everywhere, except to themselves. * Multiple links from one page to another should be treated as a single link. One idea of this weighting scheme is to attribute more authority to page $j$ if it is one of only a few pages linked to by page $k$ than if it were one of many. :::spoiler Click if you want to understand the underlying mathematics Starting from some initial ranking $\pmb{r}^0$, Equations 1 and 2 begin unfolding like this: $$ \begin{eqnarray} r_j^1 &=& \sum_k w_{jk} \, r_k^0 & [3]\\ r_j^2 &=& \sum_k w_{jk} \, r_k^1 & [4]\\ &=& \sum_k w_{jk} \left( \sum_{l} w_{kl} r_{l}^0 \right) \\ &=& \sum_{k,l} w_{jk} \, w_{kl} \, r_{l}^0 \end{eqnarray} $$ Looking at the first equation, we can see that the first update to the ranking of page $j$ depends on the weights $w_{jk}$, which encode the values that each page $k$ ascribes to page $j$. Likewise, from the last equation, we can see that the second update to the ranking of page $j$ depends on the multiplicative term $w_{jk} w_{kl}$. This term encodes not just the value that each page $k$ ascribes to page $j$, but rather the value that each page $k$ ascribes to each page $l$ multiplied by the value that each page $l$ ascribes to each page $j$. Embedded in this multplicative term is the factor $\epsilon^2$. Likewise, embedded in the calculations of $r_j^3$ is the factor $\epsilon^3$. The $\epsilon$ in these equations can be interpreted as a dampening factor, because pages that are only one hop (i.e., link) away are adjusted only by $\epsilon$, but pages that are two hops away are adjusted only by $\epsilon^2$, and so on. Indeed, the fewer hops there are from $j$ to $k$ the more $k$'s score is influenced by $j$'s. If you continue the iterative calculations spelled out in Equations 3 and 4, you will find that the rankings eventually converge. Indeed the rankings to which this process converges are the unique rankings that satisfy Equation 1. **Note:** We sort of swept things under the rug when we first wrote down Equation 1. What we should have written to be more precise was: at the ith iteration, $$ r_j^{i+1} = \sum_k w_{jk} r_k^i $$ But now that we know that this iterative process converges, Equation 1 is justified. ::: ### Pseudocode for PageRank Here is some pseudocode for PageRank (its an image that uses notation $w(j)(k)$ instead of $w_{jk}$). The variables $r$ and $r'$ represent page rankings (authorities), which consist of a number between 0 and 1 for each page. $r'$ is the current ranking, which gets updated based on a copy $r$ of that ranking in each round. The ranking has stabilized when $r'$ and $r$ are close to each other, as defined below the pseudocode. ![](https://i.imgur.com/i7ZFa9Q.png) **Distance between rankings:** The difference between two numeric rankings like these is commonly measured by computing Euclidean distance, which has the following formula (where $i$ ranges over the pages): $$ \sqrt{\sum_i (r{'}_i - r_i)^2} $$ Once this distance is sufficiently small (say less than $0.001$), you can declare the ranking stable, and output $r{'}$ (or $r$, since they are so close) as your final ranking. ### PageRank Example Let's consider a concrete example of PageRank on three pages connected as follows: ![](https://i.imgur.com/tsFCJ7b.png) Computing weights according to equation (2), we get: $$ \begin{align} w_{AB} = w_{BA} = w_{CA} = w_{CB} &= \frac{.15}{3} + \frac{0.85}{2} = 0.475\\ w_{AC} &= \frac{0.15}{3} + 0.85 = 0.9\\ w_{AA} = w_{BC} = w_{BB} = w_{CC} &= \frac{0.15}{3} = 0.05 \end{align} $$ Then, we can use those weights to iterate over page ranks for A, B, and C until they stabilize. | iteration | rank(A) | rank(B) | rank( C) | | --------- | ------- | ------- | ------- | | 0 | 0.3333 | 0.3333 | 0.3333 | | 1 | 0.4750 | 0.1916 | 0.3333 | | 2 | 0.4148 | 0.2519 | 0.3333 | | 3 | 0.4404 | 0.2263 | 0.3333 | | 4 | 0.4295 | 0.2372 | 0.3333 | | 5 | 0.4341 | 0.2325 | 0.3333 | | 6 | 0.4322 | 0.2345 | 0.3333 | | 7 | 0.4330 | 0.2337 | 0.3333 | | 8 | 0.4326 | 0.2340 | 0.3333 | > Table 1: Convergence of ranks for three pages A, B, and C. Here's another, less worked through example: ![](https://i.imgur.com/fEjJLEJ.png) | rank(A) | rank(B) | rank(C ) | rank(D) | | ------- | ------- | ------- | ------- | | 0.2018 | 0.0375 | 0.3740 | 0.3867 | ### Testing PageRank We have provided you with a sample wiki file, `PageRankWiki`, made to test the results of your PageRank implementation (separately from the rest of your code). The wiki includes several pages where one page (**Page 100**) has an overwhelmingly larger number of pages linking to it compared to the rest of the pages in the wiki. We recommend checking the values your PageRank implementation outputs for each of the documents and confirming that **Page 100** has a much higher score than the others. ## Saving Indexing Information Your Indexer will save three files: - `titles.txt`, which maps document IDs to document titles - `docs.txt`, which stores the rankings computed by PageRank - `words.txt`, which stores the relevance of documents to words It will be the job of your indexer to write these files, and the job of your querier to read them. After design checks, we will be providing code to write these files from the indexer as well as code to read them into the querier. :::warning NEW -- [Here is the code for reading/writing index files](http://cs.brown.edu/courses/csci0180/content/projects/search/FileIO.scala) (it is also in the project src directory on the dept machines). ::: The `FileIO.scala` file that we have provided contains six methods: `printTitleFile()`, `readTitles()`, `printDocumentFile(), readDocuments()`,`printWordsFile()`, and `readWords()`. These methods are for writing to and reading from the three files specified above. :::spoiler Click if you want to know more about the arguments to the methods in the `FileIO.scala` class. * `printTitleFile()` and `readTitles()` take in a string, the path to the `titles.txt` file and a `HashMap[Int, String]` representing the map of document IDs to document titles. The print method will take in the hashmap and write its contents to the specified file, the read method will take in an empty hashmap and fill it with the information specified in the input file. * `printDocumentFile()` and `readDocuments()` are similar to the titles methods, but they take a `HashMap[Int, Double]` of doc IDs to pageranks and store/read that information into/from the file specified. * `printWordsFile()` and `readWords()` take a `HashMap[String, HashMap[Int, Double]` of words to a hashmap of doc ID to frequency of the word in that document. This reads/writes in the same way as the previous methods. ::: <a name="queries"></a> ## Handling Queries Your querier should do the following: 1. Read the given files from the Indexer into a data structure of your choosing 2. Prompt the user to enter a free-text query 3. Answer the query by scoring its terms (after removing stop words and stemming) against every document and returning titles of the documents with the top 10 scores. The first document printed should be the one with the highest score, and the last document printed should be the one with the tenth-highest score. If the terms in the query appear in fewer than 10 documents, return only the documents on which they appear. If there are no search results at all, print an informative message. 4. Repeat steps 2 and 3 until the user types `:quit`. To help you explore the impact of PageRank, your querier will take an argument that indicates whether to consider PageRank scores when ranking documents. :::warning NEW -- [Here is starter code for the querier](http://cs.brown.edu/courses/csci0180/content/projects/search/QueryStencil.scala). You only need to fill in the body of the `query` method (the file gives you steps 1, 2, and 4 from the above list, including the REPL). This is a change from original plans now that the class will finish remotely. ::: ### Scoring Documents against Queries Without PageRank Given a query, you can use the term-document scores that you wrote to the *words.txt* file to locate the most relevant documents. The computation is straightforward: for every document $j$, compute the relevance of $j$ to query $Q$ by summing scores across all terms $i$ in the query (after stemming and removing stop words): $$s_{Q,j} = \sum_{i \in Q} {tf}_{ij} {idf}_i$$ Sort the documents by these scores, and return the 10 highest-scored documents as the result of the query. Break ties however you wish. ### Scoring Documents Including PageRank Recall that PageRank computes a score for each document (which you wrote to the file *docs.txt*). These scores can be combined with the term-relevance scores ($s_{Q,j}$) before sorting the documents and returning results. You can experiment with different combinations and see how that affects the results (e.g., adding the PageRank and term-relevance scores, multiplying them, other ideas). ### Users decide whether to search with/without PageRank The main method for your Querier (which takes the Wiki name and index file names), should take an optional argument `--pagerank`. If this argument is provided, you will combine pagerank results and term relevance results when scoring documents. Otherwise, you will base scoring only on term relevance. Here are examples of running your program with and without PageRank: ``` $ scala search.sol.Query titles.txt docs.txt words.txt $ scala search.sol.Query --pagerank titles.txt docs.txt words.txt ``` ### REPL (Read-Eval-Print Loop) A REPL is a method that prompts the user for input, process the input, prints the result, and repeats. This is the method that does all of the communication with a user. You have seen code that follows this cycle a few times this semester: the `loginScreen` in the Banking example from lecture, the I/O lab, and the RegExp lab give samples. For the search engine, you should continue processing queries until the user types `:quit` at the prompt. If the user types `quit` without the colon, your search engine should treat it as a query. <a name="efficiency"></a> ## Efficiency Considerations Given the scale of data that you will work with in BigWiki, you will have to pay attention to both time and space efficiency on this project. You don't have to maximize time efficiency, but we want to see you making smart high-level choices about how often your code traverses data. For example, if you take two consecutive passes over the same large data structure when one would suffice, we'd expect you to combine the two passes. **Indexer Efficiency:** Don't worry if your indexer is relatively slow at indexing a very large corpus. We'd rather see your indexer take 15 minutes to build an index, with querying taking a mere 0.5 seconds, than your indexer take 5 minutes, and querying 1.5 seconds. **Space Complexity:** This project is the first time you'll really confront space complexity. You can end up creating a lot of intermediate data structures in this project. Pay attention to when you need to create new data (it's fine if you need it, but don't assume you have to use immutable data just because we've talked about it). Also think about when data structure choices need to take space usage into account (meaning, you don't *always* have to use the smallest-space structure, but you should if a particular piece of data could take considerable space). Also pay attention to not holding onto data longer than necessary. If you have a large data structure in the heap, but no way to reach it from the environment, Java will reclaim that space (garbage collection). One common way to hold onto data too long is to make your methods do too much work (parameters go into the environment). Creating methods that process data in slightly smaller chunks can help control space usage. Lastly, be sure to close any Readers and Writers as soon as you are done using them; and similarly, do not open them prematurely. **Out of Memory Errors:** It is possible (likely even) that you will encounter a Java `OutOfMemoryError` while running your indexer. If you do, try re-running you program with the argument `-J-Xmx512m`: ``` scala -J-Xmx512m search.sol.Index <arguments> ``` The `-J` means it's an argument for your JVM, the `-Xmx` tells the JVM how much memory it can use, and `512m` means 512 megabytes. The maximum amount of memory you are permitted to use on a department machine is `-Xmx2g`, meaning up to 2 gigabytes of memory. But be forewarned; allowing the JVM to use this much memory can drastically slow down your machine's other processes. <a name="testing"></a> ## Testing Expectations You can test the calculations computed by your search engine and other important pieces of functionality using check-expects, although we are not requiring you to write check-expects. You should try various queries, with different types of pages. Document the queries you tried (along with the inputs used to start the programs) and the outputs that you expected in your `README` file. You will **only** receive credit for testing which is clearly documented in your `README` file. (More details on the readme appears later). You are welcome to create your own Wiki files for testing purposes. <a name=written></a> ## Written Questions Virtually all Internet users rely on search engines to access information. Now that you’ve built your own search engine, it’s important to consider the ways in which search engine designs can be exploited. Data voids are one example of how search engines enable the spread of misinformation. Read the following introduction to data voids and answer the questions below. *The following explanation of data voids is excerpted from [a Data & Society report about data voids](https://datasociety.net/wp-content/uploads/2019/11/Data-Voids-2.0-Final.pdf) by researchers danah boyd and Michael Golebiewski.* Data voids are one way that search users can be led into disinformation or manipulated content. These voids occur when obscure search queries have few results associated with them, making them ripe for exploitation by media manipulators with ideological, economic, or political agendas. Search engines aren’t simply grappling with media manipulators using search engine optimization techniques to get their website ranked highly or to get their videos recommended; they’re also struggling with conspiracy theorists, white nationalists, and a range of other extremist groups who see search algorithms as a tool for exposing people to problematic content. Data voids exist because of an assumption baked into the design of search engines: that for any given query, there exists some relevant content. But this is simply not true. When search engines have little available content to return for a particular query, the “most relevant” content is likely to be low quality or problematic or both. With low-data queries, not only are search engines less likely to statistically predict what the user is looking for, but there might simply be no high-quality content to return. Data voids can easily be manipulated, offering problematic content in a range of situations that can cause serious harm. Below are the main five types of data voids currently being manipulated by those spreading conspiracies or hate: 1. Breaking News: The production of problematic content can be optimized to terms that are suddenly spiking due to a breaking news situation; these voids will eventually be filled by legitimate news content, but are abused before such content exists. 2. Strategic New Terms: Manipulators create new terms and build a strategically optimized information ecosystem around them before amplifying those terms into the mainstream, often through news media, in order to introduce newcomers to problematic content and frames. 3. Outdated Terms: When terms go out of date, content creators stop producing content associated with these terms long before searchers stop seeking out content. This creates an opening for manipulators to produce content that exploits search engines’ dependence on freshness. 4. Fragmented Concepts: By breaking connections between related ideas, and creating distinct clusters of information that refer to different political frames, manipulators can segment searchers into different information worlds. 5. Problematic Queries: Search results for disturbing or fraught terms that have historically returned problematic results continue to do so unless high quality content is introduced to contextualize or outrank such problematic content. If you’re interested in learning more, read this [report about data voids](https://datasociety.net/wp-content/uploads/2018/05/Data_Society_Data_Voids_Final_3.pdf) or [this talk by Danah Boyd] (https://points.datasociety.net/the-fragmentation-of-truth-3c766ebb74cf) covering broader societal issues surrounding data voids. Answer the following questions in your README file with a few sentences per question. 1. Read about the five types of data voids listed above and answer the following questions: - Choose one type of data void and propose a modification (large or small) to your search engine design that addresses this vulnerability (you do not need to know how to implement this modification). - Choose another type of data void and propose a strategy to address this vulnerability that lies outside the role of the engineer. Consider strategies at any scale and the role of journalists, policymakers, search engine users, etc. 1. Golebiewski and boyd write that on the search engine Bing, “the overwhelming majority of users only engage with the first page of results. Less than 3% of queries result in someone going to the next page. Thus, what matters most is what is prioritized on the first page.” Given this information and your knowledge about data voids, consider your design for returning search results. - Identify a vulnerability with your current design for returning search results. Explain how it could lead to an unexpected social or ethical outcome. - Propose a new design for what to do when there is little content in the corpus to return for a query. Be creative--you do not need to know how to implement this design! 1. Answer the following questions about the PageRank algorithm: - With the scoring mechanism used in the PageRank algorithm in mind, what could an organization do to ensure their own pages get promoted to the top of the search results? - Say an inaccurate news article has gone viral and individuals have been linking to it heavily in their online feeds (social media, blogs, pages, etc.). How could this affect the results of someone who is attempting to research the story through a search engine that uses PageRank? <a name="checks"></a> ## Checks and Handin ### Design Check Design checks will be held on Wednesday, March 11th through Friday, March 13th. Refer to the Search Release email for how to sign up for design checks. **Reminder:** You are required to pair program at least the design check portion of all CS18 projects. We recommend finding a partner as soon as possible, as you will not be able to sign up for a design check without one. For the design check, you must do the following: * Explain how you will parse, tokenize, and stem documents; that is, how you will split them up into individual, lower case, stemmed terms that are not stop words. * Explain how you will score documents. * Explain what data structures you will write to file from your Indexer and read to file in your Querier and what information they will be storing. * Walk the TAs through the steps that your program will take to answer a free text query. * Explain how you plan to implement PageRank. Feel free to ask your friendly TAs as many clarification questions as necessary about PageRank in order to ensure you understand it fully *before* your design check. ### Final Handin The final handin is due by 5:00 PM on Friday, March 20th. For the final handin, your `scalaproject` should contain the packages `search.src` (in the `src` directory) and `search.sol` (in the `sol` directory). Your code should be part of the `search.sol` package. That package should also contain a `README.txt` file. Your README file should include: * Answers to your written questions. * Instructions for use, describing how a user would interact with your program. * A brief overview of your design, including how the pieces of your program fit. * A description of features you failed to implement, as well as any extra features you implemented. * A description of any known bugs in your program. * A description of how you tested your program. * A list of the people with whom you collaborated. You should handin all of the `.scala` files in your `search.src` and `search.sol` directories, as well as any wikis you generated for testing. Please **do not** hand in the small, medium, or big wiki that we give you. **Note:** Only one of you or your partner must hand in the project, but feel free to add both partners to the submission on gradescope. <a name="grading"></a> ### Grading As with all CS 17/18 projects, a good design will make coding this project significantly easier; so you should spend a fair amount of time working on your progam's design before you begin writing any code. The design check counts for 15% of your grade, including: * An explanation of your plan for tokenizing. * An explanation of your plan for scoring documents. * Details about the data structures your querier will use to store the information needed to write to the index files. * A walkthrough of the design of your querier. * A demonstration of your understanding of PageRank. Functionality counts for 70% of your grade, including: * Correct implementation of Indexer: 24 points * Correct implementation of Querier: 23 points * Correct implementation of PageRank: 10 points * Efficiency: 13 points - 4 points for indexing `BigWiki` in under and hour with 1 GB of memory on the department machine. - 5 points for indexing `MedWiki` in under two minutes on the department machine with 1 GB of memory. - 4 points for querying efficiency on `MedWiki`. As always, partial functionality merits partial credit. The final 15% of your grade will be reserved for comments, testing, and style. You should include documentation comments and test cases for all non-trivial methods. You should also perform system testing, to test interactions among methods. Additionally, comment any code which would otherwise be unclear. :::warning **Added 4/2 (10:15pm)** Grading levels for Search: - for a passing/C grade, get the indexer and querier to produce the correct answers, but you don't have to worry about achieving a good runtime, good space efficiency, or implementing PageRank. - for a B grade, get the indexer and querier to produce the correct answers, and get good runtime and space efficiency. PageRank not required. - for an A grade, get the indexer and querier to produce the correct answers, get good runtime and space efficiency, and implement PageRank. All grade levels should still turn in the written portion. ::: ___ Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS18 document by filling out the anonymous feedback form: [https://cs.brown.edu/courses/cs018/feedback](https://cs.brown.edu/courses/cs018/feedback).