Project 2: Search

--- tags: Projects-Summer2021 --- # Project 2: Search ### Due Date: Friday, July 16, Anywhere-On-Earth time ## The Hunt Continues... (optional) :::spoiler Look here for the next clue! Somewhere, out there, in the great big world, there is a bear. [Blueno](https://www.browndailyherald.com/wp-content/uploads/2020/08/DSCF9718-e1597688000493.jpeg). The greatest treasure of all time, a joy unspeakable. And the remarkable CS18 Team of Theme is searching for it. The grandiose, all-powerful Team of Theme has found a super-snazzy kind of top-secret database with information about the beloved bear and its prized treasure. Only, this set of data also happens to have some info about other things as well, like cats. If they ever want to find happiness again, they'll have to parse this database bit by bit, developing a good ole' search engine to warm the soul. To find Blueno. *In `MedWiki.xml`, there lie some clues Searchable, they are, but first you must choose Special words in this list, Paired with numbers you can't miss, To find the treasure of a Bear who lived the most Blue.* Search these words in the [Search demo](https://cs18-search.herokuapp.com/), and find the number of the result that matches the paired word. The resulting string of integers will be the end of this link (tinyurl.com/<yournumberhere>). For example ["car", "Open-wheel car"] means that you'd search "car", and find that "Open-wheel car" is result #2. The sequence of the clues produces results in the correct order for the tinyurl. ["bear", "Harappa"] ["brown", "Fonni"] ["treasure", "Mohism"] ["hunt", "Mesolithic"] ["frog", "Cuisine of the Midwestern United States"] ["kathi", "Houston"] ::: # Overview You are probably familiar with a few search engines: [Google](http://www.google.com/), [Bing](http://www.bing.com/), [Duck Duck Go](http://duckduckgo.com/). In this project, you're going to be building your own search engine! Specifically, your search engine will allow users to query a corpus (a large collection of texts) of Wikipedia pages. In this handout we will: * outline the learning objectives of the project * outline the indexer-querier archiecture of your search engine * describe the format of the input corpus data * discuss in-depth the parsing process * describe handling user queries * discuss in-depth how to calculate the relevance of a document to a query * describe the PageRank algorithm and its role in your search engine * talk about how to test your project effectively * describe the evaluation criteria for the project ## Learning Objectives During this project, you will: - choose specific data structures, citing alignment with problem needs and constraints - optimize your program to address space and time constraints - practice strategies to test and debug complex programs - implement the PageRank algorithm, which revolutionized web search # Setup ## Stencil Code ::: warning **Important**: Read [this guide](https://hackmd.io/ucedaAekROycRiN2j1LRaw) detailing how to use GitHub Classroom for projects in CS18 first! ::: You can set up your group and clone your group's project repository [**here**](https://classroom.github.com/g/P_qyZIVW). ## Installing XML Dependency In this project you use a Scala XML library that you need to install as a dependency for your project. Intellij makes this easy! * Open **File -> Project Structure -> Modules** * Click the **+** button at the bottom and select **Library** * Select **New Library... -> From Maven...** * Search for the dependency `org.scala-lang.modules:scala-xml_2.13:2.0.0-M5`and install it * You should be back in the **Choose Libraries** window, and select the newly installed library and choose **Add Selected**. * The Scala XML library should now work for you! # Implementation Architecture ## Introduction Your search engine will work a lot like a typical search engine. A user will type in what is called a *free text query* (i.e., some words), and your program will return what it judges to be the most relevant pages for that query. One naive way to build a search engine would be to ask the user to enter a query, and then scan through the corpus looking for documents that match that query. This approach would require many many traversals of the corpus -- one per query. Doing this for every single web query would *not* be fast. Instead, most modern search engines consist of two programs, an **indexer** and a **querier**. The indexer preprocesses the corpus before answering any queries, saving files with information about which documents are relevant to which words. In this project, you will implement your indexer in `Indexer.scala` and your querier in `Querier.scala`. Preprocessing will be discussed later in the handout and will involve parsing, cleaning, and analyzing the corpus texts, as well as executing the PageRank algorithm to calculate the authority of each page. :::info Yes, this takes a long time; but remember, this is a preprocessing step. So, if it takes one hour, or even one day, it does not negatively impact the user experience, because it does not affect query-response time. This is an example of a **time vs. space trade-off**. You'll use extra space to store the preprocessed information, but that enables your program to answer queries quickly. ::: When the querier receives a query from a user, it uses the information contained in the file(s) to *quickly* respond to that query. Once built, an example run of _your_ search engine might look like this: ``` $ scala sol.Indexer BigWiki.xml titles.txt docs.txt words.txt ... indexing for several minutes ... $ scala sol.Query titles.txt docs.txt words.txt ... loading indices for several seconds ... search> tropical 1 India 2 Indonesia 3 Pacific Ocean 4 International Tropical Timber Agreement, 1983 5 Mediterranean Sea 6 El Ni?o-Southern Oscillation 7 Philippines 8 Nicaragua 9 Hawaii 10 Oligocene search> pope 1 Pope 2 Middle Ages 3 Holy Roman Empire 4 Pope Julius II 5 Pope John Paul II 6 Liber Pontificalis 7 Pope Clement V 8 Pope Alexander III 9 Pope Stephen VI 10 Pope Clement IV search> :quit Process finished with exit code 0 ``` It is important to note the difference of the order of magnitude of execution time at each step: * parsing the corpus and writing indices for BigWiki will take a while (**minutes**) * reading in the indices and initializing the querier will take **several seconds** * responding to user queries should be near **instantaneous** ## Archicture Diagram The following diagram summarizes the components of the search engine. The *orange blocks* are the inputs (the web pages and the query). The blue boxes are different components of the solution (which you will write). The white boxes are files of intermediate data that get produced and read by the blue boxes. The components split into two main parts (the *indexer* and the *querier*) which you can develop and test independently. ![Search Architecture](https://i.imgur.com/YS3pxIs.png) # Data Format The collections (called Wikis) of Wikipedia pages you'll be searching over is located in the `wikis` folder in your repository. The two larger wikis, however, are not included because they are too large for GitHub, so you can download them from this [**this**](https://drive.google.com/drive/u/0/folders/10MNlAyvjtnqADblvwzbCeLt9XRhgEdz4) Google Drive folder. The three main Wikis are called `SmallWiki.xml`, `MedWiki.xml`, and `BigWiki.xml`, as well as some smaller examples to help debug your PageRank. ## The Document Format: XML Programs often need to pass information to one another, and the most convenient way to do so is to use a data-interchange format, meaning a standardized text-based format. *XML*, or eXtensible Markup Language, is one such format. Invented by the World Wide Web Consortium (W3C), XML uses an HTML-style angle bracket specification. Each XML *element* is encompassed by matching opening and closing *tags*. Opening tags are delimited by angle brackets, like this: `<tag>`. Closing tags include a slash, like this: `</tag>`. Text appears between the opening and closing tags of an element: ```*.html <email>kfisler@cs.brown.edu</email> ``` An XML element can have any number of *attributes*, which are specified within the opening tag, as follows: ```*.html <university name="Brown"> <concentration type="Computer Science"></concentration> <concentration type="Slavic Studies"></concentration> </university> ``` The `university` element has the attribute `name` with value `Brown`. The concentration tags have the attribute `type` with values `Computer Science` and `Slavic Studies`. The value of an attribute is always enclosed in double quotes. Note that the two `concentration` elements are nested inside the `university` element. As this nesting reflects the tree structure of XML elements, the `concentration` elements are called *children* of the *parent* element, `university`. ### Pages Each Wiki file consists of a single XML element tagged *xml*, within which there are a number of pages (*a.k.a* documents), each of which is enclosed by opening and closing `page` tags. Each page then contains a `title`, a unique, non-negative integer `id`, and some `text`, all of which are delimited by tags with those names. ### Links Links appear in the Wiki files in one of three forms: 1) `[[Hammer]]` * "Hammer" is the title of the page being linked to **and** is the text displayed on the page. 2) `[[Presidents|Washington]]` * "Presidents" is the title of the page being linked to but "Washington" is the text displayed on the page. 3) `[[Category:Computer Science]]` * These are meta-page links, which can be treated like the `[[Hammer]]` link above. A meta-page is one that consists of links only (in this case to the pages in the `Category Computer Science`). A *namespace* is a collection of pages with a common purpose. Other common namespaces include `Help`, `File`, and `User`. You'll need to keep track of which pages link to which other pages for PageRank! # Indexer ## Main method usage The main method of your `Indexer` should support the following usage: ```shell $ scala sol.Indexer <wikiFile> <titleIndex> <documentIndex> <wordIndex> ``` where `wikiFile` is the corpus XML file, and the other three arguments are the names of the index files to write. Running the program should process the provided corpus and write the indices to the files. ## Text Processing An XML document has content that isn't relevant to searching, such as the tags (like `<page>`) and little words like "the". The first step in building a search engine is to reduce the document to its essential interesting words. This involves several steps: - **Parsing:** extracting the text of each document from the XML - **Tokenizing:** splitting the text into separate words and numbers, removing punctuation and whitespace - **Removing stop words:** ignoring words like "a" and "the" that don't contribute information. - **Stemming:** reducing words to their *stems*, which are the base roots of words. e.g., "sleep", "sleeps" and sleeping" have "sleep" as a base root. After the document is converted to a sequence of stem terms, the indexer computes which documents are relevant to which words and stores this information in the index file. We describe each of these steps in turn. ### Parsing the XML Scala has an XML library which reads an XML file into a tree of XML Nodes. To use this library, import the XML Node class at the top of the file in which you plan to parse the XML: ```scala import scala.xml.Node ``` To access the root XML node, you use `loadFile`: ```scala val root: Node = xml.XML.loadFile("SmallWiki.xml") ``` Next, to extract specific children use the selector `""`. For example, to select all the children with the tag "page">, use: ```scala val pageSeq: NodeSeq = root \ "page" ``` You can also further select all the children with the tag "id" like this: ``` scala val idSeq: NodeSeq = root \ "page" \ "id" ``` To retrieve the String content of the "title" child of a page Node you could do: ```scala // pageNode is a single Node val idText: String = (pageNode \ "title").text ``` Observe that these selectors return an object of type `NodeSeq`. If there is only one node in this sequence, then `pageSeq.text` or `idSeq.text` returns that node's contents. If there is more than one node in this sequence, then calling `.text` returns the concatenation of the text of all the nodes in the sequence. You can find a short example of a Scala program that uses this library here in the `NodeSample.scala` file in the `src` package. Try running and playing around with it! :::warning **Note:** When parsing text, you should consider the link text as words, but **not** the titles of the documents to link to. In the examples presented in the Links section above, `Hammer` should be used to compute relevance; so should `Washington`, although `Presidents` should not be; and both `Category` and `Computer Science` should be used. ::: ### Tokenizing Once you have a parsed string, you need to break it into words/numbers and drop punctuation and whitespace. This is a good use for *regular expressions*. Scala has a built-in regular-expression class called `Regex`, that you can use to search through a string for matches to patterns. In the following example, we will use a somewhat messy looking regular expression that matches links, words with apostrophes, and words without apostrophes (we explain it shortly): First, create an instance of a Scala `Regex`: ``` scala val regex = new Regex("""\[\[[^\[]+?\]\]|[^\W_]+'[^\W_]+|[^\W_]+""") ``` Note that the regular expression is wrapped in triple quotes. :::spoiler Click for details of what that messy regular expression means Let's break it down. There are three parts, `\\[\\[[^\\[]+?\\]\\]`, `[^\W_]+'[^\W_]+`, and `[^\W_]+`, separated by pipes (`|`), which mean "or." So we are actually matching three possible alternatives: 1. `\\[\\[[^\\[]+?\\]\\]` * The meaning: Match two left brackets (`\\[\\[`) and two right brackets (`\\]\\]`), making sure there's something in the middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another link was starting inside this link (`[^\\[]+?`). * Returns links (e.g., "[[Some Wiki Page]]") 2. `[^\W_]+'[^\W_]+` * The meaning: Match at least one alphanumeric character (`[^\W_]+`), then an apostrophe (`'`), and then at least one alphanumeric character (`[^\W_]+`). * Returns words with apostrophes (e.g., "don't") 3. `[^\W_]+` * The meaning: Match at least one alphanumeric character. * Returns words (e.g., "dog") ::: <br> Next, use the method `findAllMatchIn`, which takes as input a `String`, to find all the matches within that `String`. This method actually returns an iterator over `Match` objects, which contain the substrings matched by the regular expression, as well as some metadata we can here discard (groups, start index, etc.). We then traverse these `Match`es to extract the matched substrings. ```scala // Call findAllMatchIn to get an iterator of Matches val matchesIterator = regex.findAllMatchIn(someString) // Convert the Iterator to a List and extract the matched substrings val matchesList = matchesIterator.toList.map { aMatch => aMatch.matched } ``` :::info **Note**: You are welcome to use the above regular expression in your `Indexer` to tokenize the text, but you you can also use a different expression if you wish. For example, if you don't want to match numbers, replace each instance of `[^W_]` in the regular expression with `[^W_d]`. You can also split this regular expression up into its parts, so that you can search for only links (using only the first part), or only words (using the second two parts joined with a pipe). ::: ### Stop Words Some words appear quite often in a language, but contribute little to meaning: words like "the", "in", "an", or "or". Such words are known as *stop words*. For example, in the phrase "the cat in the hat", the relevant words are "cat" and "hat". To help you identify stop words, we've provided you with a file in the `src` package: `StopWords.scala`. This file contains a `StopWords` object, which exposes a single method, `isStopWord` that takes as input a `String` (i.e., a single word), and returns `True` if the word is a stop word, and `False` otherwise. You should use this helper in your `Indexer`. ### Stemming Words have variations that aren't relevant to searching. Consider the words "sleep", "sleeps", and "sleeping" -- we'd like all of these to match the query term "sleep". The process of reducing a word to its base is called *stemming*. In `PorterStemmer.scala` in the `src` package, we have provided a `PorterStemmer` class with two methods: - `stem` takes as input a string of words and returns that same string, but with the words stemmed - `stemArray` takes as input an array of strings, each consisting of a single word, and returns an array of those same words, again with the words stemmed. The output of the stemming process are called *terms*. The indexer will determine which documents are relevant for which terms. :::warning **Note:** As always, you should **not** modify code we give you in the `src` package (and you shouldn't need to). Doing so could prevent your code from compiling with the autograder. ::: ## Determining Relevance of Documents to Terms (TF-IDF) Processing documents as described (parsing, tokenizing, and stemming) converts documents from text into a sequence of terms. Likewise, a query (after parsing, tokenizing, and stemming) is a sequence of terms. Consequently, to measure the relevance of a document to a query -- that is, to score the document -- it suffices to compare these two sequences of terms. There are two key ideas that the similarity metrics used by most practical search engines capture: * A term that appears many times within a document is likely to be more relevant to that document than a term that appears fewer times. This notion is called the **Term Frequency**. * A term that occurs in a fewer documents in the corpus is likely to be a better signal of relevance when it occurs than a term that appears in many documents. This notion is called **Inverse Document Frequency**. ### Term Frequency, $\pmb{tf}$ Given a list of all the terms that occur across all of the documents, one can count how many times each term appears in each document. These raw counts are only so useful for comparing documents, however: the term "cat" might appear less often in one document simply because the document is very short. We therefore must normalize the frequency of counts across the documents. We can do this by dividing the count of a term by the count of the most frequently used term in the same document. Formally, let $c_{ij}$ be the number of times that term $i$ appears in document $j$. Then, let $a_j = \max_i c_{ij}$ be the number of occurrences of the most frequently occurring term in document $j$. We then define the *term frequency* as ${tf}_{ij} = c_{ij} / a_j$. ### Inverse Document Frequency, $\pmb{idf}$ To capture the second idea, we want to scale up the relevance of words that occur in fewer documents in the corpus. One useful scaling factor, called *inverse document frequency* ($idf$) is computed as follows. Let $i$ be a term in the corpus, $n$ be the total number of documents, and $n_i$ be the number of documents that contain term $i$. Then, $$ \boldsymbol {idf}{_i} = {\text {log} {n \over n_i}} $$ (The use of log makes this factor less sensitive to large values of $n / n_i$) ### Computing Document-Word Relevance (TF-IDF) Given a term $i$ (from anywhere in the corpus) and a document $j$, we calculate the relevance score of $j$ to $i$ to be the normalized term frequency of $i$ in $j$ multiplied by the inverse document freqnecy of $i$. So $$ relevance_{ij} = (tf)_{ij} * idf_i $$ Here is a [YouTube video on how the TF and IDF algorithms work](https://www.youtube.com/watch?v=vrjAIBgxm_w) that you may find helpful when working on this project. :::spoiler Click for a worked example of computing the frequencies and scores Consider a corpus consisting of the following three documents: 1. the dog bit the man 2. the dog ate the cheese 3. the cheese bit the cheese Our corpus then contains the words {"the", "dog", "bit", "man", "ate", "cheese"}. For each document, we create a mapping of these words to their counts: * {2, 1, 1, 1, 0, 0} (document 1 has "the" 2 times, "dog" once, etc) * {2, 1, 0, 0, 1, 1} * {2, 0, 1, 0, 0, 2} With this information, we can compute the term frequencies $\pmb{tf}_{ij} = c_{ij} / a_j$: | | $i_0$ (the)| $i_1$ (dog)| $i_2$ (bit)| $i_3$ (man)| $i_4$ (ate)| $i_5$ (cheese) |-|---|---|---|---|---|---|--- | $j_0$ | 1 | 1/2 | 1/2 | 1/2 | 0 | 0 | $j_1$ | 1 | 1/2 | 0 | 0 | 1/2 | 1/2 | $j_2$ | 1 | 0 | 1/2 | 0 | 0 | 1 Next, we compute the inverse document frequencies for each word $i$ using $\pmb{idf}_i = \log \frac{n}{n_i}$: $\pmb{idf}$| $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | --- | --- | --- | --- | --- | --- | --- | | | $log(3/3)$ | $log(3/2)$ | $log(3/2)$ | $log(3/1)$ | $log(3/1)$ | $log(3/2)$ | | 0 | 0.405 | 0.405 | 1.099 | 1.099 | 0.405 ::: ## PageRank: Determining Document Authority One could determine which documents are relevant to a query just by comparing the terms in the query to those in the documents. This is how search engines used to work. Then in the 1990s, Google was founded using a different approach based on the PageRank algorithm (designed by the founders). The algorithm ranks pages based on the links among them (without considering the content). Pages with high scores (i.e., page ranks) are then thought of as *authoritative*. Given a query, the most authoritative documents related to the words in the query are returned. Here are the key principles underlying the design of PageRank: 1. **The more pages that link to a page $j$, the more authoritative $j$ should be.** For example, if "Blog Brown" is linked to by 5 web pages, and the Wikipedia article on "Providence, Rhode Island" is linked to by 500 pages, then it makes sense to consider the Wikipedia article more authoritative. 2. **The more authoritative those pages are, the still more authoritative $j$ should be.** Now, if "Providence Place Mall" is linked to only by "Providence, Rhode Island", and "Blueno" is linked to only by "Blog Brown" it might not be enough to measure a page's authority simply by a count of the number of pages that link to that page. Indeed, it makes sense to consider "Providence Place Mall" more authoritative than "Blueno" since it is linked to by a more important page. 3. **The fewer links those pages have to pages other than $j$, the more authoritative $j$ should be.** Assume "Research at Brown" is linked to only by a "NYTimes" article which links to only 10 other pages, while "Why Brown?" is linked to only by "Blog Brown", which links to 200 other pages. Because the "NYTimes'' page has only a few links, and "Brown Blog" has many links, a link from the "NYTimes" page can be considered to be more important than a link from the "Brown Blog", leading us to attribute more authority to "Research at Brown" than "Why Brown?" 4. **The closer $j$ is to another page $k$ (measured by the number of links that must be followed to get from $j$ to $k$), the more $k$'s score should influence $j$'s.** For example, if the average path from "Brown University" to "Boeing" is 100 links, and the average path from "Brown University" to "Amy's Personal Website" is 5 links, then all other things equal, Amy's page should be considered more authoritative than Boeing's. PageRank uses these principles to compute the authority of each document. The algorithm works roughly as follows: **Each page's authority is a number between 0 and 1, where the total authority across all documents always equals 1.** The algorithm starts by assuming that all documents have the same authority. Then, following the principles, the algorithm updates the authorities based on the links between pages. Since the authority of one page can affect that of another, this process of refining authorities repeats until the authorities stabilize across all documents (the underlying mathematics guarantee eventual stabilization). It is important that the total authority in the system is conserved, so at any point in the converging sequence of page ranks, the total authority in the system will always sum to 1. Here is a [YouTube video on how PageRank works](https://www.youtube.com/watch?v=v7n7wZhHJj8) that you might find useful when implementing the project. :::info We use the term ***weight*** to capture the impact of links from one page to another. Let $k$ and $j$ be pages, where $k$ has a link to $j$. We denote the *weight that $k$ gives to $j$* as $w_{kj}$. The weights in turn are used to compute a page's rank. ::: **Computing rank:** For a page $j$, the rank $r_j$ is defined as follows: $$ \begin{equation} r_j = \sum_k w_{kj} r_k \end{equation} \hspace{4em} [1] $$ In other words, the rank of a page $j$ is the sum of the ranks of each page $k$ multiplied by the weights that each $k$ gives to $j$. **Computing weights:** The weight is defined by the following formula, where $n$ is the total number of pages and $n_k$ is the number of (unique) pages that $k$ links to. $\epsilon$ is a small number (you should use $\epsilon = 0.15$), the use of which helps achieve the 4th principle. $$ \begin{equation} w_{kj} = \left\{ \begin{array}{ll} \frac{\epsilon}{n} + (1 - \epsilon) \frac{1}{n_k} & \mbox{if $k$ links to $j$} \\ \frac{\epsilon}{n} & \mbox{otherwise} \end{array} \right. \end{equation} \hspace{4em} [2] $$ **Special cases:** There are a few special cases to handle when defining the weights $w_{kj}$ (before adjusting those weights by $\epsilon$): * Links from a page to itself are ignored. * Links to pages outside the corpus are ignored. * **Pages that link to nothing can be considered to link (once) to everywhere except to themselves.** * Multiple links from one page to another should be treated as a single link. One idea of this weighting scheme is to attribute more authority to page $j$ if it is one of only a few pages linked to by page $k$ than if it were one of many. :::spoiler Click if you want to understand the underlying mathematics Starting from some initial ranking $\pmb{r}^0$, Equations 1 and 2 begin unfolding like this: $$ \begin{eqnarray} r_j^1 &=& \sum_k w_{kj} \, r_k^0 & [3]\\ r_j^2 &=& \sum_k w_{kj} \, r_k^1 & [4]\\ &=& \sum_k w_{kj} \left( \sum_{l} w_{lk} r_{l}^0 \right) \\ &=& \sum_{k,l} w_{kj} \, w_{lk} \, r_{l}^0 \end{eqnarray} $$ Looking at the first equation, we can see that the first update to the ranking of page $j$ depends on the weight $w_{kj}$, which encode the weight that each page $k$ gives to page $j$. Likewise, from the last equation, we can see that the second update to the ranking of page $j$ depends on the multiplicative term $w_{kj} w_{lk}$. This term encodes not just the value that each page $k$ ascribes to page $j$, but rather the value that each page $k$ ascribes to each page $j$ multiplied by the value that each page $l$ ascribes to each page $k$. Embedded in this multplicative term is the factor $\epsilon^2$. Likewise, embedded in the calculations of $r_j^3$ is the factor $\epsilon^3$. The $\epsilon$ in these equations can be interpreted as a dampening factor, because pages that are only one hop (i.e., link) away are adjusted only by $\epsilon$, but pages that are two hops away are adjusted only by $\epsilon^2$, and so on. Indeed, the fewer hops there are from $k$ to $j$ the more $j$'s score is influenced by $k$'s. If you continue the iterative calculations spelled out in Equations 3 and 4, you will find that the rankings eventually converge. Indeed the rankings to which this process converges are the unique rankings that satisfy Equation 1. **Note:** We sort of swept things under the rug when we first wrote down Equation 1. What we should have written to be more precise was: at the ith iteration, $$ r_j^{i+1} = \sum_k w_{kj} r_k^i $$ But now that we know that this iterative process converges, Equation 1 is justified. ::: ### Pseudocode for PageRank Here is some pseudocode for PageRank. The variables $r$ and $r'$ store the page ranking (a value between 0 and 1) for each page. $r'$ stores the current rankings, while $r$ is the ranking from the previous iteration. We compute the rankings of the current round based on the rankings of the previous round. The ranking has stabilized when $r'$ and $r$ are sufficiently close to each other, as defined below the pseudocode. The pseudocode uses `weight(k, j)` to mean $w_{kj}$ , or the weight that $k$ gives to $j$. ``` pageRank(pages): initialize every rank in r to be 0 initialize every rank in r' to be 1/n while distance(r, r') > delta: r <- r' for j in pages: r'(j) = 0 for k in pages: r'(j) = r'(j) + weight(k, j) * r(k) return r' ``` :::info You can decide which data structure to use to store the rankings ($r$ and $r'$) and also whether you want to calculate the weights $w_{kj}$ on the fly or precompute and store them at the beginning. As always, you should think about the tradeoffs (space/time/clarity) for any design choice! ::: ### Distance between rankings In the pseudocode, you will see that we iterate until the distance between the current rankings and the previous iteration's rankings is smaller than a given threshold. The distance between two numeric rankings like these is measured by computing the **Euclidean distance** between the two vectors, using the following formula (where $i$ ranges over the pages): $$ \sqrt{\sum_i (r{'}_i - r_i)^2} $$ Once this distance is sufficiently small (say less than $\delta$ = $0.001$), you can declare the rankings stable, and output $r{'}$ as your final ranking. ### PageRank Examples #### Example 1: Let's consider a concrete example of PageRank on three pages connected as follows: ![](https://i.imgur.com/QBIZwCf.png) Computing weights according to equation (2), we get: $$ \begin{align} w_{AB} = w_{AC} = w_{BA} = w_{BC} &= \frac{0.15}{3} + \frac{0.85}{2} = 0.475\\ w_{CA} &= \frac{0.15}{3} + \frac{0.85}{1} = 0.9\\ w_{AA} = w_{BB} = w_{CC} = w_{CB} &= \frac{0.15}{3} = 0.05 \end{align} $$ Then, we can use those weights to iterate over page ranks for A, B, and C until they stabilize. | iteration | rank(A) | rank(B) | rank( C) | | --------- | ------- | ------- | ------- | | 0 | 0.3333 | 0.3333 | 0.3333 | | 1 | 0.4750 | 0.1916 | 0.3333 | | 2 | 0.4148 | 0.2519 | 0.3333 | | 3 | 0.4404 | 0.2263 | 0.3333 | | 4 | 0.4295 | 0.2372 | 0.3333 | | 5 | 0.4341 | 0.2325 | 0.3333 | | 6 | 0.4322 | 0.2345 | 0.3333 | | 7 | 0.4330 | 0.2337 | 0.3333 | | 8 | 0.4326 | 0.2340 | 0.3333 | > Table 1: Convergence of ranks for three pages A, B, and C. #### Example 2: Here's another, less worked through example: ![](https://i.imgur.com/fEjJLEJ.png) | rank(A) | rank(B) | rank(C ) | rank(D) | | ------- | ------- | ------- | ------- | | 0.2018 | 0.0375 | 0.3740 | 0.3867 | ### Testing PageRank We've provided you with several sample wiki files to help test and debug the results of your PageRank implementation: * `PageRankExample1.xml` captures the first worked example above * `PageRankExample2.xml` captures the second worked example above * `PageRankWiki.xml` is a wiki where every page links only to a single page (page 100) The best way to test your PageRank is to verify that it is outputting the exact values (+-$\epsilon$) as shown for the two worked examples. *It will be very helpful to step through the debugger and verify that the weights are being calculated correctly and that the sequence is converging.* Also remember that **the sum of the ranks across all pages should be 1.** `PageRankWiki` can also be useful as a sanity check to show that your program gives a much higher weight to page 100 than any other page. :::warning **Important:** If you are having trouble debugging your PageRank, you should attempt to test and debug your program on the worked examples (`PageRankExample1.xml` and `PageRankExample2.xml`) *before* coming to hours or posting on Ed. ::: ## Saving Indexing Information As discussed earlier, your Indexer must save the information computed during preprocessing. Your Indexer will save three files: - `titles.txt`, which stores the **id** and **title** of each document - `docs.txt`, which stores the **id** and **ranking** of each document as computed by PageRank - `words.txt`, which stores the relevance of documents to words It will be the job of your *indexer* to write these files, and the job of your *querier* to read them. :::info **Important:** After design checks, we will be providing code to write these files in the indexer as well as code to read them into the querier. ::: # Querier ## Main method usage The main method of your `Querier` should support the following usage: ```shell $ scala sol.Querier [--pagerank] <titleIndex> <documentIndex> <wordIndex> ``` where if the user passes the `--pagerank` argument, the querier will take page ranks into account when ranking results. The other arguments are the index files created by `Indexer`. ## REPL (Read-Eval-Print Loop) To allow a user to interactively use your querier, you will create a REPL. A REPL is a program that: * prompts the user for input * processes the input * prints the result * repeats You will practice building a REPL in **Lab 8**, so feel free to refer to that when writing your querier REPL. ## Behavior The main way users will use your querier is by running the `main` method of the `Querier` object. Your `main` method should do the following: 1. Create a new `Querier` object passing in the given index files and whether or not to consider PageRank 2. Run the querier REPL **a.** Prompt and read in user query **b.** Answer the query by scoring its terms (after removing stop words and stemming) against every document and returning titles of the documents with the top 10 scores. The first document printed should be the one with the highest score, and the last document printed should be the one with the tenth-highest score. If the terms in the query appear in fewer than 10 documents, return only the documents on which they appear. If there are no search results at all, print an informative message. **c.** Repeat steps a and b until the user types `:quit`. ## Specifications Your `Querier` class is defined as follows: ```scala class Querier(titleIndex: String, documentIndex: String, wordIndex: String, usePageRank: Boolean) extends IQuerier { ... } ``` In the constructor of `Querier` you should read in the index files, initializing appropriate data structures. It is important that you make sure to do **all initialization** in the constructor, so that all a caller (someone using your code) needs to do is instantiate a new `Querier` object, passing in index files, and then use it to find results to queries immediately. It is also important that you do **not** begin your REPL in the constructor. The caller can do so by calling the `runRepl` method. You will notice that your `Querier` class extends a trait (interface) called `IQuerier` that is located in the `src` package. `IQuerier` contains two methods that you must implement in your `Querier`: ```scala // returns the ranked list of document id results for a given query def getResults(query: String): List[Int] // runs a REPL to allow interactive user queries def runRepl(): Unit ``` This means that your querier should provide two different ways for users to interact with it: by calling `getResults` programatically or by running the REPL (described below) and inputting queries when prompted. **This should not increase the amount of code you write:** think about how you can organize your code so that you use `getResults` in your REPL. :::info This design goes back to the **model-view-controller** concept Kathi has discussed in lecture. You should always work to separate your program logic (controller) from a specific user interface (view). In this case, the view/interface is the REPL you will build to allow users to interact with your program. ::: ## Scoring Documents Against Queries ### Without PageRank Given a query, you can use the term-document scores that you wrote to the *words.txt* file to locate the most relevant documents. The computation is straightforward: for every document $j$, compute the relevance of $j$ to query $Q$ by summing scores across all terms $i$ in the query (after stemming and removing stop words): $$s_{Q,j} = \sum_{i \in Q} {tf}_{ij} {idf}_i$$ Sort the documents by these scores, and return the 10 highest-scored documents as the result of the query. Break ties however you wish. :::info **Note:** If you already computed $tf*idf$ in `Indexer` and stored that in `words.txt` rather than simply storing term frequency, then you already have the relevance scores for each document to each term, which makes this scoring step quite simple! ::: ### With PageRank Recall that PageRank computes a score for each document (which you wrote to the file `docs.txt`). These scores can be combined with the term-relevance scores ($s_{Q,j}$) before sorting the documents and returning results. You can experiment with different combinations and see how that affects the results: **multiplying the PageRank and term-relevance scores, adding them, etc.** ### Allow both with/without PageRank The user should decide whether or not to use PageRank when querying. The `main` method for your Querier (which takes the Wiki name and index file names), should take an optional argument `--pagerank`. If this argument is provided, you will combine pagerank results and term relevance results when scoring documents. Otherwise, you will base scoring only on term relevance. Here are examples of running your program with and without PageRank: ``` $ scala search.sol.Query titles.txt docs.txt words.txt $ scala search.sol.Query --pagerank titles.txt docs.txt words.txt ``` # Demo For this project, we will be providing you with a web-based demo to show you what your queries may look like. These results are all based on the content of MedWiki.xml, and you can go to [this link](https://cs18-search.herokuapp.com/pagerank) for a demo that uses pagerank, and [here](https://cs18-search.herokuapp.com) for a demo that does not. While the results from your querier do not have to be exactly the same as our demo's results, they should be somewhat similar. Especially for pagerank, depending on how you choose to combine relevancy and pagerank scores, your results can end up looking somewhat different. If you ever have a hard time loading the demo (i.e. it takes a really long time to load), please try reloading the page. Afterwards it should work. <a name="efficiency"></a> # Efficiency Considerations Given the scale of data that you will work with in BigWiki, you will have to pay attention to both time and space efficiency on this project. You don't have to maximize time efficiency, but we want to see you making smart high-level choices about how often your code traverses data. For example, if you take two consecutive passes over the same large data structure when one would suffice, we'd expect you to combine the two passes. **Indexer Efficiency:** Don't worry if your indexer is relatively slow at indexing a very large corpus. We'd rather see your indexer take 15 minutes to build an index, with querying taking a mere 0.5 seconds, than your indexer take 5 minutes, and querying 1.5 seconds. **Space Complexity:** This project is the first time you'll really confront space complexity. If your program does not handle space efficiency, you will probably run out of space in the heap, and the program will crash. You can end up creating a lot of intermediate data structures in this project. Pay attention to when you need to create new data (it's fine if you need it, but don't assume you have to use immutable data just because we've talked about it). Also think about when data structure choices need to take space usage into account (meaning, you don't *always* have to use the smallest-space structure, but you should if a particular piece of data could take considerable space). Consider the Big-O space complexity of the data structures you create. How does space grow as the number of pages in the corpus increases? Are they linear? Quadratic? Also pay attention to not holding onto data longer than necessary. If you have a large data structure in the heap, but no way to reach it from the environment, Java will reclaim that space (garbage collection). One common way to hold onto data too long is to make your methods do too much work (parameters go into the environment). Creating methods that process data in slightly smaller chunks can help control space usage. **Out of Memory Errors:** It is possible (likely even) that you will encounter a Java `OutOfMemoryError` exception while running your indexer. When testing your Indexer, we recommend that you increase the size of the heap to 1024 MB in Intellij by following [these instructions](https://www.jetbrains.com/help/idea/increasing-memory-heap.html). This is the amount of memory the autograder will allocate your program. :::info See the [**Grading**](#Grading) section for more information on how your program's efficiency will be graded. ::: See the FAQ at the bottom for more optimization tips. # Testing You should test the calculations computed by your search engine and other important pieces of functionality using check-expects. You should try various queries, with different types of pages. Document the queries you tried (along with the inputs used to start the programs) and the outputs that you expected in your `README` file. You will **only** receive credit for testing which is clearly documented in your `README` file. (More details on the readme appears later). You are welcome to create your own Wiki files for testing purposes. # Design Check ## Dates / Sign-up Design checks will be held from **Tuesday, July 6th** to **Friday, July 9th**. On Saturday, July 3rd, we will send out design check groups and assign each group a TA and link a Google Appointments calendar for your group to sign up for a design check later in the week. **Reminder:** You are required to work with a group for at least the design check portion for this project. ## Tasks At your design check, you must: * explain how you will parse, tokenize, and stem documents; that is, how you will split them up into individual, lower-case, stemmed terms without stop words * explain how you will score the relevance of documents to queries * explain your plan for the data structures you will use to store the information gathered and computed in `Indexer` (term frequency, page ranks, etc.) * prepare a written specification of the format of your index files: `titles.txt`, `words.txt`, `docs.txt` * explain which data structure will your querier store this information in when it reads in these files * walk the TAs through the steps that your program will take to get the results for a free text query. * explain how you plan to implement PageRank. Feel free to ask your friendly TAs as many clarification questions as necessary about PageRank in order to ensure you understand it fully *before* your design check. * Complete the [Design Check SRC portion](#Design-Check-SRC-Searching-for-Authority) (with your partner) and submit it to the **Search: Design Check SRC** Gradescope assignment before your design check. Please submit the written specification to Gradescope. Only one group member needs to submit. ## Summary of the Collaboration Policy The following table summarizes the collaboration policy for projects. | | Code Together | Conceptual questions | Debugging help | | -------- | -------- | -------- | --------- | | Same group | yes | yes | yes | | Classmates outside of group | no | yes | yes | | Course Staff (TAs) | ---- | yes | yes | | People outside Summer 2021 CS0180 | no | no | no | <!Design checks will be held Wednesday the 19th through Friday the 21st. We will send out a spreadsheet for you and your partner to select a slot on Sunday, please find a time before 3 PM on Tuesday the 18th. Before your design check, please upload your `Vegetable` class to Gradescope.!> # SRC: Search Between the Lines :::danger **Note**: Please complete both the design check and final SRC assignments with your partner, if you have one. Do not start the final SRC section until you have written your indexer. ::: ## Design Check SRC: Searching for Authority :::warning These questions are also available in this [Google doc](https://docs.google.com/document/d/15rKV6QwAPG0gvw_IcPH9bax1kiMgpZ6gem0oxCIeas4/edit?usp=sharing), which you can copy and use if you'd like. ::: ### Task 1: List stakeholders for your search engine program, imagining it would be deployed as the new Google search engine. (1-2 sentences). ### Task 2: In [Lecture 19, Designing for SRC](https://cs18-summer-2021.github.io/static/classes/19/19slidedeck.pdf), we learned to think about the consequences of our design choices on stakeholders before we start to code. With that principle in mind, answer the following questions. When you answer these questions, **tie in your knowledge of what these terms are. Be as specific as possible**—an ideal answer would demonstrate knowledge of what each term means and how it is deployed in a search algorithm. 1. What is one way that *term frequency* could be exploited to benefit some stakeholder(s) at the expense of others? (3-4 sentences) 2. What is one way that *inverse document frequency* could be exploited to benefit some stakeholder(s) at the expense of others? (3-4 sentences) 3. What is one way that the *PageRank algorithm* could be exploited to benefit some stakeholder(s) at the expense of others? (3-4 sentences) ### How to Hand In/Grading: These questions should be answered in a pdf. Please submit this to the **Search: Design Check SRC** assignment on Gradescope. **Do not write your name or any other identifiers on the pdf.** Please have one partner submit and add their partner to the submission using the "add group member" link in the top right corner of the Gradescope submission. We grade your responses according to [**this rubric**](https://docs.google.com/spreadsheets/d/1kyjing8hVDCSVyHuiy0MB_9LjyCA9HWjKu_s85E6Iig/edit?usp=sharing). Please note that the course collaboration policy also applies to your answers to the Socially Responsible Computing segment. ***Make sure to hand this in before your design check!!!!*** ## Final Submission SRC: Search Engines for Whom? :::warning This question is also available in this [Google doc](https://docs.google.com/document/d/1Wr0jD_t_Txuv3hXq8zlJKbyUNgII1w8ytyX6XyC-bck/edit?usp=sharing), which you can copy and use if you'd like. ::: ### Task 3: **Setup** (doesn't need a direct answer): Compare the XML files you used for this project (`SmallWiki`, `MedWiki`, `BigWiki`) to [this file](https://drive.google.com/file/d/1HmSeW70dnQEM3oPdtV7LR7xf3eZCsh4_/view?usp=sharing) (which you can refer to as `SRCWiki`). Consider whether you would be able to apply your strategy for parsing, tokenizing, and stemming to `SRCWiki`. We recommend running your indexer on `SRCWiki` and examining the index files produced to help you answer the following question, but you are not required to do so. **Question**: With your ideas from the setup in mind, describe TWO assumptions made in the Indexer (either our src files or your code) that make it more difficult to search through a corpus written in a language whose linguistic rules differ from those of English. (4-5 sentences) ::: spoiler **(Optional, not a bonus)** Open if you’re curious about Wikipedia and cultural differences! Given its open access and sheer size, it’s common for computer scientists to use Wikipedia as a standard for analyzing language, training machine learning models and more generally as a snapshot into human culture. Wikipedia is where our given XML files are from. Yet, Wikipedia represents humanity unevenly. To demonstrate this, we can examine the Oxford Internet Institute’s [visualizations of global Wikipedia coverage](https://geography.oii.ox.ac.uk/the-uneven-geography-of-wikipedia/). Click on each of the four maps showing global Wikipedia coverage measured against different properties and read the accompanying text. Let us know your thoughts, if you’d like! ::: ### How to Hand In/Grading: These questions should be answered in a pdf. Please submit this to the **Search: Final SRC** assignment on Gradescope. **Do not write your name or any other identifiers on the pdf.** Please have one partner submit and add their partner to the submission using the "add group member" link in the top right corner of the Gradescope submission. If you worked alone on the code, please also complete your own SRC assignment. We grade your responses according to [**this rubric**](https://docs.google.com/spreadsheets/d/1kyjing8hVDCSVyHuiy0MB_9LjyCA9HWjKu_s85E6Iig/edit?usp=sharing). Please note that the course collaboration policy also applies to your answers to the Socially Responsible Computing segment. # Final Submission The final submission is **due on Friday, July 16, Anywhere-On-Earth time**. You should submit all your files on Gradescope. ## Code You should hand in all files in the `sol` package to the Gradescope assignment titled **Search**. You should be submitting the following: * a completed `Indexer.scala` * a completed `Querier.scala` * any other Scala files (if any) you created to help your indexer and querier * any test files (if any) you created * a `README.txt` file with: * the names of your group members * instructions describing how a user would interact with your program. * a *brief* overview of your design, including how the pieces of your program fit together * a description of any known bugs in your program * a detailed description of how you tested your program (what queries returned certain results, test files you created, unit tests you wrote, etc.) You should also submit the Final Submission component of the **SRC** section to the **Search: Final SRC** assignment on Gradescope. :::danger **Note:** Do not submit classes from the `src` package. You should only be submitting the files that you edited, which should only be in the `sol` package. Do not move files from `src` to `sol`. ::: # Grading As with all CS18 projects, a good design will make coding this project significantly easier; so you should spend a fair amount of time planning your progam's design before you begin writing any code. The design check counts for 15% of your grade, including: * an explanation of your plan for tokenizing. * an explanation of your plan for scoring documents. * details about the data structures your querier will use to store the information needed to write to the index files. * a walkthrough of the design of your querier. * a demonstration of your understanding of PageRank. Functionality counts for 70% of your grade, including: * Correct implementation of Indexer: 23 points * Correct implementation of Querier: 23 points * Correct implementation of PageRank: 10 points * Efficiency: 14 points As always, partial functionality merits partial credit. The final 15% of your grade will be reserved for the SRC component, comments, testing, and style. You should include documentation comments and test cases for all non-trivial methods. You should also perform system testing, to test interactions among methods. Additionally, comment any code which would otherwise be unclear. ## Functionality Grading ### Correctness Correctness will be measured through a combination of autograded system tests and manual grading. We will be checking to see that your results are reasonable and similar to the solution code. While every Search solution will produce slightly different results for a given query depending on design choices, the results of your PageRank will be graded for exact correctness. Making sure your PageRank works on the worked examples from the handout is a good way to ensure you receive full credit! ### Efficiency We will be running your `Indexer` and `Querier` on both `MedWiki` and `BigWiki`. The 14 efficiency points break down as follows: * 5 points for indexing efficiency on `MedWiki` * 5 points for querying efficiency on `MedWiki` * 4 points for indexing efficiency on `BigWiki` The autograder for the **Search** assignment on Gradescope will let you know if you hit the required thresholds when you submit, and you can resubmit as many times as you would like. :::info **Important:** Indexing `BigWiki` with the given time and space constraints can be tough! We don't want people to spend undue amounts of time on this, so it is **only worth 4 points** out of 100! It is totally possible for your code not to run on `BigWiki` and for you to still get a great grade on the project, so keep that in mind when deciding how to allocate your time! ::: # Getting Started Search is a big project, and it is important not to tackle it all at once. The key to starting Search is to start thinking and planning. Always keep in mind the end goals: what information do I need to gather as I go through the text in the Indexer? Given certain information, how can I find the top 10 articles in the Querier? With the end goals in mind, you can start planning what fields are necessary in the Indexer, what methods you need, and the general flow of what is going to happen (when are we stemming? How many times will we read through the entire text? When should we call helper methods?). Then, when you have a good understanding of these, it will allow you to take a step back and ask "is this the best way to do it? Could I make this better/faster?" If you have a good understanding of your plan before you start programming, the project becomes much easier! Search is very open-ended, so it is important to make your own roadmap. One suggestion is to scope out a smaller version. Get that working, then extend the working version to handle more and more each time. For example, iterating on your indexer might look like: * start by getting basic XML parsing working: retrieving the id, title, and text for every document * work on tokenizing the text you parsed and figure out how to store that information * implement writing that information to index files * figure out how to handle and store link information * implement and debug your PageRank * ... At each step, you get a core piece of the code working. You test it (using the debugger), then you add a bit more functionality. If you try to tackle the entire thing at once, you're likely to get overwhelmed. Similar incremental iteration will be helpful when implementing your querier as well! As you build your program, test it on small corpuses, not `BigWiki` as this will allow you to debug and iterate faster. # FAQ **Does the big scary regex we were provided encompass all cases we need to match against?** Yes! **How does stopping and stemming work with links?** When you run into a word that is a link, you'll need to record where that link is pointing, then remove stop words and stem the text that appears in the document. (You will need to remove the double brackets youself. The regex we give you for extracting words from the documents leaves the links in place, so when you iterate through the words that are extracted using that regex, you will need to identify if a word is a link as you go and handle that.) **Do we check titles for stop words and stem them or leave them as is?** Since titles count towards term frequencies, and term frequencies are for stopped and stemmed words, so at some point you should process the titles. However, you probably want to keep the unprocessed version wherever you are storing your titles. **How can I make my indexer more efficient?** For runtime efficiency, consider how many times you're iterating over information. Think in terms of Big-O. Iterations will be necessary, but is there a way to reduce nested iterations? Could intermediate data structures be helpful? For space efficiency, here are a few things to keep in mind: * *complexity:* consider the Big-O space complexity of different data structures in your program. Do you have any structures that are **quadratic**? That could be taking up a lot of space! * *Local environments:* local variables will be inaccessible outside the loop or method that contains them, and will thus be cleared from memory afterward. * *Shortening collections*: if you're iterating over some temporary collection using a for-loop, consider throwing away elements you **no longer need** as you go along. For example, you might take the head off a list, and immediately reassign the list to be the tail of the original list. One place where this can really come in handy is with your `NodeSeq` of pages: you can turn it into a mutable data structure, using `toArray` or `toBuffer` for example, and continually delete/set to null the nodes that you are done using. That is, don't store the NodeSeq, but convert it into a ListBuffer and store this instead. This'll allow you to get rid of used pages. * *Be thoughtful with data structures*: HashMaps are great data structures and will be invaluable in Search, but they also take up a lot of space. Consider trying to reduce the number of HashMaps you store. Similarly, it may be helpful to limit the number of variables you store when parsing. In general, think about the information that is able to be referenced (i.e. bound to a variable you have defined). If you have data that is accessible in a scope where it is no longer needed, this will cause space issues because Scala will be using memory space to fill unneeded data. :::info Space efficiency can also be tied to runtime efficiency. If your data is structured in a way where it is more difficult to run garbage collection, Scala will spend runtime trying to create space (during garbage collection) instead of doing what you want. ::: ___ Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS18 document by posting on Ed.