Project 2: Search

--- tags: Projects --- # Project 2: Search ### Due Date: Friday, March 26th at Anywhere-On-Earth time. ## The Hunt Continues... (optional) :::spoiler Look here for the next clue! Somewhere, out there, in the great big world, there is a bear. [Blueno](https://www.browndailyherald.com/wp-content/uploads/2020/08/DSCF9718-e1597688000493.jpeg). The greatest treasure of all time, a joy unspeakable. And the remarkable CS18 Team of Theme is searching for it. The grandiose, all-powerful Team of Theme has found a super-snazzy kind of top-secret database with information about the beloved bear and its prized treasure. Only, this set of data also happens to have some info about other things as well, like cats. If they ever want to find happiness again, they'll have to parse this database bit by bit, developing a good ole' search engine to warm the soul. To find Blueno. In `MedWiki.xml`, there lie some clues Searchable, they are, but first you must choose Special words in this list, Paired with numbers you can't miss, To find the treasure of a Bear who lived the most Blue. Search these words in the [Search demo](https://cs18-search.herokuapp.com/), and find the number of the result that matches the paired word. The resulting string of integers will be the end of this link (tinyurl.com/<yournumberhere>). For example ["car", "Open-wheel car"] means that you'd search "car", and find that "Open-wheel car" is result #2. The sequence of the clues produces results in the correct order for the tinyurl. ["bear", "Harappa"] ["brown", "Fonni"] ["treasure", "Mohism"] ["hunt", "Mesolithic"] ["frog", "Cuisine of the Midwestern United States"] ["kathi", "Houston"] ::: ## Introduction You are probably familiar with a few search engines: [Google](http://www.google.com/), [Bing](http://www.bing.com/), [Duck Duck Go](http://duckduckgo.com/). For this project, you're going to be writing your own search engine! Specifically, your search engine will allow users to query a corpus (a large collection of texts) of Wikipedia pages. Your search engine will work a lot like a typical search engine. A user will type in what is called a *free text query* (i.e., some words), and your program will return what it judges to be the most relevant pages for that query. One naive way to build a search engine would be to ask the user to enter a query, and then scan through the corpus looking for documents that match that query. This approach would require many many traversals of the corpus -- one per query. Doing this for every single web query would *not* be fast. Instead, most modern search engines consist of two programs, an *indexer* and a *querier*. The indexer preprocesses the corpus before answering any queries, saving files with information about which documents are relevant to which words. (Yes, this takes a long time; but remember, this is a preprocessing step. So, if it takes one hour, or even one day, it does not negatively impact the user experience, because it does not affect query-response time. This is an example of a time vs. space trade-off. You'll use extra space to store the preprocessed information, but that enables your program to answer queries quickly). When the querier receives a query from a user, it uses the information contained in the file(s) to *quickly* respond to that query. Once built, an example run of _your_ search engine might look like this: ``` $ scala search.sol.Index MedWiki.xml titles.txt docs.txt words.txt $ scala search.sol.Query titles.txt docs.txt words.txt search> cats 1 Kattegat 2 Kiritimati 3 Politics of Lithuania 4 Nirvana (UK band) 5 Autosomal dominant polycystic kidney 6 John Ambrose Fleming 7 Mercalli intensity scale 8 W. Heath Robinson 9 Men at Work 10 Morphology (linguistics) search> ruler 1 Mohism 2 Manasseh 3 Empress Suiko 4 Gasparo Contarini 5 Islamabad Capital Territory 6 Monarch 7 Jadwiga of Poland 8 Imperialism in Asia 9 Henry the Fowler 10 Michael search> :quit ``` The following diagram summarizes the components of the search engine. The orange blocks are the inputs (the web pages and the query). The blue boxes are different components of the solution (which you will write). The white boxes are files of intermediate data that get produced and read by the blue boxes. The components split into two main parts (the *indexer* and the *querier*) which you can develop and test independently. ![Search Architecture](https://i.imgur.com/YS3pxIs.png) The rest of this handout gives details of the parts of this diagram, plus logistics information. Here are links so you can jump to individual sections: - [The format of the input documents](#The-Input-Documents-and-Intro-to-XML) - [Processing documents into their essential terms (parsing, etc)](#Processing-Documents-Into-Essential-Terms) - [Determining relevance of documents to terms](#Determining-Relevance-of-Documents-to-Terms-TF-and-IDF) - [Determining authority of documents (the PageRank algorithm)](#Determining-Authority-of-Documents-PageRank) - [Files to save index information](#Saving-Indexing-Information) - [Running queries](#Handling-Queries) - Expectations for [efficiency](#Efficiency-Considerations), [testing](#Testing-Expectations), [design checks](#Checks-and-Handin), and [grading](#Grading) - [Socially Responsible Computing](#Search-between-the-lines:-Socially-Responsible-Computing) - [Getting Started Guide](#Getting-Started) ## Learning Objectives From this project, we hope you will: - Learn how a search engine works - Practice choosing data structures - Practice thinking about space usage - Learn about XML and tools that can help you process it - Learn about PageRank, which revolutionized searching --- ### These are the evaluation criteria for this project: - Select and justify a specific data structure for use in a specific problem, citing alignment with problem needs or resource constraints - Develop classes that protect how their data is used through access modifiers and appropriate choice of provided methods - Develop a collection of examples and tests that exercise key behavioral requirements aspects of a problem - Write programs that can compile and run against a provided interface - Write programs that behave as expected based on a provided description or collection of test cases - Correctly determine the space complexity of an algorithm :::danger Avoid the temptation to split the work, having one partner do the indexer and one do the querier. You need the combination to hit the learning objectives, some of which could arise on the final exam. ::: ## The Input Documents (and Intro to XML) The collection of Wikipedia pages you'll be searching over is located in `/course/cs0180/src/search/src`. The three main Wikis are called `SmallWiki.xml`, `MedWiki.xml`, and `BigWiki.xml`. If you list the contents of `/course/cs0180/src/search/src` using `ls -l`, you can see the sizes of these files: ``` $ ls -l -rw-rw-r--+ 1 kfisler cs0180ta 216466779 Feb 22 2019 BigWiki.xml -rw-rw-r--+ 1 kfisler cs0180ta 24770652 Feb 22 2019 MedWiki.xml -rw-rw-r--+ 1 kfisler cs0180ta 1667422 Feb 22 2019 SmallWiki.xml ``` To avoid copying these large files into your own directories, you should instead use symbolic links to access them. To do this, navigate to whichever directory you would like put them in and use the following command to create a (symbolic) link to the files: ``` ln -s /course/cs0180/src/search/src/<filename> ``` The above command will create a (symbolic) link called `BigWiki.xml` in the current directory, and that link will link to the file `/course/cs0180/src/search/src/BigWiki.xml`. (You'll also see a `PageRankWiki.xml` file, which we will explain later). If it is helpful for you, you can download these XML files from [**here**](https://drive.google.com/drive/folders/1xl--02EKcw3BHd9trIW_1WDiZN5zYfhV?usp=sharing). **NOTE:** BigWiki.xml is a *very* large file (> 200MB), so we highly encourage following the above for testing on BigWiki.xml instead of downloading it locally. ### The Document Format: XML Programs often need to pass information to one another, and the most convenient way to do so is to use a data-interchange format, meaning a standardized text-based format. *XML*, which stands for eXtensible Markup Language, is one such format. Invented by the World Wide Web Consortium (W3C), XML uses an HTML-style angle bracket specification. Each XML *element* is encompassed by matching opening and closing *tags*. Opening tags are delimited by angle brackets, like this: `<tag>`. Closing tags include a slash, like this: `</tag>`. Text appears between the opening and closing tags of an element: ```*.html <email>kfisler@cs.brown.edu</email> ``` An XML element can have any number of *attributes*, which are specified within the opening tag, as follows: ```*.html <university name="Brown"> <concentration type="Computer Science"></concentration> <concentration type="Slavic Studies"></concentration> </university> ``` The `university` element has the attribute `name` with value `Brown`. The concentration tags have the attribute `type` with values `Computer Science` and `Slavic Studies`. The value of an attribute is always enclosed in double quotes. Note that the two `concentration` elements are nested inside the `university` element. As this nesting reflects the tree structure of XML elements, the `concentration` elements are called *children* of the *parent* element, `university`. **Pages** Each Wiki file consists of a single XML element tagged *xml*, within which there are a number of pages (*a.k.a* documents), each of which is enclosed by opening and closing `page` tags. Each page then contains a `title`, a unique, non-negative integer `id`, and some `text`, all of which are delimited by tags with those names. **Links** Links appear in the Wiki files in one of three forms: * `[[Hammer]]` * `[[Presidents|Washington]]` * `[[Category:Computer Science]]` If a link consists of two parts separated by a pipe then the text before the pipe is the page to link to, while the text after the pipe is the text to appear on the page. So, for example, the page on which `[[Presidents|Washington]]` appears would include the text `Washington` with a link to the `Presidents` page. If a link appears without a pipe or colon, then its text is both the page to link to and the text that appears on the page. If a colon appears as part of the link, as in `Category:Computer Science`, then the link is to the *meta-page* for that *namespace*. A meta-page is one that consists of links only (in this case to the pages in the `Category Computer Science`). A *namespace* is a collection of pages with a common purpose. Other common namespaces include `Help`, `File`, and `User`. ## Processing Documents Into Essential Terms An XML document has content that isn't relevant to searching, such as the tags (like `<page>`) and little words like "the". The first step in building a search engine is to reduce the document to its essential interesting words. This involves several steps: - Extracting the text of each document from the XML (*Parsing*) - Splitting the text into separate words and numbers, removing punctuation and whitespace (*Tokenizing*) - Removing little *stop words* (like "the") that don't contribute information. - Reducing words to their *stems*, which are the base roots of words. e.g., "sleep", "sleeps" and sleeping" have "sleep" as a base root. After the document is converted to a sequence of stem terms, the indexer computes which documents are relevant to which words and stores this information in the index file. We describe each of these steps in turn. ### Parsing the XML Scala has an XML library which reads an XML string into a tree of XML Nodes. To use this library, import the XML Node class at the top of the file in which you plan to parse the Wikis: ```scala import scala.xml.Node ``` Then, to access the main XML node, you use loadFile like this: ```scala val mainNode: Node = xml.XML.loadFile("SmallWiki.xml") ``` Next, to extract specific children use the selector `""`. For example, to select all the children with the tag "page">, use: ```scala val pageSeq: NodeSeq =node \ "page" ``` You can also further select all the children with the tag "id" like this: ``` scala val idSeq: NodeSeq =node \ "page"\ "id" ``` Alternatively, you can do this same deep search (i.e., search through a node's children, grandchildren, etc.), using the selector `\\`: ``` scala val idSeq: NodeSeq =node \\ "id" ``` Observe that these selectors return an object of type `NodeSeq`. If there is only one node in this sequence, then `pageSeq.text` or `idSeq.text` returns that node's contents. If there is more than one node in this sequence, then calling `.text` returns the concatenation of the text of all the nodes in the sequence. You can find a short example of a Scala program that uses this library here: `/course/cs0180/src/search/src/NodeSample.scala`. In this example, you will observe that XML syntax is valid Scala syntax, and that it is automatically converted into an XML node. Moreover, a pair of opening and closing curly brackets "escapes" the XML, so that you can write Scala code between the brackets (e.g., { otherNode }). **Note:** When parsing text, you should consider the link text as words, but not the titles of the documents to link to. In the examples presented in the Links section above, `Hammer` should be used to compute relevance; so should `Washington`, although `Presidents` should not be; and both `Category` and `Computer Science` should be used. #### Potential Errors :::spoiler Click if you get the error "Object XML is not a member of package Scala" If you get an error similar to this, you need to add the XML jar to your dependencies. You can do this by: Install a Maven dependency in your IntelliJ project. To do this, follow the same steps from Lab 4 for installing the CSV parser, but: - If you have scala version 2.12 or lower, search for the dependency `org.scala-lang.modules:scala-xml_2.11:1.0.5` and install it. Or, - If you have scala version 2.13, search for the dependency `org.scala-lang.modules:scala-xml_2.13:2.0.0-M5`and install it. You will not have these dependencies installed if you try to run your code in the terminal on your local machine. However, they are installed to run in the terminal on the department machines. ::: ### Tokenizing Once you have a parsed string, you need to break it into words/numbers and drop punctuation and whitespace. This is a good use for *regular expressions*. Scala has a built-in regular-expression class called `Regex`, that you can use to search through a string for constituent patterns. In the following example, we will use a somewhat messy looking regular expression that matches links, words with apostrophes, and words without apostrophes (we explain it shortly): First, create an instance of a Scala `Regex`: ``` scala val regex = new Regex("""\[\[[^\[]+?\]\]|[^\W_]+'[^\W_]+|[^\W_]+""") ``` Note that the regular expression is wrapped in triple quotes. :::spoiler Click for details of what that messy regular expression means Let's break it down. There are three parts, `\\[\\[[^\\[]+?\\]\\]`, `[^\W_]+'[^\W_]+`, and `[^\W_]+`, separated by pipes (`|`), which mean "or." So we are actually matching three possible alternatives: 1. `\\[\\[[^\\[]+?\\]\\]` * The meaning: Match two left brackets (`\\[\\[`) and two right brackets (`\\]\\]`), making sure there's something in the middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another link was starting inside this link (`[^\\[]+?`). * Returns links (e.g., "[[Some Wiki Page]]") 2. `[^\W_]+'[^\W_]+` * The meaning: Match at least one alphanumeric character (`[^\W_]+`), then an apostrophe (`'`), and then at least one alphanumeric character (`[^\W_]+`). * Returns words with apostrophes (e.g., "don't") 3. `[^\W_]+` * The meaning: Match at least one alphanumeric character. * Returns words (e.g., "dog") ::: <br> Next, use the method `findAllMatchIn`, which takes as input a `String`, to find all the matches within that `String`. This method actually returns an iterator over `Match` objects, which contain the substrings matched by the regular expression, as well as some metadata we can here discard (groups, start index, etc.). We then traverse these `Match`es to extract the matched substrings. ```scala // Call findAllMatchIn to get an iterator of Matches val matchesIterator = regex.findAllMatchIn(someString) // Convert the Iterator to a List and extract the matched substrings val matchesList = matchesIterator.toList.map { aMatch => aMatch.matched } ``` **Note**: You are welcome to use a different regular expression if you wish. For example, if you don't want to match numbers, replace each instance of `[^W_]` in the regular expression with `[^W_d]`. You can also split this regular expression up into its parts, so that you can search for only links (using only the first part), or only words (using the second two parts joined with a pipe). ### Stop Words Some words appear quite often in a language, but contribute little to meaning: words like "the", "in", "an", or "or". Such words are known as *stop words*. For example, in the phrase "the cat in the hat", the relevant words are "cat" and "hat". To help you identify stop words, we've provided you with a `StopWords` object, which exposes a single method, `isStopWord` that takes as input a `String` (i.e., a single word), and returns `True` if the word is a stop word, and `False` otherwise. ### Stemming Words have variations that aren't relevant to searching. Consider the words "sleep", "sleeps", and "sleeping" -- we'd like all of these to match the query term "sleep". The process of reducing a word to its base is called *stemming*. We have provided a `PorterStemmer` class with two methods: - `stem` takes as input a string of words and returns that same string, but with the words stemmed - `stemArray` takes as input an array of strings, each consisting of a single word, and returns an array of those same words, again with the words stemmed. The output of the stemming process are called *terms*. The indexer will determine which documents are relevant for which terms. ## Determining Relevance of Documents to Terms (TF and IDF) Processing documents as described (parsing, tokenizing, and stemming) converts documents from text into a sequence of terms. Likewise, a query (after parsing, tokenizing, and stemming) is a sequence of terms. Consequently, to measure the relevance of a document to a query -- that is, to score the document -- it suffices to compare these two sequences of terms. There are two key ideas that the similarity metrics used by most practical search engines capture: * A term that appears many times within a document is likely to be more relevant to that document than a term that appears fewer times. This notion is called the **Term Frequency**. * A term that occurs in a fewer documents in the corpus is likely to be a better signal of relevance when it occurs than a term that appears in many documents. This notion is called **Inverse Document Frequency**. We discuss how to compute each in turn. ### Term Frequency, $\pmb{tf}$ Given a list of all the terms that occur across all of the documents, one can count how many times each term appears in each document. These raw counts are only so useful for comparing documents, however: the term "cat" might appear less often in one document simply because the document is very short. We therefore want to normalize the frequency of counts across the documents. We can do this by dividing the count of a term by the count of the most frequently used term in the same document. Formally, let $c_{ij}$ be the number of times that term $i$ appears in document $j$. Then, let $a_j = \max_i c_{ij}$ be the number of occurrences of the most frequently occurring term in document $j$. We then define the *term frequency* as ${tf}_{ij} = c_{ij} / a_j$. ### Inverse Document Frequency, $\pmb{idf}$ To capture the second idea, we want to scale up the term frequency of words that occur in fewer documents in the corpus. One useful scaling factor, called *inverse document frequency* ($idf$) is computed as follows. Let $i$ be a term in the corpus, $n$ be the total number of documents, and $n_i$ be the number of documents that contain term $i$. Then, $$ \boldsymbol {idf}{_i} = {\text {log} {n \over n_i}} $$ (The use of log makes this factor less sensitive to large values of $n / n_i$.) ### Scoring Word/Document Relevance Given a term $i$ (from anywhere in the corpus) and a document $j$, we multiply the term frequency and the inverse document frequency to yield the *relevance score* of $j$ to $i$. These values get recorded in the *words.txt* file that the indexer saves for the querier to use later. The section on the querier explains how to use these scores to decide which documents to report from the search. Here is a [YouTube video on how the TD and IDF algorithms work](https://www.youtube.com/watch?v=vrjAIBgxm_w) that one of the HTAs found helpful when doing this project. :::spoiler Click for a worked example of computing the frequencies and scores Consider a corpus consisting of the following three documents: 1. the dog bit the man 2. the dog ate the cheese 3. the cheese bit the cheese Our corpus then contains the words {"the", "dog", "bit", "man", "ate", "cheese"}. For each document, we create a mapping of these words to their counts: * {2, 1, 1, 1, 0, 0} (document 1 has "the" 2 times, "dog" once, etc) * {2, 1, 0, 0, 1, 1} * {2, 0, 1, 0, 0, 2} With this information, we can compute the term frequencies $\pmb{tf}_{ij} = c_{ij} / a_j$: | | $i_0$ (the)| $i_1$ (dog)| $i_2$ (bit)| $i_3$ (man)| $i_4$ (ate)| $i_5$ (cheese) |-|---|---|---|---|---|---|--- | $j_0$ | 1 | 1/2 | 1/2 | 1/2 | 0 | 0 | $j_1$ | 1 | 1/2 | 0 | 0 | 1/2 | 1/2 | $j_2$ | 1 | 0 | 1/2 | 0 | 0 | 1 Next, we compute the inverse document frequencies for each word $i$ using $\pmb{idf}_i = \log \frac{n}{n_i}$: $\pmb{idf}$| $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | --- | --- | --- | --- | --- | --- | --- | | | $log(3/3)$ | $log(3/2)$ | $log(3/2)$ | $log(3/1)$ | $log(3/1)$ | $log(3/2)$ | | 0 | 0.405 | 0.405 | 1.099 | 1.099 | 0.405 ::: <a name="pagerank"></a> ## Determining Authority of Documents (PageRank) One could determine which documents are relevant to a query just by comparing the terms in the query to those in the documents. This is how search engines used to work. Then in the 1990s, Google was founded using a different approach based on the PageRank algorithm (designed by the founders). The algorithm ranks pages based on the links among them (without considering the content). Pages with high scores (i.e., page ranks) are then thought of as *authoritative*. Given a query, the most authoritative documents related to the words in the query are returned. Here are the key principles underlying the design of PageRank: 1. **The more pages that link to a page $j$, the more authoritative $j$ should be.** For example, if "Blog Brown" is linked to by 5 web pages, and the Wikipedia article on "Providence, Rhode Island" is linked to by 500 pages, then it makes sense to consider the Wikipedia article more authoritative. 2. **The more authoritative those pages are, the still more authoritative $j$ should be.** Now, if "Providence Place Mall" is linked to only by "Providence, Rhode Island", and "Bluno" is linked to only by "Blog Brown" it might not be enough to measure a page's authority simply by a count of the number of pages that link to that page. Indeed, it makes sense to consider "Providence Place Mall" more authoritative than "Bluno" since it is linked to by a more important page. 3. **The fewer links those pages have to pages other than $j$, the more authoritative $j$ should be.** Assume "Research at Brown" is linked to only by a "NYTimes" article which links to only 10 other pages, while "Why Brown?" is linked to only by "Blog Brow", which links to 200 other pages. Because the "NYTimes'' page has only a few links, and "Brown Blog" has many links, a link from the "NYTimes" page can be considered to be more important than a link from the "Brown Blog", leading us to attribute more authority to "Research at Brown" than "Why Brown?" 4. **The closer $j$ is to another page $k$ (measured by the number of links that must be followed to get from $j$ to $k$), the more $k$'s score should influence $j$'s.** For example, if the average path from "Brown University" to "Boeing" is 100 links, and the average path from "Brown University" to "Amy's Personal Website" is 5 links, then all other things equal, Amy's page should be considered more authoritative than Boeing's. PageRank uses these principles to compute the authority of each document. The algorithm works roughly as follows: Each page's authority is a number between 0 and 1, where the total authority across all documents always equals 1. The algorithm starts by assuming that all documents have the same authority. Then, following the principles, the algorithm updates the authorities based on the links between pages. Since the authority of one page can affect that of another, this process of refining authorities repeats until the authorities stabilize across all documents. (The underlying mathematics guarantees eventual stabilization). Here is a [YouTube video on how PageRank works](https://www.youtube.com/watch?v=v7n7wZhHJj8) that one of the HTAs found useful when doing this project. We use the term *weight* to capture the impact of links from one page to another. Let $k$ and $j$ be pages, where $k$ has a link to $j$. We denote the *weight that $k$ gives to $j$* as $w_{jk}$. The weights in turn are used to compute a page's rank. **Computing rank:** For a page $j$, the rank $r_j$ is defined as follows: $$ \begin{equation} r_j = \sum_k w_{jk} r_k \end{equation} \hspace{4em} [1] $$ **Computing weights:** The weight is defined by the following formula, where $n$ is the total number of pages and $n_k$ is the number of (unique) pages that $k$ links to. $\epsilon$ is a small number (e.g., $\epsilon = 0.15$), the use of which helps achieve the 4th principle. $$ \begin{equation} w_{jk} = \left\{ \begin{array}{ll} \frac{\epsilon}{n} + (1 - \epsilon) \frac{1}{n_k} & \mbox{if $k$ links to $j$} \\ \frac{\epsilon}{n} & \mbox{otherwise} \end{array} \right. \end{equation} \hspace{4em} [2] $$ **Special cases:** There are a few special cases to handle when defining the weights $w_{jk}$ (before adjusting those weights by $\epsilon$): * Links from a page to itself are ignored. * Links to pages outside the corpus are ignored. * Pages that link to nothing can be considered to link (once) to everywhere, except to themselves. * Multiple links from one page to another should be treated as a single link. One idea of this weighting scheme is to attribute more authority to page $j$ if it is one of only a few pages linked to by page $k$ than if it were one of many. :::spoiler Click if you want to understand the underlying mathematics Starting from some initial ranking $\pmb{r}^0$, Equations 1 and 2 begin unfolding like this: $$ \begin{eqnarray} r_j^1 &=& \sum_k w_{jk} \, r_k^0 & [3]\\ r_j^2 &=& \sum_k w_{jk} \, r_k^1 & [4]\\ &=& \sum_k w_{jk} \left( \sum_{l} w_{kl} r_{l}^0 \right) \\ &=& \sum_{k,l} w_{jk} \, w_{kl} \, r_{l}^0 \end{eqnarray} $$ Looking at the first equation, we can see that the first update to the ranking of page $j$ depends on the weights $w_{jk}$, which encode the values that each page $k$ ascribes to page $j$. Likewise, from the last equation, we can see that the second update to the ranking of page $j$ depends on the multiplicative term $w_{jk} w_{kl}$. This term encodes not just the value that each page $k$ ascribes to page $j$, but rather the value that each page $k$ ascribes to each page $l$ multiplied by the value that each page $l$ ascribes to each page $j$. Embedded in this multplicative term is the factor $\epsilon^2$. Likewise, embedded in the calculations of $r_j^3$ is the factor $\epsilon^3$. The $\epsilon$ in these equations can be interpreted as a dampening factor, because pages that are only one hop (i.e., link) away are adjusted only by $\epsilon$, but pages that are two hops away are adjusted only by $\epsilon^2$, and so on. Indeed, the fewer hops there are from $j$ to $k$ the more $k$'s score is influenced by $j$'s. If you continue the iterative calculations spelled out in Equations 3 and 4, you will find that the rankings eventually converge. Indeed the rankings to which this process converges are the unique rankings that satisfy Equation 1. **Note:** We sort of swept things under the rug when we first wrote down Equation 1. What we should have written to be more precise was: at the ith iteration, $$ r_j^{i+1} = \sum_k w_{jk} r_k^i $$ But now that we know that this iterative process converges, Equation 1 is justified. ::: ### Pseudocode for PageRank Here is some pseudocode for PageRank (its an image that uses notation $w(j)(k)$ instead of $w_{jk}$). The variables $r$ and $r'$ represent page rankings (authorities), which consist of a number between 0 and 1 for each page. $r'$ is the current ranking, which gets updated based on a copy $r$ of that ranking in each round. The ranking has stabilized when $r'$ and $r$ are close to each other, as defined below the pseudocode. ![](https://i.imgur.com/i7ZFa9Q.png) **Distance between rankings:** The difference between two numeric rankings like these is commonly measured by computing Euclidean distance, which has the following formula (where $i$ ranges over the pages): $$ \sqrt{\sum_i (r{'}_i - r_i)^2} $$ Once this distance is sufficiently small (say less than $0.001$), you can declare the ranking stable, and output $r{'}$ (or $r$, since they are so close) as your final ranking. ### PageRank Example Let's consider a concrete example of PageRank on three pages connected as follows: ![](https://i.imgur.com/tsFCJ7b.png) Computing weights according to equation (2), we get: $$ \begin{align} w_{AB} = w_{BA} = w_{CA} = w_{CB} &= \frac{.15}{3} + \frac{0.85}{2} = 0.475\\ w_{AC} &= \frac{0.15}{3} + 0.85 = 0.9\\ w_{AA} = w_{BC} = w_{BB} = w_{CC} &= \frac{0.15}{3} = 0.05 \end{align} $$ Then, we can use those weights to iterate over page ranks for A, B, and C until they stabilize. | iteration | rank(A) | rank(B) | rank( C) | | --------- | ------- | ------- | ------- | | 0 | 0.3333 | 0.3333 | 0.3333 | | 1 | 0.4750 | 0.1916 | 0.3333 | | 2 | 0.4148 | 0.2519 | 0.3333 | | 3 | 0.4404 | 0.2263 | 0.3333 | | 4 | 0.4295 | 0.2372 | 0.3333 | | 5 | 0.4341 | 0.2325 | 0.3333 | | 6 | 0.4322 | 0.2345 | 0.3333 | | 7 | 0.4330 | 0.2337 | 0.3333 | | 8 | 0.4326 | 0.2340 | 0.3333 | > Table 1: Convergence of ranks for three pages A, B, and C. Here's another, less worked through example: ![](https://i.imgur.com/fEjJLEJ.png) | rank(A) | rank(B) | rank(C ) | rank(D) | | ------- | ------- | ------- | ------- | | 0.2018 | 0.0375 | 0.3740 | 0.3867 | ### Testing PageRank We have provided you with a sample wiki file, `PageRankWiki`, made to test the results of your PageRank implementation (separately from the rest of your code). The wiki includes several pages where one page (**Page 100**) has an overwhelmingly larger number of pages linking to it compared to the rest of the pages in the wiki. We recommend checking the values your PageRank implementation outputs for each of the documents and confirming that **Page 100** has a much higher score than the others. ## Saving Indexing Information Your Indexer will save three files: - `titles.txt`, which maps document IDs to document titles - `docs.txt`, which stores the rankings computed by PageRank - `words.txt`, which stores the relevance of documents to words It will be the job of your indexer to write these files, and the job of your querier to read them. After design checks, we will be providing code to write these files from the indexer as well as code to read them into the querier. <a name="queries"></a> ## Handling Queries Your querier should do the following: 1. Read the given files from the Indexer into a data structure of your choosing 2. Prompt the user to enter a free-text query 3. Answer the query by scoring its terms (after removing stop words and stemming) against every document and returning titles of the documents with the top 10 scores. The first document printed should be the one with the highest score, and the last document printed should be the one with the tenth-highest score. If the terms in the query appear in fewer than 10 documents, return only the documents on which they appear. If there are no search results at all, print an informative message. 4. Repeat steps 2 and 3 until the user types `:quit`. To help you explore the impact of PageRank, your querier will take an argument that indicates whether to consider PageRank scores when ranking documents. ### Scoring Documents against Queries Without PageRank Given a query, you can use the term-document scores that you wrote to the *words.txt* file to locate the most relevant documents. The computation is straightforward: for every document $j$, compute the relevance of $j$ to query $Q$ by summing scores across all terms $i$ in the query (after stemming and removing stop words): $$s_{Q,j} = \sum_{i \in Q} {tf}_{ij} {idf}_i$$ Sort the documents by these scores, and return the 10 highest-scored documents as the result of the query. Break ties however you wish. ### Scoring Documents Including PageRank Recall that PageRank computes a score for each document (which you wrote to the file *docs.txt*). These scores can be combined with the term-relevance scores ($s_{Q,j}$) before sorting the documents and returning results. You can experiment with different combinations and see how that affects the results (e.g., adding the PageRank and term-relevance scores, multiplying them, other ideas). ### Users decide whether to search with/without PageRank The main method for your Querier (which takes the Wiki name and index file names), should take an optional argument `--pagerank`. If this argument is provided, you will combine pagerank results and term relevance results when scoring documents. Otherwise, you will base scoring only on term relevance. Here are examples of running your program with and without PageRank: ``` $ scala search.sol.Query titles.txt docs.txt words.txt $ scala search.sol.Query --pagerank titles.txt docs.txt words.txt ``` ### REPL (Read-Eval-Print Loop) A REPL is a method that prompts the user for input, process the input, prints the result, and repeats. This is the method that does all of the communication with a user. You have seen code that follows this cycle a few times this semester: the `loginScreen` in the Banking example from lecture, the I/O lab, and the RegExp lab give samples. For the search engine, you should continue processing queries until the user types `:quit` at the prompt. If the user types `quit` without the colon, your search engine should treat it as a query. ## Demo For this project, we will be providing you with a web-based demo to show you what your queries may look like. These results are all based on the content of MedWiki.xml, and you can go to [this link](https://cs18-search.herokuapp.com/pagerank) for a demo that uses pagerank, and [here](https://cs18-search.herokuapp.com) for a demo that does not. While the results from your querier do not have to be exactly the same as our demo's results, they should be somewhat similar. Especially for pagerank, depending on how you choose to combine relevancy and pagerank scores, your results can end up looking somewhat different. If you ever have a hard time loading the demo (like it takes a really long time to load), please try reloading the page. Afterwards it should work. <a name="efficiency"></a> ## Efficiency Considerations Given the scale of data that you will work with in BigWiki, you will have to pay attention to both time and space efficiency on this project. You don't have to maximize time efficiency, but we want to see you making smart high-level choices about how often your code traverses data. For example, if you take two consecutive passes over the same large data structure when one would suffice, we'd expect you to combine the two passes. **Indexer Efficiency:** Don't worry if your indexer is relatively slow at indexing a very large corpus. We'd rather see your indexer take 15 minutes to build an index, with querying taking a mere 0.5 seconds, than your indexer take 5 minutes, and querying 1.5 seconds. **Space Complexity:** This project is the first time you'll really confront space complexity. You can end up creating a lot of intermediate data structures in this project. Pay attention to when you need to create new data (it's fine if you need it, but don't assume you have to use immutable data just because we've talked about it). Also think about when data structure choices need to take space usage into account (meaning, you don't *always* have to use the smallest-space structure, but you should if a particular piece of data could take considerable space). Also pay attention to not holding onto data longer than necessary. If you have a large data structure in the heap, but no way to reach it from the environment, Java will reclaim that space (garbage collection). One common way to hold onto data too long is to make your methods do too much work (parameters go into the environment). Creating methods that process data in slightly smaller chunks can help control space usage. Lastly, be sure to close any Readers and Writers as soon as you are done using them; and similarly, do not open them prematurely. **Out of Memory Errors:** It is possible (likely even) that you will encounter a Java `OutOfMemoryError` while running your indexer. If you do, try re-running you program with the argument `-J-Xmx512m`: ``` scala -J-Xmx512m search.sol.Index <arguments> ``` The `-J` means it's an argument for your JVM, the `-Xmx` tells the JVM how much memory it can use, and `512m` means 512 megabytes. The maximum amount of memory you are permitted to use on a department machine is `-Xmx2g`, meaning up to 2 gigabytes of memory. But be forewarned; allowing the JVM to use this much memory can drastically slow down your machine's other processes. See the FAQ at the bottom for optimization tips. ## Running On Department Machines Please take a look at [**this guide**](https://docs.google.com/document/d/129pYrPL3pSUSC9RYc7bW3JEn6bs6wVNpfpATQfyOKbE/edit?usp=sharing). <a name="testing"></a> ## Testing Expectations You should test the calculations computed by your search engine and other important pieces of functionality using check-expects. You should try various queries, with different types of pages. Document the queries you tried (along with the inputs used to start the programs) and the outputs that you expected in your `README` file. You will **only** receive credit for testing which is clearly documented in your `README` file. (More details on the readme appears later). You are welcome to create your own Wiki files for testing purposes. <a name=written></a> ## Search between the lines: Socially Responsible Computing :::danger **Note**: this SRC assignment must be completed individually, not as a group. Start this section *after* you have written your indexer and querier. ::: ### SRC Part 1: The Coverage of Wikipedia “*Arguably, the largest, most used, and most influential single web platform on which people are creating layers of information about our planet is Wikipedia. In August 2018, the text of the English Wikipedia is equivalent to 2,700 volumes of the Encyclopedia Britannica. On any given day, 15 percent of all Internet users access it. It exists in over 300 languages; 60 of those language versions have over 100,000 articles, and the English one alone contains close to six million*.”--Oxford Internet Institute Given its open access and sheer size, it’s common for computer scientists to use Wikipedia as a standard for analyzing language, training machine learning models and more generally as a snapshot into human culture. It’s where we got `SmallWiki.xml`, `MedWiki.xml` and `BigWiki.xml` that we have provided for your search engine, and it’s also what Google uses to [inform every search you do](https://www.blog.google/products/search/search-language-understanding-bert/) on its search engine by training its [machine learning models](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) on a collection of the most popular languages’ Wikipedia sites. Yet, Wikipedia represents humanity unevenly. **Task 1.1**: To understand the depth of this unevenness, we can examine the Oxford Internet Institute’s visualizations of global Wikipedia coverage. Click on each of the four [maps showing global Wikipedia coverage](https://geography.oii.ox.ac.uk/the-uneven-geography-of-wikipedia/) measured against different properties and read the accompanying text. **Task 1.2**: In 3-5 sentences, describe 3 aspects of these findings that you found interesting or surprising. ### SRC Part 2: Linguistic Assumptions :::danger You do **NOT** need to alter your existing indexer or querier code for the following tasks. ::: **Task 2.1**: Download the SRC.xml file linked [here](https://drive.google.com/file/d/1HmSeW70dnQEM3oPdtV7LR7xf3eZCsh4_/view?usp=sharing). Examine its contents, then run your search indexer and querier on it. **Task 2.2**: What worked and what broke? With reference to the components of your Indexer, describe 3 assumptions in your code or in the provided `src` code that make it difficult to search through a corpus written a language whose linguistic rules differ from those of English. [1 paragraph] **Task 2.3**: Suppose that your Search project will be used in Google’s next search engine update. Fill in the first row *(Fairness/Inclusivity)* of the social threats framework table from [Lecture 18: Identifying Social Threats](https://brown-cs18-master.github.io/content/lectures/18socialthreats/18socialthreats.pdf) with regards to your search engine. Don’t worry about making your answers in each cell perfect; this task (only Task 2.3) will be graded for completion, as we mainly want to see what your thoughts are so far. ### SRC Part 3: Search engines for whom? **Task 3.1**: Watch [34:41-42:13] of UCLA professor Ramesh Srinivasan’s [lecture](https://www.youtube.com/watch?v=QI-6zmS9wa0), *Whose Global Village?* with Talks at Google (or, if you have a Quartz account, read his [article](https://qz.com/919715/googles-algorithms-and-search-results-reinforce-western-world-views-not-disrupt-them/)). **Task 3.2**: Listen to [27:26 to 32:26] of this [Data and Society podcast episode](https://datasociety.simplecast.com/episodes/algorithms-of-oppression) with Dr. Safiya Umoja Noble, associate professor at UCLA for Information Studies, about *Algorithms of Oppression*. **Task 3.3**: With reference to the Noble, Srinivadan and what you have learned from Part 1 and Part 2, answer the following: can search algorithms be fair? If so, what should they encompass and why? If not, why? [2-3 paragraphs] **Task 3.4**: Copy your table from Task 2.3 and, using what you have learned and the guidance from the [questions table](https://docs.google.com/document/d/1PDYOcwgQmfSLS7YAAzoUlwoytvRr7_gDwM-BRkPa548/edit), update your answers for the *Fairness/Inclusivity* row for the *Data*, *Agency* and *Algorithm* columns. ## Getting Started Search is a big project, and it is important not to tackle it all at once. The key to starting Search is to start thinking and planning. Always keep in mind the end goals: what information do I need to gather as I go through the text in the Indexer? Given certain information, how can I find the top 10 articles in the Querier? With the end goals in mind, you can start planning what fields are necessary in the Indexer, what methods you need, and the general flow of what is going to happen (when are we stemming? How many times will we read through the entire text? When should we call helper methods?). Then, when you have a good understanding of these, it will allow you to take a step back and ask "is this the best way to do it? Could I make this better/faster?" If you have a good understanding of your plan before you start programming, the project becomes much easier! Search is very open-ended, so it is important to make your own roadmap. One suggestion is to scope out a smaller version. Get that working, then extend the working version to handle more and more each time. For example, here you have a querier and an indexer, with support files to give you the data files that connect them. You could: * just start with writing the querier methods, ignoring the REPL * start with queries that match terms exactly, without stemming, stop words, etc * once you can handle simple queries (like "cat"), try queries with stop words * then extend the queries again to need stemming * then hook up the REPL * ... At each step, you get a core piece of the code working. You test it, then you add a bit more functionality. If you try to tackle the entire thing at once, you're likely to get overwhelmed. Work out a step-by-step plan like this. If you want us to review yours, post on Piazza! We're happy to help. *edited from a Piazza answer by Isabel Lai & Kathi Fisler* ## FAQ **Does the big scary regex we were provided encompass all cases we need to match against?** Yes! **How does stopping and stemming work with links?** When you run into a word that is a link, you'll need to record where that link is pointing, then remove stop words and stem the text that appears in the document. (You will need to remove the double brackets youself. The regex we give you for extracting words from the documents leaves the links in place, so when you iterate through the words that are extracted using that regex, you will need to identify if a word is a link as you go and handle that.) **Do we check titles for stop words and stem them or leave them as is?** Since titles count towards term frequencies, and term frequencies are for stopped and stemmed words, so at some point you should process the titles. However, you probably want to keep the unprocessed version wherever you are storing your titles. **Why are TF and IDF in the Querier?** We calculate it in the Querier because we would only need to make the computations for the words pertaining to the query. Calculating it in the Indexer is doing more computation than what is necessary. However, you should calculate page ranks in the Indexer. **How can I get my indexer to be more efficient?** For runtime efficiency, consider how many times you're iterating over information. Think in terms of Big-O. Iterations will be necessary, but is there a way to reduce nested iterations? Could intermediate data structures be helpful? For space efficiency, here are a few things to keep in mind: 1) *Local environments*: local variables will be inaccessible outside the loop or method that contains them, and thus effectively be "cleared" from memory afterward. For this reason, we encourage you to break your code up into smaller processing phases which run one after another. 2) *Reassigning*: if you have a large object and you want to get rid of it, all you have to do is make it inaccessible. The garbage collector will come along to clear it: ``` // define as a variable var bigList: List[String] = List[String]("big", ... "strings") . . . // reassign so that the original big list is now inaccessible bigList = List[String]() ``` Reassignment is also useful if you're declaring many variables, for example, within a loop. Instead of clearing out a space in memory for each iteration, you can mutate the same slot in memory. For example: ``` var x for (i <- list) { x = i } ``` Is more space efficient than: ``` for (i <- list) { val x = i } ``` 3) *Shortening collections*: if you're iterating over some temporary collection using a for-loop, consider throwing away elements you **no longer need** as you go along. For example, you might take the head off a list, and immediately reassign the list to be the tail of the original list. One place where this can really come in handy is with your `NodeSeq` of pages: you can turn it into a mutable data structure, using `toArray` or `toBuffer` for example, and continually delete/set to null the nodes that you are done using. That is, don't store the NodeSeq, but convert it into a ListBuffer and store this instead. This'll allow you to get rid of used pages. 4) *Reconsider data structures*: `HashMaps` take up a lot of space. Consider cutting down on sthese, especially if you have more than the minimum required `HashMaps`. Along with this, it could be helpful to limit the number of variables you store. *edited from Piazza answers by Isabel Lai & an anonymous student* In general, think about the information that is able to be referenced (i.e. bound to a variable you have defined). If you have data that is accessible in a scope where it is no longer needed, this will cause space issues because Scala will be using memory space to fill unneeded data. Space efficiency can also be tied to runtime efficiency. If your data is structured in a way where it is more difficult to run garbage collection, Scala will spend runtime trying to create space (during garbage collection) instead of doing what you want. <a name="checks"></a> # Project Tasks Forming Groups ------- Please keep in mind the [project logistics](https://hackmd.io/@cs18-spring-2021/r1WrOcky_) when forming groups. :::warning Once you have a group, please fill out [**this form**](https://docs.google.com/forms/d/e/1FAIpQLSf4eVpLTZnTt0Q9khth9ORkaGW1TjU4V-GGcchP7Vk26A-nbQ/viewform?usp=sf_link) indicating your group. **You will not be able to sign up for a design check until you submit this form**. ::: If you need help finding a project group, please take a look at [this Piazza post](https://piazza.com/class/kjqgmueegkh6j2?cid=5). ### Summary of the Collaboration Policy The following table summarizes the collaboration policy for projects. | | Code Together | Conceptual questions | Debugging help | | -------- | -------- | -------- | --------- | | Same Subgroup | yes | yes | yes | | Same Group, Different Subgroup | no | yes | yes | | Classmates beyond group | no | yes | no| | Course Staff (TAs) | | yes | yes | | People outside Spring 2021 CS0180 | no | no | no | ___ Design Check ------------ Design checks will be held from Thursday, March 18th through Saturday, March 20th. You can only sign up for design checks once your group has filled out the form in the yellow box above. **Reminder:** You are required to work with a group for at least the design check portion for this project. For the design check, you must do the following: * Explain how you will parse, tokenize, and stem documents; that is, how you will split them up into individual, lower case, stemmed terms that are not stop words. * Explain how you will score documents. * Prepare a written specification of the format of your index file(s). Related, what data structure will your querier store this information in when it reads these files? * Walk the TAs through the steps that your program will take to answer a free text query. * Explain how you plan to implement PageRank. Feel free to ask your friendly TAs as many clarification questions as necessary about PageRank in order to ensure you understand it fully *before* your design check. Please submit the written specification to Gradescope. Only one group member needs to submit. ## Source Code ::: warning IMPORTANT: Read [this guide](https://docs.google.com/document/d/13AV_3eXwZo5VLAD45MQGvYuL-VPNtG-k8U3ivWjvw9c/edit?usp=sharing) first! ::: Here is the link to the source code: https://classroom.github.com/g/ykCSrN9W. Final Handin ----------------- The final handin is due by Anywhere-On-Earth time on Friday, March 26th. 1. For the final handin, your `scala` directory should contain the packages `search.src` (in the `src` directory) and `search.sol` (in the `sol` directory). Your code should be part of the `search.sol` package. That package should also contain a `README.txt` file. Your README file should include: * Instructions for use, describing how a user would interact with your program. * A brief overview of your design, including how the pieces of your program fit. * A description of features you failed to implement, as well as any extra features you implemented. * A description of any known bugs in your program. * A description of how you tested your program. * A list of the people with whom you collaborated. You must handin all of the `.scala` files in your `search.src` and `search.sol` directories, as well as any wikis you generated for testing. Please **do not** hand in the small, medium, or big wiki that we give you. **Note:** Only one of you or your partner must hand in the project, but each partner should be added to the Gradescope submission. 2. Each subgroup (including 1-member subgroups) must [**fill in this form**](https://docs.google.com/forms/d/e/1FAIpQLSfFMRDwaJIVkxMedLMPily6jgSnoI_-g1uYWfJyH2Xjmqtutg/viewform?usp=sf_link). We will not know to assign a grader to your project unless you fill this out. 3. If you worked in a subgroup with two members, each member must fill in [**this peer evaluation form**](https://docs.google.com/forms/d/e/1FAIpQLSc31Z9pNjjm4heEkIfWfvU_McGQrsfDgGHsD5GOEHd08a7YNQ/viewform?usp=sf_link). 4. **Each member** of your group should also upload their individual responses for the Socially Responsible Computing tasks to the separate `Search: SRC` submission portal on Gradescope. <a name="grading"></a> Grading ------ As with all CS 17/18 projects, a good design will make coding this project significantly easier; so you should spend a fair amount of time working on your progam's design before you begin writing any code. The design check counts for 15% of your grade, including: * An explanation of your plan for tokenizing. * An explanation of your plan for scoring documents. * Details about the data structures your querier will use to store the information needed to write to the index files. * A walkthrough of the design of your querier. * A demonstration of your understanding of PageRank. Functionality counts for 70% of your grade, including: * Correct implementation of Indexer: 23 points * Correct implementation of Querier: 23 points * Correct implementation of PageRank: 10 points * Efficiency: 14 points As always, partial functionality merits partial credit. The final 15% of your grade will be reserved for the SRC component, comments, testing, and style. You should include documentation comments and test cases for all non-trivial methods. You should also perform system testing, to test interactions among methods. Additionally, comment any code which would otherwise be unclear. ___ Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CS18 document by filling out the anonymous feedback form: [https://forms.gle/JZdUu5qomK8m8seX8](https://docs.google.com/forms/d/e/1FAIpQLSc4SvLLwSVodGaGQgpRDbAiet80FkzM--ErKwwuTKoILaROHA/viewform?usp=sf_link).