Project 3: Search

--- tags: projects, spr22 --- # Project 3: Search ### Due Date: Tuesday, May 3, 2022 at 11:59pm ET **Design Checks:** April 19-22, 2022 **Gear Up:** April 19th, 6-7 PM, [Zoom Link](https://brown.zoom.us/j/91510826448) :::info This handout is mighty long, but it contains information vital to your success on this project. Please take the time to read it in its entirety before coming to TA hours or posting on Ed. ::: # Overview In this project, you will be making your own search engine. This will consist of two steps: indexing and querying. You will also be implementing the PageRank algorithm for optimizing your search results based on relevance. This handout will take you through the basics of each of those steps and give you an overview of the Python libraries you will need to use in order to perform some of the required operations. Because you are potentially processing very large data, you will have to consider time and space efficiency when you design your approach. This handout will consist of the following sections: - [Background of a Search Engine](#Background) - [The Indexer](#Indexing) - [The Querier](#Querying) - [Efficiency](#Efficiency) - [Testing the Search Engine](#Testing) - [Design Check](#Design-Check) - [Final Submission](#Final-Submission) - [Relevant Vocabulary](#Vocabulary) ## Learning Objectives During this project, you will: - choose specific data structures, citing alignment with problem needs and constraints - practice strategies to test and debug complex programs - implement the PageRank algorithm, which revolutionized web search - apply and combine multiple concepts you've learned from the course by setting up the project from scratch ## Stencil Code :::warning You'll receive invitations to join your repository with the stencil code shortly after project pairings are out. ::: The collections (called Wikis) of Wikipedia pages you'll be searching over are located in the `wikis` folder, which you can download from [**this**](https://drive.google.com/drive/u/0/folders/1ZWlYG3UN3gavZjkna2sn7gUkDkP3naJE) Google Drive folder. The three main Wikis are called `SmallWiki.xml`, `MedWiki.xml`, and `BigWiki.xml` as well as some smaller examples to help debug your PageRank. You should add the `wikis` folder to your project. However, the Big and Med wikis are too large to store on GitHub. To avoid trying to push them to GitHub, you may want to create a `.gitignore` file, [explained here](https://edstem.org/us/courses/16807/discussion/1414621). # Background You are probably familiar with a few search engines: Google, Bing, Duck Duck Go. For this project, you’re going to be writing your own search engine! Your search engine will work a lot like a typical search engine. A user will type in what is called a free text query (i.e., some words), and your program will return what it judges to be the most relevant pages for that query. One naive way to build a search engine would be to ask the user to enter a query, and then scan through the corpus looking for documents that match that query. This approach would require many many traversals of the corpus – one per query. Doing this for every single web query would not be fast. Instead, most modern search engines consist of two programs, an *indexer* and a *querier*. The indexer preprocesses the corpus before answering any queries, saving files with information about which documents are relevant to which words. (Yes, this takes a long time; but remember, this is a preprocessing step. So, if it takes one hour, or even one day, it does not negatively impact the user experience, because it does not affect query-response time. This is an example of a time vs. space trade-off. You’ll use extra space to store the preprocessed information, but that enables your program to answer queries quickly.) When the querier receives a query from a user, it uses the information contained in the file(s) to quickly respond to that query. Once built, an example run of your search engine might look like this: :::info Examples updated 04/29 ::: ``` $ python3 index.py MedWiki.xml titles.txt docs.txt words.txt $ python3 query.py titles.txt docs.txt words.txt search> computer science 1 LEO (computer) 2 PCP 3 Junk science 4 Hacker (term) 5 Malware 6 Gary Kildall 7 Motherboard 8 Foonly 9 PVC (disambiguation) 10 Graphical user interface search> pope 1 Pope Alexander IV 2 Pope Benedict III 3 Pope Clement III 4 Pope Gregory V 5 Pope Gregory VIII 6 Pope Gregory XIV 7 Pope Formosus 8 Pope Eugene II 9 Pope Alexander VIII 10 Pope search> :quit ``` More such examples can be [seen in this document.](https://docs.google.com/document/d/16dPyUNcBDiFD02_MyVW6w0jPwhP0CosHh8mXU10r4u8/edit?usp=sharing) ## Search component diagram The following diagram summarizes the components of the search engine. The orange blocks are the inputs (the web pages and the query). The blue boxes are different components of the solution (which you will write). The white boxes are files of intermediate data that get produced and read by the blue boxes. The components split into two main parts (the indexer and the querier) which you can develop and test independently. :::success **We have organized the document so that each of the components has its own subsection, and you can make use of the navigation pane on the left side of the page to navigate to the specific information for each part.** ::: ![Search Architecture](https://i.imgur.com/YS3pxIs.png) ## Getting Started/Reference Section Search is a big project, and it is important to tackle it one piece at a time. The key to starting Search is thoughtful planning. Always keep in mind the end goals: what information do I need to gather as I go through the text in the Indexer? Given certain information, how can I find the top 10 articles in the Querier? With the end goals in mind, you can start planning what fields are necessary in the Indexer, what methods you need, and the general flow of what is going to happen (when are we removing words that don't add information (stop words)? How many times will we read through the entire text? When should we call helper methods?). Then, when you have a good understanding of these, it will allow you to take a step back and ask “is this the best way to do it? Could I make this better/faster?” If you have a good understanding of your plan before you start programming, the project becomes much easier! Search is very open-ended, so it is important to make your own roadmap. One suggestion is to scope out smaller pieces. **Write tests for that piece**, get it working, then extend your implementation with another piece so that your project has more and more functionality as you iterate. For example, here you have a querier and an indexer, with support files to give you the data files that connect them. You could: * just start with writing the querier methods * create example files that can be used as inputs to the querier * unit test the querier methods thoroughly * start with queries that match terms exactly, without stemming, stop words, etc * once you can handle simple queries (like “cat”), try queries with stop words and update/extend your tests * then extend the queries and unit tests again to handle stemming * … At each step, you get a core piece of the code working. You test it, then you add a bit more functionality. If you try to tackle the entire thing at once, you’re likely to get overwhelmed. # Indexing The Indexer will: 1. Process an XML document into a list of terms, 2. Determine the relevance between terms and documents, and 3. Determine the authority of each document. This section will go over the format of the inputs and outputs to the Indexer as well as each of these parts. :::info **Note**: You can and should implement and test each of these Indexer components separately. You should: 1. Read through this entire document before you start coding, so that you can come up with a plan for how to approach writing code for each component and what you need to make sure of so the components can fit together 2. Work on one piece at a time and test it **as you go** before you put the pieces together 3. Test how all of the pieces fit together ::: ## Command inputs and outputs The Indexer will take the following as input to the program: `<XML filepath> <titles filepath> <docs filepath> <words filepath>` Keep in mind that your Indexer should take in these inputs **in this order** (if, for nothing else, to pass the autograder!). This means you must check how many arguments are passed into the Indexer, and print a message if there are fewer or more than 4 arguments. `<XML filepath>` is the name of the input file that the indexer will read in and parse. The remaining three inputs are the names of the output documents you will write to: * `<titles filepath>`, which maps document IDs to document titles * `<docs filepath>`, which stores the rankings computed by PageRank * `<words filepath>`, which stores the relevance of documents to words Then, it will be the job of your querier to read them. We will be providing code to write these files from the indexer as well as code to read them into the querier, your job is to fill up the data structures that the functions we give you expect! The design check walks you through this. :::info **Note:** It will be your job to write the code to read these command-line arguments in. To get the arguments, you can use the `sys` library: ```python= import sys sys.argv[0] # the name of the script (e.g. "index.py")... can usually ignore sys.argv[1] # the first argument sys.argv[2] # the second argument len(sys.argv) - 1 # the number of arguments (not including the script) ``` ::: ### The Document Format: XML The wiki files that your search engine will be working on are going to be in XML format which is further discussed in this section. Programs often need to pass information to one another, and the most convenient way to do so is to use a data-interchange format, meaning a standardized text-based format. XML, which stands for eXtensible Markup Language, is one such format. You can learn more about XML [here](https://www.tutorialspoint.com/xml/xml_documents.htm#:~:text=An%20XML%20document%20is%20a,structure%20or%20a%20mathematical%20equation). Invented by the World Wide Web Consortium (W3C), XML uses an HTML-style angle bracket specification. Each XML element is encompassed by matching opening and closing tags. Opening tags are delimited by angle brackets, like this: `<tag>`. Closing tags include a slash, like this: `</tag>`. Text appears between the opening and closing tags of an element: ```*.html <email>kfisler@cs.brown.edu</email> ``` An XML element can have any number of *attributes*, which are specified within the opening tag, as follows: ```*.html <university name="Brown"> <concentration type="Computer Science"></concentration> <concentration type="Slavic Studies"></concentration> </university> ``` The `university` element has the attribute `name` with value `Brown`. The concentration tags have the attribute `type` with values `Computer Science` and `Slavic Studies`. The value of an attribute is always enclosed in double quotes. Note that the two `concentration` elements are nested inside the `university` element. As this nesting reflects the tree structure of XML elements, the `concentration` elements are called *children* of the *parent* element, `university`. **Pages** Each Wiki file consists of a single XML element tagged *xml*, within which there are a number of pages (*a.k.a* documents), each of which is enclosed by opening and closing `page` tags. Each page then contains a `title`, a unique, non-negative integer `id`, and some `text`, all of which are delimited by tags with those names. For example, look at the following 2 page wiki below: ```htmlembedded= <xml> <page> <title>A</title> <id>1</id> <text>[[C]]</text> </page> <page> <title>B</title> <id>2</id> <text>[[D]]</text> </page> </xml> ``` **Links** Page text may include links, which follow a special format. Understanding the link format is important for parsing the text into individual words, as well as for the PageRank part of the project! Links have two parts: the text that is shown and the address of the page that it links to. In the wikis, the address is *the title of the page it links to*. There are two types of links: If the text of a link is **the same as the page it links to,** it will look like: ``` [[Hammer]] ``` In this case "Hammer" is a word in the text of the wiki, and the page contains a link to the page titled `Hammer`. The text should be parsed in the same way the text is parsed. Other links in this category include meta-pages (or pages that only contain links). So if a page contains the link `[[Category:Computer Science]]`, then the page would link to the `Category: Computer Science` meta-page, and contain the words "Category", "Computer", and "Science". If the text of a link is **different than the page it links to**, they will be separated by a pipe (`|`), where the text to the left of the pipe is the address of the linked page and the text to the right of the pipe is the text. This looks like: ``` [[Presidents|Washington]] ``` If a page contained this link, there would be a link to the page named `Presidents`. In this case, the main text's data structures would count the word "Washington", but not this instance of "Presidents." So if the file contains "presidents" in plaintext elsewhere in the document it should be counted when calculating relevance, but this "presidents" page name should not be counted (refer to Note at end of Parsing the XML section). You’ll need to keep track of which pages link to which other pages for PageRank! ## Processing Documents Into Essential Terms An XML document has content that isn't relevant to searching, such as the tags (like `<page>`) and little words like "the". The first step in building a search engine is to reduce the document to its essential interesting words. This involves several steps: - **Parsing:** extracting the text of each document from the XML - **Tokenizing:** splitting the text into separate words and numbers, removing punctuation and whitespace - **Removing stop words:** ignoring words like "a" and "the" that don't contribute information. - **Stemming:** reducing words to their *stems*, which are the base roots of words. e.g., "sleep", "sleeps" and sleeping" have "sleep" as a base root. :::warning **Note**: Remember that your Indexer will also have to produce a titles file, which maps page titles to IDs. ::: :::warning **Note**: As you read through this section, consider the way that you want to process and store this data so that it can be used by the other components of the Indexer. ::: ### Parsing the XML We will be using the Python ElementTree XML Library to store our articles as XML trees. To use this library, we must add this statement to the top of the Indexer file: ```python= import xml.etree.ElementTree as et ``` To load the XML file into a tree and save the root, we can use this line: ```python root: Element = et.parse(<xml filepath>).getroot() ``` (1) To iterate children nodes, we can use a for loop: ```python for child in root: # do something ``` (2) or we can iterate through all of the pages: ```python all_pages: ElementTree = root.findall("page") for page in all_pages: # do something ``` We can also extract the text of all children with a certain tag, as follows: ```python title: str = page.find('title').text ``` This would retrieve the title as a string. You may have to use `.strip()` to process the strings, or `int(<string>)` when processing IDs. You can access further documentation for the ElementTree XML library [here](https://docs.python.org/3/library/xml.etree.elementtree.html). :::warning **Note:** When parsing text, you should consider the link text as words, but **not** the titles of the documents to link to. In the examples presented in the Links section above, `Hammer` should be used to compute relevance; so should `Washington`, although `Presidents` should not be; and both `Category`, `Computer`, and `Science` should be used. ::: ### Tokenizing the text Once you have extracted a string from the xml, you need to break it into words/numbers and drop punctuation and whitespace. This is a good use for *regular expressions*. Python has a regular expressions library called `re`, that you can use to search through a string for matches to patterns. In the following example, we will use a somewhat messy looking regular expression that matches links, words with apostrophes, and words without apostrophes (we explain it shortly): First, create a new `Regex`, note the triple single-quotes ```python= n_regex = '''\[\[[^\[]+?\]\]|[a-zA-Z0-9]+'[a-zA-Z0-9]+|[a-zA-Z0-9]+''' ``` :::spoiler Click for details of what that messy regular expression means Let's break it down. There are three parts, `\[\[[^\[]+?\]\]`, `[a-zA-Z0-9]+'[a-zA-Z0-9]+`, and `[a-zA-Z0-9]+`, separated by pipes (`|`), which mean "or." So we are actually matching three possible alternatives: 1. `\[\[[^\[]+?\]\]` * The meaning: Match two left brackets `\[\[` and two right brackets (`\]\]`), making sure there's something in the middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another link was starting inside this link (`[^\[]+?`). * Returns links (e.g., "[[Some Wiki Page]]" or "[[Universities|Brown]]") 2. `[a-zA-Z0-9]+'[a-zA-Z0-9]+` * The meaning: Match at least one alphanumeric character (`[a-zA-Z0-9]+`), then an apostrophe (`'`), and then at least one alphanumeric character (`[a-zA-Z0-9]`). * Returns words with apostrophes (e.g., "don't") 3. `[a-zA-Z0-9]+` * The meaning: Match at least one alphanumeric character. * Returns words (e.g., "dog") ::: <br> Next, we can use the Python `re` module to find matches to our regex. There is a method, `re.findall(pattern, string)`, which finds the matches to the pattern (our regex) inside the input string to it. This method specifically returns a list that we can then iterate over! ```python= cool_tokens = re.findall(n_regex, "this is a cool string") for word in cool_tokens: # "this", then "is", then "a".... ... ``` :::info **Note**: You are welcome to use the above regular expression in your `Indexer` to tokenize the text, but you you can also use a different expression if you wish. For example, if you don't want to match numbers, replace each instance of `[^W_]` in the regular expression with `[^W_d]`. You can also split this regular expression up into its parts, so that you can search for only links (using only the first part), or only words (using the second two parts joined with a pipe). ::: :::warning **Stop and think:** is the sample code above everything you need for tokenizing? How will you determine what links go where? How will you handle capitalized words? The Python [string operations](https://docs.python.org/3/library/string.html) can be your friend! ::: ### Removing stop words Some words appear quite often in a language, but contribute little to meaning: words like “the”, “in”, “an”, or “or”. Such words are known as stop words. For example, in the phrase “the cat in the hat,” the relevant words are “cat” and “hat”. To help you identify stop words, you will use the `nltk` Python library. You will first need to download the `nltk` stopwords corpus. This can be done by running a Python interpreter and running the following code: :::info To run the Python interpreter, right-click on the open code and select `Run Current File in Interactive Window`. ::: ```python= import nltk nltk.download('stopwords') ``` If this works correctly, you should then be able to add the following lines in your file: ```python= from nltk.corpus import stopwords STOP_WORDS = set(stopwords.words('english')) # examples: 'the' in STOP_WORDS # True 'milda' in STOP_WORDS # False ``` This will allow you to then use "STOP_WORDS" as a set to search for stop words over! **Note:** if you get an error that says `ModuleNotFoundError: No module named 'nltk'` or `Import "nltk.corpus" could not be resolved`, try running `pip install nltk` in your terminal first. Close and reopen VSCode and try again. ### Stemming Words have variations that aren't relevant to searching. Consider the words "sleep", "sleeps", and "sleeping" -- we'd like all of these to match the query term "sleep". The process of reducing a word to its base is called *stemming*. `nltk` also supports stemming and can be done simply with the following code: ```python= from nltk.stem import PorterStemmer nltk_test = PorterStemmer() nltk_test.stem("Stemming") # outputs "stem" ``` The last line here will then stem the word "Stemming". :::spoiler **How does stopping and stemming work with links?** When you run into a word that is a link, you’ll need to record where that link is pointing, then remove stop words and stem the text that appears in the document. (You will need to remove the double brackets youself. The regex we give you for extracting words from the documents leaves the links in place, so when you iterate through the words that are extracted using that regex, you will need to identify if a word is a link as you go and handle that.) ::: ## Determining Relevance of Documents to Terms (TF and IDF) Processing documents as described (parsing, tokenizing, and stemming) converts documents from text into a sequence of terms. Likewise, a query (after parsing, tokenizing, and stemming) is a sequence of terms. Consequently, to score the relevance of a document to a query, we compare these two sequences of terms. Similarity metrics used by most practical search engines capture two key ideas: * A term that appears many times within a document is likely to be more relevant to that document than a term that appears fewer times. This notion is called the **Term Frequency**. * A term that occurs in a fewer documents in the corpus is likely to be a better signal of relevance when it occurs than a term that appears in many documents. This notion is called **Inverse Document Frequency**. :::info *To explain these terms, we will use the following small example, which will appear in these blue boxes* Consider a corpus consisting of the following three documents (assume we are leaving stop words in for now and ignoring page titles/stemming\*): 1. the dog bit the man 2. the dog ate cheese 3. the cheese bit the cheese Our corpus therefore contains the words {"the", "dog", "bit", "man", "ate", "cheese"}. The next sections will compute Term Frequency and Inverse Document Frequency for this corpus. \* *If you are using the `test_tf_idf.xml` wiki to test, keep in mind that the numerical results from these examples will change when you remove stop words and include the page titles as terms in the corpus.* ::: :::warning **Note**: as with the other components of the Indexer, keep in mind the final information that this component produces (a file that relates documents and terms by relevance) and how it fits together with the other components of the Indexer. ::: ### Term Frequency, $\pmb{tf}$ Given a list of all the terms that occur across all of the documents, one can count how many times each term appears in each document. These raw counts are only so useful for comparing documents, however: the term "cat" might appear less often in one document simply because the document is very short. We therefore must normalize the frequency of counts across the documents. We can do this by dividing the count of a term by the count of the most frequently used term in the same document. Formally, let $c_{ij}$ be the number of times that term $i$ appears in document $j$. Then, let $a_j = \max_i c_{ij}$ be the number of occurrences of the most frequently occurring term in document $j$. We then define the *term frequency* for a given term as ${tf}_{ij} = c_{ij} / a_j$. :::info For each document, we create a mapping of these words to their counts: $j_0$: the dog bit the man $j_1$: the dog ate cheese $j_2$: the cheese bit the cheese | | the | dog | bit | man | ate | cheese | | ----- | --- | --- | --- | --- | --- | ------ | | $j_0$ | 2 | 1 | 1 | 1 | 0 | 0 | | $j_1$ | 1 | 1 | 0 | 0 | 1 | 1 | | $j_2$ | 2 | 0 | 1 | 0 | 0 | 2 | Notice that $a_0 = a_2 = 2$ and $a_1 = 1$, because those are the maximum occurrences of any term for each of the pages. With this information, we can compute the term frequencies $\pmb{tf}_{ij} = c_{ij} / a_j$: | | $i_0$ (the)| $i_1$ (dog)| $i_2$ (bit)| $i_3$ (man)| $i_4$ (ate)| $i_5$ (cheese) | |---|---|---|---|---|---|---| | $j_0$ | 1 | 1/2 | 1/2 | 1/2 | 0 | 0 | | $j_1$ | 1 | 1 | 0 | 0 | 1 | 1 | | $j_2$ | 1 | 0 | 1/2 | 0 | 0 | 1 | ::: :::warning **Hint for debugging your code**: What are the maximum and minimum possible values that $tf$ can ever take on? ::: ### Inverse Document Frequency, $\pmb{idf}$ To capture the second idea, we want to scale up the relevance of words that occur in fewer documents in the corpus. One useful scaling factor, called *inverse document frequency* ($idf$) is computed as follows. Let $i$ be a term in the corpus, $n$ be the total number of documents, and $n_i$ be the number of documents that contain term $i$. Then, $$ \boldsymbol {idf}{_i} = {\text {log} {n \over n_i}} $$ (The use of log makes this factor less sensitive to large values of $n / n_i$) :::info We first count the number of documents that contain each term, to get $n_i$: | $n_i$ | $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | | 3 | 2 | 2 | 1 | 1 | 2 | Next, we compute the inverse document frequencies for each word $i$ using $\pmb{idf}_i = \log \frac{n}{n_i}$: $\pmb{idf}$| $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | --- | --- | --- | --- | --- | --- | --- | | | $log(3/3)$ | $log(3/2)$ | $log(3/2)$ | $log(3/1)$ | $log(3/1)$ | $log(3/2)$ | | | 0 | 0.405 | 0.405 | 1.099 | 1.099 | 0.405 | ::: ### Scoring Relevance of Documents to Terms Given a term $i$ (from anywhere in the corpus) and a document $j$, we calculate the relevance score of $j$ to $i$ to be the normalized term frequency of $i$ in $j$ multiplied by the inverse document frequency of $i$. So $$ relevance_{ij} = (tf)_{ij} * idf_i $$ :::info Putting everything together, | $relevance_{ij}$ | $i_0$ | $i_1$ | $i_2$ | $i_3$ | $i_4$ | $i_5$ | | ---------------- | ----- | ----- | ----- | ----- | ----- | ----- | | $j_0$ | 0 | 0.203 | 0.203 | 0.549 | 0 | 0 | | $j_1$ | 0 | 0.405 | 0 | 0 | 1.099 | 0.405 | | $j_2$ | 0 | 0 | 0.203 | 0 | 0 | 0.405 | ::: Here is a [YouTube video on how the TF and IDF algorithms work](https://www.youtube.com/watch?v=vrjAIBgxm_w) that you may find helpful when working on this project. :::spoiler **Should page titles count toward the word counts for that document?** Yes! ::: ## PageRank: Determining Document Authority One could determine which documents are relevant to a query just by comparing the terms in the query to those in the documents. This is how search engines used to work. Then in the 1990s, Google was founded using a different approach based on the PageRank algorithm (designed by the founders). The algorithm ranks pages based on the links among them (without considering the content). Pages with high scores (i.e., page ranks) are then thought of as *authoritative*. Given a query, the most authoritative documents related to the words in the query are returned. Before we dive into the math, here are the key principles underlying the design of PageRank: 1. **The more pages that link to a page $j$, the more authoritative $j$ should be.** For example, if "Blog Brown" is linked to by 5 web pages, and the Wikipedia article on "Providence, Rhode Island" is linked to by 500 pages, then it makes sense to consider the Wikipedia article more authoritative. 2. **The more authoritative those pages are, the still more authoritative $j$ should be.** Now, if "Providence Place Mall" is linked to only by "Providence, Rhode Island", and "Blueno" is linked to only by "Blog Brown," it might not be enough to measure a page's authority simply by a count of the number of pages that link to that page. Indeed, it makes sense to consider "Providence Place Mall" more authoritative than "Blueno" since it is linked to by a more important page. 3. **The fewer links those pages have to pages other than $j$, the more authoritative $j$ should be.** Assume "Research at Brown" is linked to only by a "NYTimes" article which links to only 10 other pages, while "Why Brown?" is linked to only by "Blog Brown", which links to 200 other pages. Because the "NYTimes'' page has only a few links, and "Brown Blog" has many links, a link from the "NYTimes" page can be considered to be more important than a link from the "Brown Blog", leading us to attribute more authority to "Research at Brown" than "Why Brown?" 4. **The closer $j$ is to another page $k$ (measured by the number of links that must be followed to get from $j$ to $k$), the more $k$'s score should influence $j$'s.** For example, if the average path from "Brown University" to "Boeing" is 100 links, and the average path from "Brown University" to "Amy's Personal Website" is 5 links, then all other things equal, Amy's page should be considered more authoritative than Boeing's. PageRank uses these principles to compute the authority of each document. The algorithm works roughly as follows: **Each page's authority is a number between 0 and 1, where the total authority across all documents always equals 1.** The algorithm starts by assuming that all documents have the same authority. Then, following the principles, the algorithm updates the authorities based on the links between pages. Since the authority of one page can affect that of another, this process of refining authorities repeats until the authorities stabilize across all documents (the underlying mathematics guarantee eventual stabilization). It is important that the total authority in the system is conserved, so at any point in the converging sequence of page ranks, the total authority in the system will always sum to 1. Here is a [YouTube video on how PageRank works](https://www.youtube.com/watch?v=P8Kt6Abq_rM) that you might find useful when implementing the project. :::info *To explain the PageRank math, we will use the following small example, which will appear in these blue boxes* Say you have three websites: A, B, and C, where: * A links to B and C * B links to no pages * C links to A Then, we can represent the linking relationships between A, B, and C pictorially: ![](https://i.imgur.com/2zv24bo.png =250x) ::: :::warning **Note**: remember that the Indexer writes the results of PageRank to an output file, so, just like for the other components of the Indexer, you should consider how you are storing and dealing with the relevant data as it relates to the other components. ::: ### Weighting In order to determine the rank of a page, we need to first numerically represent the linking relationship between pages as a set of ***weights.*** This will help achieve the third principle, and, when combined with the remaining math in this section, the fourth principle. Weight captures the impact of links from one page to another. Let $k$ and $j$ be pages, where $k$ has a link to $j$. We denote the *weight that $k$ gives to $j$* as $w_{kj}$. The weight is defined by the following formula, where $n$ is the total number of pages and $n_k$ is the number of (unique) pages that $k$ links to. $\epsilon$ is a small number (you should use $\epsilon = 0.15$). $$ \begin{equation} w_{kj} = \left\{ \begin{array}{ll} \frac{\epsilon}{n} + (1 - \epsilon) \frac{1}{n_k} & \mbox{if $k$ links to $j$} \\ \frac{\epsilon}{n} & \mbox{otherwise} \end{array} \right. \end{equation} \hspace{4em} [1] $$ **Special cases:** There are a few special cases to handle when defining the weights $w_{kj}$ (before adjusting those weights by $\epsilon$): * Links from a page to itself are ignored. * Links to pages outside the corpus are ignored. * **Pages that link to nothing can be considered to link (once) to everywhere except to themselves.** * Multiple links from one page to another should be treated as a single link. One idea of this weighting scheme is to attribute more authority to page $j$ if it is one of only a few pages linked to by page $k$ than if it were one of many. :::info In our example, we first account for the special case of B linking to nothing by considering that it links to both A and C: ![](https://i.imgur.com/QBIZwCf.png) Computing weights according to equation (1), we get: $$ \begin{align} w_{AB} = w_{AC} = w_{BA} = w_{BC} &= \frac{0.15}{3} + \frac{0.85}{2} = 0.475\\ w_{CA} &= \frac{0.15}{3} + \frac{0.85}{1} = 0.9\\ w_{AA} = w_{BB} = w_{CC} = w_{CB} &= \frac{0.15}{3} = 0.05 \end{align} $$ These weights are inherent to the linking relationship between the websites and will not change during the PageRank algorithm ::: ### Computing rank As we will see, we will compute the ranks for the pages as one step of the PageRank algorithm, and then **use those values when computing the next step of the algorithm, until we converge to some stable ranking.** We first define the math for a ***single iteration*** (step) of this algorithm. For a page $j$, the rank $r_j$ on the $i$-th iteration is defined as follows (**here, superscripts with angle-brackets ($\langle\rangle$) denote the iteration number, *not* an exponent**): $$ \begin{equation} r_j^{\langle i+1 \rangle} = \sum_k w_{kj} r_k^{\langle i \rangle} \end{equation} \hspace{4em} [2] $$ In other words, the rank of a page $j$ is the sum of the ranks of each page $k$ from the *previous iteration* multiplied by the weights that each $k$ gives to $j$. Remember tha $w_{kj}$ does not change between iterations. In the base case, we initialize the ranks as: $$ \begin{equation} r_j^{\langle 0 \rangle} = \frac{1}{n} \end{equation} \hspace{4em} [3] $$ :::warning **Check your understanding**: remember that the sum of all of the ranks of the pages is 1 for a given iteration. Why is this true for $r_j^0$? You can also use this property to help you debug your code. ::: :::info For our example, the number of websites $n$ is 3, so according to equation (3), $r_A^{\langle 0 \rangle} = r_B^{\langle 0 \rangle} = r_C^{\langle 0 \rangle} = \frac{1}{3}.$ Then, according to equation (2), $$ \begin{equation} \begin{split} r_A^{\langle 1 \rangle} & = \sum_k w_{kA}\ast r_k^{\langle 0 \rangle} \\ & = w_{AA} \ast r_A^{\langle 0 \rangle} + w_{BA} \ast r_B^{\langle 0 \rangle} + w_{CA} \ast rC^{\langle 0 \rangle} \\ & = (0.05)\frac{1}{3} + (0.475)\frac{1}{3} + (0.9)\frac{1}{3} \\ & = 0.4750. \end{split} \end{equation} $$ Similarly, $$ r_B^{\langle 1 \rangle} = (0.475)\frac{1}{3} + (0.05)\frac{1}{3} + (0.05)\frac{1}{3} = 0.1916 $$ and $$ r_C^{\langle 1 \rangle} = (0.475)\frac{1}{3} + (0.475)\frac{1}{3} + (0.05)\frac{1}{3} = 0.3333. $$ The next iterations for this example are computed in the following section. ::: ### Converging on ranks We repeat the computation in equation (2) until the computation *converges,* that is, the *distance* between two iterations is a sufficiently small amount. For our distance metric, we use the ***Euclidean distance*** between two vectors (note that the superscipts on the $r$ terms inside of the angled brackets indicate iteration number, while the $2$ is an exponent): $$ \sqrt{\sum_j (r^{\langle i+1 \rangle}_j - r^{\langle i \rangle}_j)^2} $$ Once this distance is sufficiently small (say less than $\delta$ = $0.001$), you can declare the rankings stable, and output $r^{\langle i+1 \rangle}$ as your final ranking. :::warning **Note**: In summary, the code you write should: * Compute the weights between pages, taking into account special cases, * Initialize the rankings $r_j^{\langle 0 \rangle}$, * Compute $r_j^{\langle 1 \rangle}$ based on $r_j^{\langle 0 \rangle},$ then $r_j^{\langle 2 \rangle}$ based on $r_j^{\langle 1 \rangle}$, and so on ($r_j^{\langle i+1 \rangle}$ based on $r_j^{\langle i \rangle}$ in the general case) * Stop when the Euclidean distance between $\mathbf{r}^{\langle i+1\rangle}$ and $\mathbf{r}^{\langle i \rangle}$ (representing vectors of all of the page rankings of a current iteration) is sufficiently small ::: :::info The distance between the first two iterations of our example (based on the computations done in the previous step) is: $$ \sqrt{\sum_j (r^{\langle 1 \rangle}_j - r^{\langle 0 \rangle}_j)^2}\\ = \sqrt{(0.4750-0.3333)^2 + (0.1916 - 0.3333)^2 + (0.3333 -0.3333)^2}\\ = 0.2003 $$ Because this distance is not yet small enough, we iterate on the computation until it converges: | iteration | rank(A) | rank(B) | rank( C) | distance | | --------- | ------- | ------- | ------- | --- | | 0 | 0.3333 | 0.3333 | 0.3333 | - | | 1 | 0.4750 | 0.1916 | 0.3333 | 0.2003 | | 2 | 0.4148 | 0.2519 | 0.3333 | 0.0851 | | 3 | 0.4404 | 0.2263 | 0.3333 | 0.0362 | | 4 | 0.4295 | 0.2372 | 0.3333 | 0.0154 | | 5 | 0.4341 | 0.2325 | 0.3333 | 0.0065 | | 6 | 0.4322 | 0.2345 | 0.3333 | 0.0028 | | 7 | 0.4330 | 0.2337 | 0.3333 | 0.0012 | | 8 | 0.4326 | 0.2340 | 0.3333 | 0.0005 | ::: ### Pseudocode for PageRank Here is some pseudocode for PageRank. The variables $r$ and $r'$ store the page ranking (a value between 0 and 1) for each page. $r'$ stores the current rankings, while $r$ is the ranking from the previous iteration. We compute the rankings of the current round based on the rankings of the previous round. The ranking has stabilized when $r'$ and $r$ are sufficiently close to each other, as defined below the pseudocode. The pseudocode uses `weight(k, j)` to mean $w_{kj}$ , or the weight that $k$ gives to $j$. ``` pageRank(pages): initialize every rank in r to be 0 initialize every rank in r' to be 1/n while distance(r, r') > delta: r <- r' for j in pages: r'(j) = 0 for k in pages: r'(j) = r'(j) + weight(k, j) * r(k) ``` :::info You can decide which data structure to use to store the rankings ($r$ and $r'$) and also whether you want to calculate the weights $w_{kj}$ on the fly or precompute and store them at the beginning. As always, you should think about the tradeoffs (space/time/clarity) for any design choice! ::: ### Testing PageRank We've provided you with several sample wiki files to help test and debug the results of your PageRank implementation: * `PageRankWiki.xml` is a wiki where every page links only to a single page (page 100) * `PageRankExample1.xml` captures Example 1 from above * `PageRankExample2.xml` captures Example 2, illustrated here: :::info Example 2: ![](https://i.imgur.com/fEjJLEJ.png) | rank(A) | rank(B) | rank(C ) | rank(D) | | ------- | ------- | ------- | ------- | | 0.2018 | 0.0375 | 0.3740 | 0.3867 | ::: :::success The best way to test your PageRank is to verify that it is outputting the exact values (within some small floating-point error) as shown for the two worked examples. *It will be very helpful to step through the debugger and verify that the weights are being calculated correctly and that the sequence is converging.* Also remember that **the sum of the ranks across all pages should be 1.** `PageRankWiki` can also be useful as a sanity check to show that your program gives a much higher weight to page 100 than any other page. ::: :::warning **Important:** If you are having trouble debugging your PageRank, you should attempt to test and debug your program on the worked examples (`PageRankExample1.xml` and `PageRankExample2.xml`) *before* coming to hours or posting on Ed. ::: # Querying The second big piece of the search engine is the Querier. Your Querier should do the following: 1. Parse in arguments for the index files and an optional argument that says to use PageRank 1. Run a (Read-Eval-Print Loop) that takes in and processes search queries 1. Scores documents against queries based on the term relevance and PageRank (if specified) index files The following sections explain each requirement in detail. ## Main method usage The main method of your Querier should support the following usage: ```shell $ python3 query.py [--pagerank] <titleIndex> <documentIndex> <wordIndex> ``` where if the user passes the optional `--pagerank` argument, the querier will take page ranks into account when ranking results. The other arguments are the index files created by `Indexer`, **in the order specified above**. :::info You will also be reponsible for processing these command-line arguments. Refer back to the section on the indexer for an example of using `sys.argv` to read in arguments. ::: Take a look at how the given `file_io.py` reads in the files. For example, here is the header for one of the functions: ``` read_title_file(titles, ids_to_titles): ``` this function expects the name of a titles file, and a dictionary to populate with the information. This suggest that, in order to use this function in your code, you will have to initialize and then pass in an empty dictionary that the function can then fill with information. ## REPL (Read-Eval-Print Loop) To allow a user to interactively use your querier, you will create a REPL. A REPL is a program that: * prompts the user for input * processes the input * prints the result * repeats ### Behavior Your REPL should do the following: 1. Prompt and read in user query 1. Answer the query by scoring its terms (**after removing stop words and stemming**) against every document and returning titles of the documents with the top 10 scores. The first document printed should be the one with the highest score, and the last document printed should be the one with the tenth-highest score. If the terms in the query appear in fewer than 10 documents, return only the documents on which they appear. If there are no search results at all, print an informative message. 1. Repeat steps 1 and 2 until the user types `:quit`. ## Scoring Documents Against Queries ### Based on term relevance Given a query, you can use the term-document scores that you read from the `<wordIndex>` file to locate the most relevant documents. The computation is straightforward: for every document $j$, compute the relevance of $j$ to query $Q$ by summing scores across all terms $i$ in the query (after stemming and removing stop words): $$s_{Q,j} = \sum_{i \in Q} relevance_{ij}$$ Sort the documents by these scores, and return the 10 highest-scored documents as the result of the query. Break ties however you wish. :::info **Note:** If you already computed the relevance ($tf*idf$) in `Indexer` and stored that in `<wordIndex>`, then you already have the relevance scores for each document to each term, which makes this scoring step quite simple! ::: ### With PageRank Recall that PageRank computes a score for each document (which you wrote to `<docIndex>`). These scores can be combined with the term-relevance scores ($s_{Q,j}$) before sorting the documents and returning results. You can experiment with different combinations and see how that affects the results: **multiplying the PageRank and term-relevance scores, adding them, etc.** :::warning **Note**: remember that use of PageRank is determined by the `--pagerank` argument to your main method. If the user does not provide this flag, your results should not depend on PageRank. :::  # Efficiency There are three xml files that you should be running your code on to test for efficiency: `SmallWiki.xml`, `MedWiki.xml`, and `BigWiki.xml`. As their names imply, they range in size and will consume memory resources on your computer as you attempt to index them. It is your job to make the indexing process on these three files as efficient as possible, while still maintaining reasonable search results for the querier. You will have to pay attention to both time and space complexity. We want to see you making smart high-level choices about how often your code traverses data. For example, if you take two consecutive passes over the same large data structure when one would suffice, we'd expect you to combine the two passes. We discussed managing space in the March 18th [lecture on garbage collection](https://brown.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=73883bfb-cc32-4970-9e11-ae3c00e187ea). <a name="efficiency"></a> You should keep **time and space efficiency** in mind when selecting your datastructures and iterating over them. If your solution runs out of memory on Gradescope when you submit it, then you will have to make your solution more efficient. **Indexer Efficiency:** Don't worry if your indexer is relatively slow at indexing a very large corpus. We'd rather see your indexer take 15 minutes to build an index, with querying taking a mere 0.5 seconds, than your indexer take 5 minutes, and querying 1.5 seconds. **Space Complexity:** This project is the first time you'll really confront space complexity. If your program does not handle space efficiency, you will probably run out of space in the heap, and the program will crash. You can end up creating a lot of intermediate data structures in this project. Pay attention to when you need to create new data (it's fine if you need it, but don't assume you have to use immutable data just because we've talked about it). Also think about when data structure choices need to take space usage into account (meaning, you don't *always* have to use the smallest-space structure, but you should if a particular piece of data could take considerable space). Consider the Big-O space complexity of the data structures you create. How does space grow as the number of pages in the corpus increases? Are they linear? Quadratic? Also pay attention to not holding onto data longer than necessary. If you have a large data structure in the heap, but no way to reach it from the environment, Python will reclaim that space (garbage collection). One common way to hold onto data too long is to make your methods do too much work (parameters go into the environment). Creating methods that process data in slightly smaller chunks can help control space usage. **Out of Memory Errors:** It is possible that you will encounter a Python `MemoryError: out of memory` exception while running your indexer. Our solution did not encounter this error when we tested it on `BigWiki.xml`, so if you are encountering this error, your code may be performing unnecessary operations. # Testing This project has many parts, so testing each piece of your code as you write it will be very helpful. **You can (and should) create your own Wiki files for testing purposes.** These tests will not be run on wheat and chaffs; only TAs will be manually grading these. :::danger As such, it is crucial that you submit all documents you used to test, as TAs will otherwise not know your testing strategy and intent. ::: This is your chance to demonstrate your testing skills at the end of the course. - We want to see your tests and small wiki files show that you are thinking about variations that might arise in the data (both different common cases and edge cases). - We want to see you testing both easier situations (like base cases) and ones that require iterations or comparisons within data (non-base cases).” - We want to see attention to testing critical behavior, particularly handling exceptions. In the case of this project, the exceptions you should be catching relate to how your project interacts with files. ## Unit Testing (Pytest): Calculations, Parsing You should test the calculations computed by your search engine using pytest `assert`ions. This includes the calculations from the Indexer as well as expected TF and IDF calculations and PageRank calculations. You should also test that you are parsing, tokenizing, etc the documents and queries in the expected manner. :::warning **Note:** This means that you should also write assertions to test the data structures that you build up while reading in the wiki! For example, before you test your PageRank algorithm, it is going to make a lot of sense to test that you are storing the linking relationships between pages correctly (with respect to the data structure(s) you chose). ::: ## System Testing: Queries You should try various queries, with different XMLs, and with/without PageRank. Document the queries you tried (along with the inputs used to start the programs) and the outputs that you expected in your `README` file. You will only receive credit for testing which is clearly documented in your `README` file. (More details on the `README` appear later). # Design Check **Design Checks:** April 19-21, 2022 The goals of this design check are threefold: 1. Give you practice with the math of the $tf$, $idf$, and PageRank computations 2. Guide you through making a high-level plan for how the code fits together 3. Create a development and testing plan for your code 4. Have you consider search engines within a societal and cultural context ## $tf$, $idf$, and PageRank computation practice **Design check task 1:** Work through some example computations in the [conceptual check Gradescope assignment](https://www.gradescope.com/courses/344946/assignments/1999124/). Just like in Travel Planner, as you submit each question, Gradescope will tell you whether you are right or wrong, so that you can check your conception as you go. You can and should resubmit as many times as you need to. You may talk to your partner as you go through this task, but we expect both partners to submit (and get full points on) the assignment. ## Plan for tracking data through the code These tasks are meant to guide you through making a plan for how to go from the XML files, to the relevance computations, to PageRank, to the querier. ### Thinking about the files the indexer writes **Design check task 2-a:** The indexer will write [three files](#Command-inputs-and-outputs) (refer also to the [diagram](#Search-component-diagram)). Take a look at the `file_io.py` file we give you. Complete the following table, which will help you keep track of the information each file expects. Do not just put the input name, write a description of the input (for example, if the input is a dictionary, write down the types and short descriptions of what the keys and values are). The first row is done for you. **Edited at 10:01 pm on 4/19 to fix the dict that the title file takes in.** | `write_[]_file` function | Input 1 | Input 2 | | -------- | -------- | -------- | | `write_title_file` | The filename of the titles file (to be written to by this function) | A `dict` from page ids (`int`s) to page titles (`str`s) | | `write_words_file` | | | | `write_document_file` | | | The next few tasks have you think about constructing/computing these inputs. **Design check task 2-b:** The first row of the table suggests that, besides reading in the text from each page of the wiki, you will need to process page IDs and titles. Based on the XML structure of the [wiki documents](#The-Document-Format-XML) (see the example under the **Pages** heading) and the [example for getting the title of a page](#Parsing-the-XML), write two 1-line Python commands: one that gets the text of a page (as a `str`), and one that gets the ID of a page (as an `int`). **Design check task 2-c:** briefly (2-3 sentences), describe how you will go from reading in the xml file to populating the dictionary from page IDs to page titles that `write_title_file` takes in. **Design check task 2-d:** Revisit the $tf$/$idf$ example (the blue boxes of [this section](#Determining-Relevance-of-Documents-to-Terms-TF-and-IDF)). Write down every data structure you think you might need to do the steps of the computation, and how they connect. Also write down how they are used to construct one of the inputs that `write_words_file` takes in. *Try to be as detailed as possible with what each structure represents, so that you can refer to this when you are coding later!* :::warning **Note:** Because your code will be expected to deal with large input data, think deeply about the time/space efficiency of the data structures you are proposing. When does it make sense to use a `list` vs. a `dict` vs. a `set` vs something else? ::: :::info **Example to get you started:** 1. We need to have a way of keeping track of the **corpus** (all of the words over all pages of the wiki). We can use *[some data structure]*. 2. We also need come way of keeping track of *[some other thing]*, which we will store in *[some data structure]*. To compute this *[other thing]*, we need to go through all of the words in all of the documents and *[perform some computation]*, so we will write a loop that goes through `word_corpus` and *[details of the computation and the way we will store it in the data structure]*. We will do this computation in *[proposed helper function]*, which we wil call when *[step in the code]*, and we will call our *[resulting data structure]* *[name we want to call it]*. ... (X). We will need to create a *[data structure that `write_words_file` takes in]*. To do this, we will..... *This is only a template, feel free to adapt the wording to your group's style! In particular, it may help to draw some diagrams of how these data structures share data.* ::: **Design check task 2-e:** Copy and annotate the [PageRank pseudocode](#Pseudocode-for-PageRank) with comments about how you will get/store the data relevant for the computation. Start with the information you get from the XML file, think about any intermediate data structures you might need to define, and remember that the result of this computation should be stored in a format that `write_document_file` understands. We are not looking for any specific format here, but again, *being as detailed as possible will likely help you get organized when you start writing code.* :::info **Example to get you started:** the pseudocode refers to `weight(k, j)`. What data structure should we use to store all of the weights? When should they be computed? What other data do we need to compute first before we can compute the weights? *(This last question is non-trivial, and you should make sure that you and your partner are on solid footing in terms of the computation and data structure use here.*) What other data does the pseudocode use? How should this be stored/initialized/computed? ::: **Design check task 2-f:** make note of any other things you need to do for the indexer, based on the handout. For example, design check tasks 2-(a-e) guide you through going from the XML file to using our file_io code to write to the output files, but how will you read in the command-line arguments to `indexer.py`? Again, being detailed and thorough here can help you later. ### Thinking about the querier **Design check 2-g:** the querier takes in the three files that the indexer produces and reads them into data structures using the `file_io.py` functions `read_tile_file`, `read_words_file`, and `read_docs_file`. Given a user query (a phrase made up of words), how can you use these data structures to produce an answer, for now without PageRank? Write your answer in prose (some expressions in code or diagrams might be helpful here too). :::info **Example to get you started:** The formula $$s_{Q,j} = \sum_{i \in Q} relevance_{ij}$$ Means that, for each document, we will have to go through each term in the query. We will do this by splitting the query into a list of words, and looping through that list, updating the sum in the loop. To get each $relevance_{ij}$, we will *[do something, using data from somewhere]*. We will store the resulting $s_{Q,j}$ in *[some data structure]* and then we will *[use this data structure somehow to get our answer].* *Again, adapt this template to your group's needs.* ::: **Design check task 2-h:** Now incorporate PageRank. What data structures will you use to make sure PageRank factors in to your search results? How will you combine term relevance and PageRank (addition, multiplication, weighted addition, something else?). **Design check task 2-i:** Write a practice REPL. To refresh yourself on the concept of a REPL, it may help to refer back to [Lab 5](https://hackmd.io/@csci0200/Byitnwoiu#Fun-With-Words) or to the TravelPlanner starter code. Your REPL should: 1. Prompt the user for input 2. Convert the input to all caps (or some other transformation of your choice) and print it out on the next line 3. Repeat 1-2 until the user types ":quit" (or some other string of your choosing), and then exit the program. This code should be in a python file with only a `main` method. :::spoiler Hints 1. Remember that you can define a `main` method in Python using this special line: `if __name__ == "__main__":` and then indenting everything after that line by one level 2. You can read in user input in Python using [`input`](https://docs.python.org/3/library/functions.html#input) 3. You can convert a string to uppercase in Python by calling [`.upper()`](https://docs.python.org/3/library/stdtypes.html#string-methods) on it. 4. You can exit your loop by using `break` with nothing after it 5. Our solution has 6 lines. Yours may have more, but if you are finding that you have many more lines or are spending significant time on this task, you may be overcomplicating things. Write down where you are confused and have your TA help you during the design check. ::: <br> To help you prepare for the coding of the querier, think about what you will have to change in your example to make it work with the rest of the querier code. Our solution defines a helper function that we call inside of the REPL -- what should it take as input(s) and do? Where in your example REPL would it make sense to call such a helper function (perhaps replacing an existing line)? Annotate your REPL code with comments that have your answer. **Design check task 2-j:** make note of any other things you need to do for the querier, based on the handout. Again, an example may be parsing in the command-line arguments -- how will you check for the `--pagerank` argument so that PageRank is only factored in to your result if that argument is provided? Again, thoroughness and detail may help you out in the future. ## Testing and development plan Because you are asked to create the `index.py` and `query.py` files yourself from scratch, this project gives you a lot of freedom in how you do development. We want you to practice a version of *test-driven development,* meaning that you should develop an individual piece or aspect of the project (for example, the REPL or the $tf$ calculation) with its specification in mind, test it, and only then move on to develop another aspect of the project. When you integrate all of these piecese, you should also be testing the system as a whole. :::warning **Warning:** One issue that we have seen crop up in this course is when students write code encapsulating all of the functionality of an assignment in one go, leaving testing for the end. This results in really difficult debugging problems. Because search has so many pieces to it, if you take this monolithic development approach, you will likely spend orders of magnitude more time debugging than if you take an incremental approach. ::: **Design check task 3-a:** create a small example wiki xml file that you can use to test your $tf$, $idf$, and PageRank calculations. What will the (tokenized, stemmed, stop-words-removed) corpus look like for this wiki? What about the linking relationship between pages? **Design check task 3-b:** Make a testing and development plan, which breaks the project into testable chunks and explains how you will test each chunk *before* moving on and integrating that chunk into the rest of your code. :::info As an example, some chunks of the indexer part of the project are: * Parsing the command-line arguments * Reading in the XML into: * Tokenized text * Titles * IDs * Linking relationships * Computing $tf$ * ... We want you to explain your testing approaches for each of these chunks. It will help to refer back to specific variable names (e.g. from design check task 2-d). For example: *To test that we read the XML and tokenized the text correctly, we will need to check that the `word_corpus` set contains all of the stemmed, non-stop words in the wiki. To do this, we will use our test wiki from design check task 3-a and check that the `corpus` variable contains all of those words. We also need to...* ::: ## Search Engines and Social Responsibility As you begin thinking about how to build and implement a search engine, it’s worth reflecting on the history and development of information access and search on the internet. Today, most of you probably use Google as your primary search engine, but it wasn’t always that way! There used to be a plethora of different search engines, some of which you’ve probably never heard of, and even an internet without search engines at all. If you’re interested in a look at the old web, check out this emulator: [Oldweb.today](https://oldweb.today). (Note: if you do try OldWeb, beware that it is a bit glitchy and prone to crashing if you try to search, we recommend just clicking around the GeoCities site a bit.) ### Black Software As you know, the internet and information on the internet existed before search engines. People had other methods of accessing information, akin to a bulletin board or a community page like Reddit. If you’re familiar with the “/fandom.com” page for your favorite video game or tv show, you’ve seen an example of this type of information access. One compelling example of early information sharing on the internet is Black software, the phenomenon where Black engineers and internet users gathered to create community spaces online. **Design check task 4-a:** Listen to or read the transcript of [this short podcast segment](https://www.sciencefriday.com/segments/black-software/) of an interview with Charlton McIlwain, author of *Black Software: The Internet and Racial Justice, from the AfroNet to Black Lives Matter*, on Black engineers in the early days of the internet. **Design check task 4-b:** Consider these examples of early internet community building and information sharing (AfroLink, NetNoir, Universal Black Pages). Which type of information distribution (search engines vs community pages) performs better according to different values, such as scalability, diversity, accessibility, security, and power? Pick two of these values and for each one you choose, argue which distribution method better achieves it and why. (2 - 4 sentences) Now consider this quote that Mcllwain gives in [another interview](https://www.futurity.org/internet-history-black-pioneers-2224462/): > *“In the AOL days, everything was curated. With something like Net Noir, you had Black content created for Black folks, generally speaking. Things were very directional, and you had to know what you were looking for. Even early navigation systems like Yahoo! were much more akin to a card catalog than they were to a modern search engine like Google. > … > Google makes some assumptions about what the user wants, based on popularity—which is a highly prized commodity with supreme value online. This is a phenomenon that we talk about as a kind of power law: the more visibility I have, the more visibility I get, and the less visibility I have, the less visibility I get. You can imagine how that can create a wide-scale gap, based on race and other factors, between sites authored by different types of people.”* **Design check task 4-c:** Mcllwain argues that Google (or more broadly, search) “makes assumptions about what the user wants” which “can create a wide-scale gap.” Based on what you’ve read and learned so far about the inner workings of search: What assumptions does the PageRank algorithm you are implementing make about the relative importance of web pages and what the user is searching for? Do you think Mcllwain’s statement is fair and accurate or do you think he’s missing something? Why? Recall what you’ve learned earlier in the course about bias, stakeholders, power dynamics, filter-bubbles, and comparators to inform your answer to the two questions above. (6-8 sentences) ### Neutrality, Bias, and Monopolization At the heart of Mcllwain’s statement was a concern about the potential of search engines to bias results toward or away from certain types of information and certain groups of people. Others share his concerns and have elaborated on and investigated them extensively. For this part, you will learn about some of the shortcomings of search engines that engineers, researchers, and users are wrestling with today. > Note: This may look like a lot of reading, but a substantial amount of it is pictures! **Design check task 5-a:** Read [Algorithms of Oppression: How Search Engines Reinforce Racism](https://www.dropbox.com/s/gxg3j8psneot7t3/Noble_Algorithms%20of%20Oppression.pdf?dl=0) by Safiya Umoja Noble (**pages 15-26**, up to the paragraph that begins with “There is an important and growing movement of scholars.”) **Design check task 5-b:** Read [How does Google’s monopoly hurt you?](https://www.seattletimes.com/business/technology/how-does-googles-monopoly-hurt-you-try-these-searches/) by Geoffrey Fowler. **Design check task 5-c:** Navigate to google.com and try out these searches! Search for “jeans.” Search for “climate change is fake” Search for “buy zofran” (Note: Zofran is a anti-nausea/vomiting medication) **Design check task 5-d:** Reflect on the search results. Does anything surprise you? Which of Noble’s or Fowler’s arguments are supported or complicated by these results? (2 - 4 sentences) In his paper [Speech Engines](https://scholarship.law.cornell.edu/cgi/viewcontent.cgi?article=2622&context=facpub), James Grimmelmann outlines three competing theories of the role of search engines: > **Conduit theory** > The conduit theory stipulates that the law should prevent Google from using its “dominant position” to “censor” websites. It draws on the tradition in free-speech theory that speakers should have an affirmative right of access to the mass media. Since search engines have the same power as traditional mass media to shape public discourse, goes the argument, they should be subject to the same scrutiny and regulations. Search should be passive and neutral, connecting users to information and nothing more. > **Editor theory** > The editor theory argues that search engines are “media companies” that make “editorial choices” about what to publish. Advocates believe that there is no such thing as “neutrality” in search because any ranking is innately biased, and that attempts to achieve neutrality or fairness through regulation are thus futile. Search should be “an active and opinionated editor” that harnesses its power to find the most important and interesting material. > **Advisor theory** > The advisor theory presents the analogy that a search engine should be a helpful, trustworthy advisor. An ideal advisor would adopt the user’s goals and preferences, rather than having an agenda of its own. Search should be listener-oriented: instead of attempting to amplify good speech, the engine should empower users to identify for themselves the speech they wish to hear. **Design check task 5-e:** For each of the three theories, do you think the theory prioritizes the user’s perspective, the website’s perspective, or the search engine’s perspective? (3 short sentences, one per theory) **Design check task 5-f:** Which theory do you think best describes the aims and execution of search engines today? Which theories do you think that Noble and Fowler would most support? Which theory do *you* think search engines should aim for? (4 - 6 sentences) <br> # Final Submission The deadline for the final submission is **Tuesday, May 3, 2022 at 11:59pm ET**. You should submit all your files on Gradescope. For your final submission, you will implement a search engine as described in this handout and answer the questions from the Socially Responsible Computing portion, along with a brief description of your code structure and any known bugs. Thus, your final submission will have three components: code, SRC, and `README`. You should also submit your own test Wiki files if you have made any. In addition, please include your entire test suite. ## Code **Grading** Rather than the single implementation submission you are accustomed to, there will be both a correctness and efficiency component on Gradescope. For correctness, we are checking if your code produces correct results on some small examples. For efficiency, we are checking if your code runs correctly in a reasonable amount of time for larger examples. We've included some details on grading requirements for this project: :::info | Minimal | Full | | --- | --- | | Index w/o PageRank (words and titles file passes our tests)| Index w/ PageRank (docs file also passes) | | Tests pass on smaller test wiki | tests pass on large wiki (data structure and design) | | REPL w/o taking into account stop words/stemming | full REPL functionality | | querier ignores --pagerank flag | querier parses pagerank flag || Minimal | Full | ::: To ensure that code that is correct but inefficient still gets a fair amount of credit, there will be **3 different Gradescope submissions**, so pay close attention to where and what you are submitting. :::danger For **Project 3 Efficiency** and **Project 3 Correctness** on Gradescope, please submit: * a completed `index.py` * a completed `query.py` * any other Python files (if any) you created to help your indexer and querier In addition, submit **everything** to **Project 3 Main**. This includes * a completed `index.py` * a completed `query.py` * any other Python files (if any) you created to help your indexer and querier * any tests files you created * a `README.txt`file ::: **Note:** Only one of you or your partner must hand in the project, but each partner should be added to the Gradescope submission. ## README Your `README` should consist of the following: * the names of your group members * description of any known bugs with your program * instructions for use, describing how a user would interact with your program. * description of how the pieces of your program fit together * description of features you failed to implement, as well as any extra features you implemented * description of how you tested your program, and **ALL** of your system tests * Examples of system tests include testing various queries and pasting the results. # Vocabulary **Wiki:** A set of pages that contain text about various topics, and links to other pages in the wiki. **XML:** eXtensible Markup Language, a standardized, text-based format for representing the structure and content of information. **Corpus:** The entire set of documents used as an input to search ("words in the corpus" refers to unique words in this corpus). **Indexing/indexer:** The act of pre-processing the corpus into information that can be stored and used later for quickly returning search results (Indexer: the code that does the indexing). **Parsing:** Extracting the text of each document from the XML. **Tokenizing:** Splitting the text into separate words and numbers, removing punctuation and whitespace. **Stemming:** Reducing words to their stems, which are the base roots of words. **Stop words:** Common words such as "a" and "the" that don't contribute information about the contents of a document. **Regular expression:** An expression (in a standardized syntax) that is used to search over text. **Term frequency ($tf$):** A measure of how frequently a word in the corpus appears in each document of the corpus. **Inverse document frequency ($idf$):** A measure of how many documents in the corpus a word in the corpus appears. **Relevance:** A combination of $tf$ and $idf$ that scores how relevant a term is to a document. **Linking relation:** A representation of which pages in a corpus link to which other pages. Can be represented using a directed graph. **PageRank:** An algorithm developed to evaluate how important a page is based on the linking relation between all of the pages in a corpus. **Querying/querier:** The act of looking up and returning search results for an input phrase, based on pre-computed indexer files (Querier: the code that does the querying). **REPL:** Read-Evaluate-Print-Loop; an interactive computer program that reads in user input, evalutes a computation based on that input, prints some results, and does this repeatedly until prompted to quit. **Unit test:** A test that exercises one unit (compontent, function, etc) of code, usually by running the unit and comparing the result to some expected result. **System test:** A test that exercises the system as a whole, by using an external-facing interface (such as a user interface), and checking the final answer against some expected final answer. **Test-driven development:** A software engineering practice where each unit is written, tested thoroughly, and debugged, before integrating it with the system. ______________________________ *Please let us know if you find any mistakes, inconsistencies, or confusing language in this or any other CSCI0200 document by filling out the [anonymous feedback form](https://docs.google.com/forms/d/e/1FAIpQLSdFzM6mpDD_1tj-vS0SMYAohPAtBZ-oZZH0-TbMKv-_Bw5HeA/viewform?usp=sf_link).*