# CS 121: Information Retrieval A compilation of COMPSCI 121 at UC Irvine on web-based search. Topics in web crawlers, search engine optimization, and semantic graph traversal. ## Table of contents 1. [Introduction](#introduction) 2. [Search Engine Architecture](#search-engine-architecture) 3. [Search Engine Optimization](#search-engine-optimization) 4. [Web Crawling](#web-crawling) 5. [Tokenization](#tokenization) 6. [Retrieval](#retrieval) 7. [Query Scoring and Quality](#query-scoring-and-quality) 8. [Query Optimization](#query-optimization) --- ## Intro Information retrieval (IR) is the process of extracting information (raw or semantically meaningful) from the web. **Web Engines** are graphical representations of information that we use to algorithmically access and traverse information. We will first set up basic understanding of the web. Nature of the web: - the corpus is not centralized (rather, it's distributed) - constant updating - significnat duplication (30-40% of the web) - High linkage (Many hyperlink directive lines) ![Screenshot 2025-04-08 at 12.03.41 PM](https://hackmd.io/_uploads/H19M7emAJg.png) The reason we need web engines is primarily a **social one**: - incentive for content creation - aggregation of "niche" groups - best way to monetize/advertise the web The core difference between a database and search engine is **relevance**. Databases only take into account **textual similairty**. Search enginers extend to **contextual cues** from the user, document, and the web. IR is a core process that enables this. There are several types of IR: - search (what we learn here) - filtering (finding objects of interest) - classification (labeling information) - question --> answer (chatgpt) Classic assumptions of IR include: - Corpus: there is a large, fixed amounts of information - Relevance: $R(Q, D)$ scores the object based on the relevancy between the Q (Query) and Documents contents (D) Good **Retrieval Model designs** are used to match a query from the user to a document(s). ### Types of search There are two primary types of search: 1) **Searching through a relational database** based on strucured queries and 2) **Web engines** that search based on unstructured queries Web engines crawl bytes of a page to provide sub-second results. The design of search engines focus on optimizing the websites to process them more quickly The primary design issues are: - Performance (latency, throughput) - Coverage - Freshness - Spam/Inaccurate information --- ## Search Engine Architecture Web engines search the web through a structured process beginning by parsing HTML and data on the web, transforming data, ![Screenshot 2025-04-03 at 11.35.52 AM](https://hackmd.io/_uploads/ryIfHI3pyl.png) **Inputs:** User query (keywords and operators (ex: AND)) **Outputs:** Resulting summaries of pages and highlight of importnat words Here is an overview of the process: 1. User inputs a query 2. Web crawlers acquire text 3. Text is processed, transformed, and an index is created 4. Information is ranked, evaluated 5. Information/relevant websites are displayed on the UI ### Step 2: Text acquisition There are multiple ways to extract web data (data dumps, document store, etc). The most popular one is a **Web/Spider crawlers** which crawls the web, returns the content of relevant pages, and returns the next hrefs in the page to scan until all documents are scanned. - Scan HTML of the current page to find additional links (hrefs) and dump them in a queue - Spider scans next URL from the queue and scans the page again - Returns unindexed **C pages** ### Step 3: Text Processing and transformation First, we must transform the data so we can easily process it. The process is as follows: - Parse and tokenize words - Group semantically similar words (ex: fish and fishes) - Remove semantically irrelevant words (ex: the, and, there) - Analyze the popularity of links - Extract the structure of the text (ex: title, italics, header) - Classifier categorizes the document (ex: spam, website, etc) **Tokenizing** words is the process of categorizing words into meaningful linguisitic units to make it easily for the machine to parse. - Given query: `APPles can't hat.<.>` - Token: {'apple', 's', 'can', 't', 'hat'} It follows the properties below: - sequence of 3 or more alphanumeric characters - semantically meaningful - all characters convereted to lowercase - **Byte Pair Encoding (BPE)** is a type of tokenization that breaks words into byte sized components (ex: "a", "b", "c") and combines them into semantically meaningful subunits (ex: "ab", "ph") that can be re-combined later The **inverted index data structure** is calcualted from the query: a list-based data structure that represents the mapping of each word to the documents it's contained in as a matrix (i.e. what documents contain the target word). Indices are created through the following process: - terms in the document are weighted based on importance/relevancy (based on the search query) - inverse the index: $text -> doc->text->doc$ - distribute texts widely to make it easier to search (principle of creating efficient hash maps) ![inverted matrix](https://hackmd.io/_uploads/rJopWLn6Jg.jpg) ### Step 4: Evaluation **Evaluation** process analyzes the data based on indices. It makes the search process faster and more efficient There are two major parts of evaluation: 1. **Precision**: How relevant the first few hits are a. most important part of web search, more generally 2. **Recall**: number of relevant hits retrieved 3. most important when completeness is required Workflow: - Logging: monitors and records activity on the webpage (ex: searching, timestamps, etc) - Ranking analysis: rank indices based on relevance. Performed by different processing brokers - Performance Analysis: performed by a human. --- ## Search Engine Optimization SEO helps users better understand information. Search results depend on the the following: - the geographic location the data center receives the query from ### User behavior Popular searches (i.e. search trending) are kept in the cache and used to "autocomplete" results Most search results are **high variance and ill-defined**. The engine uses the following process to extract the actual user intention: ![Screenshot 2025-04-10 at 11.26.50 AM](https://hackmd.io/_uploads/ByOdpFBAJg.png) You can safeguard the relevance of the page using **relevance scores** ($R(Q, d)$). See step 4. ### Business model Ad Engines is the main profit system on the web. Ads are promoted. Ads use **pay-per-click (PPC) system** that bid on keywords entered in the search query. The engine runs an **auction system** based on parameters like price paid for ad, relevance, and expected click-through-rate to determine which ad to display. There are two primary types of search engines metrics: - Cost-per-click: cost per user click - Cost-per-mille: cost per 1,000 user impressions ### Techniques --- ## Web Crawling Search engines first need data from the web to operate and determine relevancy to a search query. This process is of extracting information from the web is called Web Crawling. There are several components to the crawling architecture: - The crawler has access to a local data structure called a **"Frontier"** which is a queue that stores URLs that it has discovered but not yet visited yet. - **DNS Resolver** translates URLs into IP addresses which a fetcher fetches information (raw HTML content). - **Parser** parses the HTML content and converts it into a structured database of URLs and page content. ![crawlerProcess](https://hackmd.io/_uploads/By8Da50yxl.png) Basic algorithm: 1. Initialize queue of URLs (seeds) 2. For each URL: - check that current url can be crawled - add all links in the url to the queue ```python def crawler_thread(frontier): while not frontier.done(): website = frontier.next_site() url = website.next_url() if website.permits_crawl(url): # perms based on robots.txt text = retrieve_url(url) store_document(url, text) # temp storage of urls for url in parse(text): frontier.add_url(url) # stores next urls to crawl frontier.release_site(website) ``` **robots.txt** is uses to determine web crawler permissions on a webpages: - `Disallow`: denies the crawler from accessing these pages - `Sitemap`: determines what the crawler can crawl - `User-agent`: determines the bot that can crawl the page This basic algorithm is not used in practice becuase it does not respect the privacy of the website owner, takes a long time, and returns irrelevant noisy data. ### Designing effective web crawlers **Politness**: placing delays throghout the crawling process to prevent overwhelming servers **Performance**: **Crawler traps**: Structural issues with webpages that lead to infinite pagination of links. **Duplicate detection**: Lots of near-duplicate websites on the internet but not identical (differnt hashes). Detectors can somewhat detect semantic similarity but methods are ineffective **Data noise**: Tags and ads cause a lot of noise in the retrieved data that can compromise IR. **Cumulative distribution of tags** can help the engine select blocks of text based on where the graph pleteaus (i.e. where the number of tags decrease) ![Screenshot 2025-04-17 at 12.08.20 PM](https://hackmd.io/_uploads/SJ-nW0CR1x.png) **Multithreading** is the process of threading multiple workers to parse different categories of webpages simultaneously and bundle them at the end into one unit --- ## Tokenization The goal of tokenization is to extract meaning from a series of words rules: - **case folding** is the process of transforming all characters into the same case (ex: Case Folding -> CASE FOLDING) speeds up comparisons. - **hyphens** are treated as a single unit during process - **numbers** could cause issues when turning a decimal 9.2 into a number 92 - **stemming/conflation** is the process of extracting the root of a series of text and only storing that. For example, swam and swimming will fall under the same root "swim" - Algorithmic stemming (Porter Stemmer): uses rule-based suffic removel (ex: removing -ing and -s endings) to extract the root form of the word. It may cause issues by forming non-existent words - Dictionary based stemming: searches up the word in a dictionary to find the root - **Phrase recognition** attempts to separate words into meaningful units using the following methods: - **Part-of-Speech (POS) tagging** divides words into grammatical units (verb, noun, etc) which helps machines understand its syntactic structures. It's highly accurate at the expense of speed. - **N Grams** organizes words in sequences. "N" corresponds to the number of words in a single unit. This helps search engines understand words that are better understood together (Ex: New York) - Unigram: one word chunks (ex: ["I", "love", "pepperoni", "pizza"]) - Bigram: two word chunks (ex: ["I love", "love pepperoni", "pepperoni pizza"]) - Trigram: three word chunks (ex: ["I love pepperoni", "love pepperoni pizza"]) - Algorithm: slides a window of size n across the sentence and stores the words it finds ![n-grams](https://hackmd.io/_uploads/HJwHVW-Wxg.png) ``` Count the number of trigrams: "the quick brown fox jumps over the lazy dog"? Answer: 7 9 words - 3 trigrams + 1 = 7 overlapping trigrams ``` - **Positional Indexes**: stores the indexed occurance of specific terms in the document. Used for proximity searches for how tightly coupled words are to each other --- ## Retrieval ## Retrieval Process **Boolean retrieval** retrieves information using boolean operators (AND OR NOT). - uses as **term-document binary matrix** which uses 0 and 1 to indicate which documents a specific term appears in. It will use this to retrieve all the documents using bitwise operators on the rows. **Inverted Index** is a data structure that maps terms to the documents they appear in. Instead of storing wether or not a term appears in EVERY document, it only stores the documents the terms appears in and nothing else. This increases efficiency since it reduces the search space. - A **postings list** is a detailed list attached to each word in the inverted index list that follows this format: ```python # example posting (document, indexed position in the document, frequency of word) pizza → [ (Doc 1, positions [3], frequency = 1), (Doc 2, positions [0], frequency = 1), (Doc 3, positions [1, 4], frequency = 2) ] ``` The posting list must have the document ID at the bare minimum. ![Screenshot 2025-05-13 at 11.16.00 AM](https://hackmd.io/_uploads/ByRDh-W-xl.png) - List of words are stored in a dictionary for O(1) lookup ## Retrieval Models **Boolean search** is the primitive form of information retrieval that requires exact queries and expert understanding. The **feast or famine problem** means that the query will either return many results or none at all ![Screenshot 2025-05-27 at 1.41.12 PM](https://hackmd.io/_uploads/rkQu7imfgg.png) **Ranked retrieval model** returns the top scored docs for the query. **Free text queries** searches infomraiton based on natural lanugage terms instead of indices --- ## Query Scoring and Quality ### Query representation Queries and Documents are represented as vectors in space. Documents are ranked based on their cosine similarity to the query in space. **Vector Space Scoring** finds the K docs in the collection that are nearest to the query vector (lowest degress btwn the query vector and the document vector) ![query](https://hackmd.io/_uploads/ByOiR5Xfee.png) ### Calculating Relevance The **term-document count matrix** represents the number of occurences of a term (row) in a doc (column). Documents store a count vector of the number of occurences ![Screenshot 2025-06-02 at 5.13.55 PM](https://hackmd.io/_uploads/By-8R2iMgg.png) To identify the relevance of a document, we measure how much of the query terms are inside the docment using the query/document pair scores. Higher scores indicate that the document is more relevant to the query - **Jaccard coefficient** is commonly used for this. It finds the percentage of vocabulary that is shared between the query and document by computing the intersection of two sets. $$ J ( A , B ) = | A ∩ B | | A ∪ B | $$ - A ∩ B is the number of shared terminology - A ∪ B is the total amount of words in both sets - **TF weight** (term frequency) computes the numbe rof times term t occurs in document d $$ tf_t,d $$ - **Log frequency rating** rates the relevance of terms in a document based on how frequently it shows up in the doc. A term that shows up 10 times matters more than one that shows up 1000 times ![Screenshot 2025-06-02 at 5.42.42 PM](https://hackmd.io/_uploads/rk2WrTjGeg.png) - The underlying assumptions is that **rare terms are more informative than frequent terms** - to align with this assumption, we use the **idf weight** (the inverse document frequency) where rare terms are assigned more valye ![Screenshot 2025-06-02 at 5.45.15 PM](https://hackmd.io/_uploads/Hy8oBTozee.png) the **tf-idf weighting** computes the product of hte TF and the IDF. This system balances the term frequency with the inverse assumption around rare terms. ![Screenshot 2025-06-02 at 5.45.56 PM](https://hackmd.io/_uploads/H1k0S6oMgl.png) Thus, the formula for the **relevance score** is as follows: ![Screenshot 2025-06-02 at 5.47.46 PM](https://hackmd.io/_uploads/rJpVI6iMgl.png) ### Relevance Models **Relevance models** are used to make assumptions about a search problem to make it easier to find high quality documents. Some of these include topical, user, binary, continuous, and multi-valued relevance. - **Bag of words model** disregards the order of words and focused only on the frequency of words in a document --- ## Query Optimization Here we will focus on getting the most **efficient** processing algorithms and methodologies instead of the most high quality rankings ### Distributed Query Evaluation Processes queries on multiple machines to speed up processing. There are two machine components of a DEQ: - **director machine** takes in queries and sends it to child machines and organizes results to return to the user - **index servers** are child machines that process queries ![distributed_query_architecture](https://hackmd.io/_uploads/SyTrrumfle.jpg) ![Screenshot 2025-05-27 at 12.14.53 PM](https://hackmd.io/_uploads/SkaE1qmMle.png) There are two main methods of distribution. **Document distribution** is when index servers act as search engines for a fraction of the documents. It receives copies of the query and returns the top-k results which are merged into a ranked list to the director **Term distribution** is when a single index is processed by multiple index servers. Usually the server with the longest inverted list will process the query and other servers will send information to that server. ### Query Caching The goal of query caching is to divide documents efficiently to process and store. **Zipf's law** states that most documents are unique and only a few are very popular. The rank 1 most common word may show up 1/10th of the time and decrease in half as it goes down the scale. This follows the semantic patterns of human speech. ![Screenshot 2025-05-27 at 12.18.58 PM](https://hackmd.io/_uploads/ByaXeq7Mgg.png) Caches are designed based on the frequency of words described in this law. More frequent words are stored in the cache in the RAM. Less frequent words are stored in long term memory ### Query Processing There are two ways to process queries. **Document at a time (DAAT)** calculates scores for the term lists in *one document at a time*. Docs are processed sequentially. Below, each column represents a new document. Columns are added one at a time after the scores are calcualted for each document ![Screenshot 2025-05-27 at 12.45.52 PM](https://hackmd.io/_uploads/HJh_8c7zlg.png) **Term at a time (TAAT)** accumulates the scores for each document by computing the scores of the *term lists one at a time*. Docs are processed concurrently. Below, we accumulate the partial score for each term. Documents (again, represented as columns) are processed at the same time. The partial scores of each document are added to the scores for each term list. ![Screenshot 2025-05-27 at 12.46.59 PM](https://hackmd.io/_uploads/ry1685XMgx.png) ### Optimization TAAT uses more memory for accumulating the partial scores but access disk memory (RAM) more efficiency. There are a two classes of optimization: 1. For simple features: Read less data from inverted lists with skip pointers 2. For complex features: Use **Conjunctive processing** to calculate scores for fewer documents. Requires that the returned document contains all query terms. **Conjunction processing** requires that documents contain all query words. This is the default mode of search engines - **Optimization:** Begin by selecting the documents containing the rarest terms to reduce the search space (ex: "Penguin AND Plushie" which are rare words then combine with "The" which is a more common word ) - **Conjunctive TAAT**: Checks the accumulator for the next document containing all query terms (target doc). Use the skip pointer to skip all postings to the target doc. - **Conjunctive DAAT**: Find the largest documnt an inverted list points to and skips to that document. **Threshold Methods** extract a number (k) of top-ranked documents. Thresholds (τ) have a *minimum score* the document must reach to be shown to the user and ignores documents that do not reach τ. We use $ τʹ $ since the actual score of τ is usually unknown. - In DAAT, τʹ = score of lowest ranked document. - **MaxScore** determines the maximum possible score, compares it to the τʹ, and removes all documents that do not meet the threshold. This is like the pre-screening proces for resumes. The people that have a low GPA, lack of professional experience, and no references will be rejected based on the cumulative score which fails to meet the threshold. $$ if MaxScore(d) < τ′ skip(d) $$ ### Safety Safety describes wether we successfully retrieve the most relevant / highest scoring documents from the corpus. - Thresholds are more safe since it proceses all documents first and prunes those that have no hope - Early termination could ignore high-frequency word lists in TAAT and ignore documents at the end of the list in DAAT - List ordering (ordering inverted lists by quality metric) will favor early entries and ignore entries at the end of the list --- ## misc: things to explore - query disambiguation - knowledge graph traversal - data ingestion and fusion - ranking systems - efficient text indexing - vector space retrieval models - evaluations and interface issues - doc clustering and classification - ml-based ranking approaches Lecture counter - [ x ] Lec 1 - [ x ] Lec 2 - [ x ] Lec 3 - [ x ] Lec 4 - [ x ] Lec 5 - [ ] Lec 6 - [ ] Lec 7 - [ ] Lec 8 - [ ] Lec 9 - [ ] Lec 10 - [ ] Lec 11 - [ ] Lec 12 - [ ] Lec 13 - [ ] Lec 14 - [ ] Lec 15 - [ ] Lec 16 - [ ] Lec 17 - [ ] Lec 18 - [ x ] Lec 19 - [ x ] Lec 20 - [ ] Lec 21 - [ ] Lec 22 - [ ] Lec 23 - [ ] Lec 24 - [ ] Language Modeling