Try   HackMD

CS 121: Information Retrieval

A compilation of COMPSCI 121 at UC Irvine on web-based search. Topics in web crawlers, search engine optimization, and semantic graph traversal.

Table of contents

  1. Introduction
  2. Search Engine Architecture
  3. Search Engine Optimization
  4. Web Crawling
  5. Tokenization
  6. Retrieval
  7. Query Scoring and Quality
  8. Query Optimization

Intro

Information retrieval (IR) is the process of extracting information (raw or semantically meaningful) from the web. Web Engines are graphical representations of information that we use to algorithmically access and traverse information. We will first set up basic understanding of the web.

Nature of the web:

  • the corpus is not centralized (rather, it's distributed)
  • constant updating
  • significnat duplication (30-40% of the web)
  • High linkage (Many hyperlink directive lines)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The reason we need web engines is primarily a social one:

  • incentive for content creation
  • aggregation of "niche" groups
  • best way to monetize/advertise the web

The core difference between a database and search engine is relevance. Databases only take into account textual similairty. Search enginers extend to contextual cues from the user, document, and the web. IR is a core process that enables this.

There are several types of IR:

  • search (what we learn here)
  • filtering (finding objects of interest)
  • classification (labeling information)
  • question > answer (chatgpt)

Classic assumptions of IR include:

  • Corpus: there is a large, fixed amounts of information
  • Relevance:
    R(Q,D)
    scores the object based on the relevancy between the Q (Query) and Documents contents (D)

Good Retrieval Model designs are used to match a query from the user to a document(s).

There are two primary types of search: 1) Searching through a relational database based on strucured queries and 2) Web engines that search based on unstructured queries
Web engines crawl bytes of a page to provide sub-second results. The design of search engines focus on optimizing the websites to process them more quickly

The primary design issues are:

  • Performance (latency, throughput)
  • Coverage
  • Freshness
  • Spam/Inaccurate information

Search Engine Architecture

Web engines search the web through a structured process beginning by parsing HTML and data on the web, transforming data,

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Inputs: User query (keywords and operators (ex: AND))
Outputs: Resulting summaries of pages and highlight of importnat words

Here is an overview of the process:

  1. User inputs a query
  2. Web crawlers acquire text
  3. Text is processed, transformed, and an index is created
  4. Information is ranked, evaluated
  5. Information/relevant websites are displayed on the UI

Step 2: Text acquisition

There are multiple ways to extract web data (data dumps, document store, etc). The most popular one is a Web/Spider crawlers which crawls the web, returns the content of relevant pages, and returns the next hrefs in the page to scan until all documents are scanned.

  • Scan HTML of the current page to find additional links (hrefs) and dump them in a queue
  • Spider scans next URL from the queue and scans the page again
  • Returns unindexed C pages

Step 3: Text Processing and transformation

First, we must transform the data so we can easily process it.

The process is as follows:

  • Parse and tokenize words
  • Group semantically similar words (ex: fish and fishes)
  • Remove semantically irrelevant words (ex: the, and, there)
  • Analyze the popularity of links
  • Extract the structure of the text (ex: title, italics, header)
  • Classifier categorizes the document (ex: spam, website, etc)

Tokenizing words is the process of categorizing words into meaningful linguisitic units to make it easily for the machine to parse.

  • Given query: APPles can't hat.<.>
  • Token: {'apple', 's', 'can', 't', 'hat'}

It follows the properties below:

  • sequence of 3 or more alphanumeric characters
  • semantically meaningful
  • all characters convereted to lowercase
  • Byte Pair Encoding (BPE) is a type of tokenization that breaks words into byte sized components (ex: "a", "b", "c") and combines them into semantically meaningful subunits (ex: "ab", "ph") that can be re-combined later

The inverted index data structure is calcualted from the query: a list-based data structure that represents the mapping of each word to the documents it's contained in as a matrix (i.e. what documents contain the target word). Indices are created through the following process:

  • terms in the document are weighted based on importance/relevancy (based on the search query)
  • inverse the index:
    text>doc>text>doc
  • distribute texts widely to make it easier to search (principle of creating efficient hash maps)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Step 4: Evaluation

Evaluation process analyzes the data based on indices. It makes the search process faster and more efficient

There are two major parts of evaluation:

  1. Precision: How relevant the first few hits are
    a. most important part of web search, more generally
  2. Recall: number of relevant hits retrieved
    3. most important when completeness is required

Workflow:

  • Logging: monitors and records activity on the webpage (ex: searching, timestamps, etc)
  • Ranking analysis: rank indices based on relevance. Performed by different processing brokers
  • Performance Analysis: performed by a human.

Search Engine Optimization

SEO helps users better understand information. Search results depend on the the following:

  • the geographic location the data center receives the query from

User behavior

Popular searches (i.e. search trending) are kept in the cache and used to "autocomplete" results

Most search results are high variance and ill-defined. The engine uses the following process to extract the actual user intention:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You can safeguard the relevance of the page using relevance scores (

R(Q,d)). See step 4.

Business model

Ad Engines is the main profit system on the web. Ads are promoted. Ads use pay-per-click (PPC) system that bid on keywords entered in the search query. The engine runs an auction system based on parameters like price paid for ad, relevance, and expected click-through-rate to determine which ad to display.

There are two primary types of search engines metrics:

  • Cost-per-click: cost per user click
  • Cost-per-mille: cost per 1,000 user impressions

Techniques


Web Crawling

Search engines first need data from the web to operate and determine relevancy to a search query. This process is of extracting information from the web is called Web Crawling. There are several components to the crawling architecture:

  • The crawler has access to a local data structure called a "Frontier" which is a queue that stores URLs that it has discovered but not yet visited yet.
  • DNS Resolver translates URLs into IP addresses which a fetcher fetches information (raw HTML content).
  • Parser parses the HTML content and converts it into a structured database of URLs and page content.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Basic algorithm:

  1. Initialize queue of URLs (seeds)
  2. For each URL:
    • check that current url can be crawled
    • add all links in the url to the queue
def crawler_thread(frontier):
    while not frontier.done():
        website = frontier.next_site()
        url = website.next_url()
        
        if website.permits_crawl(url): # perms based on robots.txt
            text = retrieve_url(url)
            store_document(url, text) # temp storage of urls
            
            for url in parse(text):
                frontier.add_url(url) # stores next urls to crawl
        
        frontier.release_site(website)

robots.txt is uses to determine web crawler permissions on a webpages:

  • Disallow: denies the crawler from accessing these pages
  • Sitemap: determines what the crawler can crawl
  • User-agent: determines the bot that can crawl the page

This basic algorithm is not used in practice becuase it does not respect the privacy of the website owner, takes a long time, and returns irrelevant noisy data.

Designing effective web crawlers

Politness: placing delays throghout the crawling process to prevent overwhelming servers

Performance:

Crawler traps: Structural issues with webpages that lead to infinite pagination of links.

Duplicate detection: Lots of near-duplicate websites on the internet but not identical (differnt hashes). Detectors can somewhat detect semantic similarity but methods are ineffective

Data noise: Tags and ads cause a lot of noise in the retrieved data that can compromise IR. Cumulative distribution of tags can help the engine select blocks of text based on where the graph pleteaus (i.e. where the number of tags decrease)

Screenshot 2025-04-17 at 12.08.20 PM

Multithreading is the process of threading multiple workers to parse different categories of webpages simultaneously and bundle them at the end into one unit


Tokenization

The goal of tokenization is to extract meaning from a series of words

rules:

  • case folding is the process of transforming all characters into the same case (ex: Case Folding -> CASE FOLDING) speeds up comparisons.
  • hyphens are treated as a single unit during process
  • numbers could cause issues when turning a decimal 9.2 into a number 92
  • stemming/conflation is the process of extracting the root of a series of text and only storing that. For example, swam and swimming will fall under the same root "swim"
    • Algorithmic stemming (Porter Stemmer): uses rule-based suffic removel (ex: removing -ing and -s endings) to extract the root form of the word. It may cause issues by forming non-existent words
    • Dictionary based stemming: searches up the word in a dictionary to find the root
  • Phrase recognition attempts to separate words into meaningful units using the following methods:
    • Part-of-Speech (POS) tagging divides words into grammatical units (verb, noun, etc) which helps machines understand its syntactic structures. It's highly accurate at the expense of speed.
    • N Grams organizes words in sequences. "N" corresponds to the number of words in a single unit. This helps search engines understand words that are better understood together (Ex: New York)
      • Unigram: one word chunks (ex: ["I", "love", "pepperoni", "pizza"])
      • Bigram: two word chunks (ex: ["I love", "love pepperoni", "pepperoni pizza"])
      • Trigram: three word chunks (ex: ["I love pepperoni", "love pepperoni pizza"])
      • Algorithm: slides a window of size n across the sentence and stores the words it finds
        n-grams
      ​​​​​​​​Count the number of trigrams:
      ​​​​​​​​"the quick brown fox jumps over the lazy dog"?
      ​​​​​​​​Answer: 7
      ​​​​​​​​9 words - 3 trigrams + 1 = 7 overlapping trigrams
      
    • Positional Indexes: stores the indexed occurance of specific terms in the document. Used for proximity searches for how tightly coupled words are to each other

Retrieval

Retrieval Process

Boolean retrieval retrieves information using boolean operators (AND OR NOT).
- uses as term-document binary matrix which uses 0 and 1 to indicate which documents a specific term appears in. It will use this to retrieve all the documents using bitwise operators on the rows.

Inverted Index is a data structure that maps terms to the documents they appear in. Instead of storing wether or not a term appears in EVERY document, it only stores the documents the terms appears in and nothing else. This increases efficiency since it reduces the search space.

  • A postings list is a detailed list attached to each word in the inverted index list that follows this format:
    ​​​​# example posting (document, indexed position in the document, frequency of word)
    ​​​​pizza → [
    ​​​​  (Doc 1, positions [3], frequency = 1),
    ​​​​  (Doc 2, positions [0], frequency = 1),
    ​​​​  (Doc 3, positions [1, 4], frequency = 2)
    ​​​​]
    
    The posting list must have the document ID at the bare minimum.

Screenshot 2025-05-13 at 11.16.00 AM

  • List of words are stored in a dictionary for O(1) lookup

Retrieval Models

Boolean search is the primitive form of information retrieval that requires exact queries and expert understanding. The feast or famine problem means that the query will either return many results or none at all

Screenshot 2025-05-27 at 1.41.12 PM

Ranked retrieval model returns the top scored docs for the query.


Query Scoring and Quality

Query representation

Queries and Documents are represented as vectors in space. Documents are ranked based on their cosine similarity to the query in space.

Vector Space Scoring finds the K docs in the collection that are nearest to the query vector (lowest degress btwn the query vector and the document vector)

query

Query Quality

Relevance models are used to make assumptions about a search problem to make it easier to find high quality documents. Some of these include topical, user, binary, continuous, and multi-valued relevance.


Query Optimization

Here we will focus on getting the most efficient processing algorithms and methodologies instead of the most high quality rankings

Distributed Query Evaluation

Processes queries on multiple machines to speed up processing. There are two machine components of a DEQ:

  • director machine takes in queries and sends it to child machines and organizes results to return to the user
  • index servers are child machines that process queries

distributed_query_architecture
Screenshot 2025-05-27 at 12.14.53 PM

There are two main methods of distribution.
Document distribution is when index servers act as search engines for a fraction of the documents. It receives copies of the query and returns the top-k results which are merged into a ranked list to the director

Term distribution is when a single index is processed by multiple index servers. Usually the server with the longest inverted list will process the query and other servers will send information to that server.

Query Caching

The goal of query caching is to divide documents efficiently to process and store. Zipf's law states that most documents are unique and only a few are very popular. The rank 1 most common word may show up 1/10th of the time and decrease in half as it goes down the scale. This follows the semantic patterns of human speech.

Screenshot 2025-05-27 at 12.18.58 PM
Caches are designed based on the frequency of words described in this law. More frequent words are stored in the cache in the RAM. Less frequent words are stored in long term memory

Query Processing

There are two ways to process queries.
Document at a time (DAAT) calculates scores for the term lists in one document at a time. Docs are processed sequentially.
Below, each column represents a new document. Columns are added one at a time after the scores are calcualted for each document

Screenshot 2025-05-27 at 12.45.52 PM

Term at a time (TAAT) accumulates the scores for each document by computing the scores of the term lists one at a time. Docs are processed concurrently.
Below, we accumulate the partial score for each term. Documents (again, represented as columns) are processed at the same time. The partial scores of each document are added to the scores for each term list.

Screenshot 2025-05-27 at 12.46.59 PM

Optimization

TAAT uses more memory for accumulating the partial scores but access disk memory (RAM) more efficiency.

There are a two classes of optimization:

  1. For simple features: Read less data from inverted lists with skip pointers
  2. For complex features: Use Conjunctive processing to calculate scores for fewer documents. Requires that the returned document contains all query terms.

Conjunction processing requires that documents contain all query words. This is the default mode of search engines
- Optimization: Begin by selecting the documents containing the rarest terms to reduce the search space (ex: "Penguin AND Plushie" which are rare words then combine with "The" which is a more common word )
- Conjunctive TAAT: Checks the accumulator for the next document containing all query terms (target doc). Use the skip pointer to skip all postings to the target doc.
- Conjunctive DAAT: Find the largest documnt an inverted list points to and skips to that document.

Threshold Methods extract a number (k) of top-ranked documents. Thresholds (τ) have a minimum score the document must reach to be shown to the user and ignores documents that do not reach τ. We use $ τʹ $ since the actual score of τ is usually unknown.

  • In DAAT, τʹ = score of lowest ranked document.
  • MaxScore determines the maximum possible score, compares it to the τʹ, and removes all documents that do not meet the threshold. This is like the pre-screening proces for resumes. The people that have a low GPA, lack of professional experience, and no references will be rejected based on the cumulative score which fails to meet the threshold.
    ifMaxScore(d)<τskip(d)

Safety

Safety describes wether we successfully retrieve the most relevant / highest scoring documents from the corpus.

  • Thresholds are more safe since it proceses all documents first and prunes those that have no hope
  • Early termination could ignore high-frequency word lists in TAAT and ignore documents at the end of the list in DAAT
  • List ordering (ordering inverted lists by quality metric) will favor early entries and ignore entries at the end of the list

misc: things to explore

  • query disambiguation
  • knowledge graph traversal
  • data ingestion and fusion
  • ranking systems
  • efficient text indexing
  • vector space retrieval models
  • evaluations and interface issues
  • doc clustering and classification
  • ml-based ranking approaches

Lecture counter

  • [ x ] Lec 1
  • [ x ] Lec 2
  • [ x ] Lec 3
  • [ x ] Lec 4
  • [ x ] Lec 5
  • [ ] Lec 6
  • [ ] Lec 7
  • [ ] Lec 8
  • [ ] Lec 9
  • [ ] Lec 10
  • [ ] Lec 11
  • [ ] Lec 12
  • [ ] Lec 13
  • [ ] Lec 14
  • [ ] Lec 15
  • [ ] Lec 16
  • [ ] Lec 17
  • [ ] Lec 18
  • [ x ] Lec 19
  • [ in progress ] Lec 20
  • [ ] Lec 21
  • [ ] Lec 22
  • [ ] Lec 23
  • [ ] Lec 24
  • [ ] Language Modeling