Try   HackMD

CS 121: Information Retrieval

A compilation of COMPSCI 121 at UC Irvine on web-based search. Topics in web crawlers, search engine optimization, and semantic graph traversal.

Intro

Information retrieval (IR) is the process of extracting information (raw or semantically meaningful) from the web. Web Engines are graphical representations of information that we use to algorithmically access and traverse information. We will first set up basic understanding of the web.

Nature of the web:

  • the corpus is not centralized (rather, it's distributed)
  • constant updating
  • significnat duplication (30-40% of the web)
  • High linkage (Many hyperlink directive lines)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The reason we need web engines is primarily a social one:

  • incentive for content creation
  • aggregation of "niche" groups
  • best way to monetize/advertise the web

The core difference between a database and search engine is relevance. Databases only take into account textual similairty. Search enginers extend to contextual cues from the user, document, and the web. IR is a core process that enables this.

There are several types of IR:

  • search (what we learn here)
  • filtering (finding objects of interest)
  • classification (labeling information)
  • question > answer (chatgpt)

Classic assumptions of IR include:

  • Corpus: there is a large, fixed amounts of information
  • Relevance:
    R(Q,D)
    scores the object based on the relevancy between the Q (Query) and Documents contents (D)

Good Retrieval Model designs are used to match a query from the user to a document(s).

There are two primary types of search: 1) Searching through a relational database based on strucured queries and 2) Web engines that search based on unstructured queries
Web engines crawl bytes of a page to provide sub-second results. The design of search engines focus on optimizing the websites to process them more quickly

The primary design issues are:

  • Performance (latency, throughput)
  • Coverage
  • Freshness
  • Spam/Inaccurate information

Search Engine Architecture

Web engines search the web through a structured process beginning by parsing HTML and data on the web, transforming data,

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Inputs: User query (keywords and operators (ex: AND))
Outputs: Resulting summaries of pages and highlight of importnat words

Here is an overview of the process:

  1. User inputs a query
  2. Web crawlers acquire text
  3. Text is processed, transformed, and an index is created
  4. Information is ranked, evaluated
  5. Information/relevant websites are displayed on the UI

Step 2: Text acquisition

There are multiple ways to extract web data (data dumps, document store, etc). The most popular one is a Web/Spider crawlers which crawls the web, returns the content of relevant pages, and returns the next hrefs in the page to scan until all documents are scanned.

  • Scan HTML of the current page to find additional links (hrefs) and dump them in a queue
  • Spider scans next URL from the queue and scans the page again
  • Returns unindexed C pages

Step 3: Text Processing and transformation

First, we must transform the data so we can easily process it.

The process is as follows:

  • Parse and tokenize words
  • Group semantically similar words (ex: fish and fishes)
  • Remove semantically irrelevant words (ex: the, and, there)
  • Analyze the popularity of links
  • Extract the structure of the text (ex: title, italics, header)
  • Classifier categorizes the document (ex: spam, website, etc)

Tokenizing words is the process of categorizing words into meaningful linguisitic units to make it easily for the machine to parse.

  • Given query: APPles can't hat.<.>
  • Token: {'apple', 's', 'can', 't', 'hat'}

It follows the properties below:

  • sequence of 3 or more alphanumeric characters
  • semantically meaningful
  • all characters convereted to lowercase
  • Byte Pair Encoding (BPE) is a type of tokenization that breaks words into byte sized components (ex: "a", "b", "c") and combines them into semantically meaningful subunits (ex: "ab", "ph") that can be re-combined later

The inverted index data structure is calcualted from the query: a list-based data structure that represents the mapping of each word to the documents it's contained in as a matrix (i.e. what documents contain the target word). Indices are created through the following process:

  • terms in the document are weighted based on importance/relevancy (based on the search query)
  • inverse the index:
    text>doc>text>doc
  • distribute texts widely to make it easier to search (principle of creating efficient hash maps)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Step 4: Evaluation

Evaluation process analyzes the data based on indices. It makes the search process faster and more efficient

There are two major parts of evaluation:

  1. Precision: How relevant the first few hits are
    a. most important part of web search, more generally
  2. Recall: number of relevant hits retrieved
    3. most important when completeness is required

Workflow:

  • Logging: monitors and records activity on the webpage (ex: searching, timestamps, etc)
  • Ranking analysis: rank indices based on relevance. Performed by different processing brokers
  • Performance Analysis: performed by a human.

Search Engine Optimization

SEO helps users better understand information

User behavior

Popular searches (i.e. search trending) are kept in the cache and used to "autocomplete" results

Most search results are high variance and ill-defined. The engine uses the following process to extract the actual user intention:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You can safeguard the relevance of the page using relevance scores (

R(Q,d)). See step 4.

Business model

Ad Engines is the main profit system on the web. Ads are promoted. Ads use pay-per-click (PPC) system that bid on keywords entered in the search query. The engine runs an auction system based on parameters like price paid for ad, relevance, and expected click-through-rate to determine which ad to display.

There are two primary types of search engines metrics:

  • Cost-per-click
  • Cost-per-mille: cost per 1,000 user impressions

Techniques

Web Crawling

Search engines first need data from the web to operate and determine relevancy to a search query. This process is of extracting information from the web is called Web Crawling. There are several components to the crawling architecture:

  • The crawler has access to a local data structure called a "Frontier" which is a queue that stores URLs that it has discovered but not yet visited yet.
  • DNS Resolver translates URLs into IP addresses which a fetcher fetches information (raw HTML content).
  • Parser parses the HTML content and converts it into a structured database of URLs and page content.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Basic algorithm:

  1. Initialize queue of URLs (seeds)
  2. For each URL:
    • check that current url can be crawled
    • add all links in the url to the queue
def crawler_thread(frontier):
    while not frontier.done():
        website = frontier.next_site()
        url = website.next_url()
        
        if website.permits_crawl(url): # perms based on robots.txt
            text = retrieve_url(url)
            store_document(url, text) # temp storage of urls
            
            for url in parse(text):
                frontier.add_url(url) # stores next urls to crawl
        
        frontier.release_site(website)

robots.txt is uses to determine web crawler permissions on a webpage:

  • Disallow: denies the crawler from accessing these pages
  • Sitemap: determines what the crawler can crawl
  • User-agent: determines the bot that can crawl the page

This basic algorithm is not used in practice becuase it does not respect the privacy of the website owner, takes a long time, and returns irrelevant noisy data.

Designing effective web crawlers

Politness:

Performance:

Crawler traps: Structural issues with webpages that lead to infinite pagination of links. Can be avoided with

Duplicate detection: Lots of near-duplicate websites on the internet but not identical (differnt hashes). Detectors can somewhat detect semantic similarity but methods are ineffective

Data noise: Tags and ads cause a lot of noise in the retrieved data that can compromise IR. Cumulative distribution of tags can help the engine select blocks of text based on where the graph pleteaus (i.e. where the number of tags decrease)

Screenshot 2025-04-17 at 12.08.20 PM

misc: things to explore

  • query disambiguation
  • knowledge graph traversal
  • data ingestion and fusion
  • ranking systems