A compilation of COMPSCI 121 at UC Irvine on web-based search. Topics in web crawlers, search engine optimization, and semantic graph traversal.
Information retrieval (IR) is the process of extracting information (raw or semantically meaningful) from the web. Web Engines are graphical representations of information that we use to algorithmically access and traverse information. We will first set up basic understanding of the web.
Nature of the web:
The reason we need web engines is primarily a social one:
The core difference between a database and search engine is relevance. Databases only take into account textual similairty. Search enginers extend to contextual cues from the user, document, and the web. IR is a core process that enables this.
There are several types of IR:
Classic assumptions of IR include:
Good Retrieval Model designs are used to match a query from the user to a document(s).
There are two primary types of search: 1) Searching through a relational database based on strucured queries and 2) Web engines that search based on unstructured queries
Web engines crawl bytes of a page to provide sub-second results. The design of search engines focus on optimizing the websites to process them more quickly
The primary design issues are:
Web engines search the web through a structured process beginning by parsing HTML and data on the web, transforming data,
Inputs: User query (keywords and operators (ex: AND))
Outputs: Resulting summaries of pages and highlight of importnat words
Here is an overview of the process:
There are multiple ways to extract web data (data dumps, document store, etc). The most popular one is a Web/Spider crawlers which crawls the web, returns the content of relevant pages, and returns the next hrefs in the page to scan until all documents are scanned.
First, we must transform the data so we can easily process it.
The process is as follows:
Tokenizing words is the process of categorizing words into meaningful linguisitic units to make it easily for the machine to parse.
APPles can't hat.<.>
It follows the properties below:
The inverted index data structure is calcualted from the query: a list-based data structure that represents the mapping of each word to the documents it's contained in as a matrix (i.e. what documents contain the target word). Indices are created through the following process:
Evaluation process analyzes the data based on indices. It makes the search process faster and more efficient
There are two major parts of evaluation:
Workflow:
SEO helps users better understand information. Search results depend on the the following:
Popular searches (i.e. search trending) are kept in the cache and used to "autocomplete" results
Most search results are high variance and ill-defined. The engine uses the following process to extract the actual user intention:
You can safeguard the relevance of the page using relevance scores (). See step 4.
Ad Engines is the main profit system on the web. Ads are promoted. Ads use pay-per-click (PPC) system that bid on keywords entered in the search query. The engine runs an auction system based on parameters like price paid for ad, relevance, and expected click-through-rate to determine which ad to display.
There are two primary types of search engines metrics:
Search engines first need data from the web to operate and determine relevancy to a search query. This process is of extracting information from the web is called Web Crawling. There are several components to the crawling architecture:
Basic algorithm:
robots.txt is uses to determine web crawler permissions on a webpages:
Disallow
: denies the crawler from accessing these pagesSitemap
: determines what the crawler can crawlUser-agent
: determines the bot that can crawl the pageThis basic algorithm is not used in practice becuase it does not respect the privacy of the website owner, takes a long time, and returns irrelevant noisy data.
Politness: placing delays throghout the crawling process to prevent overwhelming servers
Performance:
Crawler traps: Structural issues with webpages that lead to infinite pagination of links.
Duplicate detection: Lots of near-duplicate websites on the internet but not identical (differnt hashes). Detectors can somewhat detect semantic similarity but methods are ineffective
Data noise: Tags and ads cause a lot of noise in the retrieved data that can compromise IR. Cumulative distribution of tags can help the engine select blocks of text based on where the graph pleteaus (i.e. where the number of tags decrease)
Multithreading is the process of threading multiple workers to parse different categories of webpages simultaneously and bundle them at the end into one unit
The goal of tokenization is to extract meaning from a series of words
rules:
Boolean retrieval retrieves information using boolean operators (AND OR NOT).
- uses as term-document binary matrix which uses 0 and 1 to indicate which documents a specific term appears in. It will use this to retrieve all the documents using bitwise operators on the rows.
Inverted Index is a data structure that maps terms to the documents they appear in. Instead of storing wether or not a term appears in EVERY document, it only stores the documents the terms appears in and nothing else. This increases efficiency since it reduces the search space.
Boolean search is the primitive form of information retrieval that requires exact queries and expert understanding. The feast or famine problem means that the query will either return many results or none at all
Ranked retrieval model returns the top scored docs for the query.
Queries and Documents are represented as vectors in space. Documents are ranked based on their cosine similarity to the query in space.
Vector Space Scoring finds the K docs in the collection that are nearest to the query vector (lowest degress btwn the query vector and the document vector)
Relevance models are used to make assumptions about a search problem to make it easier to find high quality documents. Some of these include topical, user, binary, continuous, and multi-valued relevance.
Here we will focus on getting the most efficient processing algorithms and methodologies instead of the most high quality rankings
Processes queries on multiple machines to speed up processing. There are two machine components of a DEQ:
There are two main methods of distribution.
Document distribution is when index servers act as search engines for a fraction of the documents. It receives copies of the query and returns the top-k results which are merged into a ranked list to the director
Term distribution is when a single index is processed by multiple index servers. Usually the server with the longest inverted list will process the query and other servers will send information to that server.
The goal of query caching is to divide documents efficiently to process and store. Zipf's law states that most documents are unique and only a few are very popular. The rank 1 most common word may show up 1/10th of the time and decrease in half as it goes down the scale. This follows the semantic patterns of human speech.
Caches are designed based on the frequency of words described in this law. More frequent words are stored in the cache in the RAM. Less frequent words are stored in long term memory
There are two ways to process queries.
Document at a time (DAAT) calculates scores for the term lists in one document at a time. Docs are processed sequentially.
Below, each column represents a new document. Columns are added one at a time after the scores are calcualted for each document
Term at a time (TAAT) accumulates the scores for each document by computing the scores of the term lists one at a time. Docs are processed concurrently.
Below, we accumulate the partial score for each term. Documents (again, represented as columns) are processed at the same time. The partial scores of each document are added to the scores for each term list.
TAAT uses more memory for accumulating the partial scores but access disk memory (RAM) more efficiency.
There are a two classes of optimization:
Conjunction processing requires that documents contain all query words. This is the default mode of search engines
- Optimization: Begin by selecting the documents containing the rarest terms to reduce the search space (ex: "Penguin AND Plushie" which are rare words then combine with "The" which is a more common word )
- Conjunctive TAAT: Checks the accumulator for the next document containing all query terms (target doc). Use the skip pointer to skip all postings to the target doc.
- Conjunctive DAAT: Find the largest documnt an inverted list points to and skips to that document.
Threshold Methods extract a number (k) of top-ranked documents. Thresholds (τ) have a minimum score the document must reach to be shown to the user and ignores documents that do not reach τ. We use $ τʹ $ since the actual score of τ is usually unknown.
Safety describes wether we successfully retrieve the most relevant / highest scoring documents from the corpus.
Lecture counter