A compilation of COMPSCI 121 at UC Irvine on web-based search. Topics in web crawlers, search engine optimization, and semantic graph traversal.
Information retrieval (IR) is the process of extracting information (raw or semantically meaningful) from the web. Web Engines are graphical representations of information that we use to algorithmically access and traverse information. We will first set up basic understanding of the web.
Nature of the web:
The reason we need web engines is primarily a social one:
The core difference between a database and search engine is relevance. Databases only take into account textual similairty. Search enginers extend to contextual cues from the user, document, and the web. IR is a core process that enables this.
There are several types of IR:
Classic assumptions of IR include:
Good Retrieval Model designs are used to match a query from the user to a document(s).
There are two primary types of search: 1) Searching through a relational database based on strucured queries and 2) Web engines that search based on unstructured queries
Web engines crawl bytes of a page to provide sub-second results. The design of search engines focus on optimizing the websites to process them more quickly
The primary design issues are:
Web engines search the web through a structured process beginning by parsing HTML and data on the web, transforming data,
Inputs: User query (keywords and operators (ex: AND))
Outputs: Resulting summaries of pages and highlight of importnat words
Here is an overview of the process:
There are multiple ways to extract web data (data dumps, document store, etc). The most popular one is a Web/Spider crawlers which crawls the web, returns the content of relevant pages, and returns the next hrefs in the page to scan until all documents are scanned.
First, we must transform the data so we can easily process it.
The process is as follows:
Tokenizing words is the process of categorizing words into meaningful linguisitic units to make it easily for the machine to parse.
APPles can't hat.<.>
It follows the properties below:
The inverted index data structure is calcualted from the query: a list-based data structure that represents the mapping of each word to the documents it's contained in as a matrix (i.e. what documents contain the target word). Indices are created through the following process:
Evaluation process analyzes the data based on indices. It makes the search process faster and more efficient
There are two major parts of evaluation:
Workflow:
SEO helps users better understand information
Popular searches (i.e. search trending) are kept in the cache and used to "autocomplete" results
Most search results are high variance and ill-defined. The engine uses the following process to extract the actual user intention:
You can safeguard the relevance of the page using relevance scores (). See step 4.
Ad Engines is the main profit system on the web. Ads are promoted. Ads use pay-per-click (PPC) system that bid on keywords entered in the search query. The engine runs an auction system based on parameters like price paid for ad, relevance, and expected click-through-rate to determine which ad to display.
There are two primary types of search engines metrics:
Search engines first need data from the web to operate and determine relevancy to a search query. This process is of extracting information from the web is called Web Crawling. There are several components to the crawling architecture:
Basic algorithm:
robots.txt is uses to determine web crawler permissions on a webpage:
Disallow
: denies the crawler from accessing these pagesSitemap
: determines what the crawler can crawlUser-agent
: determines the bot that can crawl the pageThis basic algorithm is not used in practice becuase it does not respect the privacy of the website owner, takes a long time, and returns irrelevant noisy data.
Politness:
Performance:
Crawler traps: Structural issues with webpages that lead to infinite pagination of links. Can be avoided with
Duplicate detection: Lots of near-duplicate websites on the internet but not identical (differnt hashes). Detectors can somewhat detect semantic similarity but methods are ineffective
Data noise: Tags and ads cause a lot of noise in the retrieved data that can compromise IR. Cumulative distribution of tags can help the engine select blocks of text based on where the graph pleteaus (i.e. where the number of tags decrease)