# Information Retrieval & Web Search
###### tags: `UCR`
## Introduction
* Difference between database & documnets
* Database: Made up of well-defined fields, has schema
* Documents vs records
* The query must be compared to the text of entire documnets
* IR tasks
* Ad-hoc search
* Find relevant documnets for an arbituary text
* Filtering
* Identify relevant user profile for a new document
* Claasfication
* Identify relevant user labels for documents
* Question answering
* GIve a specific answer to a question
### Issues
* Relevance
* Evaluation
* User & information needs
* Spam
* Affect the efficiency of seach engine
### Components
Web => Spider -> (Documents Corpus) -> IR System ->(Ranked Documents)
* Indexing process
* Inversion: Core of indecxing process
* Index Distribution: Provides interface and parsers fro query language (Many change becuz ppl are lazy to type)
* Indexing process
* Text acquisition: Identify and stores documents for indexing
* Feeds: Real-time stream of documents
* Conversion: Convert variety of documnets into a consistent text plus
* Text transormation: Transforms text to
* Stopping: remove common words
* Stemming: Group nwords derived from a common stem
* Link Analysis: Makes use of links and anchor
## Retrieval model
### Older models
* Boolean retrieval
* True or False, simpest form of ranking
* Specified using boolean operators: AND, OR, NOT
* Advantage
* Predictable
* Features can be incorporated
* Disadvantages
* Difficult complex query,
* No ranking, not suitable for web search
* Search by numbers
### Vector Space Model
* Documnets & queries are represented by a vector of term weights
* Collections are represented by a matrix of term weights
* Use cosine correlation (more difficult to compue the angle)
* The angle is smaller when the value of cosine is larger
::: info
For query "UCR Riverside", "UCR" and "Riverside" have same similarity, but "UCR" is more specific
:::
* Can be represented as bag of words, no ordering of the words
* Advantages
* Simple framework for rating
* Disadvantage
* Assumptions of term independence
::: info
If Q = "UCR Riverside Football", D1 = "UCR Riverside" D2 = "UCR football"
Vector space model gives the same score, but `P(Riverside|UCR) > P(football|UCR)`, D2 is better
On the other hand, Q = "UCR play", D1 = "UCR game", D2 = "UCR potato", but vector space model failed to find the similarity of "play" and "game"
:::
#### Term weights
* Tf-idf
* Inverse document frequency measures importance in collection
::: info
In previous example, "UCR" has higher inverse frequency than "Riverside"
:::
* TF-IDF weights: Normalization
#### Relevance feedback
* Pseudorelevance feedback
* Don't ask user anything but atomatically do the query
* Optimal feedback
* Maximize the differnce between the average vectors representing relevant document than that of non-relevant documents.
### Probability Ranking Principles
* Bayes Classifier
* A document D is relevant if `P(Relevant|D) > P(Non-relevant|D)`
* Using likelihood ratio
* Split document toward the words
### Contigency table
### BM25
* Effective ranking Algorithm based on binary independence model
## Language model
* Unigram
* Bigram
* Decide the next word based on the previous words
* But more complicated
* Example
* I: `P(0.9)P(I|play) = 0.2`, Play`P(0.1)P(I|play) = 0.9`
### LMS for retrieval
* Probability of generating the query text from a document language model
* `D->LM(D)`, compute `P(Q|LM(D)`
* Smoothing
* Prevent 0 score
* Less smoothing for big documents, more smoothing for small documents
*
* Jelinek-Mercer Smoothing
* Dirichlet Smoothing
## Machine learning models
* If user choose D2, then `S(D2, Q)` > `S(D1, Q)` and `S(D2, Q)` > `S(D3, Q)`
## Web Crawler
* From seed
* Implemented as a queue, push to hyperlink ina page to the end of the queue
* Robots.txt file on the root can be used to controll crawler
* Disallow and allow
* Sitemaps
* Contains URLs and Relevant meta, generated by web server administrator
### Freshness
* Web pages are constanly being added, deleted and modified
* Web crawler must continously revisit pages it has already crawled
* Stale copies no longer reflect the real content
* HTTP has HEAD to help checking for page changes
* A age of page is a Poisson event
### Focus Crawling
* Attempts to download on ly those pages that are about a particular topic
* Use a text classifier
### Crawling Deep Web
* Site that are difficult to crawl
* Private sites
* No incoming links
* Form results
* Data can only obtain after putting some data into a form
* Scripted pages
* Generated by scripts
* Some contain may hidden behind
* e.g. Forms
### Distributed Crawling
* Every machine has its own queue
* May contain duplicates
* Solution: Hashing
* Threads exchange hashes of URLs
### Desktop crawling
* Search your own computer
* Not popular
* Now most files on the cloud
* Provided by OS
* Documents Feed
### Conversion
* Since texts are stored in hundreds of incompatible file format, a convertion tool is needed
* Character Encoding
* Encoding: UTF-8, ASCII, etc.
* A major source of imcompatibility
* Many other language contain more glyphs
* Solution: unicode
### Storing webpages
* File system
* Databases
* Don't need SQL
* Don't need to join
* NoSQL
* BigTable
* No query language
* Only row level transaction
* Stored in replicated file system accessible by all BigTable server
* Transactions are stored in a shared file system
### Noise Removal
* There are outlying content block
* Finding Content block
* Find index i and j maximize both the number of tags below i and above j
## Indexing
Given a keyword, return which document contaions the keyword
* Access to full documents is slow
* Inverted frequency
* Storing How many keywords in each documents
* Compression
* Reason: The list can be long
* Document-at-a-time
* Need more random access
* But save memory
* Term-at-a-time
* No random access
* Need large memory space
* Need to keep track of partial score
### Optimization Techniques
* WAND algorithm
* Subroutine: keep pointer
* Skipping
* Threshold method
* Variant of **term-at-a-time**
* Merging
* merging address limit memory problem
## MapReduce
* Distributed Indexing
* A paradism to work across multiple machines
* Ensure the entries with same key aggragrate on the same machine
* Mapper and reducer may on the same machine
* Hadoop on top of HDFS
* A distributed file system
* Good at
* Very large read-only or append-only files
* Sqeuential access
* Disadvantages
* Multiple small files
* files are stored as blocks and replicated
* Reducers wait until all mappers finish
* RR: Used to define own worker
* There may be **combiners** between mappers and reducers
* Goal: Minimize the pairs need to be shuffled
* Do some grouping
## Quizes
* Twitter crawler
```
Insert some user to Q
While Q not :
Get user u from Q
Add to Tweets getTweets(u)
Insert into Q getFollower(Q)
Insert into Q getFollowee(Q)
```
* Map Reduce
* Mapper
```
Eliminate duplicate
emit (w,1)
```
* Reducer
```
Count pair
emit(w, count)
```
* Combiner: (Same as reducer)
## PageRank
* Traditional IR Ranking
* Pages are not sufficiently self-descriptive
* Link analysis
* Assumptions
* If the page points to p are good, p is also a good page
* Google paper
* Hub: Page points to good authorities
## Evaluation
* Relevance judgements are difficult to compute
* Pooling
* Top K results obtained by various algorithms or models are merged into a pool
* Query Log
* Used for both tuning and evaluating of search engines
* Not accurate: Don't like the page and go back
* Tricky to share query workload
## QA
#### 1.What are the differences between IR and DB search? What are the challenges of IR search?
Ans: Database: Made up of well-defined fields, has schema, text reversly.
#### 2.list 4 types of content used in IR applications
Ans:
* Ad-hoc search
* Find relevant documnets for an arbituary text
* Filtering
* Identify relevant user profile for a new document
* Claasfication
* Identify relevant user labels for documents
* Question answering
* GIve a specific answer to a question
#### 3.list and define 5 types of applications of IR
Ans:
* Link analysis
* Ranking
* Web crawl
#### 4.list 2 example of query refinement technique
Ans:
* smoothing method
#### 5.list 4 issue of search engine
Ans:
* Noise of the page (or spam)
* Relevance
* Evaluation
* User & information needs