Information Retrieval & Web Search

# Information Retrieval & Web Search ###### tags: `UCR` ## Introduction * Difference between database & documnets * Database: Made up of well-defined fields, has schema * Documents vs records * The query must be compared to the text of entire documnets * IR tasks * Ad-hoc search * Find relevant documnets for an arbituary text * Filtering * Identify relevant user profile for a new document * Claasfication * Identify relevant user labels for documents * Question answering * GIve a specific answer to a question ### Issues * Relevance * Evaluation * User & information needs * Spam * Affect the efficiency of seach engine ### Components Web => Spider -> (Documents Corpus) -> IR System ->(Ranked Documents) * Indexing process * Inversion: Core of indecxing process * Index Distribution: Provides interface and parsers fro query language (Many change becuz ppl are lazy to type) * Indexing process * Text acquisition: Identify and stores documents for indexing * Feeds: Real-time stream of documents * Conversion: Convert variety of documnets into a consistent text plus * Text transormation: Transforms text to * Stopping: remove common words * Stemming: Group nwords derived from a common stem * Link Analysis: Makes use of links and anchor ## Retrieval model ### Older models * Boolean retrieval * True or False, simpest form of ranking * Specified using boolean operators: AND, OR, NOT * Advantage * Predictable * Features can be incorporated * Disadvantages * Difficult complex query, * No ranking, not suitable for web search * Search by numbers ### Vector Space Model * Documnets & queries are represented by a vector of term weights * Collections are represented by a matrix of term weights * Use cosine correlation (more difficult to compue the angle) * The angle is smaller when the value of cosine is larger ::: info For query "UCR Riverside", "UCR" and "Riverside" have same similarity, but "UCR" is more specific ::: * Can be represented as bag of words, no ordering of the words * Advantages * Simple framework for rating * Disadvantage * Assumptions of term independence ::: info If Q = "UCR Riverside Football", D1 = "UCR Riverside" D2 = "UCR football" Vector space model gives the same score, but `P(Riverside|UCR) > P(football|UCR)`, D2 is better On the other hand, Q = "UCR play", D1 = "UCR game", D2 = "UCR potato", but vector space model failed to find the similarity of "play" and "game" ::: #### Term weights * Tf-idf * Inverse document frequency measures importance in collection ::: info In previous example, "UCR" has higher inverse frequency than "Riverside" ::: * TF-IDF weights: Normalization #### Relevance feedback * Pseudorelevance feedback * Don't ask user anything but atomatically do the query * Optimal feedback * Maximize the differnce between the average vectors representing relevant document than that of non-relevant documents. ### Probability Ranking Principles * Bayes Classifier * A document D is relevant if `P(Relevant|D) > P(Non-relevant|D)` * Using likelihood ratio * Split document toward the words ### Contigency table ### BM25 * Effective ranking Algorithm based on binary independence model ## Language model * Unigram * Bigram * Decide the next word based on the previous words * But more complicated * Example * I: `P(0.9)P(I|play) = 0.2`, Play`P(0.1)P(I|play) = 0.9` ### LMS for retrieval * Probability of generating the query text from a document language model * `D->LM(D)`, compute `P(Q|LM(D)` * Smoothing * Prevent 0 score * Less smoothing for big documents, more smoothing for small documents * * Jelinek-Mercer Smoothing * Dirichlet Smoothing ## Machine learning models * If user choose D2, then `S(D2, Q)` > `S(D1, Q)` and `S(D2, Q)` > `S(D3, Q)` ## Web Crawler * From seed * Implemented as a queue, push to hyperlink ina page to the end of the queue * Robots.txt file on the root can be used to controll crawler * Disallow and allow * Sitemaps * Contains URLs and Relevant meta, generated by web server administrator ### Freshness * Web pages are constanly being added, deleted and modified * Web crawler must continously revisit pages it has already crawled * Stale copies no longer reflect the real content * HTTP has HEAD to help checking for page changes * A age of page is a Poisson event ### Focus Crawling * Attempts to download on ly those pages that are about a particular topic * Use a text classifier ### Crawling Deep Web * Site that are difficult to crawl * Private sites * No incoming links * Form results * Data can only obtain after putting some data into a form * Scripted pages * Generated by scripts * Some contain may hidden behind * e.g. Forms ### Distributed Crawling * Every machine has its own queue * May contain duplicates * Solution: Hashing * Threads exchange hashes of URLs ### Desktop crawling * Search your own computer * Not popular * Now most files on the cloud * Provided by OS * Documents Feed ### Conversion * Since texts are stored in hundreds of incompatible file format, a convertion tool is needed * Character Encoding * Encoding: UTF-8, ASCII, etc. * A major source of imcompatibility * Many other language contain more glyphs * Solution: unicode ### Storing webpages * File system * Databases * Don't need SQL * Don't need to join * NoSQL * BigTable * No query language * Only row level transaction * Stored in replicated file system accessible by all BigTable server * Transactions are stored in a shared file system ### Noise Removal * There are outlying content block * Finding Content block * Find index i and j maximize both the number of tags below i and above j ## Indexing Given a keyword, return which document contaions the keyword * Access to full documents is slow * Inverted frequency * Storing How many keywords in each documents * Compression * Reason: The list can be long * Document-at-a-time * Need more random access * But save memory * Term-at-a-time * No random access * Need large memory space * Need to keep track of partial score ### Optimization Techniques * WAND algorithm * Subroutine: keep pointer * Skipping * Threshold method * Variant of **term-at-a-time** * Merging * merging address limit memory problem ## MapReduce * Distributed Indexing * A paradism to work across multiple machines * Ensure the entries with same key aggragrate on the same machine * Mapper and reducer may on the same machine * Hadoop on top of HDFS * A distributed file system * Good at * Very large read-only or append-only files * Sqeuential access * Disadvantages * Multiple small files * files are stored as blocks and replicated * Reducers wait until all mappers finish * RR: Used to define own worker * There may be **combiners** between mappers and reducers * Goal: Minimize the pairs need to be shuffled * Do some grouping ## Quizes * Twitter crawler ``` Insert some user to Q While Q not : Get user u from Q Add to Tweets getTweets(u) Insert into Q getFollower(Q) Insert into Q getFollowee(Q) ``` * Map Reduce * Mapper ``` Eliminate duplicate emit (w,1) ``` * Reducer ``` Count pair emit(w, count) ``` * Combiner: (Same as reducer) ## PageRank * Traditional IR Ranking * Pages are not sufficiently self-descriptive * Link analysis * Assumptions * If the page points to p are good, p is also a good page * Google paper * Hub: Page points to good authorities ## Evaluation * Relevance judgements are difficult to compute * Pooling * Top K results obtained by various algorithms or models are merged into a pool * Query Log * Used for both tuning and evaluating of search engines * Not accurate: Don't like the page and go back * Tricky to share query workload ## QA #### 1.What are the differences between IR and DB search? What are the challenges of IR search? Ans: Database: Made up of well-defined fields, has schema, text reversly. #### 2.list 4 types of content used in IR applications Ans: * Ad-hoc search * Find relevant documnets for an arbituary text * Filtering * Identify relevant user profile for a new document * Claasfication * Identify relevant user labels for documents * Question answering * GIve a specific answer to a question #### 3.list and define 5 types of applications of IR Ans: * Link analysis * Ranking * Web crawl #### 4.list 2 example of query refinement technique Ans: * smoothing method #### 5.list 4 issue of search engine Ans: * Noise of the page (or spam) * Relevance * Evaluation * User & information needs