# CSCI - 572 Midterm Review ## Week 1 ### 1B - SearchEngineBasic #### Search Engine History and Basics #### Search Engine Basic Behavior - Web search - Web search engine definition - enterprise search engines - personal search engines - Web search engine elements - spider / crawler / robot - The indexer - Query processor - Query processing - semantic analysis ### Web Trends and Measurements - Web trends - Major Cloud Providers - Google voice search query - Measuring the web - Web facts - p32 - Manual Hierarchical Web Taxonomies - Yahoo - Open Directory Project - DMoZ - Internet Archive ## Week 2 ### Video - **Precision and Recall, Jurafsky & Manning, Stanford U. (16 min) - 2-by-2 contingency table - Precision and recall - precision: % of selected items that are correct - recall: % of correct items that are selected - A combined measure: F - A combined measure that assesses the P/R tradeoff if F meaure (weighted harmonic mean) - The harmonic mean is a very conservative average - People usually used balanced F1 meausre - beta = 1 (alpha = 1/2) - F = 2PR/(P+R) - **WDM116:URL Frontier Using Mercator Scheme (29 min) - Mercator URL frontier - URLs flow in from the top into the frontier - Front queues manage prioritization - Back queues manage enforce politeness - Each queue is FIFO - Front queues - Prioritizer assgins to URL an interger priority between 1 and K - Appends URL to corresponding queue - Heuristics for assigning priority - Refresh rate sampled from previous crawls - Application-specific (e.g. crawl news sites more often) - Biased front queue selector - When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL - This choice can be round robin biased to queues of higher priority, or some more sophisticated variant - Can be randomized - URL frontier: two main considerations - politeness: do not hit a web server too frequently - freshness: crawl some pages more oftehr than others - e.g. news page - This goal may confict each other - e.g. simple priority queue fails - many links out of a page go to its own site, creating a burst of acess to that site - Back queue invariants - Each back queue is kept non-empty while the crawl is in progress - Each back queue only contains URLs from a single host - maintain a table from hosts to back queues - host name / back queue - back queue heap - one entry for each back queue - the entry is earliest time te at which the host corresponding to the back queue can be hit again - This earlist time is determined from - last access to that host - any time buffer heuristic we choose - politeness - challenges - even if we restrict only one thread to fetch from a host, can hit it repeatedly - common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from the host - Back queu processing (20min) - a crawler thread seeking a URL to crawl - Extract the root of the heap - Fetch Urls at head of corresponding back queue q(look table) - check if queue q is now empty - if so pull a url v from front queue - when q is non-empty, create heap entry for it - Numbers of back queue B - keep all threads busy while respecting politeness - mercator recommendation: three times as many back queues as crawler threads ### Search Engine Evaluation - Defining precision/recall - precision / recall - formalizting precision/recall - $Precision = tp / (tp + fp)$ - $Recall = tp / (tp + fn)$ - Harmonic Mean and F Measure - Harmonic Mean - F-score / F-measure - Calculating Recall/Precision at Fixed Positions - page 10 - Mean Average Precision - Average Precision of the Relevant Documents - page 11 - Average across Queries - MAP: Mean average precision - example: page14 - Difficulites in using Precision / Recall - Discounted Cumulative Gain: DCG - A final evaluation measure - Search Engine Evaluation Metrics / Elements of Good Search Results - text collections of queries and hand-ranked result - recall - precision at top $k$ positions - Search engines also use non-relevance-based measures - Click-through on first result - Studies of user behavior in the lab - A/B testing - Google's Search Quality - 4-step process for changing their search algorithm - precision evaluation - side-by-side experiment - live traffic experiment - full lunch - A/B Testing - User clicks - Using log files for evaluation - Google's Enhancements of Search Results ### Crawlers and Crawling - Web crawler - Web crewler issues - How to crawl? - Quality, efficiency, etiquette. - How much to crawl? - Coverage, relative coverage - How often? - Freshness, how much has really changed? - Simplest Crawler Operation - Initialize with “seed” page - Loop: Fetch and parse a page - Crawling picutre - challenges - Handling/avoiding malicous pages - Latency / bandwidth - Robots.txt - Maintain politeness - Robots.txt - Basic search stategy - page 11 - Breadth-first search - Depth-first search - QUerying stategy - page 15 - FIFO - BFS - LIFO - DFS - Heuristically ordering - focused crawler - re-order URLs - Avoiding page duplication - A crawer must efficiently index - URLs - starndard format / already seen - Already visited page - identical & near-identical - Link extraction - find html element to extract links - URLs must be completed - Anomalies - anchors might not have links - dynamic pages - Representing URLs - To determine if a new URL has already been seen - URLs are sorted lexicographically and then stored as a delta-encoded text file - **Trie** for URL Exact Matching - Why normalizing URLs? - page 20 - **Normalizing URLs** - 4 rules - page 21 - Unreserved characters - ALPHA (%41–%5A and %61–%7A) - DIGIT (%30–%39) - hyphen (%2D) - period (%2E) - underscore (%5F) - tilde (%7E) - Avoiding spider traps - session id - Handling spam web page - repeated terms - cloaking - doorway pages - The mercator web crawler - DNS - UDP - parallel threads - Measuring and Tuning a Crawler (improve) - DNS caching, pre-fetching, and resolution - DNS - UDP - customize - Multi-threaded crawling - Crawling approaches - Centralized crawler - Distributed crawling approaches - Distributed model - Benefited and issues of distributed crawling - benefits: scalability, costs, network-load dispersion and reduction - issues: overlap, quality, communication bandwidth - Coordination of distributed crawling - independent - dynamic assignment - central coordinator - static assignment - firewall mode - cross-over mode - exchange mode - Classification of parallel crawlers - Freshness - page 33 - selection policy - re-visit policy - politeness policy - parallelization policy - Keeping Spidered Pages Up to Date - Implications for a Web Crawler - running multiple types of crawlers is best - Updating in-place keeps the index current - Cho and Garcia-Molina, 2000 - re-visit policy - uniform policy - proportional policy - Help the Search Engine Crawler Creating a SiteMap - sitemap - XML - General Sitemap Guidelines - Google crawlers - Googlebot - To verify That The Visitor is Googlebot - reverse DNS lookup - match the crawler's IP address - Controlling How Often Googlebot Visits Your Site - crawl rate - Googlebot is Really Chrome Re-packaged - Googlebot processes web pages with JavaScript in 3 phases 1. Crawling – processing all links 2. Rendering – executing JS and then looping back to 1 3. Indexing ## Week 3 ### 5 - De-Duplication - De-Duplication definition - locker storage - Similar content with different URLs - Totally Distinct URLs May Deliver the Same Page - Near duplicate - spotting near duplicate - page 6 - Web Duplicates-Mirroring - single largest cause - Host1/a and Host2/b are mirrors iff - Why detect **exact duplicates** - page 11 - smarter crawling - better connectivity analysis - add redundancy in result listings - reduce crawl time - ideally - Why detect **near duplicates** - page 12 - clustering - data extraction - plagiarism - spam detection - duplictates within a domain - Solving the Duplicate/Near-Duplicate Detection Problems - page 13 1. Duplicate Problem: Exact match; - Solution: compute fingerprints using cryptographic hashing 2. Near-Duplicate Problem: Approximate match - Solution: compute the syntactic similarity with an edit-distance measure, and use a similarity threshold - Using a Cryptographic Hash Function to Convert a Web Page to a Number - cryptographic hash function - hash value - message digest, digital fingerprint, digest, or checksum - 4 main properties - easy / fast - ?? - small change - It is extremely unlikely that two slightly different messages will have the same hash. - Cryptographic Hash Functions - MD5 - SHA-1, SHA-2 - SHA-3 - RIPEMD-160 - Identifying Identical Web Pages – 4 Approaches - page 15 - campare character by character - hash first few characters - use a hash function - better approach - even better approach: crytographic hash - Identifying Near Identical Web Pages - 2 approaches - page 16 - fingerprints - Simhash - Hamming Distance - shingles - General paradigm - General Properties of Distance Measures - page 18 - Use **distance measure** to compute similarity - 4 properties of distance meansure 1. No negative distances 2. D(x,y) = 0 iff x=y 3. D(x,y) = D(y,x) symmetric 4. D(x,y) <= D(x,z) + D(z,y) triangle inequality - Distance meaures - Euclidean distance - Jaccard distance - Gosine distance - Edit distance - Hamming distance - boolean vector - Set Distance and Set Similarity are Related - notion of distance: $d(A,B)$ - notion of similarity: $s(A,B)$ - $d(A, B) = 1 – s(A, B)$ - **Jaccard similarity** - page 20 - Computing Jaccard Similarity from Sets Containing Shingles - ==shingle== - $S(D,w)$ is the set of shingles of a document D of width w - Similarity Measures - 相似性 Resemblance / Jaccard(A,B) - 包含性 containment - ==Shingling Modeling Choices== - white space ? -> yes - capitalization ? -> yes - punctuation - $k = $ ? - count replicas ? -> no - omitt stop words -> yes - Mapping Shingles to Numbers - fingerprints - High Jaccard similarity (e.g. greater than 0.9), implies the pages are near duplicate; orif ( J ( fingerprint(A), fingerprint(B)) ) > k, then the pages are similar; - Solution to Speed Up the Use of Shingles for Near Document Similarity - page 25 - a probabilistic approach - For each document D compute the $sketch_D[i]$ - Compute $sketch_D[i]$ page 26 - 28 - Computing Near Duplicates Using Jaccard Similarity and Probability - Using ==SimHash== to Detect **Near Duplicates** - page 30 - Documents D 1and D 2are near duplicates iff - Hamming-Distance(Simhash(D1 ), Simhash(D2 )) <= K • Typically f = 64 and k = 3 - Simhash is useful because if the Simhash bitwise Hamming distance of two phrases is low then their Jaccard coefficient is high - Simhash - page 32 - 36 - rotate ## Week 4 ### Videos - Video: Inverse Document Frequency Weighting, Jurafsky & Manning, Stanford U. (10 min) - Document frequency - idf weight - $idf_t = log_{10}(N/df)$ - Effect of idf on ranking - Collection frequence vs. Document frequency - Video: TF-IDF Weighting, Jurafsky & Manning, Stanford U. (3 min) - tf-idf wights - Best known wighting scheme in information retrieval. - Increase with rarity of the term in the collection. - Final ranking of documents for a query - $Score(q,d) = \epsilon_{t \in q \cap d }tf.idf_{t,d}$ - Binary -> count -> weight matrix - Video: Calculating TF-IDF Cosine Scores, Jurafsky & Manning, Stanford U. (12 min) - tf-idf weighting has many variants - term frequency - document frequency - normalization - weighting may differ in queries vs. documents - SMART notation: ddd, qqq - Query: logarithmic tf, idf, cosine normalization - tf-idf example: Inc.Itc - Summary - vector space ranking - Represent tht query as a weighted tf-idf vector - Represent each document as a weighted tf-idf vector - Compute the cosine silimarity score for the query vector and each document vector - Rank documents with respect to the query by score - Return the top K (e.g., K=10) to the user - Video: Word Tokenization, Jurafsky & Manning, Stanford U. (14 min) - Text Normalizalization - segmenting/tokening works in running text - normalizing word formats - segmenting sentences in runing text - How many words - Fragments, filled pauses - Lemma - Wordform - Type & Token - N = number of token - V = vocabulary = set of types - Church and Gale (1990): $|V| > O(N^{1/2})$ - terminal - less shakes.txt | less - tr -sc 'A-Za-z' '\n' < shakes.txt | less - tr -sc 'A-Za-z' '\n' < shakes.txt | sort | less - tr -sc 'A-Za-z' '\n' < shakes.txt | sort | unique - tr -sc 'A-Za-z' '\n' < shakes.txt | sort | unique | sort -n -r | less - tr 'A-Z' 'a-z' < shakes.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less - Issues in TOkenization - Tokenization: language issues - Word Tokenization in Chinese - Word Segmentation - Stardard baseline segmentation algorithm: Maximum Matching - Max-match segmentation illustration ### 7 - Introduction to Information Retrieval - Information retrieval (IR) - index (a corpus) - efficiently retrive relevant documents - The Traditional IR System - History of IR (page 4-8 - Areas Related To, But Different Than, Information Retrieval (page 9-15 - Basic Information Retrieval Begins with **Keyword Matching** - simplest relevance - synonymouse terms - ambiguous terms - Intelligent Information Retrieval - meaning, order, feedback, authority - A More Detailed IR Architecture (page 17 - Defining terms (page 19) - **stopword** - **stemming** - **indexing** (inverted index) - **searching** - **ranking** - Retrieval Models (page 20) - A retrieval model specifies the details of - Document representation - Query representation - Retrieval function - Notion of relevance - binary - continuous - major information retrieval model - boolean models - vector space models - probabilistic models - Common Pre-Processing Step (page 22) - Boolean model (page 23) - Boolean retrieval model (page 24) - problems (page 25) - problem examples (page 26) - Vector-space model (page 27) - Graphic Representation Example (page 28) - Representing a Collection of Documents (page 29) - Term Weights: Term Frequency (page 30) - Term Weights: Inverse Document Frequency (page 31) - $df_i$ - $idf_i$ - discrimination (page 33) - Computing TF.IDF -- An Example (page 34) - Cosine Similarity (page 35) - Similarity measure (page 36) - degree of similarity - Normalized Vectors (page 37) - Similarity Measure for Document and Query (page 38) - Limitations of the Inner Product (page 39) - Cosine Similarity Using Inner Product -- Examples (page 40) - Cosine Similarity Measure Normalized (page 41) - Naïve Implementation (page 42) - Efficient Cosine Ranking (page 43) - Computing Cosine Scores (page 44) - Summary of Algorithm for Vector Space Ranking (page 45) - Summary of Algorithm for Vector Space Ranking (page 46) - Bottleneck (page 47) - A Pre-Processing Strategy (page 48) - pre-compute - search - Comments on Vector Space Model (page 49) - Problems with Vector Space Model (page 50) ### 8 - Lexicon & Text Normalization - Overview - the issue of how a search engine creates its database of documents. - lexicon - vocabulary - operations for refining the vocabulary - stemming - stop words - tokenization (parsing) - capitalization - case folding (uppercase -> lowercase) - synonyms - similar sounding words - Power Laws and Zipf’s Law (page 3) - power laws: $y = kx^c$ - Zipf’s law is a power law with $c = –1$ - Power Law Examples (page 4-5) - Statistical Properties of Text - frequency of different words distributed - how fast vocabulary size grow with size of documents - Word frequency (page 7) - common words - hapax legomena - Vocabulary growth (page 10) - Heaps's Law (page 11) - conclution (page 13) - **Lexicon Construction** (page 14) - definition - linguistic information - morphology - syntactic patterns - semantic information - is-a hierarchy - synonyms - number in normal form - dates in normal form - alternative names - Determining the Characters in a Document (page 16 - 17) - unicode, ASCII, UTF-8 - DOC, zip - XML - postscript (PS) / PDF - Tika - More Complications: Format/Language - ==Tokenization== (page 19) - definition - tokenizer - lexer - simple tokenization algorithm - Token Normalization (page 20) - stardard - alternative - Handling Unusual Specific Tokens (page 21) - Stop words can be critical to meaning (page 22) - removal - exceotions: phrase searches - exceptions: sentiment determination - Capitalization/Case-Folding (page 23) - retrain capitalization - Stemming and Lemmatization (page 24) - steamming - lematizetion - morphology - JavaScript Porter Stemmer Online (page 25) - The Porter Stemming Algorithm (page 26) - Typical Rules in Porter’s algorithm (page 27) - Error Metrics (page 28) - over-stemming: false positive - under-stemming: false negative - Soundex Algorithm (page 29-33) - a phonetic algorithm for indexing names by their sound when pronounced in English. - 4-character - Soundex code (page 31) - Soundex algorithm (page 32-33) - SOundex example (page 34) - Postscript: Lucene and Solr (page 35) - Lucene - Main Lucene modeuls (page 36) - Lucene in a search system - Lucene Tokenizers Specify How the Text in a Field is to be Indexed (page 38) - Example (page 39 - 41) - Solr ## Week 5 ### Videos - Video: Word Similarity and Thesaurus Methods, Jurafsky & Manning (16 min) - Word similarity - synonymy - two words are either synonymous or not - similarity / distance - two words are more similar if they share a more features of meaning - similarity is properly a relation between senses - We'll computer similarity over both words and senses - Why word similarity - Word similartiy & word relatedness - Focus on similarity in this video - similar words - related words - Two classes of similarity Algorithms - Thesaurus-based algorithm - Q: - are words nearby in hypernym hierarchy? - do words have similar glosses (difinitions)? - Distributional algorithm - Q: do words have similar distributional context? - Path based similarity - Two concpets are similar if they near each other in the thesaurus hierarchy - = have a short path between them - concepts has path 1 to themselves - Refinements to path-based similarity - $pathlen(c1, c2) = 1 + path$ - $simpath(c1, c2) = 1 / pathlen(c1, c2)$ - $wordsim(w1, w2)$ - Problems with basic path-based similarity - assume each link represent a uniform distance - nodes high in the hierarchy are very abstract - we instead watnt a metric that - represent the cost of each edge independently - words connected only through abstract nodes - are less similar - Information Content similarity metrics - 6min2ss - P(c): - the probability that a randomly selected word in a corpus is an instance of concept c - P(c) - 1-P(c) - p(root) = 1 - Information Content similarity - Train by counting in a corpus 7'18 - WordNet hierarchy augmented with probabilities P(c) 8'32 - Information content: definitions - 9'14 - information content - IC(c) = - log P(c) - Lowest common subsumer - LCS - Resnik method - 9'38 - resnik: measure common information as - sim resnic (c1, c2) = -log P (LCS(c1, c2)) - Dekang Lin method - difference between a and b - Commonality = $IC(common(A, B))$ - Difference = $IC(description(A, B)) - IC(common(A, B))$ - Lin similarity functions - Dekang Lin similarity theorem - 11'14 - sim lin (A,B) - Lin similarity function - 11'41 - The (extended) Lesk Algorithm - 12'59 - a thesaurus-based measure that look at **glosses** - two concepts are similar if their glosses contain similar words - n-word phrase in both glosses - summary - sim path - sim resnik - sim jiangconrath - sin elesk - Libraries - NLTK - WordNet::Similarty - Evaluating similarity - Intrinsic evaluation - correlation between algorithm and human word similarity raing - Extrinsic (task-based, end-to-end) evaluation - malapropism (spelling error) detect - WSD - eassay grading - TOEFL - Thesaurus methods - Youtube vides - Unique 11-length string - a randowm number of base 64 - ~~How Youtube makes money~~ - How Youtube works - A whole video is chopped into many pieces. - Have copys of different resolutions and formats. - DNS, IPS. - TCP/IP. ### 9 - Inverted Indexing - outline - Definition of an Inverted index - Examples of Inverted Indices - Representing an Inverted Index - Processing a Query on a Linked Inverted Index - Skip Pointers to Improve Merging - Phrase Queries - biwords - Grammatical Tagging - N-Grams - Distributed Indexing - Creating an ==Inverted Index== (page 3) - A **vector** containing all distinct words of the text collection in lexicographical order (vocabulary) - Terms in the inverted file index may be refined: - case folding - stemming/lemmatization - stop words - Processing a Query An Example (page 4-6) - grep with a regular expression - Term-Document Incidence Matrix (page 5) - Incidence Vectors (page 6) - 0/1 vector for each term - Actual Answers to the Query (page 7) - Term-Document Incident Matrices are Naturally **Sparse** (page 8) - 0.1% of the elements are 1 - Inverted Index Example (page 9-14) - Inverted Index Stored In Two Parts (page 10) - dictionary (index file) - posting list - Parsing Documents To Create an Inverted Index (page 11) - Query Processing Across the Postings List (page 15) - Basic merge (page 16) - $O(m+n)$ - Query Optimization (page 18 - 20) - ==Skip pointers== (page 21) - The Technique of Skip Pointers (page 22) - Query Processing With Skip Pointers (page 23) - Facts on Skip Pointers (page 24) - Skip pointers are added at indexing time - argument - sollution: for a postings list of length P, use ==$sqrt(P)$== evenly-spaced **skip pointers**. - Phrase Queries (page 26) - Phrase queries - two approaches to solving this problem 1. Bi-word indexes (also called 2-grams or 2 shingles) 2. 2. Positional indexes - Using Biword Indexes for Phrase Searching - bi-word / 2-gram - Handliing longer phrase queries (page 29) - Queries longer than 2 words will have to be broken into bi-word segments - e.g stanford university palo alto - stanford university AND university palo AND palo alto - might have false postive - **phrase index&& - Part-of-Speech Tagging (page 30) - embedded stop words - part-of-speech tagging - Alternate Solution Using Positional Indexes (page 31) - **Positional Index** - Positional Index Example (page 32) - Processing a Phrase Query - Algorithm for matching a phrase query - extract - merge - match (find adjacent) - Algorithm for Proximity Queries with k words (page 34) - Some High Freqency Noun Phrases from TREC and Patent DataSets (page 35) - Google Relies Upon n-gram Indexes (page 36) - Zipf distribution (power law) - Google’s N-Gram Database Facts - Examples of 3-Gram, 4-Gram Data (page 38) - The Value of N-Grams (page 39) - deciding which n-grams can be chunked together - help make next word prediction - make spelling error corrections - An N-Gram Example (page 40) - N-Grams Used To Find Best and Worst Search Queries (page 41) - Google ’ s N-Gram Viewer (page 42) - Comparing N-Grams Across Languages ### 10 - Video Search Engines YouTube et al - Video search engines - quick summary (page 2) - video search engines - indexing - ranking - Video Search Engines That Crawl for Content - Video Search Engines That Host - Video Search Engines That Stream Entertainment - Some Technologies Supporting Video Content (page 6) - subtitle - open caption - close caption - SDH (Subtitles for the Deaf and Hard of Hearing): - Speech recognition - YouTube Background - YouTube as a search engine - 2nd largest search engine - YouTube Traffic - Some Facts - YouTube Search Engine Issues to Consider (page 10) - video format - display platform - distribute: content distribution network (CDN) - monetize website: ContentID system - keep users watching: recommendation system - YouTube Channels - YouTube Gathers Information When Videos are Uploaded - Uploading to YouTube Second Input Screen - Uploading to YouTube Third Input Screen - Business Model: Ads, Ads, Ads - Sample YouTube Search Results for Katy Perry - Ranking: Ads, Views, Age YouTube Search Results - YouTube Advanced Search Ranking Filters - YouTube Ranking Factors (page 18) - metadata - video quality - numbers of views, likes, shares and links - subtitles and closed captions - YouTube Upload Characteristics (page 19) - 8 video format for uploading - MOV, MP4 (MPEG4), AVI, WMV, FLV, 3GP, MPEGPS, WebM - aspect ratio - 4:3 or 16:9 - maximum file zise: 128GB - 15 minutes long - short life cycle - YouTube Videos Run On Multiple Platforms - YouTube Makes Recommendations to Retain Viewers - YouTube Recommendation Algorithm - YouTube Recommendation System Uses Graph Properties - accociation rule mining - Media Sites (including YouTube) Move Away from False Information - Media Sites (including YouTube) Move Away from False Information - Content Delivery Networks (page 26) - Content Delivery Network (CDN) - Akamai, Limelight and Level 3 Communications - YouTube Video Delivery System (page 27) - identification: 11-length 64-based string - efficient delivery: CDN - YouTube (Google’s) Content Delivery Datacenters (page 28) - Uploading a YouTube Video - YouTube ’ s Content Distribution Network Downloading a YouTube Video - YouTube Delivery SystemReferences to YouTube ’ s CDN - Monetizing YouTube (page 34) - challenge - **ContentID** (page 35) - Creating an Acoustic Fingerprint - How Good is Content ID - Will YouTube Ever Run Out of Storage - Total Capacity of YouTube Storage