# CSCI - 572 Final Reviewg ## Week 7 ### 13 - PageRank - Bibliometrics - citation analysis - Bbliographic coupling - p5 - impact factor - p6 - citations vs web links - p9 - PageRank - p10 - Larry Page & Sergey Brin - probability distribution - Initial PageRank Formulation - p12 - Simplified Algorithm - p13 - Complete PageRank Algorithm - p18-19 - 0.85 - 0.15 - principal eigenvector - p20 - recurrence relation - p20 - PageRank Convergence - p27 - damping factor - sink - Matrix Formulation for Computing PageRank - p28 - column stochastic matrix - rank vector - Flow Equation in Matrix Form - p29 - Eigenvector Formulation - p30 - Eigenvalue and Eigenvector - p31 - Power Iteration method - p32 - relaxation - Examples - p33-43 - fully meshed - p40-41 - a technique used by some disreputable sites - p43 - link farm - Suggestions for improving your page rank - p44 - site maps - Imporance of PageRank - p45 - Random Walk Interpretation - p46 - random web surfer - Spider traps - p47 - Random teleports - p48 - HITS by Kleinberg - p50 - 52 - another link analysis algorithm - hubs p52 - authorities p51 - in-degree - bipartite graph - p53 - HITS algorithm - p55 - normal query - limitations - p57 - final observations - p61 ## Week 8 ### 15 - MapReduce #### MapReduce - Introduction - MapReduce is a methodology for exploiting parallelism in computing clouds - p3 - parallelization - Multithreaded = Unpredictability - Synchronization - p11 - MapReduce provides - p12 - Automatic parallelization - Fault tolerance -> p26 - I/O scheduling - Monitoring & Status updates - Lisp - p13 - programming model - p15 - map - shuffle - reduce - pradiam - p16 - map - group (shuffle) - reduce - map task - p23 - master controller - worker #### Google File System - GFS - Characteristics of a Google DataCenter - p29 - fixed size chunks of 64 MB - p35 - master server & chunk server - p36 - Characteristics of a Google DataCenter - p39 #### BigTable ## Week 9 ### 17 - SearchEngineAdvertisingOverview - Types of Online Advertising - p2 - Types of Online Advertising - p7 - Advertisers Designate Keyword Matching Rules - p13 - broad match - p14 - exact match - p15 - phrase match - p16 - negative keyword - p17 - Capabilities of Search Engine Ad Servers - p19 - Google Ads Auction Rules - p21 - AdWords - p22 - CPC: cost per click - Cost Per Action (CPA) - p23 - first-price aution - p25 - second-price aution - p25 - Ad Rank= Bid X Click Probability - p27 - AdSense - p28 - p35 - wordnet - p33 - Google AdSense, AdMob Moves to a First-Price Auction Model - p36 - Ad Exchanges and DoubleClick - p37 - ad exchange - ad network - how DoubleClick works - p49 ### 18 - WikiMaster - Outline - Basic definitions: Taxonomy, Ontology, Knowledgebase - Knowledgebase Internals and Examples - WordNet - Wikipedia - Google’s Knowledge Graph - Properties of a Good Knowledge Representation System - p11 - Inheritance - p12-14 - advantage - p14 - notations for a knowledgebase - p15 - RDF data model - p16 - labeled multigraphs - p18 - inferencing on knowledgeBases - p25 - forward chaining - p26 - modus ponens - backward chaining - p26 - - Binary Relations and Instances - p27 - Semantic Network - 29-32 #### ==WordNet== - p33-37 - lexical database with classes, subclasses, and superclassess #### Wikipedia - p39 - Tranformation from database to knowledgebase - five pillars - 5P - p41 - Combining Wikipedia Named Entities to WordNet Synsets - p46 - WikiData - p48 #### Google’s Knowledge Graph - p51 - Google Search Combines Document Index with Knowledge Graph - last page ## Week 10 ### 19 - QueryProcessingModifiedY - Outline 1. Restructuring the inverted index to speed up processing 2. Reverse engineering Google’s query processing algorithm 3. A close up look at Google’s internal architecture #### Speeding up indexed retrival - p3 - stategies - p4-6, 8, 10, 12 - Static Quality Scores Heuristic - p7 - relevance - authority #### Query Processing Algorithm - searchmetrics - p15 - moz.com - p15 - ranking factors - p17 - nofollow links - p18 - SERP: search engine results page - correlation is not necessarily causation - MozRank - p28 #### Google Architecture - p32 - Modern Query Processing Methodology - p38 - term-bases vs entity based meaning - p39 - Google’s Query Processing Elements - p40 - Using the KnowledgeGraph to Identify Entities - p41 - Entity Recognition in the Knowledge Graph - p42 - Descriing senmantic Clustring - p43 - RankBrain: An entity-based processor - p44-51 ## Week 11 ### 21 - SpellingCorrection - 2 main spelling tasks - p6 - spelling error detection - spelling error correction - 3 types of spelling error - p7 - non-word errors - typographical errors - congintive errors - causes of misspelling - p9 - challenge for identifying spelling errors - p11-12 - The Noisy Channel Model - p13 - Bayesian Inference Implies We Can Use Previous Combinations to Predict the Correct Word - p14 - Bayesian Inference Implies We Can Use Previous Combinations to Predict the Correct Word - p15-16 - Use Edit Distance To Produce Candidate Corrections - p17 - correction & autocomplete - p19 - prefix tree - p20 - n-gram - p22 #### A Complete Spelling Correction Program - A Complete Spelling Correction Program - p29 - Natural Language Corpus Data #### Edit Distance & Levenshtein Algorithm - p32 - Computing alignments - p41 - weighted edit distance - p42 - confusion matrix ### 22 - RichTextSnippets - meta description - automatic summarization - extraction - abstraction - People Also Ask (PAA) - schema.org - object hierarchy - TLDR - too long didn't read - formalism - microdata - for rich snippets - RDFa - microformat encoding ## Week 12 ### 23 - Clustering - Outline - Document clustering – Motivations – Document representations – Success criteria - Clustering algorithms – Partitional – Hierarchical - Clustering & classification - p4 - clustering: unsupervised - classification: supervised - Cluster hypothesis - p6 - ergo - yippy.com - p8 - clusty - Yahoo's name derives - p9 - google news - google ress feeds - good cluster - p12 - classification vs clustering - p14 - similarity / distance - p21 - cosine similarity - euclidean distance - clustering algorithms - p23 - partitioning based - hierarchical - agglomerative - divisive - k-means clustering - p24-38 - np-hard - hierarchical - p39 - agglomerative - divisive - distance matrix - p40 - centroid - p41 - dendrogram - p42 - Divisive Clustering Algorithm - p54 ### 24 - QuestionAnswering (QA) - Information Retrieval vs Question Answering - p2 - semantic difficulties - p9 - NLP challenges - p11 - predicate-argument structure - knowledge-based approach - p15 - IBM’s WATSON System - p17 - Question type - p19 - phase of QA - p20 - QA 3 phase block architecture - p21 - wordnet - POS parser - NER - named entity recoginition - Question Taxonomy - p22 - factoid questions - p23 - learning question classifiers - Capabilities for QA system - p25 - Question Processing Tool - part-of-speech recognizer - p26 - name entity recoginizer - p27 - NLP extraction to build knowledge graph - p28 - jeopardy query - p29 - Expanding the Keyword Set Using Variants - p32 - morphological variants - lexical variants - semantic variants - Incorporate Lexical Variants Using Hypernims and Hyponims - p33 - wordnet - Semantic Similarity - p37 - passage retrieval - p38 - passage scoring method - p39 - passage ordering - Ranking Candidate Answer - Local alignment example - p42-48 - Ranking Scheme - p49 - BERT - p50 - Bidirectional Encoder Representations from Transformers - Why BERT? - p51 - context-free models - contextual models - How BERT works? - p52 - bidirectionally - pre-trained model - ASKMSR - p54 - rewrite query ## Week 13 ### 25 - Classification - Outline - Examples of Relevance Feedback - Query Expansion examples and techniques for producing relevance feedback - Rocchio Algorithm for Relevance Feedback (under ideal conditions, i.e. relevant and non-relevant documents are known) - Rocchio Algorithm for Classification including an online version - K-Nearest Neighbor Algorithm for Classification - Relevance Feedback - query - recall - improve recall - p5 - local methods - global methods - Query expansion - p6 - 6 techniques - word stemming - acronyms - misspellings - synonyms - translations - ignored words - WordNet to implement Query Expansion - p9 - WordNet synset relationships - p10 - Query Expansion - using WordNet - p11 - using Thesaurus - p12 - Rocchio algorithm - p14 - kNN - k Nearest Neighbor Method - p35 - classifiers - p38 - contiguity hypothesis - p39 - measure distance - p40 - K-NN has also been called - p44 – Case-based learning – Memory-based learning – Lazy learning - Voronoi diagram - p46 - Algorithm comparison between K-means & KNN ### 26 - RecommendationSystem - Scarcity versus Abundance - p6 - 2 types of Recommendatian System - p7 - content-based filtering - collaborative filtering - hybrid system - Example - last.fm - user similarity - pandora - item similarity - utility matrix - p14 - boolean utility matrix - p18 - star ratings normalized - p19 - content-based approach - pros - p21 - cons - p22 - Collaborative Filtering - p23 - Centered cosine - p27 - pearson correlation - Making rating predictions for a user - item-item vs user-user - p31 - Evaluation Metrics for Recommendation Engines - p36 ## Week 14 ### 27 - ImageSearch - Image Search - p2 - Video Search Engine - p2 - How search engines do image indexing - p11-13 - first user tags - p11 - use surrounding text - p12 - use feature extraction - p13 - feature extraction: 3 types of image features - p14 - primitive features - semantic features - domain specific features - color histograms - similar color content - color layout - ImageNet: A Large-Scale Hierarchical Image Database - p23