# CSCI - 572 Midterm Review
## Week 1
### 1B - SearchEngineBasic
#### Search Engine History and Basics
#### Search Engine Basic Behavior
- Web search
- Web search engine definition
- enterprise search engines
- personal search engines
- Web search engine elements
- spider / crawler / robot
- The indexer
- Query processor
- Query processing
- semantic analysis
### Web Trends and Measurements
- Web trends
- Major Cloud Providers
- Google voice search query
- Measuring the web
- Web facts - p32
- Manual Hierarchical Web Taxonomies
- Yahoo
- Open Directory Project
- DMoZ
- Internet Archive
## Week 2
### Video
- **Precision and Recall, Jurafsky & Manning, Stanford U. (16 min)
- 2-by-2 contingency table
- Precision and recall
- precision: % of selected items that are correct
- recall: % of correct items that are selected
- A combined measure: F
- A combined measure that assesses the P/R tradeoff if F meaure (weighted harmonic mean)
- The harmonic mean is a very conservative average
- People usually used balanced F1 meausre
- beta = 1 (alpha = 1/2)
- F = 2PR/(P+R)
- **WDM116:URL Frontier Using Mercator Scheme (29 min)
- Mercator URL frontier
- URLs flow in from the top into the frontier
- Front queues manage prioritization
- Back queues manage enforce politeness
- Each queue is FIFO
- Front queues
- Prioritizer assgins to URL an interger priority between 1 and K
- Appends URL to corresponding queue
- Heuristics for assigning priority
- Refresh rate sampled from previous crawls
- Application-specific (e.g. crawl news sites more often)
- Biased front queue selector
- When a back queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URL
- This choice can be round robin biased to queues of higher priority, or some more sophisticated variant
- Can be randomized
- URL frontier: two main considerations
- politeness: do not hit a web server too frequently
- freshness: crawl some pages more oftehr than others
- e.g. news page
- This goal may confict each other
- e.g. simple priority queue fails - many links out of a page go to its own site, creating a burst of acess to that site
- Back queue invariants
- Each back queue is kept non-empty while the crawl is in progress
- Each back queue only contains URLs from a single host
- maintain a table from hosts to back queues
- host name / back queue
- back queue heap
- one entry for each back queue
- the entry is earliest time te at which the host corresponding to the back queue can be hit again
- This earlist time is determined from
- last access to that host
- any time buffer heuristic we choose
- politeness - challenges
- even if we restrict only one thread to fetch from a host, can hit it repeatedly
- common heuristic: insert time gap between successive requests to a host that is >> time for most recent fetch from the host
- Back queu processing (20min)
- a crawler thread seeking a URL to crawl
- Extract the root of the heap
- Fetch Urls at head of corresponding back queue q(look table)
- check if queue q is now empty
- if so pull a url v from front queue
- when q is non-empty, create heap entry for it
- Numbers of back queue B
- keep all threads busy while respecting politeness
- mercator recommendation: three times as many back queues as crawler threads
### Search Engine Evaluation
- Defining precision/recall
- precision / recall
- formalizting precision/recall
- $Precision = tp / (tp + fp)$
- $Recall = tp / (tp + fn)$
- Harmonic Mean and F Measure
- Harmonic Mean
- F-score / F-measure
- Calculating Recall/Precision at Fixed Positions - page 10
- Mean Average Precision
- Average Precision of the Relevant Documents - page 11
- Average across Queries
- MAP: Mean average precision
- example: page14
- Difficulites in using Precision / Recall
- Discounted Cumulative Gain: DCG
- A final evaluation measure
- Search Engine Evaluation Metrics / Elements of Good Search Results
- text collections of queries and hand-ranked result
- recall
- precision at top $k$ positions
- Search engines also use non-relevance-based measures
- Click-through on first result
- Studies of user behavior in the lab
- A/B testing
- Google's Search Quality
- 4-step process for changing their search algorithm
- precision evaluation
- side-by-side experiment
- live traffic experiment
- full lunch
- A/B Testing
- User clicks
- Using log files for evaluation
- Google's Enhancements of Search Results
### Crawlers and Crawling
- Web crawler
- Web crewler issues
- How to crawl?
- Quality, efficiency, etiquette.
- How much to crawl?
- Coverage, relative coverage
- How often?
- Freshness, how much has really changed?
- Simplest Crawler Operation
- Initialize with “seed” page
- Loop: Fetch and parse a page
- Crawling picutre
- challenges
- Handling/avoiding malicous pages
- Latency / bandwidth
- Robots.txt
- Maintain politeness
- Robots.txt
- Basic search stategy - page 11
- Breadth-first search
- Depth-first search
- QUerying stategy - page 15
- FIFO - BFS
- LIFO - DFS
- Heuristically ordering - focused crawler
- re-order URLs
- Avoiding page duplication
- A crawer must efficiently index
- URLs
- starndard format / already seen
- Already visited page
- identical & near-identical
- Link extraction
- find html element to extract links
- URLs must be completed
- Anomalies
- anchors might not have links
- dynamic pages
- Representing URLs
- To determine if a new URL has already been seen
- URLs are sorted lexicographically and then stored as a delta-encoded text file
- **Trie** for URL Exact Matching
- Why normalizing URLs? - page 20
- **Normalizing URLs** - 4 rules - page 21
- Unreserved characters
- ALPHA (%41–%5A and %61–%7A)
- DIGIT (%30–%39)
- hyphen (%2D)
- period (%2E)
- underscore (%5F)
- tilde (%7E)
- Avoiding spider traps
- session id
- Handling spam web page
- repeated terms
- cloaking
- doorway pages
- The mercator web crawler
- DNS
- UDP
- parallel threads
- Measuring and Tuning a Crawler (improve)
- DNS caching, pre-fetching, and resolution
- DNS
- UDP
- customize
- Multi-threaded crawling
- Crawling approaches
- Centralized crawler
- Distributed crawling approaches
- Distributed model
- Benefited and issues of distributed crawling
- benefits: scalability, costs, network-load dispersion and reduction
- issues: overlap, quality, communication bandwidth
- Coordination of distributed crawling
- independent
- dynamic assignment
- central coordinator
- static assignment
- firewall mode
- cross-over mode
- exchange mode
- Classification of parallel crawlers
- Freshness - page 33
- selection policy
- re-visit policy
- politeness policy
- parallelization policy
- Keeping Spidered Pages Up to Date
- Implications for a Web Crawler
- running multiple types of crawlers is best
- Updating in-place keeps the index current
- Cho and Garcia-Molina, 2000
- re-visit policy
- uniform policy
- proportional policy
- Help the Search Engine Crawler Creating a SiteMap
- sitemap
- XML
- General Sitemap Guidelines
- Google crawlers
- Googlebot
- To verify That The Visitor is Googlebot
- reverse DNS lookup
- match the crawler's IP address
- Controlling How Often Googlebot Visits Your Site
- crawl rate
- Googlebot is Really Chrome Re-packaged
- Googlebot processes web pages with JavaScript in 3 phases
1. Crawling – processing all links
2. Rendering – executing JS and then looping back to 1
3. Indexing
## Week 3
### 5 - De-Duplication
- De-Duplication definition
- locker storage
- Similar content with different URLs
- Totally Distinct URLs May Deliver the Same Page
- Near duplicate
- spotting near duplicate - page 6
- Web Duplicates-Mirroring
- single largest cause
- Host1/a and Host2/b are mirrors iff
- Why detect **exact duplicates** - page 11
- smarter crawling
- better connectivity analysis
- add redundancy in result listings
- reduce crawl time
- ideally
- Why detect **near duplicates** - page 12
- clustering
- data extraction
- plagiarism
- spam detection
- duplictates within a domain
- Solving the Duplicate/Near-Duplicate Detection Problems - page 13
1. Duplicate Problem: Exact match;
- Solution: compute fingerprints using cryptographic hashing
2. Near-Duplicate Problem: Approximate match
- Solution: compute the syntactic similarity with an edit-distance measure, and use a similarity threshold
- Using a Cryptographic Hash Function to Convert a Web Page to a Number
- cryptographic hash function
- hash value
- message digest, digital fingerprint, digest, or checksum
- 4 main properties
- easy / fast
- ??
- small change
- It is extremely unlikely that two slightly different messages will have the same hash.
- Cryptographic Hash Functions
- MD5
- SHA-1, SHA-2
- SHA-3
- RIPEMD-160
- Identifying Identical Web Pages – 4 Approaches - page 15
- campare character by character
- hash first few characters
- use a hash function
- better approach
- even better approach: crytographic hash
- Identifying Near Identical Web Pages - 2 approaches - page 16
- fingerprints
- Simhash
- Hamming Distance
- shingles
- General paradigm
- General Properties of Distance Measures - page 18
- Use **distance measure** to compute similarity
- 4 properties of distance meansure
1. No negative distances
2. D(x,y) = 0 iff x=y
3. D(x,y) = D(y,x) symmetric
4. D(x,y) <= D(x,z) + D(z,y) triangle inequality
- Distance meaures
- Euclidean distance
- Jaccard distance
- Gosine distance
- Edit distance
- Hamming distance
- boolean vector
- Set Distance and Set Similarity are Related
- notion of distance: $d(A,B)$
- notion of similarity: $s(A,B)$
- $d(A, B) = 1 – s(A, B)$
- **Jaccard similarity** - page 20
- Computing Jaccard Similarity from Sets Containing Shingles
- ==shingle==
- $S(D,w)$ is the set of shingles of a document D of width w
- Similarity Measures
- 相似性 Resemblance / Jaccard(A,B)
- 包含性 containment
- ==Shingling Modeling Choices==
- white space ? -> yes
- capitalization ? -> yes
- punctuation
- $k = $ ?
- count replicas ? -> no
- omitt stop words -> yes
- Mapping Shingles to Numbers
- fingerprints
- High Jaccard similarity (e.g. greater than 0.9), implies the pages are near duplicate; orif ( J ( fingerprint(A), fingerprint(B)) ) > k, then the pages are similar;
- Solution to Speed Up the Use of Shingles for Near Document Similarity - page 25
- a probabilistic approach
- For each document D compute the $sketch_D[i]$
- Compute $sketch_D[i]$ page 26 - 28
- Computing Near Duplicates Using Jaccard Similarity and Probability
- Using ==SimHash== to Detect **Near Duplicates** - page 30
- Documents D 1and D 2are near duplicates iff
- Hamming-Distance(Simhash(D1 ), Simhash(D2 )) <= K • Typically f = 64 and k = 3
- Simhash is useful because if the Simhash bitwise Hamming distance of two phrases is low then their Jaccard coefficient is high
- Simhash - page 32 - 36
- rotate
## Week 4
### Videos
- Video: Inverse Document Frequency Weighting, Jurafsky & Manning, Stanford U. (10 min)
- Document frequency
- idf weight
- $idf_t = log_{10}(N/df)$
- Effect of idf on ranking
- Collection frequence vs. Document frequency
- Video: TF-IDF Weighting, Jurafsky & Manning, Stanford U. (3 min)
- tf-idf wights
- Best known wighting scheme in information retrieval.
- Increase with rarity of the term in the collection.
- Final ranking of documents for a query
- $Score(q,d) = \epsilon_{t \in q \cap d }tf.idf_{t,d}$
- Binary -> count -> weight matrix
- Video: Calculating TF-IDF Cosine Scores, Jurafsky & Manning, Stanford U. (12 min)
- tf-idf weighting has many variants
- term frequency
- document frequency
- normalization
- weighting may differ in queries vs. documents
- SMART notation: ddd, qqq
- Query: logarithmic tf, idf, cosine normalization
- tf-idf example: Inc.Itc
- Summary - vector space ranking
- Represent tht query as a weighted tf-idf vector
- Represent each document as a weighted tf-idf vector
- Compute the cosine silimarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g., K=10) to the user
- Video: Word Tokenization, Jurafsky & Manning, Stanford U. (14 min)
- Text Normalizalization
- segmenting/tokening works in running text
- normalizing word formats
- segmenting sentences in runing text
- How many words
- Fragments, filled pauses
- Lemma
- Wordform
- Type & Token
- N = number of token
- V = vocabulary = set of types
- Church and Gale (1990): $|V| > O(N^{1/2})$
- terminal
- less shakes.txt | less
- tr -sc 'A-Za-z' '\n' < shakes.txt | less
- tr -sc 'A-Za-z' '\n' < shakes.txt | sort | less
- tr -sc 'A-Za-z' '\n' < shakes.txt | sort | unique
- tr -sc 'A-Za-z' '\n' < shakes.txt | sort | unique | sort -n -r | less
- tr 'A-Z' 'a-z' < shakes.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less
- Issues in TOkenization
- Tokenization: language issues
- Word Tokenization in Chinese
- Word Segmentation
- Stardard baseline segmentation algorithm: Maximum Matching
- Max-match segmentation illustration
### 7 - Introduction to Information Retrieval
- Information retrieval (IR)
- index (a corpus)
- efficiently retrive relevant documents
- The Traditional IR System
- History of IR (page 4-8
- Areas Related To, But Different Than, Information Retrieval (page 9-15
- Basic Information Retrieval Begins with **Keyword Matching**
- simplest relevance
- synonymouse terms
- ambiguous terms
- Intelligent Information Retrieval
- meaning, order, feedback, authority
- A More Detailed IR Architecture (page 17
- Defining terms (page 19)
- **stopword**
- **stemming**
- **indexing** (inverted index)
- **searching**
- **ranking**
- Retrieval Models (page 20)
- A retrieval model specifies the details of
- Document representation
- Query representation
- Retrieval function
- Notion of relevance
- binary
- continuous
- major information retrieval model
- boolean models
- vector space models
- probabilistic models
- Common Pre-Processing Step (page 22)
- Boolean model (page 23)
- Boolean retrieval model (page 24)
- problems (page 25)
- problem examples (page 26)
- Vector-space model (page 27)
- Graphic Representation Example (page 28)
- Representing a Collection of Documents (page 29)
- Term Weights: Term Frequency (page 30)
- Term Weights: Inverse Document Frequency (page 31)
- $df_i$
- $idf_i$
- discrimination (page 33)
- Computing TF.IDF -- An Example (page 34)
- Cosine Similarity (page 35)
- Similarity measure (page 36)
- degree of similarity
- Normalized Vectors (page 37)
- Similarity Measure for Document and Query (page 38)
- Limitations of the Inner Product (page 39)
- Cosine Similarity Using Inner Product -- Examples (page 40)
- Cosine Similarity Measure Normalized (page 41)
- Naïve Implementation (page 42)
- Efficient Cosine Ranking (page 43)
- Computing Cosine Scores (page 44)
- Summary of Algorithm for Vector Space Ranking (page 45)
- Summary of Algorithm for Vector Space Ranking (page 46)
- Bottleneck (page 47)
- A Pre-Processing Strategy (page 48)
- pre-compute
- search
- Comments on Vector Space Model (page 49)
- Problems with Vector Space Model (page 50)
### 8 - Lexicon & Text Normalization
- Overview
- the issue of how a search engine creates its database of documents.
- lexicon
- vocabulary
- operations for refining the vocabulary
- stemming
- stop words
- tokenization (parsing)
- capitalization
- case folding (uppercase -> lowercase)
- synonyms
- similar sounding words
- Power Laws and Zipf’s Law (page 3)
- power laws: $y = kx^c$
- Zipf’s law is a power law with $c = –1$
- Power Law Examples (page 4-5)
- Statistical Properties of Text
- frequency of different words distributed
- how fast vocabulary size grow with size of documents
- Word frequency (page 7)
- common words
- hapax legomena
- Vocabulary growth (page 10)
- Heaps's Law (page 11)
- conclution (page 13)
- **Lexicon Construction** (page 14)
- definition
- linguistic information
- morphology
- syntactic patterns
- semantic information
- is-a hierarchy
- synonyms
- number in normal form
- dates in normal form
- alternative names
- Determining the Characters in a Document (page 16 - 17)
- unicode, ASCII, UTF-8
- DOC, zip
- XML
- postscript (PS) / PDF
- Tika
- More Complications: Format/Language
- ==Tokenization== (page 19)
- definition
- tokenizer
- lexer
- simple tokenization algorithm
- Token Normalization (page 20)
- stardard
- alternative
- Handling Unusual Specific Tokens (page 21)
- Stop words can be critical to meaning (page 22)
- removal
- exceotions: phrase searches
- exceptions: sentiment determination
- Capitalization/Case-Folding (page 23)
- retrain capitalization
- Stemming and Lemmatization (page 24)
- steamming
- lematizetion
- morphology
- JavaScript Porter Stemmer Online (page 25)
- The Porter Stemming Algorithm (page 26)
- Typical Rules in Porter’s algorithm (page 27)
- Error Metrics (page 28)
- over-stemming: false positive
- under-stemming: false negative
- Soundex Algorithm (page 29-33)
- a phonetic algorithm for indexing names by their sound when pronounced in English.
- 4-character
- Soundex code (page 31)
- Soundex algorithm (page 32-33)
- SOundex example (page 34)
- Postscript: Lucene and Solr (page 35)
- Lucene
- Main Lucene modeuls (page 36)
- Lucene in a search system
- Lucene Tokenizers Specify How the Text in a Field is to be Indexed (page 38)
- Example (page 39 - 41)
- Solr
## Week 5
### Videos
- Video: Word Similarity and Thesaurus Methods, Jurafsky & Manning (16 min)
- Word similarity
- synonymy
- two words are either synonymous or not
- similarity / distance
- two words are more similar if they share a more features of meaning
- similarity is properly a relation between senses
- We'll computer similarity over both words and senses
- Why word similarity
- Word similartiy & word relatedness
- Focus on similarity in this video
- similar words
- related words
- Two classes of similarity Algorithms
- Thesaurus-based algorithm
- Q:
- are words nearby in hypernym hierarchy?
- do words have similar glosses (difinitions)?
- Distributional algorithm
- Q: do words have similar distributional context?
- Path based similarity
- Two concpets are similar if they near each other in the thesaurus hierarchy
- = have a short path between them
- concepts has path 1 to themselves
- Refinements to path-based similarity
- $pathlen(c1, c2) = 1 + path$
- $simpath(c1, c2) = 1 / pathlen(c1, c2)$
- $wordsim(w1, w2)$
- Problems with basic path-based similarity
- assume each link represent a uniform distance
- nodes high in the hierarchy are very abstract
- we instead watnt a metric that
- represent the cost of each edge independently
- words connected only through abstract nodes
- are less similar
- Information Content similarity metrics - 6min2ss
- P(c):
- the probability that a randomly selected word in a corpus is an instance of concept c
- P(c)
- 1-P(c)
- p(root) = 1
- Information Content similarity
- Train by counting in a corpus 7'18
- WordNet hierarchy augmented with probabilities P(c) 8'32
- Information content: definitions - 9'14
- information content
- IC(c) = - log P(c)
- Lowest common subsumer
- LCS
- Resnik method - 9'38
- resnik: measure common information as
- sim resnic (c1, c2) = -log P (LCS(c1, c2))
- Dekang Lin method
- difference between a and b
- Commonality = $IC(common(A, B))$
- Difference = $IC(description(A, B)) - IC(common(A, B))$
- Lin similarity functions
- Dekang Lin similarity theorem - 11'14
- sim lin (A,B)
- Lin similarity function - 11'41
- The (extended) Lesk Algorithm - 12'59
- a thesaurus-based measure that look at **glosses**
- two concepts are similar if their glosses contain similar words
- n-word phrase in both glosses
- summary
- sim path
- sim resnik
- sim jiangconrath
- sin elesk
- Libraries
- NLTK
- WordNet::Similarty
- Evaluating similarity
- Intrinsic evaluation
- correlation between algorithm and human word similarity raing
- Extrinsic (task-based, end-to-end) evaluation
- malapropism (spelling error) detect
- WSD
- eassay grading
- TOEFL
- Thesaurus methods
- Youtube vides
- Unique 11-length string
- a randowm number of base 64
- ~~How Youtube makes money~~
- How Youtube works
- A whole video is chopped into many pieces.
- Have copys of different resolutions and formats.
- DNS, IPS.
- TCP/IP.
### 9 - Inverted Indexing
- outline
- Definition of an Inverted index
- Examples of Inverted Indices
- Representing an Inverted Index
- Processing a Query on a Linked Inverted Index
- Skip Pointers to Improve Merging
- Phrase Queries
- biwords
- Grammatical Tagging
- N-Grams
- Distributed Indexing
- Creating an ==Inverted Index== (page 3)
- A **vector** containing all distinct words of the text collection in lexicographical order (vocabulary)
- Terms in the inverted file index may be refined:
- case folding
- stemming/lemmatization
- stop words
- Processing a Query An Example (page 4-6)
- grep with a regular expression
- Term-Document Incidence Matrix (page 5)
- Incidence Vectors (page 6)
- 0/1 vector for each term
- Actual Answers to the Query (page 7)
- Term-Document Incident Matrices are Naturally **Sparse** (page 8)
- 0.1% of the elements are 1
- Inverted Index Example (page 9-14)
- Inverted Index Stored In Two Parts (page 10)
- dictionary (index file)
- posting list
- Parsing Documents To Create an Inverted Index (page 11)
- Query Processing Across the Postings List (page 15)
- Basic merge (page 16)
- $O(m+n)$
- Query Optimization (page 18 - 20)
- ==Skip pointers== (page 21)
- The Technique of Skip Pointers (page 22)
- Query Processing With Skip Pointers (page 23)
- Facts on Skip Pointers (page 24)
- Skip pointers are added at indexing time
- argument
- sollution: for a postings list of length P, use ==$sqrt(P)$== evenly-spaced **skip pointers**.
- Phrase Queries (page 26)
- Phrase queries
- two approaches to solving this problem
1. Bi-word indexes (also called 2-grams or 2 shingles)
2. 2. Positional indexes
- Using Biword Indexes for Phrase Searching
- bi-word / 2-gram
- Handliing longer phrase queries (page 29)
- Queries longer than 2 words will have to be broken into bi-word segments
- e.g stanford university palo alto
- stanford university AND university palo AND palo alto
- might have false postive
- **phrase index&&
- Part-of-Speech Tagging (page 30)
- embedded stop words
- part-of-speech tagging
- Alternate Solution Using Positional Indexes (page 31)
- **Positional Index**
- Positional Index Example (page 32)
- Processing a Phrase Query
- Algorithm for matching a phrase query
- extract
- merge
- match (find adjacent)
- Algorithm for Proximity Queries with k words (page 34)
- Some High Freqency Noun Phrases from TREC and Patent DataSets (page 35)
- Google Relies Upon n-gram Indexes (page 36)
- Zipf distribution (power law)
- Google’s N-Gram Database Facts
- Examples of 3-Gram, 4-Gram Data (page 38)
- The Value of N-Grams (page 39)
- deciding which n-grams can be chunked together
- help make next word prediction
- make spelling error corrections
- An N-Gram Example (page 40)
- N-Grams Used To Find Best and Worst Search Queries (page 41)
- Google ’ s N-Gram Viewer (page 42)
- Comparing N-Grams Across Languages
### 10 - Video Search Engines YouTube et al
- Video search engines - quick summary (page 2)
- video search engines
- indexing
- ranking
- Video Search Engines That Crawl for Content
- Video Search Engines That Host
- Video Search Engines That Stream Entertainment
- Some Technologies Supporting Video Content (page 6)
- subtitle
- open caption
- close caption
- SDH (Subtitles for the Deaf and Hard of Hearing):
- Speech recognition
- YouTube Background
- YouTube as a search engine
- 2nd largest search engine
- YouTube Traffic - Some Facts
- YouTube Search Engine Issues to Consider (page 10)
- video format
- display platform
- distribute: content distribution network (CDN)
- monetize website: ContentID system
- keep users watching: recommendation system
- YouTube Channels
- YouTube Gathers Information When Videos are Uploaded
- Uploading to YouTube Second Input Screen
- Uploading to YouTube Third Input Screen
- Business Model: Ads, Ads, Ads
- Sample YouTube Search Results for Katy Perry
- Ranking: Ads, Views, Age YouTube Search Results
- YouTube Advanced Search Ranking Filters
- YouTube Ranking Factors (page 18)
- metadata
- video quality
- numbers of views, likes, shares and links
- subtitles and closed captions
- YouTube Upload Characteristics (page 19)
- 8 video format for uploading
- MOV, MP4 (MPEG4), AVI, WMV, FLV, 3GP, MPEGPS, WebM
- aspect ratio
- 4:3 or 16:9
- maximum file zise: 128GB
- 15 minutes long
- short life cycle
- YouTube Videos Run On Multiple Platforms
- YouTube Makes Recommendations to Retain Viewers
- YouTube Recommendation Algorithm
- YouTube Recommendation System Uses Graph Properties
- accociation rule mining
- Media Sites (including YouTube) Move Away from False Information
- Media Sites (including YouTube) Move Away from False Information
- Content Delivery Networks (page 26)
- Content Delivery Network (CDN)
- Akamai, Limelight and Level 3 Communications
- YouTube Video Delivery System (page 27)
- identification: 11-length 64-based string
- efficient delivery: CDN
- YouTube (Google’s) Content Delivery Datacenters (page 28)
- Uploading a YouTube Video
- YouTube ’ s Content Distribution Network Downloading a YouTube Video
- YouTube Delivery SystemReferences to YouTube ’ s CDN
- Monetizing YouTube (page 34)
- challenge
- **ContentID** (page 35)
- Creating an Acoustic Fingerprint
- How Good is Content ID
- Will YouTube Ever Run Out of Storage
- Total Capacity of YouTube Storage