Massive_Data - HackMD

# Algorithms for Massive Data ## Introduction The recent explosion of "Big Data" has revolutionized countless fields, from genomics and astronomy to finance and social media. However, this abundance of data presents a significant challenge: how do we extract meaningful insights from datasets that are too large to fit in a computer's main memory? Traditional algorithms, designed for smaller datasets that can be processed in RAM, often become inefficient or even infeasible when confronted with these massive datasets. This necessitates the development of specialized algorithms that can efficiently process, analyze, and store such data. ### The Real-World Challenge This challenge is not merely academic; it has real-world implications. Consider the following: * **Scientific Discovery:** Analyzing massive datasets from telescopes, particle accelerators, and genome sequencing projects can lead to breakthroughs in our understanding of the universe and ourselves. * **Business Intelligence:** Extracting patterns from customer transaction data, social media trends, and market fluctuations can give businesses a competitive edge. * **Personalized Medicine:** Tailoring medical treatments to individual patients requires analyzing vast amounts of genomic and clinical data. These are just a few examples where the ability to efficiently process massive data is crucial. ### Algorithmic Techniques In these notes, we will explore algorithmic techniques specifically designed for handling massive datasets. We will focus on two key paradigms: * **Streaming Model:** Data arrives sequentially, like a continuous stream from sensors, social media feeds, or financial markets. We need algorithms that can process this data in real-time with limited memory. * **Matrix Algorithms using Sampling:** Many massive datasets can be represented as matrices. We will explore techniques that utilize clever sampling of rows and columns to efficiently analyze and compress these matrices. ### Goals * Approximate frequency moments of data streams. * Identify frequent items in data streams. * Sketch massive matrices for efficient storage and computation. ## Frequency Moments of Data Streams In this section, we delve into the concept of frequency moments, a crucial tool for analyzing data streams. Imagine a continuous flow of data—like a sensor feeding temperature readings, a social media platform generating posts, or a stock ticker streaming price updates. This is a data stream, and we need ways to understand its properties without storing the entire stream. ### What are Frequency Moments? Let's formalize this. Consider a data stream $a_1, a_2, \dots, a_n$ of length $n$, where each symbol $a_i$ belongs to an alphabet of $m$ possible symbols, denoted as $\{1,2,...,m\}$. ![Screenshot 2024-10-02 at 11.01.53 AM](https://hackmd.io/_uploads/ry3ivNq0C.png) The frequency $f_s$ of a symbol $s$ represents the number of times it appears in the stream. For any non-negative integer $p$, the $p$-th frequency moment of the stream is defined as: $$ \sum_{s=1}^m f_s^p $$ This seemingly simple formula holds valuable information about the stream. Let's look at some special cases: * $p=0$: Corresponds to the number of distinct symbols in the stream (using the convention $0^0=0$). Think of identifying the unique visitors to a website or the different species in a biological sample. * $p=1$: Simply equals the length of the stream, $n$. * $p=2$: Useful for computing the variance of the stream, which measures how spread out the data is. A high variance indicates a large spread in frequencies. * As $p$ approaches infinity: $(\sum_{s=1}^m f_s^p)^{\frac1p}$ gives the frequency of the most frequent element(s). This helps identify dominant elements, like the best-selling product or the most popular hashtag. #### Why are frequency moments useful? Frequency moments have diverse applications across various domains: * **Network Traffic Analysis:** Identifying heavy bandwidth users by analyzing the frequency of IP addresses. * **Sales Data Analysis:** Determining best-selling items from purchase records. * **Database Management:** Estimating the number of unique visitors or customers. * **Networking:** Calculating the variance of network log data for source-destination pairs to understand traffic patterns. In the following subsections, we'll explore specific algorithms for estimating these frequency moments efficiently, even when the data stream is massive and cannot be stored in its entirety. We'll start by tackling the problem of counting distinct elements. ### Number of Distinct Elements in a Data Stream Imagine you're monitoring a high-speed network, and you want to count the number of unique IP addresses that access a server. Or perhaps you're analyzing a massive dataset of customer transactions and need to determine how many distinct customers made purchases. These scenarios involve counting the distinct elements in a data stream, a task that becomes challenging when the stream is too large to store in memory. Formally, consider a sequence $a_1, a_2, \dots, a_n$ of $n$ elements, each an integer in the range $1$ to $m$, where both $n$ and $m$ can be very large. Our goal is to determine the number of distinct $a_i$ in the sequence using limited memory. #### The Challenge of Limited Memory Ideally, we want an algorithm that uses space logarithmic in both $m$ and $n$. While we could easily count distinct elements using $O(m)$ space (with a bit-vector) or $O(n \log m)$ space (by storing a list of distinct elements), these approaches become impractical for massive data streams. #### Why Exact Deterministic Algorithms Won't Work Unfortunately, it's impossible to determine the exact number of distinct elements using a deterministic algorithm with space logarithmic in $m$ and $n$. Here's why: Suppose we have a sequence of length $m+1$. Any deterministic algorithm using less than $m$ bits of memory will inevitably have the same memory state for two different subsets of $\{1,2, \dots,m\}$. * **Different-Sized Subsets:** If these subsets have different numbers of elements, the algorithm will be incorrect for at least one of them. * **Same-Sized Subsets:** If the subsets have the same size, but the next element in the sequence is in one subset and not the other, the algorithm will produce the same answer for both, leading to an incorrect count. Therefore, any exact deterministic algorithm must use at least $m$ bits of memory on some input sequence of length $m+1$. #### Embracing Approximation with Randomization To overcome this limitation, we turn to approximation and randomization. Instead of aiming for an exact count, we'll design an algorithm that provides a good estimate of the number of distinct elements with high probability. ##### Intuition: The Minimum Hash Value Imagine the set $S$ of distinct elements was chosen uniformly at random from $\{1,\dots,m \}$. Let min be the minimum element in $S$. * With one distinct element, $\mathbb{E}(\min)$ is roughly $\frac{m}{2}$. * With two distinct element, $\mathbb{E}(\min)$ is roughly $\frac{m}{3}$. * In general, $\mathbb{E}(\min) \approx \frac{m}{|S|+1}$ Solving $\min = \frac{m}{|S|+1}$ for $|S|$ gives us: $$ |S| \approx \frac{m}{\min} - 1 $$ This suggests that tracking the minimum element can help us estimate the number of distinct elements, $|S|$. ![Screenshot 2024-10-02 at 1.27.33 PM](https://hackmd.io/_uploads/HJg0YUqCR.png) #### Introducing Hashing for Practicality In real-world data, the set $S$ might not be uniformly random. To make our approach robust, we introduce hashing. Instead of tracking the minimum element directly, we use a hash function $h$ to map elements to a smaller range: $$ h: \{1,2,\dots,m\} \rightarrow \{1,2,\dots,M-1\} $$ We then track the minimum hash value $h(a_i)$. This effectively "randomizes" the input, making our estimation more reliable. #### Ensuring Randomness with 2-Universal Hash Functions To guarantee the effectiveness of our approach, we need a hash function with specific properties. A 2-universal (pairwise independent) hash function ensures that the hash values of any two distinct elements are independent and uniformly distributed. This randomness is crucial for our estimation. ##### Example Let $M$ be a prime number greater than $m$. For each pair of integers $a$ and $b$ in the range $[0,M−1]$, the hash function $h_{ab}(x) = ax+b \ (\text{mod } M)$ is 2-universal. Bonus explanation #### Analysis and Beyond By combining the minimum hash value technique with 2-universal hashing, we can achieve a good approximation of the number of distinct elements in a data stream using limited memory. Let $b_1, b_2, \dots, b_d$ be the distinct values in the input. We select a hash function $h$ from our 2-universal family $H$. The set $S=\{h(b_1), h(b_2), \dots, h(b_d)\}$ then consists of $d$ random and pairwise independent values from $\{0,1,2, \dots,M−1\}$. **Lemma** With a probability of at least $\frac23 - \frac{d}{M}$, our estimate satisfies $\frac{d}{6} \leq \frac{M}{\min} \leq \frac16 - \frac{d}{M}$, where $\min=\min(S)$ is the minimum value in the hashed set. Equivalently, $\frac{M}{6d} \leq \min \leq \frac{6M}{d}$. This method provides a good approximation for the number of distinct elements in the data stream using limited memory. Bonus proof ### Counting the Occurrences of a Given Element In this section, we shift our focus from counting distinct elements to counting the occurrences of a specific element within a data stream. While this might seem like a simpler task (and it is, with enough memory!), we'll explore how to approximate the count using minimal space. #### The Problem Imagine you're tracking user activity on a website, and you want to count how many times a particular user has logged in. Or consider analyzing a stream of sensor readings and counting the occurrences of a specific value. These scenarios involve counting the occurrences of a given element in a data stream. For simplicity, let's consider a stream of 0's and 1's of length $n$, where we want to count the occurrences of 1's. Let $m$ be the actual number of 1's in the stream. #### Exact Counting vs. Approximation We could easily count the 1's exactly using $\log n$ bits of space (a counter). However, our goal is to use significantly less space - only $\log \log n$ bits. To achieve this, we'll employ an approximation technique. #### The Algorithm 1. **Initialization:** Maintain a value $k$ such that $2^k$ approximates $m$. Start with $k=0$. 2. **Update:** For each occurrence of '1' in the stream, increment $k$ by one with probability $\frac{1}{2^k}$. 3. **Estimation:** At the end of the stream, the estimate for $m$ is $2^k-1$. #### Implementation Trick To simulate a coin flip with probability $\frac{1}{2^k}$, we can flip a fair coin $k$ times and increment $k$ only if all flips result in heads. #### Intuition The key idea is that given $k$, it takes on average $2^k$ occurrences of '1' before $k$ is incremented. Thus, the expected number of 1's to reach the current value of $k$ is: $$ 1 + 2 + 4 + \dots + 2^{k-1} = 2^k-1 $$ #### Space Efficiency This algorithm requires only $\log \log n$ bits to store $k$, providing a significant space reduction compared to the exact counting method. This makes it suitable for scenarios where memory is highly constrained. #### Trade-offs and Applications While this algorithm uses minimal space, it provides an approximation of the count. This trade-off is often acceptable in applications where an exact count is not critical, and a good estimate suffices. Examples include: * **Monitoring user activity:** Estimating the number of times a user performs a specific action. * **Analyzing sensor data:** Approximating the frequency of a particular sensor reading. * **Tracking events in a distributed system:** Estimating the number of occurrences of a specific event across multiple nodes. This technique demonstrates the power of approximation and randomization in designing space-efficient algorithms for massive data streams. ### Finding Frequent Elements In the vast ocean of data streams, some elements appear more often than others. Identifying these frequent items is crucial for various applications, from trend analysis in social media to anomaly detection in network traffic. In this section, we'll explore algorithms that can pinpoint frequent elements in a data stream without storing the entire stream in memory. #### The Majority Algorithm Let's start with a specific case: finding the majority element. Given a stream of integers $a_1, a_2, \dots, a_n$, where each $a_i$ belongs to $\{1,2,\dots,m\}$, the majority element is an element that appears more than $\frac{n}{2}$ times. The Majority Algorithm elegantly solves this problem with minimal memory: 1. Store $a_1$ and initialize a counter to one. 2. For each subsequent $a_i$: * If $a_i$ is the same as the stored element, increment the counter. * If $a_i$ is different, decrement the counter (if non-zero). If the counter becomes zero, store $a_i$ and set the counter to one. ##### Why does this work? If a majority element exists, it will be stored at the end. This is because each elimination of the majority element also eliminates another element, preventing the majority element from being eliminated more than $\frac{n}{2}$ times. #### Beyond the Majority: The Frequent Algorithm The Majority Algorithm provides a foundation for identifying elements with frequencies above a certain threshold. The Frequent Algorithm extends this idea to approximate the frequency $f_s$ of each element $s \in \{1,2,\dots,m\}$ within an additive term of $\frac{n}{k+1}$, where $k$ is a parameter. ##### The Algorithm 1. Maintain a list of items being counted (initially empty). 2. For each item: * If it's on the list, increment its counter. * If it's not on the list and the list has less than $k$ items, add it with a counter set to one. * If the list already has $k$ items, decrement each counter by one. Delete any element with a count of zero. **Theorem** At the end of the Frequent Algorithm, for each $s \in \{1,2,\dots,m\}$, its counter $\tilde{f}_s$ satisfies $\tilde{f}_s \in [f_s - \frac{n}{k-1}, f_s]$. Bonus proof ##### Analysis This algorithm uses $O(k \log n + k \log m)$ space and provides approximate frequency counts for all elements in the stream. The parameter $k$ controls the trade-off between accuracy and space usage. #### Applications Identifying frequent elements has numerous applications: * **Trend analysis:** Identifying trending topics on social media or popular products in sales data. * **Anomaly detection:** Detecting unusual activity in network traffic or fraudulent transactions. * **Database management:** Finding frequently accessed data items for caching or optimization. These algorithms provide efficient and scalable solutions for finding frequent elements in massive data streams, enabling us to extract valuable insights from the data without excessive memory usage. ### Computing the Second Moment We've explored how to count distinct elements and specific elements in a data stream. Now, let's delve into a more complex problem: computing the second moment. Recall that the second moment of a data stream with symbols from $\{1,2,\dots,m\}$ is given by $\sum_{s=1}^m f_s^2$, where $f_s$ is the frequency of symbol $s$. This value provides valuable insights into the data distribution, particularly its variance. #### The Challenge Computing the second moment exactly requires storing the frequency of each symbol, which can be impractical for massive data streams. Our goal is to estimate the second moment efficiently using limited memory. #### The Algorithm 1. **Hash Function:** For each symbol $s$, independently set a random variable $x_s$ to $\pm 1$ with probability $\frac12$. This can be achieved using a random hash function $h(s)$ with range $\{−1,1\}$. 2. **Sum Maintenance:** Maintain a sum by adding $x_s$ to it each time the symbol $s$ appears in the stream. 3. **Estimator:** At the end of the stream, the sum will equal $\sum_{s=1}^m x_s f_s$. The square of this sum, $a = (\sum_{s=1}^m x_s f_s)^2$, is an unbiased estimator of $\sum_{s=1}^m f_s^2$. #### Why does this work? * The expected value of the sum is zero: $\mathbb{E}(\sum_{s=1}^m x_s f_s) = 0$ * The expected value of the square of the sum is the second moment: $$\mathbb{E}(\sum_{s=1}^m x_s f_s)^2 = \mathbb{E}(\sum_{s=1}^m x^2_s f^2_s) + 2 \mathbb{E}(\sum_{s \neq t} x_s x_t f_s f_t ) = \sum_{s=1}^m f_s^2$$ #### Improving Accuracy To get a tighter guarantee, we consider the second moment of $a$, which requires 4-wise independent random variables. By repeating the process multiple times with different hash functions and averaging the results, we can obtain high accuracy with high probability. **Theorem** The average $x$ of $r=\frac{2}{\epsilon^2\delta}$ estimates $a_1, \dots, a_r$ using independent sets of 4-way independent random variables satisfies: $$ \text{Prob}\left(|x - \mathbb{E}(x)| > \epsilon \mathbb{E}(x) \right) < \frac{\text{Var}(x)}{\epsilon^2 \mathbb{E}^2(x)} $$ Bonus proof #### Implementation * 4-way independent random variables can be implemented using $O(\log m)$ space through techniques like polynomial interpolation over a finite field. * The algorithm can be implemented with low space by storing only the "seeds" (coefficients of the polynomials) and computing the $x_s$ values on the fly. Bonus explanation #### Applications Estimating the second moment has applications in: * **Network monitoring:** Analyzing network traffic to identify unusual patterns or anomalies. * **Database optimization:** Understanding the distribution of data access to optimize query performance. * **Data mining:** Discovering patterns and relationships in large datasets. ## Matrix Algorithms Using Sampling Massive datasets often take the form of matrices, capturing relationships between different entities. Think of user-movie ratings on Netflix, word frequencies in documents, or connections in a social network. These matrices can be enormous, exceeding the capacity of RAM and making traditional matrix operations computationally expensive. In this section, we explore techniques that leverage sampling to efficiently analyze and compress these massive matrices. Instead of examining every single entry, we strategically sample rows and columns to create smaller, manageable representations that still capture the essence of the original matrix. ### Why Sampling? * **Memory Efficiency:** Dealing with matrices that are too large to fit in memory. * **Computational Speedup:** Performing approximate computations in less time. * **Dimensionality Reduction:** Finding a lower-dimensional representation that preserves essential information. ### Matrix Multiplication Using Sampling Let's revisit the fascinating world of matrix multiplication, but this time with a twist. How can we efficiently multiply massive matrices that exceed our computer's memory capacity? The answer lies in clever sampling techniques. #### The Challenge of Massive Matrices Imagine multiplying two matrices, $A$ (of size $m \times n$) and $B$ (of size $n \times p$). Traditional matrix multiplication requires computing the dot product of each row of $A$ with each column of $B$, resulting in $m \times p$ entries in the output matrix. This becomes computationally expensive when $m$, $n$, and $p$ are large. #### Sampling to the Rescue Instead of computing all those dot products, what if we strategically sample a subset of columns from $A$ and the corresponding rows from $B$? This is the core idea behind matrix multiplication using sampling. #### The Procedure: A Step-by-Step Guide 1. **Column Sampling from A:** * We don't just sample uniformly. Instead, we employ length squared sampling, where the probability of choosing a column from $A$ is proportional to its squared length (sum of squares of its entries). This prioritizes influential columns that contribute more to the final product. * Let's say we sample $s$ columns. We form a new matrix $C$ (of size $m \times s$) using these sampled columns. * Crucially, we scale each sampled column of $C$ by $1/\sqrt{s p_k}$, where $p_k$ is the probability of sampling that column. This scaling ensures that the expected value of $CC^T$ equals $AA^T$, a property that's vital for the accuracy of our approximation. 2. **Row Sampling from $B$:** * Now, we turn our attention to $B$. We select the rows from $B$ that correspond to the columns we sampled from $A$. This gives us a matrix $R$ (of size $s \times p$). * Similar to the column sampling, we scale each row of $R$ by $1/\sqrt{s p_k}$. This ensures that $\mathbb{E}(R^TR)=B^T B$. 3. **Approximating the Product:** * Finally, we multiply our sampled and scaled matrices, $C$ and $R$. The resulting matrix $CR$ serves as an approximation of the actual product $AB$. ![Screenshot 2024-10-02 at 1.53.31 PM](https://hackmd.io/_uploads/SyIklP90A.png) #### Why Does This Work? Length squared sampling is key. By prioritizing columns with larger lengths, we capture the most important directions of variance in the data, leading to a better approximation of the true product. Bonus explanation #### How Good is the Approximation? The expected error of this approximation is bounded by: **Theorem** $$ \mathbb{E}\left(||AB - CR||^2_F \right) \leq \frac{||A||_F^2 ||B||_F^2}{s} $$ Thus, to ensure $\mathbb{E}\left(||AB - CR||^2_F \right) \leq \epsilon^2 ||A||_F^2 ||B||_F^2$, it suffices to make $s$ greater than or equal to $1/\epsilon^2$. If $\epsilon$ is $\Omega(1)$, so $s \in O(1)$, then the multiplication $CR$ can be carried out in time $O(mp)$. This means that the expected error decreases as we increase the number of samples $s$. #### When is This Effective? This technique shines when the matrix $A$ has a low effective rank, meaning that a small number of singular values capture most of its variance. In such cases, we can achieve a good approximation with a relatively small number of samples. #### Beyond the Theory This sampling-based approach not only provides a theoretical framework but also offers practical benefits. It allows us to handle massive matrices that wouldn't fit in memory, and it speeds up computations significantly. This has implications for various applications, such as: * **Recommendation systems:** Efficiently multiplying user-item matrices to generate recommendations. * **Natural language processing:** Analyzing large document-term matrices for topic modeling or document similarity. * **Machine learning:** Performing computations on large datasets for training machine learning models. Matrix multiplication using sampling demonstrates the power of randomized algorithms in tackling large-scale problems, providing a balance between efficiency and accuracy. ### Implementing Length Squared Sampling We've seen the power of length squared sampling for matrix operations. But how do we actually implement it when dealing with massive matrices that exceed our computer's memory? This is where the "two-pass" approach comes into play. #### The Challenge of External Memory Imagine your matrix is so large that it's stored on an external disk rather than in RAM. Accessing individual entries randomly becomes expensive. We need a way to perform length squared sampling efficiently, even with these constraints. #### The Two-Pass Algorithm The solution lies in making two passes through the data: 1. **Pass 1: Computing Column Lengths** * We sequentially read the matrix entries from the external storage. * For each column, we maintain a running sum of squares of its entries. This gives us the squared length of each column. * These squared lengths are small enough to be stored in RAM. 2. **Pass 2: Sampling Columns** * Now that we have the squared lengths in RAM, we can use a random number generator to determine which columns to sample, based on their length squared probabilities. * We make a second pass through the matrix in external storage, this time only reading and storing the selected columns. ##### Why Two Passes? The two-pass approach allows us to perform length squared sampling even when random access to the matrix is expensive. The first pass gathers the necessary information (squared lengths) in a compact form, and the second pass efficiently retrieves the sampled columns. #### One-Pass Algorithm for Column-Ordered Matrices If the matrix is stored in column-major order (where entries of each column are stored contiguously), we can optimize further with a one-pass algorithm: 1. **Initialization:** * Maintain a sum $s$ of the squared lengths of the columns seen so far. * Maintain an index $i$ of a selected column. 2. **Update:** * For each new column $j$: * Update the sum $s$ by adding the squared length of column $j$. * Change the selected index to $j$ with probability $\frac{|A(:,j)|^2}{s}$. #### Benefits and Applications These algorithms demonstrate how to efficiently implement length squared sampling even for massive matrices stored in external memory. This has significant implications for various applications: * **Large-scale machine learning:** Handling massive datasets for training machine learning models. * **Recommendation systems:** Analyzing user-item matrices that are too large to fit in RAM. * **Bioinformatics:** Processing large genomic datasets for analysis and research. By enabling efficient sampling, these techniques pave the way for analyzing and extracting insights from truly massive datasets, pushing the boundaries of what's computationally possible. ### Sketching a Large Matrix Imagine trying to create a miniature portrait that captures the likeness of a person. You wouldn't try to recreate every detail, but rather focus on the essential features. Similarly, in the realm of massive matrices, we can create "sketches" that capture the essence of the original matrix in a much smaller form. #### Why Sketch? * **Compact Representation:** Storing a massive matrix in a compressed form, saving valuable memory. * **Efficient Computation:** Performing approximate computations using the sketch instead of the full matrix, saving time and resources. * **Data Understanding:** Gaining insights into the matrix's structure and key features. #### The Art of Sketching a Matrix Given an $m \times n$ matrix $A$, we can create a sketch using a combination of sampling and clever reconstruction: 1. **Column Sampling:** * Just as we did for matrix multiplication, we sample s columns of $A$ using length squared sampling, prioritizing influential columns. * We form an $m \times s$ matrix $C$ with these sampled columns, scaling each column appropriately. 2. **Row Sampling:** * Similarly, we sample $r$ rows of $A$ using length squared sampling, creating an $r \times n$ matrix $R$. 3. **Finding the Missing Link $U$:** * Now, the magic happens. We find an $s \times r$ matrix $U$ such that the product $CUR$ approximates the original matrix $A$. ![Screenshot 2024-10-02 at 12.12.36 PM](https://hackmd.io/_uploads/rJ1r_BcAC.png) #### The Intuition The sketch $CUR$ is like a miniature portrait of the original matrix $A$. It uses a subset of the actual rows and columns to represent the matrix, much like an artist might use a few key strokes to capture the essence of a subject. #### How to Find $U$ Finding the right $U$ is crucial. It involves constructing an identity-like matrix that acts as the identity on the row space of $R$. This allows us to approximate $A$ and then use the sampled columns and rows to obtain $CUR$. Bonus explanation and code implementation #### Error Bound The accuracy of our sketch is measured by the following error bound: **Theorem** $$ \mathbb{E}(||A−CUR||^2_2) \leq ||A||^2_F \left(\frac2{\sqrt{r}} + \frac{2r}{s} \right) $$ Bonus proof This bound tells us that the error decreases as we increase the number of sampled rows $r$ and columns $s$. #### When is Sketching Effective? Sketching is particularly effective when the matrix $A$ has a low effective rank, meaning a small number of dominant singular values. This is often the case in real-world datasets, where a few underlying factors explain most of the variation. #### Applications of Matrix Sketching * **Document analysis:** Representing large document-term matrices to efficiently estimate similarities between documents. * **Recommendation systems:** Creating compact representations of user-item matrices to speed up recommendations. * **Image processing:** Compressing images while preserving essential features. * **Bioinformatics:** Analyzing gene expression data to identify patterns and relationships. Matrix sketching provides a powerful tool for handling massive matrices, enabling efficient storage, computation, and analysis. It's a testament to the ingenuity of algorithms in extracting valuable information from even the largest datasets. ## Sketches of Documents Imagine having to compare thousands of documents to find near-duplicates or to cluster them based on similarity. Comparing each document word-by-word would be incredibly time-consuming. Instead, we can use "sketches" – compact representations that capture the essence of a document. ### Why Sketch Documents? * **Efficient Storage:** Storing a massive corpus of documents in a compressed, space-efficient manner. * **Fast Similarity Estimation:** Estimating the similarity between documents using their sketches, rather than comparing the full text. * **Scalability:** Handling a massive number of documents with ease. ### Creating Document Sketches: A Step-by-Step Guide 1. **Shingling:** * First, we convert each document into a set of its substrings of a fixed length $k$. These substrings are called "shingles." * For example, with $k=2$, the string "`The quick brown fox`" becomes `{"Th", "he", "e ", " q", "qu", "ui", "ic", "ck", "k ", " b", "br", "ro", "ow", "wn", "n ", " f", "fo", "ox"}`. 2. **Sampling:** * Storing all shingles can still be expensive. So, we sample a small subset of them to represent the document. * Different sampling methods can be used: * **Uniform Sampling:** Select $s$ shingles uniformly at random. * **Min-Hashing:** Select the $s$ shingles with the smallest hash values. This technique is particularly effective for estimating resemblance (similarity). * **Mod-m Sampling:** Select shingles based on their hash values modulo $m$. 3. **The Sketch:** * The selected shingles form the sketch of the document. This compact representation is used for comparison and analysis. ### Measuring Similarity with Sketches We can estimate the similarity between documents using their sketches and measures like: * **Resemblance:** The overlap between two sets of shingles: $$ \text{resemblance}(A,B) = \frac{|A \cap B|}{|A \cup B|}$$ * **Containment:** The extent to which one set of shingles is contained in another: $$ \text{containment}(A,B) = \frac{|A \cap B|}{|A|} $$ **Example:** If we use min-hashing to create sketches, the estimated resemblance is simply the fraction of hash values that are the same in the two sketches. ### Advantages of Document Sketches * **Compactness:** Sketches are significantly smaller than the original documents, saving storage space. * **Efficiency:** Similarity estimation using sketches is much faster than comparing entire documents. * **Scalability:** This technique scales well to handle massive collections of documents. ### Applications * **Web Search:** Detecting near-duplicate web pages to avoid redundancy in search results. * **Document Clustering:** Grouping similar documents together for organization and analysis. * **Plagiarism Detection:** Identifying plagiarism in academic or literary works. * **Recommendation Systems:** Recommending similar documents or articles to users based on their reading history. Document sketching provides a powerful technique for managing and analyzing massive text data, enabling efficient comparison, clustering, and information retrieval. It's a testament to the power of approximation and clever sampling in handling the challenges of Big Data. ## Reference * Chapter 6 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020. ## Further Reading ### General * Leskovec, J., Rajaraman, A., & Ullman, J. D. (2020). Mining of massive datasets (3rd revised ed.). Cambridge University Press. [link](http://www.mmds.org/) ### Hashing * Chapter 5 of Introduction to Algorithms (3rd edition) by Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). MIT Press. ### Tokenization * Chapter 2 of Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. [link](https://nlp.stanford.edu/IR-book/html/htmledition/the-term-vocabulary-and-postings-lists-1.html)