# Unsupervised Learning In the realm of machine learning, algorithms fall into three broad categories: * **Supervised Learning:** Models learn from labeled data, where input-output pairs guide the training process. * **Semi-Supervised Learning:** Models utilize a combination of labeled and unlabeled data, particularly useful when labeled data is scarce. * **Unsupervised Learning:** Models analyze unlabeled data to uncover hidden patterns, relationships, and structures without explicit targets. ## Understanding Unsupervised Learning Unsupervised learning algorithms function like detectives, analyzing *unlabeled data* to extract meaningful insights. They excel in scenarios where: * **Labeled data is unavailable or expensive:** Unsupervised learning eliminates the need for time-consuming and potentially costly data labeling. * **Exploratory analysis is the goal:** These models help reveal underlying patterns and relationships within the data itself. ### Key Objectives of Unsupervised Learning * **Data Visualization:** Transforming complex data into visual representations that enhance our understanding of patterns and trends. * **Subgroup Discovery:** Identifying natural clusters or groupings within the data based on shared characteristics, without needing predefined labels. ### Common Unsupervised Learning Methods * **Dimensionality Reduction:** Simplifying high-dimensional data by identifying the most important variables or creating informative new features. Key benefits: * **Improved visualization** of complex data. * **Enhanced performance** of supervised learning models. * **Reduced computational cost.** * **Example:** Principal Component Analysis (PCA) * **Clustering:** Partitioning data points into groups based on their similarity. Key characteristics: * **Data points within a cluster** are more similar to each other than those in other clusters. * **Wide Applications:** Customer segmentation, image analysis, anomaly detection. * **Examples:** K-means, Hierarchical Clustering, DBSCAN ## Dimensionality Reduction High-dimensional data often presents challenges in machine learning. Dimensionality reduction addresses this by transforming data into a lower-dimensional space, streamlining analysis while preserving essential information. ### Principal Component Analysis (PCA) PCA is a cornerstone of dimensionality reduction. Let's break it down: **Core Idea:** PCA finds new directions (linear combinations of original features) that capture the maximum amount of variation within your dataset. :::success <font color=blue>Algorithm (Principal Component Analysis)</font> Input: Data matrix X of size N x p, where N is the number of samples and p is the number of features <br/> Output: Principal components Z <br/> 1. Standardize the data: <br/> for j = 1 to p: <br/> mu[j] = mean(X[:, j]) # Calculate mean of j-th feature <br/> sigma[j] = std(X[:, j]) # Calculate standard deviation of j-th feature <br/> X[:, j] = (X[:, j] - mu[j]) / sigma[j] # Standardize j-th feature 2. Compute the covariance matrix: <br/> Sigma = (1 / (N - 1)) * X.T @ X 3. Compute the eigenvalues and eigenvectors of the covariance matrix: <br/> eigenvalues, eigenvectors = eigen(Sigma) 4. Sort the eigenvalues and corresponding eigenvectors in descending order 5. Select the top k eigenvectors (principal components) based on the desired variance threshold 6. Project the standardized data onto the principal components: <br/> Z = X @ eigenvectors[:, :k] 7. Return the principal components Z ::: ### Choosing the Number of Principal Components The **Cumulative Explained Variance Ratio** helps us determine the ideal number of components: * Each principal component explains a percentage of the dataset's total variance. * The cumulative explained variance plot shows how this percentage increases as we include more components. * Our goal is to select enough components to explain a high percentage of the variance (e.g., 90%, 95%) while achieving significant dimensionality reduction. ### Example The following Python code illustrates PCA application, reducing a high-dimensional dataset to two dimensions for visualization: ``` python= import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # Load the 64-dimensional data from the digits dataset digits_data = load_digits().data # Standardize the data scaler = StandardScaler() scaled_data = scaler.fit_transform(digits_data) # Project the 64-dimensional data to 2-dimensional pca_2d = PCA(n_components=2) reduced_data = pca_2d.fit_transform(scaled_data) # Plotting the Reduced Data plt.figure(figsize=(8, 6)) plt.title('Project the 64-dimensional data to 2-dimensional', fontsize=16) plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=10, alpha=0.5) plt.xlabel('Principal Component 1', fontsize=14) plt.ylabel('Principal Component 2', fontsize=14) plt.show() # Plot the cumulative explained variance ratio pca = PCA().fit(scaled_data) plt.figure(figsize=(8, 6)) plt.title('Cumulative Explained Variance Ratio', fontsize=16) plt.xlabel('Number of Components', fontsize=14) plt.ylabel('Cumulative Variance Ratio (%)', fontsize=14) plt.ylim(80, 101) plt.xlim(18, 66) plt.grid(axis='x') cumsum = np.cumsum(pca.explained_variance_ratio_ * 100) plt.plot(np.arange(1, len(cumsum) + 1), cumsum, marker='.', linestyle='-', color='b') variance_thresholds = [85, 90, 95, 99] for threshold in variance_thresholds: plt.axhline(y=threshold, color='r', linestyle='--', alpha=0.5) num_components = np.argmax(cumsum >= threshold) + 1 plt.text(num_components + 1, threshold - 2, f'{threshold}% ({num_components} components)', color='r', fontsize=12) plt.show() ``` ![PCA](https://hackmd.io/_uploads/BkXwCDHWA.png) ## Missing Values and Matrix Completion Missing data is a pervasive challenge in data analysis. Understanding how to handle missing values is essential for effective modeling. Matrix completion provides a powerful technique for imputing (filling in) missing entries. Let's explore these concepts: ### Missing Values * **Why they occur:** Missing values arise due to issues like equipment malfunctions, uncollected data, or deliberate removal during data preprocessing. * **Impact on modeling:** Many machine learning models cannot directly handle datasets with missing entries. Preprocessing to address missing values becomes critical. ### Handling Missing Values * **Simple Strategies:** * **Deletion:** Removing rows or columns with missing data (use with caution as it can lead to significant information loss). * **Mean/Median Imputation:** Replacing missing values with the mean or median of the feature's observed values. * **Matrix Completion:** A sophisticated approach employing low-rank matrix assumptions for missing value imputation. ### Matrix Completion * **Objective:** Recover missing entries in a partially observed matrix, a task analogous to statistical data imputation. * **Low-Rank Assumption:** Matrix completion often assumes that the underlying data matrix has a low-rank structure. This means patterns within the data can be represented by a smaller number of latent factors. :::success <font color=blue>Algorithm (Matrix Completion via PCA)</font> 1. Create a complete data matrix $\tilde{X}$ of dimension $n \times p$ of which the $(i, j)$ element equals $$\tilde{x}_{ij} = \begin{cases} x_{ij} &\text{if} (i, j) \in \cal{O} \\ \bar{x}_j &\text{if} (i, j) \not\in \cal{O}, \end{cases}$$ where $\bar{x}_j$ is the average of the observed values for the $j$th variable in the incomplete data matrix $X$. Here, $\cal{O}$ indexes the observations that are observed in $X$. 2. Repeat steps (a)–\(c) until the objective (12.14) fails to decrease: <br/> (a) Solve $$\min_{A \in \Bbb{R}^{N \times M}, B \in \Bbb{R}^{p \times M}} \sum_{j=1}^p \sum_{i=1}^N \left( \widetilde{x}_{ij}-\sum_{m=1}^M a_{im}b_{jm} \right)^2$$ by computing the principal components of $\tilde{X}$. <br/> (b) For each element $(i, j) \notin \cal{O}$, set $\tilde{x}_{ij}=\sum_{m=1}^M \hat{a}_{im}\hat{b}_{jm}$. <br/> \(c) Compute the objective $$\sum_{(i, j) \in \cal{O}} \left( \widetilde{x}_{ij}-\sum_{m=1}^M \hat{a}_{im}\hat{b}_{jm} \right)^2.$$ 3. Return the estimated missing entries $\tilde{x}_{ij}$ ,$(i, j) \not\in \cal{O}$. ::: #### Choosing the Number of Principal Components $M$ We can use the validation-set approach to choose the number $M$. :::success <font color=blue>Algorithm (Choosing $M$ by Validation-set approach)</font> 1. Divide data into $K$ subsets of equal size. <br/> 2. Each time one of the $K$ subsets is used as the validation set, and the remaining $K-1$ subsets are combined to form the training set. 3. For each $M$: * For $k=1, \ldots, K$: * In each iteration, PCA is performed on the training set, reducing the data to $M$ dimensions. * The validation set is then used to calculate the loss or error. * The mean loss across all $K$ folds is calculated for the current value of $M$. 4. Choose the optimal number of components $M$ that gives the lowest mean loss on the validation set. ::: #### Calculate the loss To evaluate how well our matrix completion model performs, we need a way to measure the difference between the true values in the validation set and their estimated counterparts. Here's how we calculate the PCA loss: 1. **Principal Components $Z$:** Obtain the top $M$ principal components from the training set. These form a matrix $Z$ of dimensions $N' \times M$, where $N'$ is the size of the training set. 2. **Orthonormality (Optional):** If the columns of $Z$ are not already orthonormal (meaning their dot products equal zero for distinct columns and one for the same column), we can apply QR factorization to make them so. This simplifies calculations. 3. **PCA Loss Function:** For each data point 'x' in the validation set, the PCA loss is calculated as: $$\text{Loss}_Z(x)= \lVert x \rVert_2^2-\lVert Z^T x\rVert_2^2, \quad x \in \text{validation set}$$ ### Netflix Recommender Systems One specifc early example of Netflix movie rating dataset matrix had $N = 480,189$ customers and $p = 17,770$ movies. However, on average each customer had seen around $200$ movies, so $99$% of the matrix had missing elements. To suggest a movie that a particular customer might like, Netfix needed to impute the missing values of this data matrix. ### Example ```python= import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler USArrests = get_rdataset('USArrests').data scaler = StandardScaler(with_std=True, with_mean=True) USArrests_scaled = scaler.fit_transform(USArrests) ``` | | Murder | Assault | UrbanPop | Rape | | ---------- | --------- | ----------- | ---------- | --------- | | Alabama | 13.2 | 236 | 58 | 21.2 | | Arizona | 8.1 | 294 | 80 | 31.0 | | ... | ... | ... | ... | .. | | Wisconsin | 2.6 | 53 | 66 | 10.8 | | Wyoming | 6.8 | 161 | 60 | 15.6 | | $\mu$ | 7.788 | 170.760 | 65.540 | 21.232 | | $\sigma^2$ | 18.970465 | 6945.165714 | 209.518776 | 87.729159 | ```python=+ X = USArrests_scaled U, D, V = np.linalg.svd(X, full_matrices=False) #full_matrices=False: ensures that for a tall matrix the shape of U is the same as the shape of X. ``` U.shape:(50, 4), D.shape:(4,), V.shape:(4, 4) V is the loading matrix and equals | -0.53589947 | -0.58318363 | -0.27819087 | -0.54343209 | | ----------- | ----------- | ----------- | ----------- | | 0.41818087 | 0.1879856 | -0.87280619 | -0.16731864 | | -0.34123273 | -0.26814843 | -0.37801579 | 0.81777791 | | 0.6492278 | -0.74340748 | 0.13387773 | 0.08902432 | ```python=+ n_omit = 20 np.random.seed(15) r_idx = np.random.choice(np.arange(X.shape[0]), n_omit, replace=False) c_idx = np.random.choice(np.arange(X.shape[1]), n_omit, replace=True) Xna = X.copy() Xna[r_idx, c_idx] = np.nan def low_rank(X, M=1): U, D, V = np.linalg.svd(X) L = U[:,:M] * D[None,:M] return L.dot(V[:M]) ``` Replacing the missing values with the column means of the non-missing entries ```python=+ Xhat = Xna.copy() Xbar = np.nanmean(Xhat, axis=0) Xhat[r_idx, c_idx] = Xbar[c_idx] thresh = 1e-7 rel_err = 1 count = 0 ismiss = np.isnan(Xna) mssold = np.mean(Xhat[∼ismiss]**2) mss0 = np.mean(Xna[∼ismiss]**2) while rel_err > thresh: count += 1 # Step 2(a) Xapp = low_rank(Xhat, M=1) # Step 2(b) Xhat[ismiss] = Xapp[ismiss] # Step 2(c) mss = np.mean(((Xna - Xapp)[∼ismiss])**2) rel_err = (mssold - mss) / mss0 mssold = mss print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}" .format(count, mss, rel_err)) ``` Iteration: 1, MSS:0.395, Rel.Err 5.99e-01 Iteration: 2, MSS:0.382, Rel.Err 1.33e-02 Iteration: 3, MSS:0.381, Rel.Err 1.44e-03 Iteration: 4, MSS:0.381, Rel.Err 1.79e-04 Iteration: 5, MSS:0.381, Rel.Err 2.58e-05 Iteration: 6, MSS:0.381, Rel.Err 4.22e-06 Iteration: 7, MSS:0.381, Rel.Err 7.65e-07 Iteration: 8, MSS:0.381, Rel.Err 1.48e-07 Iteration: 9, MSS:0.381, Rel.Err 2.95e-08 Compute the correlation between the 20 imputed values and the actual values ```python=+ np.corrcoef(Xapp[ismiss], X[ismiss])[0,1] ``` 0.711 ## Clustering Methods Clustering is a powerful unsupervised machine learning technique for discovering patterns in unlabeled data. It has wide-ranging applications in customer segmentation, image analysis, fraud detection, and more. ### K-means Clustering K-means aims to divide data into K groups while minimizing the total *Within-Cluster Variation (WCV)*. Here's the core idea: * **Dataset:** A collection of $N$ data points (e.g., represented as vectors). * **Clusters:** We have $K$ clusters denoted by sets $C_1, C_2, \ldots, C_K$. * If a data point $x$ belongs to the $i$-th cluster, we write $x \in C_i$. * **Coverage:** Each data point belongs to at least one cluster. $$\#(C_1 \cup C_2 \cup \ldots \cup C_K)=\#D=N$$ * **Non-Overlap:** No data point belongs to more than one cluster. $$C_i \cap C_j=\emptyset \quad \forall i \neq j$$ * **Within-Cluster Variation (WCV) for a cluster:** $$\text{WCV}(C_i) = \frac{1}{\#C_i} \sum_{x, y \, \in C_i} d(x, y)^2$$ where $\#C_i$ is the number of points in $C_i$, and $d(x,y)$ is the distance (e.g., Euclidean distance) between points $x$ and $y$. * **Optimization Goal:** Minimize the total WCV: $$\min_{C_1, C_2, \ldots, C_K} \left( \sum_{i=1}^K WCV(C_i) \right)$$ :::success <font color=blue>Algorithm (K-Means Clustering)</font> 1. Randomly assign a number, from $1$ to $K$, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: <br/> (a) For each of the $K$ clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. <br/> (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). ::: #### Choosing the Number of Clusters $K$ * **Elbow Method:** * Calculate the total WCV for different values of K. * Plot WCV vs. $K$. * Look for the "elbow" where adding more clusters leads to diminishing returns in reducing WCV, mathematically it is the point of maximum [curvature](https://en.wikipedia.org/wiki/Curvature). This point often suggests a suitable $K$. * Beyond the elbow point, increasing the value of $K$ does not lead to a significant reduction in WCV. * ![Elbow method](https://hackmd.io/_uploads/BkopQes1R.png) However, there are instances when the elbow method does not present a distinct elbow inflection point to determine the appropriate value for $K$. In such cases, we resort to the Silhouette score method. This method proves to be highly effective in identifying the number of $K$ when the elbow method fails to reveal the elbow point. * **Silhouette Score Method:** * [Silhouette Score](https://en.wikipedia.org/wiki/Silhouette_(clustering)) measures how well a point fits within its assigned cluster vs. neighboring clusters. * Calculate the Silhouette Score for different values of $K$. * Choose the $K$ that maximizes the average Silhouette Score. * ![Silhouette Score](https://hackmd.io/_uploads/SyV3rxoJA.png) #### Strengths * Simple to understand and implement. * Computationally efficient. * Works well with relatively spherical clusters. #### Considerations * Requires specifying the number of clusters $K$ in advance. * Can be sensitive to the initial choice of centroids. * Might not perform optimally for non-spherical clusters. ### Hierarchical Clustering Hierarchical clustering creates a tree-like structure (dendrogram) representing the relationships between data points. A key advantage is that it doesn't require you to know the number of clusters in advance. :::success <font color=blue>Algorithm (Hierarchical Clustering) </font> 1. Begin with $N$ observations and a measure (such as Euclidean distance) of all the ${N \choose 2}$ pairwise dissimilarities. Treat each observation as its own cluster. <br/> 3. For $i = N, N-1, \dots, 2$: <br/> (i) Examine all pairwise inter-cluster dissimilarities among the $i$ clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed. <br/> (ii) Compute the new pairwise inter-cluster dissimilarities among the $i − 1$ remaining clusters. <br/> ::: #### Dendrogram Visualization * Dendrograms visually depict the cluster hierarchy. * The height at which two clusters merge in the dendrogram reflects their dissimilarity (higher merges = less similar). * The height of the link represents the distance between the two clusters that contain those two objects. ![005-visualizing-dendrograms-cutree-1](https://hackmd.io/_uploads/rkMGe7eZC.png) #### Key Choices in Hierarchical Clustering 1. **Dissimilarity Measure:** * Euclidean distance is common, but consider correlation-based distance if relationships between features are important. * Euclidean distance is common, but consider correlation-based distance if relationships between features are important. 2. **Linkage Criterion:** Determines the distance between two clusters based on the pairwise distances between points in each cluster. * **Complete-linkage:** $\max_{x \in C_i,\,y \in C_j}d(x, y)$ * **Single-linkage:** $\min_{x \in C_i,\,y \in C_j}d(x, y)$ * **Average-linkage:** $\frac{1}{\#C_i \cdot \#C_j}\sum_{x \in C_i,\,y \in C_j}d(x, y)$ * **Centroid linkage:** $d(\mu_{C_i}, \mu_{C_j})$ * **Ward's Method:** $\sum_{x \in C_i \bigcup C_j}d(x, \mu_{C_i \bigcup C_j})$ #### Strengths * No need to pre-specify the number of clusters. * Handles various cluster shapes and sizes. * Dendrogram provides insights into data relationships. #### Considerations * Computationally more expensive than K-means, especially for large datasets. ### DBSCAN (Density-Based Spatial Clustering of Applications with Noise) DBSCAN is a powerful clustering algorithm designed to handle datasets with noise and discover clusters of various shapes. #### Key Concepts * **Neighborhood Radius $\epsilon$:** Maximum distance between points for them to be considered neighbors. * **MinPts:** Minimum number of neighboring points needed for a point to be a core point. * **Core Point:** A point with at least MinPts neighbors within its $\epsilon$-neighborhood. * **Border Point:** A point in a core point's ε-neighborhood but without enough neighbors to be a core point itself. * **Noise:** Points that are neither core nor border points. :::success <font color=blue>Algorithm (DBSCAN)</font> 1. Find all the border points within $\epsilon$ and identify the core points. <br/> 4. Repeat following steps until all core points already assigned to a cluster. <br/> (i) Pick a core point, if it is not already assigned to a cluster, create a new cluster. <br/> (ii) Find recursively all its density-connected points and assign them to the same cluster as the core point. <br/> 3. Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any cluster are noise. <br/> ::: #### Parameter Tuning: How to Choose $\epsilon$ and MinPts * **$\epsilon$:** Examine a k-distance plot (distances to the k-th nearest neighbor, often with `k=MinPts`). The 'knee' in the plot suggests a good $\epsilon$ value, i.e. the point of maximum [curvature](https://en.wikipedia.org/wiki/Curvature). * **MinPts:** A common heuristic is to set `MinPts = 2 * dim_data`. #### Strengths * Effective for non-spherical clusters. * Robust to outliers in the data. * Doesn't need the number of clusters specified upfront. #### Considerations * Performance depends heavily on the choice of $\epsilon$ and `MinPts`. * May struggle with datasets where clusters have very different densities. ### Example ```python= from sklearn.cluster import (KMeans, AgglomerativeClustering) from scipy.cluster.hierarchy import (dendrogram, cut_tree) from ISLP.cluster import compute_linkage ``` a simple simulated example ```python= np.random.seed(0) X = np.random.standard_normal((50,2)) X[:25,0] += 3 X[:25,1] -= 4 ``` #### K-means Perform K-means clustering with K = 2 ```python= kmeans = KMeans(n_clusters=2, random_state=2, n_init=20).fit(X) fig, ax = plt.subplots(1, 1, figsize=(8,8)) ax.scatter(X[:,0], X[:,1], c=kmeans.labels_) ax.set_title("K-Means Clustering Results with K=2") ``` Perform K-means clustering with K = 3 ```python=+ kmeans = KMeans(n_clusters=3, random_state=3, n_init=20).fit(X) fig, ax = plt.subplots(figsize=(8,8)) ax.scatter(X[:,0], X[:,1], c=kmeans.labels_) ax.set_title("K-Means Clustering Results with K=3"); ``` ![K-Means](https://hackmd.io/_uploads/rkyE4RH-0.png) The n_init argument to run the K-means with n_init initial cluster assignments (the default is 10) ```python=+ kmeans1 = KMeans(n_clusters=3, random_state=3, n_init=1).fit(X) kmeans20 = KMeans(n_clusters=3, random_state=3, n_init=20).fit(X) ``` kmeans1.inertia_ = 78.06 kmeans20.inertia_ = 75.04 kmeans.inertia_ is the total within-cluster sum of squares >We strongly recommend always running K-means clustering with a large value of n_init, such as 20 or 50, since otherwise an undesirable local optimum may be obtained. #### Hierarchical Clustering ```python= HClust = AgglomerativeClustering hc_comp = HClust(distance_threshold=0, n_clusters=None, linkage='complete') hc_comp.fit(X) hc_avg = HClust(distance_threshold=0, n_clusters=None, linkage='average') hc_avg.fit(X) hc_sing = HClust(distance_threshold=0, n_clusters=None, linkage='single') hc_sing.fit(X); D = np.zeros((X.shape[0], X.shape[0])); for i in range(X.shape[0]): x_ = np.multiply.outer(np.ones(X.shape[0]), X[i]) D[i] = np.sqrt(np.sum((X - x_)**2, 1)); hc_sing_pre = HClust(distance_threshold=0, n_clusters=None, metric='precomputed', linkage='single') hc_sing_pre.fit(D) ``` Plot the dendrograms ```python=+ cargs = {'color_threshold':-np.inf, 'above_threshold_color':'black'} linkage_comp = compute_linkage(hc_comp) fig, ax = plt.subplots(1, 1, figsize=(8, 8)) dendrogram(linkage_comp, ax=ax, **cargs) ``` Color branches of the tree above and below a cutthreshold diferently ```python=+ fig, ax = plt.subplots(1, 1, figsize=(8, 8)) dendrogram(linkage_comp, ax=ax, color_threshold=4, above_threshold_color='black') ``` ![dendrogram (1)](https://hackmd.io/_uploads/Hy7RmNFzC.png) The cluster labels for each observation associated with a given cut of the dendrogram ```python=+ cut_tree(linkage_comp, n_clusters=4).T ``` ``` array([[0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3]]) ``` ```python=+ cut_tree(linkage_comp, height=5) ``` ``` array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2]]) ``` #### DBSCAN ```python= from sklearn import datasets import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN import numpy as np def generator_data(seed=0): np.random.seed(seed) X1, y1 = datasets.make_circles(n_samples=1234, factor=0.5, noise=0.05) X2, y2 = datasets.make_blobs(n_samples=234, n_features=2, centers=[[0,0]], cluster_std=[[0.1]], random_state=6) X = np.concatenate((X1, X2)) return X X = generator_data(0) dbscan = DBSCAN(eps = 0.08, min_samples=4).fit(X) fig, ax = plt.subplots(figsize=(8,8)) ax.scatter(X[:, 0], X[:, 1], c=dbscan.labels_) ax.set_title("DBSCAN Clustering Results") ``` ![DS (1)](https://hackmd.io/_uploads/SkoTG4YGR.png) ## **Reference** * [CS 480/680 - Introduction to Machine Learning - Spring 2021](https://youtube.com/playlist?list=PLmd_zeMNzSvSzRRc4Q29qEcpxbhdwjMOx&si=lf0u2qv1cMS6DY50) * [EE104/CME107: Introduction to Machine Learning](https://ee104.stanford.edu/schedule.html) * [Principal Component Analysis Sanjay Lall and Stephen Boyd EE104 Stanford University](https://ee104.stanford.edu/lectures/pca.pdf) * [Unsupervised Learning Sanjay Lall and Stephen Boyd EE104 Stanford University](https://ee104.stanford.edu/lectures/unsupervised.pdf) * [Stanford EE104: Introduction to Machine Learning Full Course](https://youtube.com/playlist?list=PLoROMvodv4rN_Uy7_wmS051_q1d6akXmK&si=HX4PcD4wva-HT3I8) * [Statistical Learning with Python](https://www.youtube.com/playlist?list=PLoROMvodv4rPP6braWoRt5UCXYZ71GZIQ) * [A Review of Methods for Missing Data, Therese D. Pigott, Pages 353-383 | Published online: 09 Aug 2010](https://doi.org/10.1076/edre.7.4.353.8937) * [DBSCAN: Density-Based Clustering Essentials](https://www.datanovia.com/en/lessons/dbscan-density-based-clustering-essentials/) * [Determining The Optimal Number Of Clusters: 3 Must Know Methods](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#average-silhouette-method) * [Silhouettes: A graphical aid to the interpretation and validation of cluster analysis](https://doi.org/10.1016/0377-0427(87)90125-7) * [Robert Tibshirani, Guenther Walther, Trevor Hastie, Estimating the Number of Clusters in a Data Set Via the Gap Statistic, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 63, Issue 2, July 2001, Pages 411–423](https://doi.org/10.1111/1467-9868.00293) * [Python implementation of the Gap Statistic](https://github.com/milesgranger/gap_statistic) * [Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1–27.](https://doi.org/10.1080/03610927408827101) * [Netflix Movie Rating Dataset](https://www.kaggle.com/datasets/rishitjavia/netflix-movie-rating-dataset/)