# Machine Learning Homework 1 PCA Department : Forsetry Student ID : M10912004 Name : De-Kai Kao (高得愷) ## Generate the data set 1. ### Produce 200 samples of 2D random data X(x1, x2), where - x1 is the samples from the normal distribution with mean 3 and standard deviation 2.5, - x2 is samples from the normal distribution with mean 50 and standard deviation 14. ``` import numpy as np import matplotlib.pyplot as plt # Set up the parameters for the normal distributions mean1 = 3 std1 = 2.5 mean2 = 50 std2 = 14 # Generate 200 random samples of x1 and x2 x1 = np.random.normal(mean1, std1, 200) x2 = np.random.normal(mean2, std2, 200) ``` 2. ### Plot the histogram diagrams of x1 and x2 , respectively. ``` # Plot the histograms of x1 and x2 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5)) ax1.hist(x1, bins=20) ax1.set_xlabel('x1') ax1.set_ylabel('Frequency') ax1.set_title('Histogram of x1') ax2.hist(x2, bins=20) ax2.set_xlabel('x2') ax2.set_ylabel('Frequency') ax2.set_title('Histogram of x2') plt.show() ``` ![](https://i.imgur.com/l3kD65c.png) 3. ### Plot the 2D dot plot of the data set (x1 vs. x2) ``` # Plot the scatter diagram of x1 and x2 fig, ax = plt.subplots(figsize=(8, 8)) ax.scatter(x1, x2, alpha=0.5) ax.set_xlabel('x1') ax.set_ylabel('x2') ax.set_title('Scatter Diagram of x1 and x2') plt.show()` ``` ![](https://i.imgur.com/b4iTnPq.png) ## Perform PCA to your data by applying the following three methods: 1. ### the PCA procedure presented in Lecture Calculate covariance ``` # Combine the samples into a 2D array X = np.vstack((x1, x2)).T ``` Handwork ``` meanx1 = sum(x1)/200 meanx2 = sum(x2)/200 meanX = np.array([meanx1, meanx2]) def covariance(X): n_samples = len(X) n_features = len(X[0]) covariance = [[0] * n_features for _ in range(n_features)] for sample in X: for i in range(n_features): for j in range(n_features): S = (sample[i] - meanX[i]) * (sample[j] - meanX[j]) / (n_samples) covariance[i][j] += S #covariance = np.array(covariance) return 'n_samples:{}'.format(n_samples),'n_features:{}'.format(n_features), covariance covariance(X) ``` - Result [[5.791835929969865, -4.080428803211322], [-4.080428803211322, 178.48208258622697]] --- Lagrange Multiplier ``` A = covariance(X) # Step 1: Define lambda lambda_ = 0 # Step 2: Define matrix B B = A - lambda_ * np.eye(A.shape[0]) # Step 3: Compute determinant of B det_B = np.linalg.det(B) # Step 4: Set det(B) = 0 eqn = det_B # Step 5: Solve for lambda lambdas = np.roots([eqn] + [0]*(A.shape[0])) # Step 6: For each lambda, find eigenvector eigenvecs = [] for lambd in lambdas: B = A - lambd * np.eye(A.shape[0]) _, _, v = np.linalg.svd(B) nullvec = v[np.abs(np.linalg.det(v)) < 1e-9] eigenvecs.append(nullvec) # Print the eigenvalues and eigenvectors print("Eigenvalue: ", lambdas) print("Eigenvector: ", eigenvecs) ``` - Result Eigenvalue: 0.0 Eigenvector: [] I'm sorry, professor Zeng. I've tried to convert Lagrange Multiplier into Python code, but do not got any results. I may not understand thoroughly enough. --- Numpy ``` mean = np.mean(X, axis=0) data_centered = X - mean covariance1 = np.cov(data_centered, rowvar=False) print(covariance1) ``` - Result [[ 5.82094063 -4.10093347] [ -4.10093347 179.37897747]] ``` lam, vet = np.linalg.eig(covariance1) print(lam) print(vet) ``` - Result [ 5.72409536 179.47582275] [[-0.99972127 0.02360884] [-0.02360884 -0.99972127]] --- 2. ### the SVD method ``` # Step 1: Calculate the singular value decomposition U, s, V = np.linalg.svd(covariance1) # Step 2: Calculate the eigenvalues eigenvalues = s # Step 3: Calculate the eigenvectors eigenvectors = V print("Eigenvalues: ", eigenvalues) print("Eigenvectors: ", eigenvectors) ``` - Result Eigenvalues: [179.47582275 5.72409536] Eigenvectors: [[-0.02360884 0.99972127] [ 0.99972127 0.02360884]] 3. ### the Python Scikit-learn ``` import numpy as np from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(X) print(pca.explained_variance_ratio_) print(pca.singular_values_) print(pca.components_) ``` - Result Ratio: [0.96909234 0.03090766] Eigenvalue: [188.98594849 33.75048112] Eigenvector: [[-0.02360884 0.99972127] [-0.99972127 -0.02360884]] Plot: ``` import matplotlib.pyplot as plt new_X = pca.fit_transform(X) plt.figure(figsize=(12,6), dpi= 60) plt.subplot(121) plt.axis([-5,10,-20,100]) plt.scatter(X[:,0],X[:,1],s = 50, c = 'r', marker ='o', alpha = 0.5) plt.subplot(122) plt.axis([-5,10,-20,100]) plt.scatter(new_X[:,0],new_X[:,1],s = 50, c = 'b', marker ='o', alpha = 0.5) plt.show() ``` ![](https://i.imgur.com/4hNqDNY.png) ## Discuss the results of your works. I've tried to convert **Lagrange Multiplier** into Python code, but do not got any results. I may not understand thoroughly enough. The difference between the handwork algorithm and the Numpy algorithm in calculating **covariance** lies in whether N_sample is subtracted by 1. It's because of the difference between the population and the sample. Result of **np.linalg.svd** ``` Eigenvalues: [179.47582275 5.72409536] Eigenvectors: [[-0.02360884 0.99972127] [ 0.99972127 0.02360884]] ``` Result of **np.linalg.eig** ``` Eigenvalues: [ 5.72409536 179.47582275] Eigenvectors: [[-0.99972127 0.02360884] [-0.02360884 -0.99972127]] ``` It seems there is a little different between two function,and I found the answer in stackoverflow: The rows in **np.linalg.svd** contain the Eigenvectors, whereas in **np.linalg.eig** it's the columns.