# Machine Learning Homework 1 PCA
Department : Forsetry
Student ID : M10912004
Name : De-Kai Kao (高得愷)
## Generate the data set
1. ### Produce 200 samples of 2D random data X(x1, x2), where
- x1 is the samples from the normal distribution with mean 3 and standard deviation 2.5,
- x2 is samples from the normal distribution with mean 50 and standard deviation 14.
```
import numpy as np
import matplotlib.pyplot as plt
# Set up the parameters for the normal distributions
mean1 = 3
std1 = 2.5
mean2 = 50
std2 = 14
# Generate 200 random samples of x1 and x2
x1 = np.random.normal(mean1, std1, 200)
x2 = np.random.normal(mean2, std2, 200)
```
2. ### Plot the histogram diagrams of x1 and x2 , respectively.
```
# Plot the histograms of x1 and x2
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.hist(x1, bins=20)
ax1.set_xlabel('x1')
ax1.set_ylabel('Frequency')
ax1.set_title('Histogram of x1')
ax2.hist(x2, bins=20)
ax2.set_xlabel('x2')
ax2.set_ylabel('Frequency')
ax2.set_title('Histogram of x2')
plt.show()
```

3. ### Plot the 2D dot plot of the data set (x1 vs. x2)
```
# Plot the scatter diagram of x1 and x2
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(x1, x2, alpha=0.5)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Scatter Diagram of x1 and x2')
plt.show()`
```

## Perform PCA to your data by applying the following three methods:
1. ### the PCA procedure presented in Lecture
Calculate covariance
```
# Combine the samples into a 2D array
X = np.vstack((x1, x2)).T
```
Handwork
```
meanx1 = sum(x1)/200
meanx2 = sum(x2)/200
meanX = np.array([meanx1, meanx2])
def covariance(X):
n_samples = len(X)
n_features = len(X[0])
covariance = [[0] * n_features for _ in range(n_features)]
for sample in X:
for i in range(n_features):
for j in range(n_features):
S = (sample[i] - meanX[i]) * (sample[j] - meanX[j]) / (n_samples)
covariance[i][j] += S
#covariance = np.array(covariance)
return 'n_samples:{}'.format(n_samples),'n_features:{}'.format(n_features),
covariance
covariance(X)
```
- Result
[[5.791835929969865, -4.080428803211322],
[-4.080428803211322, 178.48208258622697]]
---
Lagrange Multiplier
```
A = covariance(X)
# Step 1: Define lambda
lambda_ = 0
# Step 2: Define matrix B
B = A - lambda_ * np.eye(A.shape[0])
# Step 3: Compute determinant of B
det_B = np.linalg.det(B)
# Step 4: Set det(B) = 0
eqn = det_B
# Step 5: Solve for lambda
lambdas = np.roots([eqn] + [0]*(A.shape[0]))
# Step 6: For each lambda, find eigenvector
eigenvecs = []
for lambd in lambdas:
B = A - lambd * np.eye(A.shape[0])
_, _, v = np.linalg.svd(B)
nullvec = v[np.abs(np.linalg.det(v)) < 1e-9]
eigenvecs.append(nullvec)
# Print the eigenvalues and eigenvectors
print("Eigenvalue: ", lambdas)
print("Eigenvector: ", eigenvecs)
```
- Result
Eigenvalue: 0.0
Eigenvector: []
I'm sorry, professor Zeng. I've tried to convert Lagrange Multiplier into Python code, but do not got any results. I may not understand thoroughly enough.
---
Numpy
```
mean = np.mean(X, axis=0)
data_centered = X - mean
covariance1 = np.cov(data_centered, rowvar=False)
print(covariance1)
```
- Result
[[ 5.82094063 -4.10093347]
[ -4.10093347 179.37897747]]
```
lam, vet = np.linalg.eig(covariance1)
print(lam)
print(vet)
```
- Result
[ 5.72409536 179.47582275]
[[-0.99972127 0.02360884]
[-0.02360884 -0.99972127]]
---
2. ### the SVD method
```
# Step 1: Calculate the singular value decomposition
U, s, V = np.linalg.svd(covariance1)
# Step 2: Calculate the eigenvalues
eigenvalues = s
# Step 3: Calculate the eigenvectors
eigenvectors = V
print("Eigenvalues: ", eigenvalues)
print("Eigenvectors: ", eigenvectors)
```
- Result
Eigenvalues: [179.47582275 5.72409536]
Eigenvectors: [[-0.02360884 0.99972127]
[ 0.99972127 0.02360884]]
3. ### the Python Scikit-learn
```
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)
print(pca.components_)
```
- Result
Ratio: [0.96909234 0.03090766]
Eigenvalue: [188.98594849 33.75048112]
Eigenvector: [[-0.02360884 0.99972127]
[-0.99972127 -0.02360884]]
Plot:
```
import matplotlib.pyplot as plt
new_X = pca.fit_transform(X)
plt.figure(figsize=(12,6), dpi= 60)
plt.subplot(121)
plt.axis([-5,10,-20,100])
plt.scatter(X[:,0],X[:,1],s = 50, c = 'r', marker ='o', alpha = 0.5)
plt.subplot(122)
plt.axis([-5,10,-20,100])
plt.scatter(new_X[:,0],new_X[:,1],s = 50, c = 'b', marker ='o', alpha = 0.5)
plt.show()
```

## Discuss the results of your works.
I've tried to convert **Lagrange Multiplier** into Python code, but do not got any results. I may not understand thoroughly enough.
The difference between the handwork algorithm and the Numpy algorithm in calculating **covariance** lies in whether N_sample is subtracted by 1. It's because of the difference between the population and the sample.
Result of **np.linalg.svd**
```
Eigenvalues: [179.47582275 5.72409536]
Eigenvectors: [[-0.02360884 0.99972127]
[ 0.99972127 0.02360884]]
```
Result of **np.linalg.eig**
```
Eigenvalues: [ 5.72409536 179.47582275]
Eigenvectors: [[-0.99972127 0.02360884]
[-0.02360884 -0.99972127]]
```
It seems there is a little different between two function,and I found the answer in stackoverflow: The rows in **np.linalg.svd** contain the Eigenvectors, whereas in **np.linalg.eig** it's the columns.