# Homework 4: CSCI 347: Data Mining
Names: Sam Behrens and Karl Molina
Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code.
1. [2 points] Consider the following matrix $A$ and vector $v$. Compute the matrix-vector product $Av$.
$$
A=\left(\begin{array}{ll}{2} & {1} \\ {1} & {3}\end{array}\right), v=\left(\begin{array}{c}{1} \\ {-1}\end{array}\right)
$$
```python
>>> import numpy as np
>>> A = np.array([[2, 1], [1, 3]])
>>> v = np.array([1, -1])
>>> A.dot(v)
array([ 1, -2])
```
$$
Av=\left(\begin{array}{c}{1} \\ {-2}\end{array}\right)
$$
Use the matrix $A$ and the data set $D$ below in all the remaining problems:
$$
A=\left(\begin{array}{cc}{\frac{\sqrt{3}}{2}} & {-\frac{1}{2}} \\ {\frac{1}{2}} & {\frac{\sqrt{3}}{2}}\end{array}\right), \quad D=\left(\begin{array}{cc}{1} & {1} \\ {1} & {2} \\ {3} & {4} \\ {-1} & {1} \\ {-1} & {1} \\ {1} & {-2} \\ {2} & {2} \\ {2} & {3}\end{array}\right)
$$
2. [2 points] Use Python to create a scatter plot of the data, where the x-axis is $X_1$ and the y-axis is $X_1$, and $X_1$ and $X_2$ are the first and second attributes of the data.
```python
import matplotlib.pyplot as plt
import numpy as np
D = np.array(
[[1, 1], [1, 2], [3, 4], [-1, 1],
[-1, 1], [1, -2], [2, 2], [2, 3]])
plt.scatter(D[:, 0], D[:, 1], c='blue', label='D')
plt.title('D')
plt.xlabel('X1')
plt.ylabel('X2')
plt.axhline(0, color='gray')
plt.axvline(0, color='gray')
plt.axis('equal')
plt.show()
```

3. [4 points] Treating each row as a 2-dimensional vector, apply the linear transformation $A$ to each row. In other words, find the matrix-vector product $Ax_i$ for each $x_i$, where $x_i$ is one row $i$ of $D$. We represent $x_i$ as a vector with two rows and one column when multiplying it by $A$. So, for example,
$$
x_{2}=\left(\begin{array}{c}
{1} \\
{2}
\end{array}\right) \text { and } A x_{2}=\left(\begin{array}{cc}
{\frac{\sqrt{3}}{2}} & {-\frac{1}{2}} \\
{\frac{1}{2}} & {\frac{\sqrt{3}}{2}}
\end{array}\right)\left(\begin{array}{l}
{1} \\
{2}
\end{array}\right)
$$
```python
>>> A = np.array(
[[np.sqrt(3) / 2, -1 / 2],
[1 / 2, np.sqrt(3) / 2]])
>>> A.dot(D.T).T
array([[ 0.3660254 , 1.3660254 ],
[-0.1339746 , 2.23205081],
[ 0.59807621, 4.96410162],
[-1.3660254 , 0.3660254 ],
[-1.3660254 , 0.3660254 ],
[ 1.8660254 , -1.23205081],
[ 0.73205081, 2.73205081],
[ 0.23205081, 3.59807621]])
```
4. [3 points] Use Python to create a plot showing both the original data and the transformed data, with the x-axis still corresponding to $X_1$ and the y-axis corresponding to $X_2$. Use different colors and markers to differentiate between the original and transformed data. That is, each transformed data point in the plot should be one matrix-vector product $Ax_i$, which is a 2-dimensional vector. Each original point in the plot should have the same coordinates as it did in Problem 2.
```python
import matplotlib.pyplot as plt
import numpy as np
D = np.array(
[[1, 1], [1, 2], [3, 4], [-1, 1],
[-1, 1], [1, -2], [2, 2], [2, 3]])
plt.scatter(D[:, 0], D[:, 1], c='blue', label='D')
A = np.array(
[[np.sqrt(3) / 2, -1 / 2],
[1 / 2, np.sqrt(3) / 2]])
transformed_D = A.dot(D.T).T
plt.scatter(transformed_D[:, 0], transformed_D[:, 1],
c='red', marker='x', label='transformed_D')
plt.title('D transformed')
plt.xlabel('X1')
plt.ylabel('X2')
plt.axhline(0, color='gray')
plt.axvline(0, color='gray')
plt.axis('equal')
plt.legend(loc='lower right')
plt.show()
```

5. [1 point] Write down the multi-variate mean of the data. (Remember that this should be a 2-dimensional vector)
```python
>>> D.mean(axis=0)
array([1. , 1.5])
```
6. [2 points] Mean-center the data. Write down the mean-centered data matrix.
```python
>>> D - D.mean(axis=0)
array([[ 0. , -0.5],
[ 0. , 0.5],
[ 2. , 2.5],
[-2. , -0.5],
[-2. , -0.5],
[ 0. , -3.5],
[ 1. , 0.5],
[ 1. , 1.5]])
```
7. [2 points] Use Python to create a scatter plot showing both the original data and the mean-centered data, where the x-axis is $X_1$ and the y-axis is $X_2$, and $X_1$ and $X_2$ are the first and second attributes of the data. Use different colors and markers to differenctiate between the original and mean-centered data.
```python
import matplotlib.pyplot as plt
import numpy as np
D = np.array(
[[1, 1], [1, 2], [3, 4], [-1, 1],
[-1, 1], [1, -2], [2, 2], [2, 3]])
plt.scatter(D[:, 0], D[:, 1], c='blue', label='D')
mean_centered_D = D - D.mean(axis=0)
plt.scatter(mean_centered_D[:, 0], mean_centered_D[:, 1],
c='red', marker='x', label='mean_centered_D')
plt.title('Mean Centered D')
plt.xlabel('X1')
plt.ylabel('X2')
plt.axhline(0, color='gray')
plt.axvline(0, color='gray')
plt.axis('equal')
plt.legend(loc='lower right')
plt.show()
```

8. [3 points] Write down the covariance matrix of the data matrix $D$. Use sample covariance.
```python
>>> np.cov(D.T)
array([[2. , 1.28571429],
[1.28571429, 3.14285714]])
```
9. [3 points] Write down the covariance matrix of the centered data matrix $Z$. Use sample covariance.
```python
>>> Z = D - D.mean(axis=0)
>>> np.cov(Z.T)
array([[2. , 1.28571429],
[1.28571429, 3.14285714]])
```
10. [4 points] Write down the covariance matrix of the data after applying standard normalization. Use sample covariance.
```python
>>> normalized_data = Z / D.std(axis=0, ddof=1)
>>> np.cov(normalized_data.T)
array([[1. , 0.51282259],
[0.51282259, 1. ]])
```
11. EXTRA CREDIT [5 points]
A. [3 points] Find the eigenvectors and eigenvalues of the matrix $C$, where $C$ is defined as follows:
$$
C=\frac{1}{n-1} Z^{T} Z
$$
where $Z$ is the mean-centered data matrix that we used in Problem 2.
```python
>>> import numpy.linalg as LA
>>> n = len(Z)
>>> C = 1 / (n - 1) * Z.T.dot(Z)
>>> evalues, evectors = LA.eig(C)
>>> evalues
array([1.16444889, 3.97840826])
>>> evectors
array([[-0.83849224, -0.54491354],
[ 0.54491354, -0.83849224]])
```
What is the sum of the eigenvalues?
```python
>>> evalues.sum()
5.142857142857142
```
How does it compare to the total variance in the data (smaller, larger, how close are the values)?
The total variance of $D$ is found by:
```python
>>> np.var(D, axis=0, ddof=1).sum()
5.142857142857142
```
These values are exactly the same.
B. [2 points] Let $u_1$ be the 2x1 eigenvector corresponding to the larger eigenvalue. For each row $x_i$ in the data set D, find the dot product $u^T_1x_i$. Let $p$ be the vector obtained by stacking these dot products into a vector:
$$
p=\left(\begin{array}{l}
{u_{1}^{T} x_{1}} \\
{u_{1}^{T} x_{2}} \\
{u_{1}^{T} x_{2}} \\
{u_{1}^{T} x_{3}} \\
{u_{1}^{T} x_{3}} \\
{u_{1}^{T} x_{5}} \\
{u_{1}^{T} x_{7}} \\
{u_{1}^{T} x_{8}} \\
{u_{1}^{T} x_{8}}
\end{array}\right)
$$
What is the sample variance of the data in vector $p$?
```python
>>> u1 = evectors[:, 1]
>>> p = u1.dot(D.T)
>>> p_variance = np.cov(p)
>>> p_variance
array(3.97840826)
```
What fraction of the total variance of the data is the variance in $p$?
```python
>>> p_variance / np.var(D, axis=0, ddof=1).sum()
0.7735793833832251
```