Homework 4: CSCI 347: Data Mining

# Homework 4: CSCI 347: Data Mining Names: Sam Behrens and Karl Molina Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code. 1. [2 points] Consider the following matrix $A$ and vector $v$. Compute the matrix-vector product $Av$. $$ A=\left(\begin{array}{ll}{2} & {1} \\ {1} & {3}\end{array}\right), v=\left(\begin{array}{c}{1} \\ {-1}\end{array}\right) $$ ```python >>> import numpy as np >>> A = np.array([[2, 1], [1, 3]]) >>> v = np.array([1, -1]) >>> A.dot(v) array([ 1, -2]) ``` $$ Av=\left(\begin{array}{c}{1} \\ {-2}\end{array}\right) $$ Use the matrix $A$ and the data set $D$ below in all the remaining problems: $$ A=\left(\begin{array}{cc}{\frac{\sqrt{3}}{2}} & {-\frac{1}{2}} \\ {\frac{1}{2}} & {\frac{\sqrt{3}}{2}}\end{array}\right), \quad D=\left(\begin{array}{cc}{1} & {1} \\ {1} & {2} \\ {3} & {4} \\ {-1} & {1} \\ {-1} & {1} \\ {1} & {-2} \\ {2} & {2} \\ {2} & {3}\end{array}\right) $$ 2. [2 points] Use Python to create a scatter plot of the data, where the x-axis is $X_1$ and the y-axis is $X_1$, and $X_1$ and $X_2$ are the first and second attributes of the data. ```python import matplotlib.pyplot as plt import numpy as np D = np.array( [[1, 1], [1, 2], [3, 4], [-1, 1], [-1, 1], [1, -2], [2, 2], [2, 3]]) plt.scatter(D[:, 0], D[:, 1], c='blue', label='D') plt.title('D') plt.xlabel('X1') plt.ylabel('X2') plt.axhline(0, color='gray') plt.axvline(0, color='gray') plt.axis('equal') plt.show() ``` ![](https://i.imgur.com/tNT3hWR.png) 3. [4 points] Treating each row as a 2-dimensional vector, apply the linear transformation $A$ to each row. In other words, find the matrix-vector product $Ax_i$ for each $x_i$, where $x_i$ is one row $i$ of $D$. We represent $x_i$ as a vector with two rows and one column when multiplying it by $A$. So, for example, $$ x_{2}=\left(\begin{array}{c} {1} \\ {2} \end{array}\right) \text { and } A x_{2}=\left(\begin{array}{cc} {\frac{\sqrt{3}}{2}} & {-\frac{1}{2}} \\ {\frac{1}{2}} & {\frac{\sqrt{3}}{2}} \end{array}\right)\left(\begin{array}{l} {1} \\ {2} \end{array}\right) $$ ```python >>> A = np.array( [[np.sqrt(3) / 2, -1 / 2], [1 / 2, np.sqrt(3) / 2]]) >>> A.dot(D.T).T array([[ 0.3660254 , 1.3660254 ], [-0.1339746 , 2.23205081], [ 0.59807621, 4.96410162], [-1.3660254 , 0.3660254 ], [-1.3660254 , 0.3660254 ], [ 1.8660254 , -1.23205081], [ 0.73205081, 2.73205081], [ 0.23205081, 3.59807621]]) ``` 4. [3 points] Use Python to create a plot showing both the original data and the transformed data, with the x-axis still corresponding to $X_1$ and the y-axis corresponding to $X_2$. Use different colors and markers to differentiate between the original and transformed data. That is, each transformed data point in the plot should be one matrix-vector product $Ax_i$, which is a 2-dimensional vector. Each original point in the plot should have the same coordinates as it did in Problem 2. ```python import matplotlib.pyplot as plt import numpy as np D = np.array( [[1, 1], [1, 2], [3, 4], [-1, 1], [-1, 1], [1, -2], [2, 2], [2, 3]]) plt.scatter(D[:, 0], D[:, 1], c='blue', label='D') A = np.array( [[np.sqrt(3) / 2, -1 / 2], [1 / 2, np.sqrt(3) / 2]]) transformed_D = A.dot(D.T).T plt.scatter(transformed_D[:, 0], transformed_D[:, 1], c='red', marker='x', label='transformed_D') plt.title('D transformed') plt.xlabel('X1') plt.ylabel('X2') plt.axhline(0, color='gray') plt.axvline(0, color='gray') plt.axis('equal') plt.legend(loc='lower right') plt.show() ``` ![](https://i.imgur.com/GXcFh85.png) 5. [1 point] Write down the multi-variate mean of the data. (Remember that this should be a 2-dimensional vector) ```python >>> D.mean(axis=0) array([1. , 1.5]) ``` 6. [2 points] Mean-center the data. Write down the mean-centered data matrix. ```python >>> D - D.mean(axis=0) array([[ 0. , -0.5], [ 0. , 0.5], [ 2. , 2.5], [-2. , -0.5], [-2. , -0.5], [ 0. , -3.5], [ 1. , 0.5], [ 1. , 1.5]]) ``` 7. [2 points] Use Python to create a scatter plot showing both the original data and the mean-centered data, where the x-axis is $X_1$ and the y-axis is $X_2$, and $X_1$ and $X_2$ are the first and second attributes of the data. Use different colors and markers to differenctiate between the original and mean-centered data. ```python import matplotlib.pyplot as plt import numpy as np D = np.array( [[1, 1], [1, 2], [3, 4], [-1, 1], [-1, 1], [1, -2], [2, 2], [2, 3]]) plt.scatter(D[:, 0], D[:, 1], c='blue', label='D') mean_centered_D = D - D.mean(axis=0) plt.scatter(mean_centered_D[:, 0], mean_centered_D[:, 1], c='red', marker='x', label='mean_centered_D') plt.title('Mean Centered D') plt.xlabel('X1') plt.ylabel('X2') plt.axhline(0, color='gray') plt.axvline(0, color='gray') plt.axis('equal') plt.legend(loc='lower right') plt.show() ``` ![](https://i.imgur.com/imTvuiD.png) 8. [3 points] Write down the covariance matrix of the data matrix $D$. Use sample covariance. ```python >>> np.cov(D.T) array([[2. , 1.28571429], [1.28571429, 3.14285714]]) ``` 9. [3 points] Write down the covariance matrix of the centered data matrix $Z$. Use sample covariance. ```python >>> Z = D - D.mean(axis=0) >>> np.cov(Z.T) array([[2. , 1.28571429], [1.28571429, 3.14285714]]) ``` 10. [4 points] Write down the covariance matrix of the data after applying standard normalization. Use sample covariance. ```python >>> normalized_data = Z / D.std(axis=0, ddof=1) >>> np.cov(normalized_data.T) array([[1. , 0.51282259], [0.51282259, 1. ]]) ``` 11. EXTRA CREDIT [5 points] A. [3 points] Find the eigenvectors and eigenvalues of the matrix $C$, where $C$ is defined as follows: $$ C=\frac{1}{n-1} Z^{T} Z $$ where $Z$ is the mean-centered data matrix that we used in Problem 2. ```python >>> import numpy.linalg as LA >>> n = len(Z) >>> C = 1 / (n - 1) * Z.T.dot(Z) >>> evalues, evectors = LA.eig(C) >>> evalues array([1.16444889, 3.97840826]) >>> evectors array([[-0.83849224, -0.54491354], [ 0.54491354, -0.83849224]]) ``` What is the sum of the eigenvalues? ```python >>> evalues.sum() 5.142857142857142 ``` How does it compare to the total variance in the data (smaller, larger, how close are the values)? The total variance of $D$ is found by: ```python >>> np.var(D, axis=0, ddof=1).sum() 5.142857142857142 ``` These values are exactly the same. B. [2 points] Let $u_1$ be the 2x1 eigenvector corresponding to the larger eigenvalue. For each row $x_i$ in the data set D, find the dot product $u^T_1x_i$. Let $p$ be the vector obtained by stacking these dot products into a vector: $$ p=\left(\begin{array}{l} {u_{1}^{T} x_{1}} \\ {u_{1}^{T} x_{2}} \\ {u_{1}^{T} x_{2}} \\ {u_{1}^{T} x_{3}} \\ {u_{1}^{T} x_{3}} \\ {u_{1}^{T} x_{5}} \\ {u_{1}^{T} x_{7}} \\ {u_{1}^{T} x_{8}} \\ {u_{1}^{T} x_{8}} \end{array}\right) $$ What is the sample variance of the data in vector $p$? ```python >>> u1 = evectors[:, 1] >>> p = u1.dot(D.T) >>> p_variance = np.cov(p) >>> p_variance array(3.97840826) ``` What fraction of the total variance of the data is the variance in $p$? ```python >>> p_variance / np.var(D, axis=0, ddof=1).sum() 0.7735793833832251 ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.