# 教育數據探勘與應用HW1
@tPJ4eIj3QIKVLzPjc7S6jg
## 110590064 資工三 劉韶軒
### 題目1
>•2.2: Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
>
>•(e) Give the five-number summary of the data.
•(f) Show a boxplot of the data
### 答案(a)
five-number summary:min,Q1,med,Q3,Max
Min = 13
Q1 = 20.5
Q2 = 25
Q3 = 35
Max = 70
### 答案(b)
```python=
import matplotlib.pyplot as plt
import numpy as np
data = [13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]
# Calculate quartiles and other statistics
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # Median
q3 = np.percentile(data, 75)
min_value = min(data)
max_value = max(data)
# Create a boxplot
plt.boxplot(data)
# Set labels for the plot
plt.title("Boxplot of Age Data")
plt.ylabel("Age")
# Annotate the plot with labels
plt.annotate(f'Q1 = {q1}', xy=(1, q1), xytext=(1.2, q1), arrowprops=dict(arrowstyle='->'))
plt.annotate(f'Q2 (Median) = {q2}', xy=(1, q2), xytext=(1.2, q2), arrowprops=dict(arrowstyle='->'))
plt.annotate(f'Q3 = {q3}', xy=(1, q3), xytext=(1.2, q3), arrowprops=dict(arrowstyle='->'))
plt.annotate(f'Min = {min_value}', xy=(1, min_value), xytext=(1.2, min_value), arrowprops=dict(arrowstyle='->'))
plt.annotate(f'Max = {max_value}', xy=(1, max_value), xytext=(1.2, max_value), arrowprops=dict(arrowstyle='->'))
# Show the plot
plt.show()
```

### 題目2
> •Suppose we have the following 2-D data set:
> 
>
>•(a) Consider the data as 2-D data points. Given a new data point, x=(1.4,1.6) as a query, rank the database points based on similarity with the query using Euclidean distance, Manhattan distance, supremum distance, and cosine similarity.
•(b) Normalize the data set to make the (Euclidean) norm of each data point equal to 1. Use Euclidean distance on the transformed data to rank the data points.
### 答案(a)
| <span style="font-size: 8px"> | <span style="font-size: 8px">Euclidean distance</span> | <span style="font-size: 8px">Manhattan distance</span> |
| ---------------------------------------- | ------------------------------------------ | ------------------------------------------ |
| <span style="font-size: 8px">X1(1.5,1.7)</span> | <span style="font-size: 8px">$\sqrt{(1.4-1.5)^2+(1.6-1.7)^2}=\sqrt{0.02}$</span> | <span style="font-size: 8px">$\vert 1.4-1.5\vert+\vert1.6-1.7\vert=0.2$</span> |
| <span style="font-size: 8px">X2(2.0,1.9)</span> | <span style="font-size: 8px">$\sqrt{(1.4-2.0)^2+(1.6-1.9)^2}=\sqrt{0.45}$</span> | <span style="font-size: 8px">$\vert 1.4-2.0\vert+\vert1.6-1.9\vert=0.9$</span> |
| <span style="font-size: 8px">X3(1.6,1.8)</span> | <span style="font-size: 8px">$\sqrt{(1.4-1.6)^2+(1.6-1.8)^2}=\sqrt{0.08}$</span> | <span style="font-size: 8px">$\vert 1.4-1.6\vert+\vert1.6-1.8\vert=0.4$</span> |
| <span style="font-size: 8px">X4(1.2,1.5)</span> | <span style="font-size: 8px">$\sqrt{(1.4-1.2)^2+(1.6-1.5)^2}=\sqrt{0.05}$</span> | <span style="font-size: 8px">$\vert 1.4-1.2\vert+\vert1.6-1.5\vert=0.3$</span> |
| <span style="font-size: 8px">X5 (1.5,1.0)</span> | <span style="font-size: 8px">$\sqrt{(1.4-1.5)^2+(1.6-1.0)^2}=\sqrt{0.37}$</span> | <span style="font-size: 8px">$\vert 1.4-1.5\vert+\vert1.6-1.0\vert=0.7$</span> |
| <span style="font-size: 8px"> | <span style="font-size: 8px">Supremum distance</span> | <span style="font-size: 8px">Cosine similarity</span> |
| ---------------------------------------- | ------------------------------------------ | ------------------------------------------ |
| <span style="font-size: 8px">X1(1.5,1.7)</span> | <span style="font-size: 8px">$\max( \vert1.4-1.5\vert,\vert1.6-1.7\vert)=0.1$</span> | <span style="font-size: px">$\frac{1.4*1.5+1.6*1.7}{\sqrt{1.4^2+1.6^2}*\sqrt{1.5^2+1.7^2}}=0.999991$</span> |
| <span style="font-size: 8px">X2(2.0,1.9)</span> | <span style="font-size: 8px">$\max( \vert1.4-2.0\vert,\vert1.6-1.9\vert)=0.6$</span> | <span style="font-size: px">$\frac{1.4*2.0+1.6*1.9}{\sqrt{1.4^2+1.6^2}*\sqrt{2.0^2+1.9^2}}=0.995752$</span> |
| <span style="font-size: 8px">X3(1.6,1.8)</span> | <span style="font-size: 8px">$\max( \vert1.4-1.6\vert,\vert1.6-1.8\vert)=0.2$</span> | <span style="font-size: px">$\frac{1.4*1.6+1.6*1.8}{\sqrt{1.4^2+1.6^2}*\sqrt{1.6^2+1.8^2}}=0.999969$</span> |
| <span style="font-size: 8px">X4(1.2,1.5)</span> | <span style="font-size: 8px">$\max( \vert1.4-1.2\vert,\vert1.6-1.5\vert)=0.2$</span> | <span style="font-size: px">$\frac{1.4*1.2+1.6*1.5}{\sqrt{1.4^2+1.6^2}*\sqrt{1.2^2+1.5^2}}=0.999028$</span> |
| <span style="font-size: 8px">X5 (1.5,1.0)</span> | <span style="font-size: 8px">$\max( \vert1.4-1.5\vert,\vert1.6-1.0\vert)=0.6$</span> | <span style="font-size: px">$\frac{1.4*1.5+1.6*1.0}{\sqrt{1.4^2+1.6^2}*\sqrt{1.5^2+1.0^2}}=0.965363$</span> |
### 答案(b)
| 原始數據點 | 歐幾里德範數 | 歸一化的數據點 |
|-------------|--------------|------------------|
| x(1.4,1.6) | $\sqrt{1.4^2+1.6^2}=2.126$ | x(0.6585,0.7526) |
| X1(1.5,1.7) | $\sqrt{1.5^2+1.7^2}=2.26716$ | x1(0.6616,0.7498 ) |
| X2(2.0,1.9) | $\sqrt{2.0^2+1.9^2}=2.75862$ | x2(0.7250,0.6887) |
| X3(1.6,1.8) | $\sqrt{1.6^2+1.8^2}=2.40832$ | x3(0.6644,0.7474) |
| X4(1.2,1.5) |$\sqrt{1.2^2+1.5^2}=1.92094$ | x4(0.6247,0.7809) |
| X5(1.5,1.0) |$\sqrt{1.5^2+1.0^2}=1.8027$ | x5(0.8321,0.5547) |
### 題目3
> 3.3: Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth these data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
#### 答案(a)
數據箱1:13, 15, 16
數據箱2:16, 19, 20
數據箱3:20, 21, 22
數據箱4:22, 25, 25
數據箱5:25, 25, 30
數據箱6:30, 33, 33
數據箱7:35, 35, 35
數據箱8:35, 35, 36
數據箱9:36, 40, 45
數據箱10:46, 52, 70
平滑後資料:14.67, 14.67, 14.67, 18.33, 18.33, 18.33, 21.0, 21.0, 21.0, 24.0, 24.0, 24.0, 26.67, 26.67, 26.67, 33.67, 33.67, 33.67, 35.0, 35.0, 35.0, 40.33, 40.33, 40.33, 56.0, 56.0, 56.0
#### 答案(b)
Z-Score Method or IQR (Interquartile Range) Method
### 題目4
>Using the data for age given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age is 12.94 years.
#### 答案(a)
最小-最大正規化=$\frac{最大值−最小值}{原始數據−最小值}$
最小-最大正規化=$\frac{35−13}{70−13}=0.38596$
#### 答案(b)
$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}$
Z得分=$\frac{原始數據−平均值}{標準差}$
最小-最大正規化=$\frac{35−28.04}{12.94}= 0.5378670788253479$
### 題目5
>3.8: Using the data for age and body fat given in Exercise 2.4, answer the following:

(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two attributes positively or negatively correlated? Compute their covariance.
#### 答案
```python
age_data = [23,23,27,27,39,41,47,49,50,52,54,54,56,57,58,58,60,61]
body_fat_data = [9.5,26.5,7.8,17.8,31.4,25.9,27.4,
27.2,31.2,34.6,42.5,28.8,33.4,30.2,34.1,32.9,41.2,35.7]
```
$\bar{x} = \frac{\sum \text{age_data}}{n}$
$\bar{y} = \frac{\sum \text{body_fat_data}}{n}$
$cov=\frac{1}{n}=\sum_{i=1}^{n}{(age_i- \bar x)(bodyfat_i-\bar y)}=100.0196$
$s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (age_i - \bar{x})^2}=13.218$
$s_y = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (bodyfat_i - \bar{y})^2}=9.254$
$r=\frac{cov}{s_x*s_y}=0.817618$
正相關
### 題目6
> Plot an equal-width histogram of width 10.
#### 答案
```python=
import matplotlib.pyplot as plt
# Original data in ascending order
age_data = [13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]
# Set the bin width
bin_width = 10
# Create a histogram with specified bin width
plt.hist(age_data, bins=range(min(age_data), max(age_data) + bin_width, bin_width), edgecolor='black')
# Add labels and title
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Equal-Width Histogram (Bin Width = 10)")
# Show the histogram
plt.show()
```
