# Chp 2 Getting to Know Your Data
###### tags: `Data Mining 心得`
## Attributes
$def:$ a data field representing a characteristic or feature of a data object
* **Nominal Attributes**(categorical): Each value represents some kind of category, code, or state.
* **Binary Attributes**(Boolean): with only two categories or states: 0 or 1
- symmetric: no preference on which outcome should be coded as 0 or 1. (e.g. male、female)
- asymmetric: not equally important(e.g. *positive* and *negative* outcomes of a medical test)
* **Ordinal Attributes**: have a meaningful order or ranking
- useful for registering subjective assessments of qualities that cannot be measured
* **Numeric Attributes**: a measurable quantity, represented in integer or real values
* Interval-scaled attributes: a scale of equal-size units(e.g. temperature)
* Ratio-Scaled Attributes: if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.(e.g. Kelvin)
* **Discrete** versus **Continuous** Attributes:
In practice, real values are represented using a finite number of digits (discrete). Continuous attributes are typically represented as *floating-point* variables.
## Measuring the Central Tendency: Mean, Median, and Mode
### Mean 平均數
1. **Weighted** arithmetic mean: $$\bar x= \frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i}$$
- Con: sensitive to extreme (e.g., outlier) values.
2. **Trimmed** mean: the mean obtained after chopping off values at the high and low extremes
### Median 中位數
* Pro: For skewed (asymmetric) data, median is a better measure.
* Con: Expensive to compute with a large number of observations(Need to sort first.)
* Approximate the median:
* Assume that data are grouped in intervals, and the frequency (i.e., number of data values) of each interval is known. The interval that contains the median frequency be the **median interval**.
* $$median = L_1 + (\frac{N/2-(\Sigma freq)_l}{freq_{median}})width$$
* $L_1$: lower boundary of the median interval
* $N$: is the number of values
* $(\Sigma freq)_l$ : the sum of the frequencies of all of the intervals that are lower than the median interval
* $width$: width of the median interval.
### Mode 眾數
1. $def:$ value that occurs most frequently in the set
2. A data set with two or more modes is **multimodal**.
3. If each data value occurs only once, then there is no mode.
* Property
For unimodal numeric data that are moderately skewed(asymmetrical):
Empirical relation:
* $mean - mode \approx 3 × (mean − median)$
### Midrange ($\frac{max+min}{2}$)
---

---
## Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range
### Range
* $Def$: difference between the largest (max()) and smallest (min()) values
### k-Quantiles
* $Def$: Points taken at regular intervals of a data distribution, dividing it into essentially equal-size consecutive sets
* quartiles: k=4 、percentiles: k=100
### interquartile range (IQR)
* $Def$: $Q_3-Q_1$ in **quartiles**
* A common rule of thumb for identifying suspected **outliers** is to single out values falling at least $1.5 × IQR$ above $Q_3$ or below $Q_1$
### Five-Number Summary
* $Minimum,\ Q1,\ Median(Q2),\ Q3,\ Maximum$
### Boxplots

### Variance and Standard Deviation
* Measure data dispersion.
* low standard deviation: data observations close to the mean
* high standard deviation: spread out over a large range of values
* Formula:
* $$ \sigma^2 = \frac{1}{N} \sum_{i=1}^N(x_i- \bar x_i^2) = (\frac{1}{N} \sum_{i=1}^Nx_i^2)- \bar x_i^2$$
* Variance: $\sigma^2$
* Standard deviation: $\sigma$
## Data Visualization(Skipped)
* **Pixel-oriented visualization**: Create m windows on the screen, one for each dimension.

* **Geometric Projection Visualization**: Find interesting projections of multidimensional data sets.


* **Icon-Based Visualization**: Use small icons to represent multidimensional data values.
* Chernoff faces: help reveal trends in the data

## Measuring Data Similarity and Dissimilarity
### Data Matrix V.S. Dissimilarity Matrix
* **Data matrix** (or object-by-attribute structure):
* n-by-p matrix (n objects 、 p attributes)

* **Dissimilarity matrix** (or object-by-object structure):
* n-by-n table

* $d(i, j)$ is the measured dissimilarity or “difference” between objects $i$ and $j$
* $d(i, j)= \frac{p-m}{p}$
* m: number of matches
* p: total number of attributes
* similarity: $sim(i, j) = 1 - d(i, j)$
### Proximity Measures for Binary Attributes

* Symmetric binary dissimilarity
* $d(i, j) = \frac{r+s}{q+r+s+t}$
* Asymmetric binary dissimilarity
* $d(i, j) = \frac{r+s}{q+r+s}$
### Dissimilarity of Numeric Data
* **Euclidean distance**:
$$d(i,j) = \sqrt{(x_{i1}-x_{j1})^2 + (x_{i2}-x_{j2})^2 + ... + (x_{ip}-x_{jp})^2}$$
* **Manhattan distance**
$$d(i,j) = |x_{i1}-x_{j1}| + |x_{i1}-x_{j1}| + ... +|x_{ip}-x_{jp}| +$$
* Some properties:
* Non-negativity: $d(i,j)\geq0$
* Identity of indiscernibles: $d(i,i)=0$
* Symmetry: $d(i,j) = d(j,i)$
* Triangle inequality: $d(i,j)\geq d(i,k)+d(k,j)$
* **Cosine Similarity**
$$sim(x,y) = \frac{x \cdot y}{\parallel \ x \parallel \parallel y \ \parallel} $$