# Chp 2 Getting to Know Your Data ###### tags: `Data Mining 心得` ## Attributes $def:$ a data field representing a characteristic or feature of a data object * **Nominal Attributes**(categorical): Each value represents some kind of category, code, or state. * **Binary Attributes**(Boolean): with only two categories or states: 0 or 1 - symmetric: no preference on which outcome should be coded as 0 or 1. (e.g. male、female) - asymmetric: not equally important(e.g. *positive* and *negative* outcomes of a medical test) * **Ordinal Attributes**: have a meaningful order or ranking - useful for registering subjective assessments of qualities that cannot be measured * **Numeric Attributes**: a measurable quantity, represented in integer or real values * Interval-scaled attributes: a scale of equal-size units(e.g. temperature) * Ratio-Scaled Attributes: if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.(e.g. Kelvin) * **Discrete** versus **Continuous** Attributes: In practice, real values are represented using a finite number of digits (discrete). Continuous attributes are typically represented as *floating-point* variables. ## Measuring the Central Tendency: Mean, Median, and Mode ### Mean 平均數 1. **Weighted** arithmetic mean: $$\bar x= \frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i}$$ - Con: sensitive to extreme (e.g., outlier) values. 2. **Trimmed** mean: the mean obtained after chopping off values at the high and low extremes ### Median 中位數 * Pro: For skewed (asymmetric) data, median is a better measure. * Con: Expensive to compute with a large number of observations(Need to sort first.) * Approximate the median: * Assume that data are grouped in intervals, and the frequency (i.e., number of data values) of each interval is known. The interval that contains the median frequency be the **median interval**. * $$median = L_1 + (\frac{N/2-(\Sigma freq)_l}{freq_{median}})width$$ * $L_1$: lower boundary of the median interval * $N$: is the number of values * $(\Sigma freq)_l$ : the sum of the frequencies of all of the intervals that are lower than the median interval * $width$: width of the median interval. ### Mode 眾數 1. $def:$ value that occurs most frequently in the set 2. A data set with two or more modes is **multimodal**. 3. If each data value occurs only once, then there is no mode. * Property For unimodal numeric data that are moderately skewed(asymmetrical): Empirical relation: * $mean - mode \approx 3 × (mean − median)$ ### Midrange ($\frac{max+min}{2}$) --- ![](https://i.imgur.com/9dv0JEs.png) --- ## Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range ### Range * $Def$: difference between the largest (max()) and smallest (min()) values ### k-Quantiles * $Def$: Points taken at regular intervals of a data distribution, dividing it into essentially equal-size consecutive sets * quartiles: k=4 、percentiles: k=100 ### interquartile range (IQR) * $Def$: $Q_3-Q_1$ in **quartiles** * A common rule of thumb for identifying suspected **outliers** is to single out values falling at least $1.5 × IQR$ above $Q_3$ or below $Q_1$ ### Five-Number Summary * $Minimum,\ Q1,\ Median(Q2),\ Q3,\ Maximum$ ### Boxplots ![](https://i.imgur.com/j49QjUK.png =70%x) ### Variance and Standard Deviation * Measure data dispersion. * low standard deviation: data observations close to the mean * high standard deviation: spread out over a large range of values * Formula: * $$ \sigma^2 = \frac{1}{N} \sum_{i=1}^N(x_i- \bar x_i^2) = (\frac{1}{N} \sum_{i=1}^Nx_i^2)- \bar x_i^2$$ * Variance: $\sigma^2$ * Standard deviation: $\sigma$ ## Data Visualization(Skipped) * **Pixel-oriented visualization**: Create m windows on the screen, one for each dimension. ![](https://i.imgur.com/eBLdKrv.png =80%x) * **Geometric Projection Visualization**: Find interesting projections of multidimensional data sets. ![](https://i.imgur.com/mzJXeSq.png =60%x) ![](https://i.imgur.com/rqt8kFv.png =60%x) * **Icon-Based Visualization**: Use small icons to represent multidimensional data values. * Chernoff faces: help reveal trends in the data ![](https://i.imgur.com/v92sPra.png =60%x) ## Measuring Data Similarity and Dissimilarity ### Data Matrix V.S. Dissimilarity Matrix * **Data matrix** (or object-by-attribute structure): * n-by-p matrix (n objects 、 p attributes) ![](https://i.imgur.com/K6zjbWi.png =40%x) * **Dissimilarity matrix** (or object-by-object structure): * n-by-n table ![](https://i.imgur.com/lgw2y0K.png =40%x) * $d(i, j)$ is the measured dissimilarity or “difference” between objects $i$ and $j$ * $d(i, j)= \frac{p-m}{p}$ * m: number of matches * p: total number of attributes * similarity: $sim(i, j) = 1 - d(i, j)$ ### Proximity Measures for Binary Attributes ![](https://i.imgur.com/5qtsriT.png =70%x) * Symmetric binary dissimilarity * $d(i, j) = \frac{r+s}{q+r+s+t}$ * Asymmetric binary dissimilarity * $d(i, j) = \frac{r+s}{q+r+s}$ ### Dissimilarity of Numeric Data * **Euclidean distance**: $$d(i,j) = \sqrt{(x_{i1}-x_{j1})^2 + (x_{i2}-x_{j2})^2 + ... + (x_{ip}-x_{jp})^2}$$ * **Manhattan distance** $$d(i,j) = |x_{i1}-x_{j1}| + |x_{i1}-x_{j1}| + ... +|x_{ip}-x_{jp}| +$$ * Some properties: * Non-negativity: $d(i,j)\geq0$ * Identity of indiscernibles: $d(i,i)=0$ * Symmetry: $d(i,j) = d(j,i)$ * Triangle inequality: $d(i,j)\geq d(i,k)+d(k,j)$ * **Cosine Similarity** $$sim(x,y) = \frac{x \cdot y}{\parallel \ x \parallel \parallel y \ \parallel} $$