# Chp 3 Data Preprocessing
###### tags: `Data Mining 心得`
**data quality**: **accuracy**, **completeness**, **consistency**.
**Believability**: how much the data are trusted by users
**interpretability**: how easy the data are understood
## Tasks in Data Preprocessing
* **Data cleaning**:
- filling in missing values
- smoothing noisy data
- identifying or removing outliers
- resolving inconsistencies
- avoid redundancies during data integration.
* **Data reduction**:
- **Dimensionality reduction**:
data encoding schemes to obtain a reduced or “compressed” representation of the original data
e.g. PCA, attribute subset selection, attribute construction
- **Numerosity reduction**:
Data are replaced by smaller representations:
+ parametric models:
+ regression
+ log-linear models
+ nonparametric models:
+ histograms
+ clusters
+ sampling
+ data aggregation
* **Data transformation**:
- Normalization: scaled to [0, 1]
- Data discretization
- Concept hierarchy generation
## 3.2 Data cleaning
* **Missing Values**
1. Ignore the tuple (not good)
2. Fill in the missing value manually (not good)
3. Use a global constant to fill in (e.g. 'Unknown')(not good)
4. Use *mean*(normal data distribution) or *median*(skew data distribution) to fill in
5. Use the attribute mean or median to the same class
6. **Use probable value** (e.g. decision tree, regression)
* **Noisy Data**
* **Binning**: Smooth a sorted data value by consulting the values around it
- smoothing by bin means
- smoothing by bin median
- smoothing by bin boundaries(min or max)
- 
* **Regression**: Find the “best” line to fit two attributes
* **Outlier analysis**: by clustering
* **discrepancy detection**:
* **unique rule**: Each value of the given attribute must be different from all other values for that attribute
* **consecutive rule**: There can be no missing values between the lowest and highest values for the attribute.
* **null rule**: Use of blanks, question marks, special characters, or other strings that may indicate the null condition
## 3.3 Data Integration
* Happens when merging two database.
* **Redundancy and correlation analysis**: $\chi^2$ test
$$ \chi^2 = \sum_{i=1}^c \sum_{j=1}^r \frac{(o_{ij}-e_{ij})^2}{e_{ij}^2} $$
$o_{ij}$: observed frequency of the joint event $(A_i, B_j)$
$e_{ij}$: expected frequency $e_{ij}= \frac{count(A=a_i)*count(B=b_i)}{n}$
* **Correlation Coefficient for Numeric Data**:
* $$r_{A,B} = \frac{ \sum_{i=1}^n (a_i-\bar A)(b_i-\bar B)}{n \sigma_A \sigma_B}$$
* **Covariance for Numeric Data**:
* $$Cov(A,B) = \frac{ \sum_{i=1}^n (a_i-\bar A)(b_i-\bar B)}{n}$$
## 3.4 Data Reduction
* **Dimensionality reduction**:
* **Discrete wavelet transform (DWT)**:
* Works to remove noise without smoothing out the main features of the data, making it effective for data cleaning as well
* **Principal Components Analysis (PCA)**:
* Searches for k n-dimensional orthogonal vectors that can best be used to represent the data
* **Attribute Subset Selection**: greedy method
* Stepwise forward selection
* Stepwise backward elimination
* Decision tree induction
* **Numerosity reduction**:
* Parametric Data Reduction
* linear regression $y = ax + b$
* Log-linear models: Consider each tuple as a point in an n-dimensional space. Estimate the probability of each point in a multidimensional space for a set of discretized attributes.
* **Histograms**:
* Equal-width
* Equal-frequency
* **Clustering**:
* The cluster representations of the data are used to replace the actual data.
* **Sampling**
* Simple random sample *without* replacement
* Simple random sample *with* replacement: The drawn tuple will be place back.
* Cluster sample: Tuples in D are grouped into M mutually disjoint clusters
* Stratified sample:
* If D is divided intomutually disjoint parts called *strata*, a stratified sample of D is generated by obtaining an SRS at each stratum.
* Adantage of sampling:
* Sampling complexity is potentially sublinear to the size of the data rather than exponentially growth
## 3.5 Data Transformation and Data Discretization
1. Smoothing: Remove noise fromthe data
2. Attribute construction: New attributes are constructed from given data set.
3. Aggregation: Constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization: Scale to [-1, 1] or [0, 1]
5. Discretization: of a numeric attribute are replaced by interval labels or conceptual labels
e.g. age --> 0-10, 11-20,... --> youth, adult, senior
6. Concept hierarchy generation: attributes can be generalized to higher-level concepts(street to city or country)
* **Discretization**:
* The raw data are replaced by a smaller number of interval or concept labels.
* Top-down discretization(splitting):
* Split the entire attribute range, and then repeats this recursively on the resulting intervals
* Bottom-up discretization(merging)
* Considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals
* Data Transformation by **Normalization**:
* Give all attributes an equal weight.
* **Min-max** normalization: range [new minA,new maxA]
* $$v_i^{'} = \frac{v_i - min_A}{max_A-min_A}(new\text{_}max_A-new\text{_}min_A)+new\text{_}min_A$$
* **z-score** normalization:
* $$v_i^{'} = \frac{v_i-\bar A}{\sigma_A}$$
* Moving the decimal point of values of attribute A.
* $$v_i^{'} = \frac{v_i}{10^j}$$
* $j$ is the smallest integer s.t. $max(|v_i^{'}|)<1$
* Discretization by **Binning**
* Top-down splitting technique
* Can be applied recursively to the resulting partitions to generate concept hierarchies.
* Discretization by **Histogram Analysis**
* A histogram partitions the values of an attribute into disjoint ranges called buckets or bins.
* Discretization by Cluster, Decision Tree, and Correlation Analyses
* **Clustering**:
* Discretize a numeric attribute by partitioning the values into clusters or groups.
* **Decision tree**:
* To discretize a numeric attribute, A, the method selects the value of A that has the minimum *entropy* as a split-point.
* Recursively partitions the resulting intervals to arrive at a hierarchical discretization.
* **ChiMerge**:
* A bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals.
* Concept Hierarchy Generation for Nominal Data
* Nominal attributes have a finitenumber of distinct values, with no ordering among the values.(e.g. job_category)
* Partial ordering:
e.g. street < city < province or state < country
* A portion of a hierarchy
Specify explicit groupings for a small portion of intermediate-level data.