Chp 3 Data Preprocessing

# Chp 3 Data Preprocessing ###### tags: `Data Mining 心得` **data quality**: **accuracy**, **completeness**, **consistency**. **Believability**: how much the data are trusted by users **interpretability**: how easy the data are understood ## Tasks in Data Preprocessing * **Data cleaning**: - filling in missing values - smoothing noisy data - identifying or removing outliers - resolving inconsistencies - avoid redundancies during data integration. * **Data reduction**: - **Dimensionality reduction**: data encoding schemes to obtain a reduced or “compressed” representation of the original data e.g. PCA, attribute subset selection, attribute construction - **Numerosity reduction**: Data are replaced by smaller representations: + parametric models: + regression + log-linear models + nonparametric models: + histograms + clusters + sampling + data aggregation * **Data transformation**: - Normalization: scaled to [0, 1] - Data discretization - Concept hierarchy generation ## 3.2 Data cleaning * **Missing Values** 1. Ignore the tuple (not good) 2. Fill in the missing value manually (not good) 3. Use a global constant to fill in (e.g. 'Unknown')(not good) 4. Use *mean*(normal data distribution) or *median*(skew data distribution) to fill in 5. Use the attribute mean or median to the same class 6. **Use probable value** (e.g. decision tree, regression) * **Noisy Data** * **Binning**: Smooth a sorted data value by consulting the values around it - smoothing by bin means - smoothing by bin median - smoothing by bin boundaries(min or max) - ![](https://i.imgur.com/CsfGsjI.png =60%x) * **Regression**: Find the “best” line to fit two attributes * **Outlier analysis**: by clustering * **discrepancy detection**: * **unique rule**: Each value of the given attribute must be different from all other values for that attribute * **consecutive rule**: There can be no missing values between the lowest and highest values for the attribute. * **null rule**: Use of blanks, question marks, special characters, or other strings that may indicate the null condition ## 3.3 Data Integration * Happens when merging two database. * **Redundancy and correlation analysis**: $\chi^2$ test $$ \chi^2 = \sum_{i=1}^c \sum_{j=1}^r \frac{(o_{ij}-e_{ij})^2}{e_{ij}^2} $$ $o_{ij}$: observed frequency of the joint event $(A_i, B_j)$ $e_{ij}$: expected frequency $e_{ij}= \frac{count(A=a_i)*count(B=b_i)}{n}$ * **Correlation Coefficient for Numeric Data**: * $$r_{A,B} = \frac{ \sum_{i=1}^n (a_i-\bar A)(b_i-\bar B)}{n \sigma_A \sigma_B}$$ * **Covariance for Numeric Data**: * $$Cov(A,B) = \frac{ \sum_{i=1}^n (a_i-\bar A)(b_i-\bar B)}{n}$$ ## 3.4 Data Reduction * **Dimensionality reduction**: * **Discrete wavelet transform (DWT)**: * Works to remove noise without smoothing out the main features of the data, making it effective for data cleaning as well * **Principal Components Analysis (PCA)**: * Searches for k n-dimensional orthogonal vectors that can best be used to represent the data * **Attribute Subset Selection**: greedy method * Stepwise forward selection * Stepwise backward elimination * Decision tree induction * **Numerosity reduction**: * Parametric Data Reduction * linear regression $y = ax + b$ * Log-linear models: Consider each tuple as a point in an n-dimensional space. Estimate the probability of each point in a multidimensional space for a set of discretized attributes. * **Histograms**: * Equal-width * Equal-frequency * **Clustering**: * The cluster representations of the data are used to replace the actual data. * **Sampling** * Simple random sample *without* replacement * Simple random sample *with* replacement: The drawn tuple will be place back. * Cluster sample: Tuples in D are grouped into M mutually disjoint clusters * Stratified sample: * If D is divided intomutually disjoint parts called *strata*, a stratified sample of D is generated by obtaining an SRS at each stratum. * Adantage of sampling: * Sampling complexity is potentially sublinear to the size of the data rather than exponentially growth ## 3.5 Data Transformation and Data Discretization 1. Smoothing: Remove noise fromthe data 2. Attribute construction: New attributes are constructed from given data set. 3. Aggregation: Constructing a data cube for data analysis at multiple abstraction levels. 4. Normalization: Scale to [-1, 1] or [0, 1] 5. Discretization: of a numeric attribute are replaced by interval labels or conceptual labels e.g. age --> 0-10, 11-20,... --> youth, adult, senior 6. Concept hierarchy generation: attributes can be generalized to higher-level concepts(street to city or country) * **Discretization**: * The raw data are replaced by a smaller number of interval or concept labels. * Top-down discretization(splitting): * Split the entire attribute range, and then repeats this recursively on the resulting intervals * Bottom-up discretization(merging) * Considering all of the continuous values as potential split-points, removes some by merging neighborhood values to form intervals * Data Transformation by **Normalization**: * Give all attributes an equal weight. * **Min-max** normalization: range [new minA,new maxA] * $$v_i^{'} = \frac{v_i - min_A}{max_A-min_A}(new\text{_}max_A-new\text{_}min_A)+new\text{_}min_A$$ * **z-score** normalization: * $$v_i^{'} = \frac{v_i-\bar A}{\sigma_A}$$ * Moving the decimal point of values of attribute A. * $$v_i^{'} = \frac{v_i}{10^j}$$ * $j$ is the smallest integer s.t. $max(|v_i^{'}|)<1$ * Discretization by **Binning** * Top-down splitting technique * Can be applied recursively to the resulting partitions to generate concept hierarchies. * Discretization by **Histogram Analysis** * A histogram partitions the values of an attribute into disjoint ranges called buckets or bins. * Discretization by Cluster, Decision Tree, and Correlation Analyses * **Clustering**: * Discretize a numeric attribute by partitioning the values into clusters or groups. * **Decision tree**: * To discretize a numeric attribute, A, the method selects the value of A that has the minimum *entropy* as a split-point. * Recursively partitions the resulting intervals to arrive at a hierarchical discretization. * **ChiMerge**: * A bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals. * Concept Hierarchy Generation for Nominal Data * Nominal attributes have a finitenumber of distinct values, with no ordering among the values.(e.g. job_category) * Partial ordering: e.g. street < city < province or state < country * A portion of a hierarchy Specify explicit groupings for a small portion of intermediate-level data.