Chp 1 Introduction

# Chp 1 Introduction ###### tags: `Data Mining 心得` * **knowledge discovery process**: 1. Data cleaning (remove noise and inconsistent data) 2. Data integration (multiple data sources can be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed and consolidated) 5. Data mining (intelligent methods are applied to extract data patterns) 6. Pattern evaluation (identify the truly interesting patterns representing knowledge) 7. Knowledge presentation (visualization and knowledge representation) >[color=#00A000] > >1~4是 data 的 preprocessing >1,2 可以看作是a「儲存data」的preprocessing >[name=Omnom] * **Data mining** is the process of discovering interesting patterns and knowledge from large amounts of data. * **data warehouse** a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. * **Transactional Data** identity number (trans ID) and a list of the items making up the transaction * **mining tasks** * Descriptive: characterize properties of the data in a target data set. * Predictive: perform induction on the current data in order to make predictions. # What Kinds of Patterns Can Be Mined? * **Class/Concept Description** * **Data characterization**: a summarization of the general characteristics or features of a target class of data. * **Data discrimination**: a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes. * **Predictive analysis** * **Classfication**: finding a model (or function) that describes and distinguishes data classes or concepts.The model is used to predict the class label of objects. - If-then rules - decision tree - neural network - naive Bayesian - support vector machine - k-nearest-neighbor - ... * **Regression**: often used for numeric prediction and also encompasses the identification of distribution trends based on the available data. >[color=#00a000] > >**correlation does not imply causation** >[name=Omnom] * **Cluster Analysis** The objects are clustered or grouped based on the principle of **maximizing the intraclass similarity** and **minimizing the interclass similarity**. * **Outlier Analysis** Using statistical tests that assume a distribution or probability model for the data, or using distance measures where objects that are remote from any other cluster are considered outliers. * Association rules of the form $X \Rightarrow Y$: + **support**: the percentage of transactions from a transaction database that the given rule satisfies + **confidence**: assesses the degree of certainty of the detected association + support $(X \Rightarrow Y)$ = $P(X \cup Y)$ + confidence $(X \Rightarrow Y)$ = $P(Y|X)$ # Which Technologies Are Used? * **Statistical model** : a set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions. * **Machine Learning**: * **Supervised learning**(classfication): The supervision in the learning comes from the **labeled examples** in the training data set. * **Unsupervised learning**(clustering): The input examples are not class labeled. Typically, we may use clustering to **discover classes within the data**. * **Semi-supervised learning**: Make use of both labeled and unlabeled examples. In one approach, labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes. * **Active learning**: The goal is to optimize the model quality by actively acquiring knowledge from human users, given a constraint on how many examples they can be asked to label. # Major Issues in Data Mining ## Mining Methodology * Mining various and new kinds of knowledge * Mining knowledge in multidimensional space * Data mining—an interdisciplinary effort: * enhanced by integrating new methods * Boosting the power of discovery in a networked environment * Handling uncertainty, noise, or incompleteness of data * Pattern evaluation and pattern- or constraint-guided mining ## User Interaction * Interactive mining: * it is important to build flexible user interfaces and an exploratory mining environment * Incorporation of background knowledge * Ad hoc(即席查詢) data mining and data mining query languages * Presentation and visualization of data mining results ## Efficiency and Scalability * Efficiency and scalability of data mining algorithms: * the running time of a data mining algorithm must be predictable, short, and acceptable by applications. * **Efficiency**, **scalability**, **performance**, **optimization**, and the **ability to execute in real time** * Parallel, distributed, and incremental mining algorithms: * Such algorithms first partition the data into “pieces.” * The patterns from each partition are eventually merged. * Cloud computing and cluster computing; * **incremental** data mining: incorporates new data updates without having to mine the entire data “from scratch.”