Chp 13 Data Mining Trends and Research Frontiers

# Chp 13 Data Mining Trends and Research Frontiers ###### tags: `Data Mining 心得` ## Mining Complex Data Types ![](https://i.imgur.com/AK2399E.png) ### Mining Sequence Data: Time-Series, Symbolic Sequences, and Biological Sequences #### Similarity search in Time-Series Data * Often require subsequence matching * Dimensionality reduction: 1. Discrete Fourier transform (DFT) 2. Discrete wavelet transforms (DWT) 3. Singular value decomposition (SVD) 4. Principle components analysis (PCA) #### Regression and Trend Analysis in Time-Series Data * Trend analysis * ![](https://i.imgur.com/RhjYXwU.png =70%x) 1. Trend or long-term movements: Using **weighted moving average** and the **least squares** methods to find trend curves. 2. Cyclic movements: long-term oscillations 3. Seasonal variations: e.g. holiday shopping seasons 4. Randommovements #### Sequential Pattern Mining in Symbolic Sequences * Mining symbolic sequences. * **Constraint-based sequential pattern mining**: user-specified constraints can be used to reduce the search space in sequential pattern mining and derive only the patterns that are of interest to the user. * Relax constraints: * Folding events into proper-size windows and finding recurring subsequences in these windows * **Partial order patterns**: relaxing the requirement of strict sequential ordering #### Sequence Classification * Three categories (1) Feature-based classification: Transforms a sequence into a feature vector and then applies conventional classification methods (2) Sequence distance–based classification: Measures the similarity between sequences. (3) Model-based classification: e.g. Hidden Markov model #### Alignment of Biological Sequences * **Sequence alignment**: - Lining up sequences to achieve a maximal identity level - local alignments and global aligment * Substitution matrices: - Represent the probabilities of substitutions of nucleotides or amino acids and probabilities of insertions and deletions. ### Mining Graphs and Networks 1. Graph Pattern Mining: - Mining frequent subgraphs - Structure similarity search 2. Statistical Modeling of Networks - Homogeneous: The nodes and links are of the same type. - Heterogeneous: The nodes and links are of different types. - Scale-free model: power law distribution 3. Data Cleaning, Integration, and Validation by Information Network Analysis 4. Clustering and Classification of Graphs and Homogeneous Networks - Discover hidden communities, hubs, and outliers 5. Clustering, Ranking, and Classification of Heterogeneous Networks 6. Role Discovery and Link Prediction in Information Networks - Link prediction: Assess expected relationships among the candidate nodes/links. 7. Similarity Search and OLAP in Information Networks - OLAP: Online analytical processing - Path-based similarity 8. Evolution of Social and Information Networks ### Mining Other Kinds of Data #### Mining Spatial Data * Discovers patterns and knowledge from spatial data, like geospace-related data * Popular topics: - Mining spatial associations and co-location patterns - Spatial clustering - Spatial classification - Spatial modeling - Spatial trend and outlier analysis #### Mining Spatiotemporal Data and Moving Objects * Spatiotemporal Data: Relate to both space and time, like the evolutionary history of cities and lands, global warming trends * Moving-object data (important): Mining movement patterns of multiple moving objects #### Mining Cyber-Physical System Data * e.g. A transportation system that links a transportation monitoring network * Need real-time calculations, and returning prompt responses #### Mining Multimedia Data * Including image data, video data, audio data, as well as sequence data and hypertext data #### Mining Text Data * Discovery of patterns and trends using statistical pattern learning, topic modeling, and statistical language modeling, etc. #### Mining Web Data * Web content mining: text, multimedia data, and structured data * Web structure mining: hyperlinks - Using graph and network mining methods to analyze the nodes and connection structures on the Web. * Web usage mining: Server logs - Understands users’ search patterns, trends, and associations - Predicts what users are looking for on the Internet #### Mining Data Streams * Only be able to read the stream once in sequential order ## Other Methodologies of Data Mining ![](https://i.imgur.com/JBkcgZm.png) ### Statistical Data Mining * **Regression**: Predict the value of a response (dependent) variable from one or more predictor (independent) variables, where the variables are numeric. * **Generalized linear models**: Include logistic regression and Poisson regression * **Analysis of variance**: Analyze experimental data for two or more populations described by a numeric response variable * **Mixed-effect models**: Analyze grouped data that can be classified according to one or more grouping variables. * **Factor analysis**: Determine which variables are combined to generate a given factor. * **Discriminant analysis**: Determine several discriminant functions that discriminate among the groups defined by the response variable * **Survival analysis**: Predict the probability that a patient undergoing a medical treatment would survive at least to time $t$ . * **Quality control**: Shewhart charts and CUSUM charts ### Views on Data Mining Foundations * **Data reduction**: Include singular value decomposition, wavelets, regression, log-linear models, histograms, clustering, sampling, and the construction of index trees * **Data compression**: Compress the given data by encoding in terms of bits, association rules, decision trees, clusters. * **Probability and statistical theory**: Discover joint probability distributions of random variables. * **Microeconomic view**: e.g. can be used in the decision-making process of some enterprise * **Pattern discovery and inductive databases**: Discover patterns occurring in the data such as associations, classification models, sequential patterns ### Visual and Audio Data Mining * **Visual data mining**: - Data visualization - Data mining * **Audio data mining**: Uses audio signals to indicate the patterns of data ## Data Mining Applications ![](https://i.imgur.com/0YhmhZR.png) * Financial Data Analysis * Retail and Telecommunication Industries * Science and Engineering * Intrusion Detection and Prevention * Recommender Systems ## Data Mining and Society * Customer relationship management(CRM): Provide more customized, personal service addressing individual customer’s needs. ### Privacy-preserving data mining * Obtaining valid data mining results without disclosing the underlying sensitive data values * **Randomization methods**: Add noise to the data to mask some attribute values of records. * **The k-anonymity and l-diversity methods**: - **k-anonymity**: The granularity of data representation isreduced sufficiently so that any given record maps onto at least k other records in the data. - **l-diversity**: Enforcing intragroup diversity of sensitive values to ensure anonymization.