---
# System prepended metadata

title: Chp 13 Data Mining Trends and Research Frontiers
tags: [Data Mining 心得]

---

# Chp 13 Data Mining Trends and Research Frontiers


###### tags: `Data Mining 心得`

## Mining Complex Data Types

![](https://i.imgur.com/AK2399E.png)


### Mining Sequence Data: Time-Series, Symbolic Sequences, and Biological Sequences

#### Similarity search in Time-Series Data
* Often require subsequence matching
* Dimensionality reduction:
    1. Discrete Fourier transform (DFT)
    2. Discrete wavelet transforms (DWT)
    3. Singular value decomposition (SVD)
    4. Principle components analysis (PCA)

#### Regression and Trend Analysis in Time-Series Data
* Trend analysis
* ![](https://i.imgur.com/RhjYXwU.png =70%x)
    1. Trend or long-term movements: Using **weighted moving average** and the **least squares** methods to find trend curves.
    2. Cyclic movements: long-term oscillations
    3. Seasonal variations: e.g. holiday shopping seasons
    4. Randommovements

#### Sequential Pattern Mining in Symbolic Sequences
* Mining symbolic sequences.
* **Constraint-based sequential pattern mining**: user-specified constraints can be used to reduce the search space in sequential pattern mining and derive only the patterns that are of interest to the user.
* Relax constraints:
    * Folding events into proper-size windows and finding recurring subsequences in these windows
    * **Partial order patterns**: relaxing the requirement of strict sequential ordering

#### Sequence Classification

* Three categories
    (1) Feature-based classification: Transforms a sequence into a feature vector and then applies conventional classification methods
    (2) Sequence distance–based classification: Measures the similarity between sequences.
    (3) Model-based classification: e.g. Hidden Markov model


#### Alignment of Biological Sequences
* **Sequence alignment**: 
    - Lining up sequences to achieve a maximal identity level
    - local alignments and global aligment
* Substitution matrices:
    - Represent the probabilities of substitutions of nucleotides or amino acids and probabilities of insertions and deletions.

### Mining Graphs and Networks

1. Graph Pattern Mining: 
    - Mining frequent subgraphs
    - Structure similarity search
2. Statistical Modeling of Networks
    - Homogeneous: The nodes and links are of the same type.
    - Heterogeneous: The nodes and links are of different types.
    - Scale-free model: power law distribution

3. Data Cleaning, Integration, and Validation by Information Network Analysis


4. Clustering and Classification of Graphs and Homogeneous Networks
    - Discover hidden communities, hubs, and outliers

5. Clustering, Ranking, and Classification of Heterogeneous Networks

6. Role Discovery and Link Prediction in Information Networks
    - Link prediction: Assess expected relationships among the candidate nodes/links.

7. Similarity Search and OLAP in Information Networks
    - OLAP: Online analytical processing
    - Path-based similarity
8. Evolution of Social and Information Networks


### Mining Other Kinds of Data

#### Mining Spatial Data
* Discovers patterns and knowledge from spatial data, like geospace-related data
* Popular topics:
    - Mining spatial associations and co-location patterns
    - Spatial clustering
    - Spatial classification
    - Spatial modeling
    - Spatial trend and outlier analysis

#### Mining Spatiotemporal Data and Moving Objects
* Spatiotemporal Data: Relate to both space and time, like the evolutionary history of cities and lands, global warming trends
* Moving-object data (important): Mining movement patterns of multiple moving objects
    
#### Mining Cyber-Physical System Data
* e.g. A transportation system that links a transportation monitoring network
* Need real-time calculations, and returning prompt responses

#### Mining Multimedia Data
* Including image data, video data, audio data, as well as sequence data and hypertext data

#### Mining Text Data
* Discovery of patterns and trends using statistical pattern learning, topic modeling, and statistical language modeling, etc.

#### Mining Web Data
* Web content mining: text, multimedia data, and structured data
* Web structure mining: hyperlinks
    - Using graph and network mining methods to analyze the nodes and connection structures on the Web.

* Web usage mining: Server logs
    - Understands users’ search patterns, trends, and associations
    - Predicts what users are looking for on the Internet

#### Mining Data Streams
* Only be able to read the stream once in sequential order

## Other Methodologies of Data Mining

![](https://i.imgur.com/JBkcgZm.png)

### Statistical Data Mining
* **Regression**: Predict the value of a response (dependent) variable from one or more predictor (independent) variables, where the variables are numeric.

* **Generalized linear models**: Include logistic regression and Poisson regression

* **Analysis of variance**: Analyze experimental data for two or more populations described by a numeric response variable

* **Mixed-effect models**: Analyze grouped data that can be classified according to one or more grouping variables.

* **Factor analysis**: Determine which variables are combined to generate a given factor.

* **Discriminant analysis**: Determine several discriminant functions that discriminate among the groups defined by the response variable

* **Survival analysis**: Predict the probability that a patient undergoing a medical treatment would survive at least to time $t$ .

* **Quality control**: Shewhart charts and CUSUM charts


### Views on Data Mining Foundations

* **Data reduction**: Include singular value decomposition, wavelets, regression, log-linear models, histograms, clustering, sampling, and the construction of index trees

* **Data compression**: Compress the given data by encoding in terms of bits, association rules, decision trees, clusters.

* **Probability and statistical theory**: Discover joint probability distributions of random variables.

* **Microeconomic view**: e.g. can be used in the decision-making process of some enterprise

* **Pattern discovery and inductive databases**: Discover patterns occurring in the data such as associations, classification models, sequential patterns

### Visual and Audio Data Mining

* **Visual data mining**: 
    - Data visualization
    - Data mining

* **Audio data mining**: Uses audio signals to indicate the patterns of data


## Data Mining Applications
![](https://i.imgur.com/0YhmhZR.png)

* Financial Data Analysis
* Retail and Telecommunication Industries
* Science and Engineering
* Intrusion Detection and Prevention
* Recommender Systems


## Data Mining and Society

* Customer relationship management(CRM): Provide more customized,  personal service addressing individual customer’s needs.

### Privacy-preserving data mining
* Obtaining valid data mining results without disclosing the underlying sensitive data values

* **Randomization methods**: Add noise to the data to mask some attribute values of records.

* **The k-anonymity and l-diversity methods**: 
    - **k-anonymity**: The granularity of data representation isreduced sufficiently so that any given record maps onto at least k other records in the data.
    - **l-diversity**: Enforcing intragroup diversity of sensitive values to ensure anonymization.