# Chp 13 Data Mining Trends and Research Frontiers
###### tags: `Data Mining 心得`
## Mining Complex Data Types

### Mining Sequence Data: Time-Series, Symbolic Sequences, and Biological Sequences
#### Similarity search in Time-Series Data
* Often require subsequence matching
* Dimensionality reduction:
1. Discrete Fourier transform (DFT)
2. Discrete wavelet transforms (DWT)
3. Singular value decomposition (SVD)
4. Principle components analysis (PCA)
#### Regression and Trend Analysis in Time-Series Data
* Trend analysis
* 
1. Trend or long-term movements: Using **weighted moving average** and the **least squares** methods to find trend curves.
2. Cyclic movements: long-term oscillations
3. Seasonal variations: e.g. holiday shopping seasons
4. Randommovements
#### Sequential Pattern Mining in Symbolic Sequences
* Mining symbolic sequences.
* **Constraint-based sequential pattern mining**: user-specified constraints can be used to reduce the search space in sequential pattern mining and derive only the patterns that are of interest to the user.
* Relax constraints:
* Folding events into proper-size windows and finding recurring subsequences in these windows
* **Partial order patterns**: relaxing the requirement of strict sequential ordering
#### Sequence Classification
* Three categories
(1) Feature-based classification: Transforms a sequence into a feature vector and then applies conventional classification methods
(2) Sequence distance–based classification: Measures the similarity between sequences.
(3) Model-based classification: e.g. Hidden Markov model
#### Alignment of Biological Sequences
* **Sequence alignment**:
- Lining up sequences to achieve a maximal identity level
- local alignments and global aligment
* Substitution matrices:
- Represent the probabilities of substitutions of nucleotides or amino acids and probabilities of insertions and deletions.
### Mining Graphs and Networks
1. Graph Pattern Mining:
- Mining frequent subgraphs
- Structure similarity search
2. Statistical Modeling of Networks
- Homogeneous: The nodes and links are of the same type.
- Heterogeneous: The nodes and links are of different types.
- Scale-free model: power law distribution
3. Data Cleaning, Integration, and Validation by Information Network Analysis
4. Clustering and Classification of Graphs and Homogeneous Networks
- Discover hidden communities, hubs, and outliers
5. Clustering, Ranking, and Classification of Heterogeneous Networks
6. Role Discovery and Link Prediction in Information Networks
- Link prediction: Assess expected relationships among the candidate nodes/links.
7. Similarity Search and OLAP in Information Networks
- OLAP: Online analytical processing
- Path-based similarity
8. Evolution of Social and Information Networks
### Mining Other Kinds of Data
#### Mining Spatial Data
* Discovers patterns and knowledge from spatial data, like geospace-related data
* Popular topics:
- Mining spatial associations and co-location patterns
- Spatial clustering
- Spatial classification
- Spatial modeling
- Spatial trend and outlier analysis
#### Mining Spatiotemporal Data and Moving Objects
* Spatiotemporal Data: Relate to both space and time, like the evolutionary history of cities and lands, global warming trends
* Moving-object data (important): Mining movement patterns of multiple moving objects
#### Mining Cyber-Physical System Data
* e.g. A transportation system that links a transportation monitoring network
* Need real-time calculations, and returning prompt responses
#### Mining Multimedia Data
* Including image data, video data, audio data, as well as sequence data and hypertext data
#### Mining Text Data
* Discovery of patterns and trends using statistical pattern learning, topic modeling, and statistical language modeling, etc.
#### Mining Web Data
* Web content mining: text, multimedia data, and structured data
* Web structure mining: hyperlinks
- Using graph and network mining methods to analyze the nodes and connection structures on the Web.
* Web usage mining: Server logs
- Understands users’ search patterns, trends, and associations
- Predicts what users are looking for on the Internet
#### Mining Data Streams
* Only be able to read the stream once in sequential order
## Other Methodologies of Data Mining

### Statistical Data Mining
* **Regression**: Predict the value of a response (dependent) variable from one or more predictor (independent) variables, where the variables are numeric.
* **Generalized linear models**: Include logistic regression and Poisson regression
* **Analysis of variance**: Analyze experimental data for two or more populations described by a numeric response variable
* **Mixed-effect models**: Analyze grouped data that can be classified according to one or more grouping variables.
* **Factor analysis**: Determine which variables are combined to generate a given factor.
* **Discriminant analysis**: Determine several discriminant functions that discriminate among the groups defined by the response variable
* **Survival analysis**: Predict the probability that a patient undergoing a medical treatment would survive at least to time $t$ .
* **Quality control**: Shewhart charts and CUSUM charts
### Views on Data Mining Foundations
* **Data reduction**: Include singular value decomposition, wavelets, regression, log-linear models, histograms, clustering, sampling, and the construction of index trees
* **Data compression**: Compress the given data by encoding in terms of bits, association rules, decision trees, clusters.
* **Probability and statistical theory**: Discover joint probability distributions of random variables.
* **Microeconomic view**: e.g. can be used in the decision-making process of some enterprise
* **Pattern discovery and inductive databases**: Discover patterns occurring in the data such as associations, classification models, sequential patterns
### Visual and Audio Data Mining
* **Visual data mining**:
- Data visualization
- Data mining
* **Audio data mining**: Uses audio signals to indicate the patterns of data
## Data Mining Applications

* Financial Data Analysis
* Retail and Telecommunication Industries
* Science and Engineering
* Intrusion Detection and Prevention
* Recommender Systems
## Data Mining and Society
* Customer relationship management(CRM): Provide more customized, personal service addressing individual customer’s needs.
### Privacy-preserving data mining
* Obtaining valid data mining results without disclosing the underlying sensitive data values
* **Randomization methods**: Add noise to the data to mask some attribute values of records.
* **The k-anonymity and l-diversity methods**:
- **k-anonymity**: The granularity of data representation isreduced sufficiently so that any given record maps onto at least k other records in the data.
- **l-diversity**: Enforcing intragroup diversity of sensitive values to ensure anonymization.