# NOTE : Data Mining: Concepts and Techniques
database
data warehouse - integrates data originating from multiple sources and various timeframes
multidimensional data mining
organized so as to facilitate management decision making
## Data Scale

Basic Statistical Descriptions of Data
Mean, Median, and Mode
Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
Visualization
pixel-oriented, geometric-based, icon-based, or
hierarchical.
Data Quality
accuracy, completeness, consistency, timeliness,
believability, and interpretability
# Ch3 Data Preprocessing

## Major Tasks
## Data cleaning
filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
### Missing Values
* Ignore the tuple
* Fill in the missing value manually
* Fill in the missing value
Use a global constant
Use a measure of central tendency for the attribute (e.g., the mean or median)
* Use the attribute mean or median for all samples belonging to the same class as
* the given tuple
* Use the most probable value to fill in the missing value
### Noisy Data
* Binning - sorted data value by consulting the values around it

* Regression
* Outlier analysis (Clustering)
## Data Integration
**combines data from multiple sources into a coherent data store, as in data warehousing**
* Entity Identification Problem
* Redundancy and Correlation Analysis
nominal data - use χ2 (chi-square) test
numeric attributes - use the correlation coefficient and covariance
* Tuple Duplication (重複變數)
there are two or more identical tuples for a given
unique data entry case
* Data Value Conflict Detection and Resolution
(EX) Hotel Chain:
Price: different currencies / different services (e.g., free breakfast) / taxes.
## Data Reduction
**mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results**
* dimensionality reduction
the process of reducing the number of random variables or attributes under consideration.
**Methods:**
Wavelet Transforms - discrete wavelet transform (DWT)
Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction
* numerosity reduction
replace the original data volume by alternative, smaller forms of data representation.
Reduce data volume by choosing alternative, smaller
forms of data representation
* data compression
transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
* Attribute Subset Selection
Reduce the data set size by removing irrelevant or redundant attributes
**Heuristic methods**


--Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
* Regression and Log-Linear Models: Parametric ata Reduction
* Clustering
* Sampling
Simple random sample without replacement (SRSWOR) of size s-機率相等
Simple random sample with replacement (SRSWR) of size s-抽完放回
Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位,再選取抽樣Cluster中的所有資料作為樣本
Stratified sample - skew may have poor performance, used in conjunction with skewed data
* Data Cube Aggregation

## Data Transformation and Data Discretization
### Data Transformation by Normalization
To help avoid dependence on the choice of measurement units, the data should be **normalized or standardized**
### Discretization
* Binning
* Histogram Analysis
* Cluster, Decision Tree, and Correlation Analyses
### Concept Hierarchy Generation for Nominal Data
Nominal attributes - have a finite number of distinct values, with **no ordering** among the values.
(Ex) geographic location, job category, and item type.
Concept Hierarchy - transform the data into multiple levels of granularity
1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts
(EX) street < city < province or state < country
2. Specification of a portion of a hierarchy by explicit data grouping
(EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering
higher-level concepts generally cover several subordinate lower-level concepts
(EX) high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street)
4. Specification of only a partial set of attributes
Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.

# Ch4 Data Warehousing and Online Analytical Processing
## Data Warehouse: Basic Concepts
data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases.
* subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process
著重將資料按其意義歸類至相同的主題區
* integrated
* time-variant
* nonvolatile - 資料只要被寫入就不會被取代或刪除,就算是錯的
### Differences between Operational Database Systems and Data Warehouses
### Data Warehousing: A Multitiered Architecture

* Bottom - warehouse database server
* Middle - OLAP server
relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations)
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
* implements multidimensional data and operations)
Top - front-end client layer (e.g., trend analysis, prediction...)
### Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

### Extraction, Transformation, and Loading
* Data extraction - gathers data from multiple, heterogeneous, and external sources.
* **Data cleaning** - detects errors in the data and rectifies them when possible.
* **Data transformation** - converts data from legacy or host format to warehouse format.
* Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
* Refresh - propagates the updates from the data sources to the warehouse.
### Metadata Repository
## Data Warehouse Modeling: Data Cube and OLAP
**Data warehouses and OLAP tools are based on a multidimensional data model.**
consists of **dimensions & measures**
### Data Cube: A Multidimensional Data Model
data cube - allows data to be modeled and viewed in multiple dimensions
2D / 3D / 4D data cube
### Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models
### Dimensions: The Role of Concept Hierarchies
defines a sequence of mappings from a set of **low-level concepts to higher-level**
### Typical OLAP Operations
* **Roll-up** - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction
* **Drill-down** - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions
* **Slice and dice**
* **Pivot (rotate)**
* Other OLAP operations

### OLAP Systems versus Statistical Databases
* SDB - focus on socioeconomic applications - low-level data
* OLAP - business applications - efficiently handling huge amounts of data
### A Starnet Query Model for Querying Multidimensional Databases

## Data Warehouse Design and Usage
### A Business Analysis Framework for Data Warehouse Design
4 views
* top-down view
* data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE
* data warehouse view
* business query view - from the end-user’s viewpoint
### Data Warehouse Design Process
* top-down approach - starts with overall design and planning
technology is mature and well known / the business problems that must be solved are clear and well understood
* bottom-up approach - experiments and prototypes
* combination
1. Choose a *business process* to model
(orders, invoices, shipments, inventory, account administration, sales, or the general ledger)
3. Choose the business process *grain*, which is the fundamental, atomic level of data to be represented in the fact table for this process
(e.g., individual transactions, individual daily snapshots...)
5. Choose the *dimensions* that will apply to each fact table record.
(time, item, customer, supplier, warehouse, transaction type, status)
5. Choose the *measures* that will populate each fact table record.
(numeric additive quantities - dollars sold, units sold)
### Data Warehouse Usage for Information Processing
* Information processing - querying, basic statistical analysis, and reporting(web-based)
* Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting)
* Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting)
## Data Warehouse Implementation
### Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids


define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Each cuboid represents a different group-by
### Partial Materialization: Selected Computation of Cuboids
1. No materialization
2. Full materialization
3. Partial materialization - Materialize some cuboids
--- Should consider:
(1) identify the subset (2) exploit during query processing
(3) efficiently update during load and refresh.
### Indexing OLAP Data: Bitmap Index and Join Index
* bitmap indexing - fast
not suitable for high cardinality domains

* join indexing
join indexing registers the joinable rows of two relations from a relational database.

### Efficient Processing of OLAP Queries
1. Determine which **operations** should be performed on the available cuboids
2. Determine to which **materialized cuboid(s)** the relevant operations should be applied
### OLAP Server Architectures: ROLAP / MOLAP / HOLAP
* Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability
即時、效率差、空間需求低
* Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
* Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star
* Specialized SQL servers
# Ch5 Data Cube Technology
## Data Cube Computation: Preliminary Concepts
### Cube Materialization
* Full Cube
* Iceberg Cube
# Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods
**Frequent pattern** mining searches for **recurring relationships** in a given data set
(EX) Market Basket Analysis
* Frequent Itemsets - Let I = {I1, I2,..., Im} be an itemset.
* Closed Itemsets
* Association Rules


The occurrence frequency of an itemset is the number of transactions that contain the itemset.
* Association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important)
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.
## Frequent Itemset Mining Methods
### Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
* Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

**Algorithm: Apriori**. Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input:
D, a database of transactions; min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

### Generating Association Rules from Frequent Itemsets

### Improving the Efficiency of Apriori
* Hash-based technique
* Transaction reduction
* Partitioning

* Sampling
* Dynamic itemset counting
### A Pattern-Growth Approach for Mining Frequent Itemsets
suffer from two nontrivial costs:
1. It may still need to generate a huge number of candidate sets.
2. It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.
-->To solve these problems: frequent pattern growth
FP-growth (finding frequent itemsets without candidate generation)
* An FP-tree registers compressed, frequent pattern information.

### Mining Closed and Max Patterns
Item merging
Sub-itemset pruning:
Item skipping
## Pattern Evaluation Methods
Which Patterns Are Interesting?
Strong Rules Are Not Necessarily Interesting - sometime misleading
From Association Analysis to Correlation Analysis