NOTE : Data Mining: Concepts and Techniques

database
data warehouse - integrates data originating from multiple sources and various timeframes
multidimensional data mining
organized so as to facilitate management decision making

Data Scale

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Basic Statistical Descriptions of Data
Mean, Median, and Mode
Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range

Visualization
pixel-oriented, geometric-based, icon-based, or
hierarchical.

Data Quality
accuracy, completeness, consistency, timeliness,
believability, and interpretability

Ch3 Data Preprocessing

Major Tasks

Data cleaning

filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies

Missing Values

Ignore the tuple
Fill in the missing value manually
Fill in the missing value
Use a global constant
Use a measure of central tendency for the attribute (e.g., the mean or median)
Use the attribute mean or median for all samples belonging to the same class as
the given tuple
Use the most probable value to fill in the missing value

Noisy Data

Binning - sorted data value by consulting the values around it

Regression
Outlier analysis (Clustering)

Data Integration

combines data from multiple sources into a coherent data store, as in data warehousing

Entity Identification Problem
Redundancy and Correlation Analysis
nominal data - use χ2 (chi-square) test
numeric attributes - use the correlation coefficient and covariance
Tuple Duplication (重複變數)
there are two or more identical tuples for a given
unique data entry case
Data Value Conflict Detection and Resolution
(EX) Hotel Chain:
Price: different currencies / different services (e.g., free breakfast) / taxes.

Data Reduction

mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results

dimensionality reduction
the process of reducing the number of random variables or attributes under consideration.

Methods:
Wavelet Transforms - discrete wavelet transform (DWT)
Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal

vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction

numerosity reduction
replace the original data volume by alternative, smaller forms of data representation.
Reduce data volume by choosing alternative, smaller
forms of data representation
data compression
transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
Attribute Subset Selection
Reduce the data set size by removing irrelevant or redundant attributes

Heuristic methods

–Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction

Regression and Log-Linear Models: Parametric　ata Reduction
Clustering
Sampling
Simple random sample without replacement (SRSWOR) of size ｓ-機率相等
Simple random sample with replacement (SRSWR) of size ｓ-抽完放回
Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位，再選取抽樣Cluster中的所有資料作為樣本

Stratified sample - skew may have poor performance, used in conjunction with skewed data

Data Cube Aggregation

Data Transformation and Data Discretization

Data Transformation by Normalization

To help avoid dependence on the choice of measurement units, the data should be normalized or standardized

Discretization

Binning
Histogram Analysis
Cluster, Decision Tree, and Correlation Analyses

Concept Hierarchy Generation for Nominal Data

Nominal attributes - have a finite number of distinct values, with no ordering among the values.
(Ex) geographic location, job category, and item type.

Concept Hierarchy - transform the data into multiple levels of granularity

Specification of a partial ordering of attributes explicitly at the schema level by users or experts

(EX) street < city < province or state < country

Specification of a portion of a hierarchy by explicit data grouping
(EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.”
Specification of a set of attributes, but not of their partial ordering
higher-level concepts generally cover several subordinate lower-level concepts
(EX) high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street)
Specification of only a partial set of attributes

Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.

Ch4 Data Warehousing and Online Analytical Processing

Data Warehouse: Basic Concepts

data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases.

subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process
著重將資料按其意義歸類至相同的主題區
integrated
time-variant
nonvolatile - 資料只要被寫入就不會被取代或刪除，就算是錯的

Differences between Operational Database Systems and Data Warehouses

Data Warehousing: A Multitiered Architecture

Bottom - warehouse database server
Middle - OLAP server
relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations)
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations)
Top - front-end client layer (e.g., trend analysis, prediction…)

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

Extraction, Transformation, and Loading

Data extraction - gathers data from multiple, heterogeneous, and external sources.
Data cleaning - detects errors in the data and rectifies them when possible.
Data transformation - converts data from legacy or host format to warehouse format.
Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
Refresh - propagates the updates from the data sources to the warehouse.

Metadata Repository

Data Warehouse Modeling: Data Cube and OLAP

Data warehouses and OLAP tools are based on a multidimensional data model.
consists of dimensions & measures

Data Cube: A Multidimensional Data Model

data cube - allows data to be modeled and viewed in multiple dimensions

2D / 3D / 4D data cube

Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models

Dimensions: The Role of Concept Hierarchies

defines a sequence of mappings from a set of low-level concepts to higher-level

Typical OLAP Operations

Roll-up - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction
Drill-down - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions
Slice and dice
Pivot (rotate)
Other OLAP operations

OLAP Systems versus Statistical Databases

SDB - focus on socioeconomic applications - low-level data
OLAP - business applications - efficiently handling huge amounts of data

A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage

A Business Analysis Framework for Data Warehouse Design

4 views

top-down view
data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE
data warehouse view
business query view - from the end-user’s viewpoint

Data Warehouse Design Process

top-down approach - starts with overall design and planning
technology is mature and well known / the business problems that must be solved are clear and well understood
bottom-up approach - experiments and prototypes
combination

Choose a business process to model
(orders, invoices, shipments, inventory, account administration, sales, or the general ledger)
Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process
(e.g., individual transactions, individual daily snapshots…)
Choose the dimensions that will apply to each fact table record.
(time, item, customer, supplier, warehouse, transaction type, status)
Choose the measures that will populate each fact table record.
(numeric additive quantities - dollars sold, units sold)

Data Warehouse Usage for Information Processing

Information processing - querying, basic statistical analysis, and reporting(web-based)
Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting)
Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting)

Data Warehouse Implementation

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Each cuboid represents a different group-by

Partial Materialization: Selected Computation of Cuboids

No materialization
Full materialization
Partial materialization - Materialize some cuboids
–- Should consider:
(1) identify the subset (2) exploit during query processing
(3) efficiently update during load and refresh.

Indexing OLAP Data: Bitmap Index and Join Index

bitmap indexing - fast
not suitable for high cardinality domains
join indexing
join indexing registers the joinable rows of two relations from a relational database.

Efficient Processing of OLAP Queries

Determine which operations should be performed on the available cuboids
Determine to which materialized cuboid(s) the relevant operations should be applied

OLAP Server Architectures: ROLAP / MOLAP / HOLAP

Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability
即時、效率差、空間需求低
Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star
Specialized SQL servers

Ch5 Data Cube Technology

Data Cube Computation: Preliminary Concepts

Cube Materialization

Full Cube
Iceberg Cube

Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods

Frequent pattern mining searches for recurring relationships in a given data set

(EX) Market Basket Analysis

Frequent Itemsets - Let I = {I1, I2,…, Im} be an itemset.
Closed Itemsets
Association Rules

The occurrence frequency of an itemset is the number of transactions that contain the itemset.

Association rule mining can be viewed as a two-step process:

Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important)
Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

Frequent Itemset Mining Methods

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input:
D, a database of transactions; min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

Generating Association Rules from Frequent Itemsets

Improving the Efficiency of Apriori

Hash-based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting

A Pattern-Growth Approach for Mining Frequent Itemsets

suffer from two nontrivial costs:

It may still need to generate a huge number of candidate sets.
It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.

–>To solve these problems: frequent pattern growth
FP-growth (finding frequent itemsets without candidate generation)

An FP-tree registers compressed, frequent pattern information.

Mining Closed and Max Patterns

Item merging
Sub-itemset pruning:
Item skipping

Pattern Evaluation Methods

Which Patterns Are Interesting?

Strong Rules Are Not Necessarily Interesting - sometime misleading
From Association Analysis to Correlation Analysis

NOTE : Data Mining: Concepts and Techniques

Data Scale

Ch3 Data Preprocessing

Major Tasks

Data cleaning

Missing Values

Noisy Data

Data Integration

Data Reduction

Data Transformation and Data Discretization

Data Transformation by Normalization

Discretization

Concept Hierarchy Generation for Nominal Data

Ch4 Data Warehousing and Online Analytical Processing

Data Warehouse: Basic Concepts

Differences between Operational Database Systems and Data Warehouses

Data Warehousing: A Multitiered Architecture

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

Extraction, Transformation, and Loading

Metadata Repository

Data Warehouse Modeling: Data Cube and OLAP

Data Cube: A Multidimensional Data Model

Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models

Dimensions: The Role of Concept Hierarchies

Typical OLAP Operations

OLAP Systems versus Statistical Databases

A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage

A Business Analysis Framework for Data Warehouse Design

Data Warehouse Design Process

Data Warehouse Usage for Information Processing

Data Warehouse Implementation

Efficient Data Cube Computation

Partial Materialization: Selected Computation of Cuboids

Indexing OLAP Data: Bitmap Index and Join Index

Efficient Processing of OLAP Queries

OLAP Server Architectures: ROLAP / MOLAP / HOLAP

Ch5 Data Cube Technology

Data Cube Computation: Preliminary Concepts

Cube Materialization

Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods

Frequent Itemset Mining Methods

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

Generating Association Rules from Frequent Itemsets

Improving the Efficiency of Apriori

A Pattern-Growth Approach for Mining Frequent Itemsets

Mining Closed and Max Patterns

Pattern Evaluation Methods

Read more

NOTE : Data Mining: Concepts and Techniques