NOTE : Data Mining: Concepts and Techniques
database
data warehouse - integrates data originating from multiple sources and various timeframes
multidimensional data mining
organized so as to facilitate management decision making
Data Scale
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Basic Statistical Descriptions of Data
Mean, Median, and Mode
Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
Visualization
pixel-oriented, geometric-based, icon-based, or
hierarchical.
Data Quality
accuracy, completeness, consistency, timeliness,
believability, and interpretability
Ch3 Data Preprocessing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Major Tasks
Data cleaning
filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
Missing Values
- Ignore the tuple
- Fill in the missing value manually
- Fill in the missing value
Use a global constant
Use a measure of central tendency for the attribute (e.g., the mean or median)
- Use the attribute mean or median for all samples belonging to the same class as
- the given tuple
- Use the most probable value to fill in the missing value
Noisy Data
- Binning - sorted data value by consulting the values around it
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Regression
- Outlier analysis (Clustering)
Data Integration
combines data from multiple sources into a coherent data store, as in data warehousing
- Entity Identification Problem
- Redundancy and Correlation Analysis
nominal data - use χ2 (chi-square) test
numeric attributes - use the correlation coefficient and covariance
- Tuple Duplication (重複變數)
there are two or more identical tuples for a given
unique data entry case
- Data Value Conflict Detection and Resolution
(EX) Hotel Chain:
Price: different currencies / different services (e.g., free breakfast) / taxes.
Data Reduction
mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results
- dimensionality reduction
the process of reducing the number of random variables or attributes under consideration.
Methods:
Wavelet Transforms - discrete wavelet transform (DWT)
Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction
- numerosity reduction
replace the original data volume by alternative, smaller forms of data representation.
Reduce data volume by choosing alternative, smaller
forms of data representation
- data compression
transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
- Attribute Subset Selection
Reduce the data set size by removing irrelevant or redundant attributes
Heuristic methods
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
–Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
-
Regression and Log-Linear Models: Parametric ata Reduction
-
Clustering
-
Sampling
Simple random sample without replacement (SRSWOR) of size s-機率相等
Simple random sample with replacement (SRSWR) of size s-抽完放回
Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位,再選取抽樣Cluster中的所有資料作為樣本
Stratified sample - skew may have poor performance, used in conjunction with skewed data
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
To help avoid dependence on the choice of measurement units, the data should be normalized or standardized
Discretization
- Binning
- Histogram Analysis
- Cluster, Decision Tree, and Correlation Analyses
Concept Hierarchy Generation for Nominal Data
Nominal attributes - have a finite number of distinct values, with no ordering among the values.
(Ex) geographic location, job category, and item type.
Concept Hierarchy - transform the data into multiple levels of granularity
- Specification of a partial ordering of attributes explicitly at the schema level by users or experts
(EX) street < city < province or state < country
-
Specification of a portion of a hierarchy by explicit data grouping
(EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.”
-
Specification of a set of attributes, but not of their partial ordering
higher-level concepts generally cover several subordinate lower-level concepts
(EX) high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street)
-
Specification of only a partial set of attributes
Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Ch4 Data Warehousing and Online Analytical Processing
Data Warehouse: Basic Concepts
data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases.
- subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process
著重將資料按其意義歸類至相同的主題區
- integrated
- time-variant
- nonvolatile - 資料只要被寫入就不會被取代或刪除,就算是錯的
Differences between Operational Database Systems and Data Warehouses
Data Warehousing: A Multitiered Architecture
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Bottom - warehouse database server
- Middle - OLAP server
relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations)
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
- implements multidimensional data and operations)
Top - front-end client layer (e.g., trend analysis, prediction…)
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

- Data extraction - gathers data from multiple, heterogeneous, and external sources.
- Data cleaning - detects errors in the data and rectifies them when possible.
- Data transformation - converts data from legacy or host format to warehouse format.
- Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
- Refresh - propagates the updates from the data sources to the warehouse.
Data Warehouse Modeling: Data Cube and OLAP
Data warehouses and OLAP tools are based on a multidimensional data model.
consists of dimensions & measures
Data Cube: A Multidimensional Data Model
data cube - allows data to be modeled and viewed in multiple dimensions
2D / 3D / 4D data cube
Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models
Dimensions: The Role of Concept Hierarchies
defines a sequence of mappings from a set of low-level concepts to higher-level
Typical OLAP Operations
- Roll-up - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction
- Drill-down - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions
- Slice and dice
- Pivot (rotate)
- Other OLAP operations

OLAP Systems versus Statistical Databases
- SDB - focus on socioeconomic applications - low-level data
- OLAP - business applications - efficiently handling huge amounts of data
A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage
A Business Analysis Framework for Data Warehouse Design
4 views
- top-down view
- data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE
- data warehouse view
- business query view - from the end-user’s viewpoint
Data Warehouse Design Process
- top-down approach - starts with overall design and planning
technology is mature and well known / the business problems that must be solved are clear and well understood
- bottom-up approach - experiments and prototypes
- combination
- Choose a business process to model
(orders, invoices, shipments, inventory, account administration, sales, or the general ledger)
- Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process
(e.g., individual transactions, individual daily snapshots…)
- Choose the dimensions that will apply to each fact table record.
(time, item, customer, supplier, warehouse, transaction type, status)
- Choose the measures that will populate each fact table record.
(numeric additive quantities - dollars sold, units sold)
- Information processing - querying, basic statistical analysis, and reporting(web-based)
- Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting)
- Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting)
Data Warehouse Implementation
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids


define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Each cuboid represents a different group-by
Partial Materialization: Selected Computation of Cuboids
- No materialization
- Full materialization
- Partial materialization - Materialize some cuboids
–- Should consider:
(1) identify the subset (2) exploit during query processing
(3) efficiently update during load and refresh.
Indexing OLAP Data: Bitmap Index and Join Index
Efficient Processing of OLAP Queries
- Determine which operations should be performed on the available cuboids
- Determine to which materialized cuboid(s) the relevant operations should be applied
OLAP Server Architectures: ROLAP / MOLAP / HOLAP
- Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability
即時、效率差、空間需求低
- Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
- Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star
- Specialized SQL servers
Ch5 Data Cube Technology
Data Cube Computation: Preliminary Concepts
Cube Materialization
Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods
Frequent pattern mining searches for recurring relationships in a given data set
(EX) Market Basket Analysis
- Frequent Itemsets - Let I = {I1, I2,…, Im} be an itemset.
- Closed Itemsets
- Association Rules


The occurrence frequency of an itemset is the number of transactions that contain the itemset.
- Association rule mining can be viewed as a two-step process:
-
Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important)
-
Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.
Frequent Itemset Mining Methods
Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
- Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input:
D, a database of transactions; min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

Generating Association Rules from Frequent Itemsets

Improving the Efficiency of Apriori
- Hash-based technique
- Transaction reduction
- Partitioning

- Sampling
- Dynamic itemset counting
A Pattern-Growth Approach for Mining Frequent Itemsets
suffer from two nontrivial costs:
- It may still need to generate a huge number of candidate sets.
- It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.
–>To solve these problems: frequent pattern growth
FP-growth (finding frequent itemsets without candidate generation)
- An FP-tree registers compressed, frequent pattern information.

Mining Closed and Max Patterns
Item merging
Sub-itemset pruning:
Item skipping
Pattern Evaluation Methods
Which Patterns Are Interesting?
Strong Rules Are Not Necessarily Interesting - sometime misleading
From Association Analysis to Correlation Analysis