NOTE : Data Mining: Concepts and Techniques
database
data warehouse - integrates data originating from multiple sources and various timeframes
multidimensional data mining
organized so as to facilitate management decision making
Data Scale
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Basic Statistical Descriptions of Data
Mean, Median, and Mode
Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
Visualization
pixel-oriented, geometric-based, icon-based, or
hierarchical.
Data Quality
accuracy, completeness, consistency, timeliness,
believability, and interpretability
Ch3 Data Preprocessing

Major Tasks
Data cleaning
filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
Missing Values
- Ignore the tuple
- Fill in the missing value manually
- Fill in the missing value
Use a global constant
Use a measure of central tendency for the attribute (e.g., the mean or median)
- Use the attribute mean or median for all samples belonging to the same class as
- the given tuple
- Use the most probable value to fill in the missing value
Noisy Data
- Binning - sorted data value by consulting the values around it

- Regression
- Outlier analysis (Clustering)
Data Integration
combines data from multiple sources into a coherent data store, as in data warehousing
- Entity Identification Problem
- Redundancy and Correlation Analysis
nominal data - use χ2 (chi-square) test
numeric attributes - use the correlation coefficient and covariance
- Tuple Duplication (重複變數)
there are two or more identical tuples for a given
unique data entry case
- Data Value Conflict Detection and Resolution
(EX) Hotel Chain:
Price: different currencies / different services (e.g., free breakfast) / taxes.
Data Reduction
mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results
- dimensionality reduction
the process of reducing the number of random variables or attributes under consideration.
Methods:
Wavelet Transforms - discrete wavelet transform (DWT)
Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction
- numerosity reduction
replace the original data volume by alternative, smaller forms of data representation.
Reduce data volume by choosing alternative, smaller
forms of data representation
- data compression
transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
- Attribute Subset Selection
Reduce the data set size by removing irrelevant or redundant attributes
Heuristic methods


–Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
-
Regression and Log-Linear Models: Parametric ata Reduction
-
Clustering
-
Sampling
Simple random sample without replacement (SRSWOR) of size s-機率相等
Simple random sample with replacement (SRSWR) of size s-抽完放回
Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位,再選取抽樣Cluster中的所有資料作為樣本
Stratified sample - skew may have poor performance, used in conjunction with skewed data

To help avoid dependence on the choice of measurement units, the data should be normalized or standardized
Discretization
- Binning
- Histogram Analysis
- Cluster, Decision Tree, and Correlation Analyses
Concept Hierarchy Generation for Nominal Data
Nominal attributes - have a finite number of distinct values, with no ordering among the values.
(Ex) geographic location, job category, and item type.
Concept Hierarchy - transform the data into multiple levels of granularity
- Specification of a partial ordering of attributes explicitly at the schema level by users or experts
(EX) street < city < province or state < country
-
Specification of a portion of a hierarchy by explicit data grouping
(EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.”
-
Specification of a set of attributes, but not of their partial ordering
higher-level concepts generally cover several subordinate lower-level concepts
(EX) high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept level (e.g.,
street)
-
Specification of only a partial set of attributes
Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.

Ch4 Data Warehousing and Online Analytical Processing
Data Warehouse: Basic Concepts
data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases.
- subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process
著重將資料按其意義歸類至相同的主題區
- integrated
- time-variant
- nonvolatile - 資料只要被寫入就不會被取代或刪除,就算是錯的
Differences between Operational Database Systems and Data Warehouses
Data Warehousing: A Multitiered Architecture

- Bottom - warehouse database server
- Middle - OLAP server
relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations)
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
- implements multidimensional data and operations)
Top - front-end client layer (e.g., trend analysis, prediction…)
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

- Data extraction - gathers data from multiple, heterogeneous, and external sources.
- Data cleaning - detects errors in the data and rectifies them when possible.
- Data transformation - converts data from legacy or host format to warehouse format.
- Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
- Refresh - propagates the updates from the data sources to the warehouse.
Data Warehouse Modeling: Data Cube and OLAP
Data warehouses and OLAP tools are based on a multidimensional data model.
consists of dimensions & measures
Data Cube: A Multidimensional Data Model
data cube - allows data to be modeled and viewed in multiple dimensions
2D / 3D / 4D data cube
Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models
Dimensions: The Role of Concept Hierarchies
defines a sequence of mappings from a set of low-level concepts to higher-level
Typical OLAP Operations
- Roll-up - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction
- Drill-down - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions
- Slice and dice
- Pivot (rotate)
- Other OLAP operations

OLAP Systems versus Statistical Databases
- SDB - focus on socioeconomic applications - low-level data
- OLAP - business applications - efficiently handling huge amounts of data
A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage
A Business Analysis Framework for Data Warehouse Design
4 views
- top-down view
- data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE
- data warehouse view
- business query view - from the end-user’s viewpoint
Data Warehouse Design Process
- top-down approach - starts with overall design and planning
technology is mature and well known / the business problems that must be solved are clear and well understood
- bottom-up approach - experiments and prototypes
- combination
- Choose a business process to model
(orders, invoices, shipments, inventory, account administration, sales, or the general ledger)
- Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process
(e.g., individual transactions, individual daily snapshots…)
- Choose the dimensions that will apply to each fact table record.
(time, item, customer, supplier, warehouse, transaction type, status)
- Choose the measures that will populate each fact table record.
(numeric additive quantities - dollars sold, units sold)
- Information processing - querying, basic statistical analysis, and reporting(web-based)
- Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting)
- Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting)
Data Warehouse Implementation
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids


define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Each cuboid represents a different group-by
Partial Materialization: Selected Computation of Cuboids
- No materialization
- Full materialization
- Partial materialization - Materialize some cuboids
–- Should consider:
(1) identify the subset (2) exploit during query processing
(3) efficiently update during load and refresh.
Indexing OLAP Data: Bitmap Index and Join Index
Efficient Processing of OLAP Queries
- Determine which operations should be performed on the available cuboids
- Determine to which materialized cuboid(s) the relevant operations should be applied
OLAP Server Architectures: ROLAP / MOLAP / HOLAP
- Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability
即時、效率差、空間需求低
- Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
- Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star
- Specialized SQL servers
Ch5 Data Cube Technology
Data Cube Computation: Preliminary Concepts
Cube Materialization
Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods
Frequent pattern mining searches for recurring relationships in a given data set
(EX) Market Basket Analysis
- Frequent Itemsets - Let I = {I1, I2,…, Im} be an itemset.
- Closed Itemsets
- Association Rules


The occurrence frequency of an itemset is the number of transactions that contain the itemset.
- Association rule mining can be viewed as a two-step process:
-
Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important)
-
Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.
Frequent Itemset Mining Methods
Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation
- Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input:
D, a database of transactions; min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

Generating Association Rules from Frequent Itemsets

Improving the Efficiency of Apriori
- Hash-based technique
- Transaction reduction
- Partitioning

- Sampling
- Dynamic itemset counting
A Pattern-Growth Approach for Mining Frequent Itemsets
suffer from two nontrivial costs:
- It may still need to generate a huge number of candidate sets.
- It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.
–>To solve these problems: frequent pattern growth
FP-growth (finding frequent itemsets without candidate generation)
- An FP-tree registers compressed, frequent pattern information.

Mining Closed and Max Patterns
Item merging
Sub-itemset pruning:
Item skipping
Pattern Evaluation Methods
Which Patterns Are Interesting?
Strong Rules Are Not Necessarily Interesting - sometime misleading
From Association Analysis to Correlation Analysis