Try   HackMD

NOTE : Data Mining: Concepts and Techniques

database
data warehouse - integrates data originating from multiple sources and various timeframes
multidimensional data mining
organized so as to facilitate management decision making

Data Scale

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Basic Statistical Descriptions of Data
Mean, Median, and Mode
Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range

Visualization
pixel-oriented, geometric-based, icon-based, or
hierarchical.

Data Quality
accuracy, completeness, consistency, timeliness,
believability, and interpretability

Ch3 Data Preprocessing

Major Tasks

Data cleaning

filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies

Missing Values

  • Ignore the tuple
  • Fill in the missing value manually
  • Fill in the missing value
    Use a global constant
    Use a measure of central tendency for the attribute (e.g., the mean or median)
  • Use the attribute mean or median for all samples belonging to the same class as
  • the given tuple
  • Use the most probable value to fill in the missing value

Noisy Data

  • Binning - sorted data value by consulting the values around it

  • Regression
  • Outlier analysis (Clustering)

Data Integration

combines data from multiple sources into a coherent data store, as in data warehousing

  • Entity Identification Problem
  • Redundancy and Correlation Analysis
    nominal data - use χ2 (chi-square) test
    numeric attributes - use the correlation coefficient and covariance
  • Tuple Duplication (重複變數)
    there are two or more identical tuples for a given
    unique data entry case
  • Data Value Conflict Detection and Resolution
    (EX) Hotel Chain:
    Price: different currencies / different services (e.g., free breakfast) / taxes.

Data Reduction

mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results

  • dimensionality reduction
    the process of reducing the number of random variables or attributes under consideration.

Methods:
Wavelet Transforms - discrete wavelet transform (DWT)
Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal

vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction

  • numerosity reduction
    replace the original data volume by alternative, smaller forms of data representation.
    Reduce data volume by choosing alternative, smaller
    forms of data representation
  • data compression
    transformations are applied so as to obtain a reduced or “compressed” representation of the original data.
  • Attribute Subset Selection
    Reduce the data set size by removing irrelevant or redundant attributes

Heuristic methods

Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction

  • Regression and Log-Linear Models: Parametric ata Reduction

  • Clustering

  • Sampling
    Simple random sample without replacement (SRSWOR) of size s-機率相等
    Simple random sample with replacement (SRSWR) of size s-抽完放回
    Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位,再選取抽樣Cluster中的所有資料作為樣本

Stratified sample - skew may have poor performance, used in conjunction with skewed data

  • Data Cube Aggregation

Data Transformation and Data Discretization

Data Transformation by Normalization

To help avoid dependence on the choice of measurement units, the data should be normalized or standardized

Discretization

  • Binning
  • Histogram Analysis
  • Cluster, Decision Tree, and Correlation Analyses

Concept Hierarchy Generation for Nominal Data

Nominal attributes - have a finite number of distinct values, with no ordering among the values.
(Ex) geographic location, job category, and item type.

Concept Hierarchy - transform the data into multiple levels of granularity

  1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts

(EX) street < city < province or state < country

  1. Specification of a portion of a hierarchy by explicit data grouping
    (EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.”

  2. Specification of a set of attributes, but not of their partial ordering
    higher-level concepts generally cover several subordinate lower-level concepts
    (EX) high concept level (e.g., country) will usually contain a smaller
    number of distinct values than an attribute defining a lower concept level (e.g.,
    street)

  3. Specification of only a partial set of attributes

Automatic generation of a schema concept hierarchy based on the number of distinct
attribute values.

Ch4 Data Warehousing and Online Analytical Processing

Data Warehouse: Basic Concepts

data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases.

  • subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process
    著重將資料按其意義歸類至相同的主題區
  • integrated
  • time-variant
  • nonvolatile - 資料只要被寫入就不會被取代或刪除,就算是錯的

Differences between Operational Database Systems and Data Warehouses

Data Warehousing: A Multitiered Architecture

  • Bottom - warehouse database server
  • Middle - OLAP server
    relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations)
    multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
  • implements multidimensional data and operations)
    Top - front-end client layer (e.g., trend analysis, prediction)

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse

Extraction, Transformation, and Loading

  • Data extraction - gathers data from multiple, heterogeneous, and external sources.
  • Data cleaning - detects errors in the data and rectifies them when possible.
  • Data transformation - converts data from legacy or host format to warehouse format.
  • Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
  • Refresh - propagates the updates from the data sources to the warehouse.

Metadata Repository

Data Warehouse Modeling: Data Cube and OLAP

Data warehouses and OLAP tools are based on a multidimensional data model.
consists of dimensions & measures

Data Cube: A Multidimensional Data Model

data cube - allows data to be modeled and viewed in multiple dimensions

2D / 3D / 4D data cube

Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models

Dimensions: The Role of Concept Hierarchies

defines a sequence of mappings from a set of low-level concepts to higher-level

Typical OLAP Operations

  • Roll-up - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction
  • Drill-down - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions
  • Slice and dice
  • Pivot (rotate)
  • Other OLAP operations

OLAP Systems versus Statistical Databases

  • SDB - focus on socioeconomic applications - low-level data
  • OLAP - business applications - efficiently handling huge amounts of data

A Starnet Query Model for Querying Multidimensional Databases

Data Warehouse Design and Usage

A Business Analysis Framework for Data Warehouse Design

4 views

  • top-down view
  • data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE
  • data warehouse view
  • business query view - from the end-user’s viewpoint

Data Warehouse Design Process

  • top-down approach - starts with overall design and planning
    technology is mature and well known / the business problems that must be solved are clear and well understood
  • bottom-up approach - experiments and prototypes
  • combination
  1. Choose a business process to model
    (orders, invoices, shipments, inventory, account administration, sales, or the general ledger)
  2. Choose the business process grain, which is the fundamental, atomic level of data to be represented in the fact table for this process
    (e.g., individual transactions, individual daily snapshots)
  3. Choose the dimensions that will apply to each fact table record.
    (time, item, customer, supplier, warehouse, transaction type, status)
  4. Choose the measures that will populate each fact table record.
    (numeric additive quantities - dollars sold, units sold)

Data Warehouse Usage for Information Processing

  • Information processing - querying, basic statistical analysis, and reporting(web-based)
  • Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting)
  • Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting)

Data Warehouse Implementation

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids


define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Each cuboid represents a different group-by

Partial Materialization: Selected Computation of Cuboids

  1. No materialization
  2. Full materialization
  3. Partial materialization - Materialize some cuboids
    - Should consider:
    (1) identify the subset (2) exploit during query processing
    (3) efficiently update during load and refresh.

Indexing OLAP Data: Bitmap Index and Join Index

  • bitmap indexing - fast
    not suitable for high cardinality domains

  • join indexing
    join indexing registers the joinable rows of two relations from a relational database.

Efficient Processing of OLAP Queries

  1. Determine which operations should be performed on the available cuboids
  2. Determine to which materialized cuboid(s) the relevant operations should be applied

OLAP Server Architectures: ROLAP / MOLAP / HOLAP

  • Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability
    即時、效率差、空間需求低
  • Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data
  • Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star
  • Specialized SQL servers

Ch5 Data Cube Technology

Data Cube Computation: Preliminary Concepts

Cube Materialization

  • Full Cube
  • Iceberg Cube

Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods

Frequent pattern mining searches for recurring relationships in a given data set

(EX) Market Basket Analysis

  • Frequent Itemsets - Let I = {I1, I2,, Im} be an itemset.
  • Closed Itemsets
  • Association Rules


The occurrence frequency of an itemset is the number of transactions that contain the itemset.

  • Association rule mining can be viewed as a two-step process:
  1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important)

  2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

Frequent Itemset Mining Methods

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

  • Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input:
D, a database of transactions; min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

Generating Association Rules from Frequent Itemsets

Improving the Efficiency of Apriori

  • Hash-based technique
  • Transaction reduction
  • Partitioning
  • Sampling
  • Dynamic itemset counting

A Pattern-Growth Approach for Mining Frequent Itemsets

suffer from two nontrivial costs:

  1. It may still need to generate a huge number of candidate sets.
  2. It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching.

>To solve these problems: frequent pattern growth
FP-growth (finding frequent itemsets without candidate generation)

  • An FP-tree registers compressed, frequent pattern information.

Mining Closed and Max Patterns

Item merging
Sub-itemset pruning:
Item skipping

Pattern Evaluation Methods

Which Patterns Are Interesting?

Strong Rules Are Not Necessarily Interesting - sometime misleading
From Association Analysis to Correlation Analysis