# NOTE : Data Mining: Concepts and Techniques database data warehouse - integrates data originating from multiple sources and various timeframes multidimensional data mining organized so as to facilitate management decision making ## Data Scale ![](https://i.imgur.com/yec7ag4.jpg) Basic Statistical Descriptions of Data Mean, Median, and Mode Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range Visualization pixel-oriented, geometric-based, icon-based, or hierarchical. Data Quality accuracy, completeness, consistency, timeliness, believability, and interpretability # Ch3 Data Preprocessing ![](https://i.imgur.com/LLTGYoB.jpg) ## Major Tasks ## Data cleaning filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies ### Missing Values * Ignore the tuple * Fill in the missing value manually * Fill in the missing value Use a global constant Use a measure of central tendency for the attribute (e.g., the mean or median) * Use the attribute mean or median for all samples belonging to the same class as * the given tuple * Use the most probable value to fill in the missing value ### Noisy Data * Binning - sorted data value by consulting the values around it ![](https://i.imgur.com/cGIeXws.jpg) * Regression * Outlier analysis (Clustering) ## Data Integration **combines data from multiple sources into a coherent data store, as in data warehousing** * Entity Identification Problem * Redundancy and Correlation Analysis nominal data - use χ2 (chi-square) test numeric attributes - use the correlation coefficient and covariance * Tuple Duplication (重複變數) there are two or more identical tuples for a given unique data entry case * Data Value Conflict Detection and Resolution (EX) Hotel Chain: Price: different currencies / different services (e.g., free breakfast) / taxes. ## Data Reduction **mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results** * dimensionality reduction the process of reducing the number of random variables or attributes under consideration. **Methods:** Wavelet Transforms - discrete wavelet transform (DWT) Principal Components Analysis(PCA) - searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k ≤ n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction * numerosity reduction replace the original data volume by alternative, smaller forms of data representation. Reduce data volume by choosing alternative, smaller forms of data representation * data compression transformations are applied so as to obtain a reduced or “compressed” representation of the original data. * Attribute Subset Selection Reduce the data set size by removing irrelevant or redundant attributes **Heuristic methods** ![](https://i.imgur.com/a4Vs0Wt.jpg) ![](https://i.imgur.com/ytbX2bA.jpg) --Step-wise forward selection – Step-wise backward elimination – Combining forward selection and backward elimination – Decision-tree induction * Regression and Log-Linear Models: Parametric ata Reduction * Clustering * Sampling Simple random sample without replacement (SRSWOR) of size s-機率相等 Simple random sample with replacement (SRSWR) of size s-抽完放回 Cluster sample - 依照已知標準霍特徵所排列的Cluster作為抽樣單位,再選取抽樣Cluster中的所有資料作為樣本 Stratified sample - skew may have poor performance, used in conjunction with skewed data * Data Cube Aggregation ![](https://i.imgur.com/HV4ayE7.jpg) ## Data Transformation and Data Discretization ### Data Transformation by Normalization To help avoid dependence on the choice of measurement units, the data should be **normalized or standardized** ### Discretization * Binning * Histogram Analysis * Cluster, Decision Tree, and Correlation Analyses ### Concept Hierarchy Generation for Nominal Data Nominal attributes - have a finite number of distinct values, with **no ordering** among the values. (Ex) geographic location, job category, and item type. Concept Hierarchy - transform the data into multiple levels of granularity 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts (EX) street < city < province or state < country 2. Specification of a portion of a hierarchy by explicit data grouping (EX) “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies Canada} ⊂ Western Canada.” 3. Specification of a set of attributes, but not of their partial ordering higher-level concepts generally cover several subordinate lower-level concepts (EX) high concept level (e.g., country) will usually contain a smaller number of distinct values than an attribute defining a lower concept level (e.g., street) 4. Specification of only a partial set of attributes Automatic generation of a schema concept hierarchy based on the number of distinct attribute values. ![](https://i.imgur.com/qANnyNB.jpg) # Ch4 Data Warehousing and Online Analytical Processing ## Data Warehouse: Basic Concepts data warehouse - refers to a data repository that is maintained separately from an organization’s operational databases. * subject-oriented - focuses on the modeling and analysis of data for decision makers. excluding data that are not useful in the decision support process 著重將資料按其意義歸類至相同的主題區 * integrated * time-variant * nonvolatile - 資料只要被寫入就不會被取代或刪除,就算是錯的 ### Differences between Operational Database Systems and Data Warehouses ### Data Warehousing: A Multitiered Architecture ![](https://i.imgur.com/hkLUUn1.jpg) * Bottom - warehouse database server * Middle - OLAP server relational OLAP (ROLAP) (i.e., an extended relational DBMS that maps operations on multidimensional data to standard relational operations) multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly * implements multidimensional data and operations) Top - front-end client layer (e.g., trend analysis, prediction...) ### Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse ![](https://i.imgur.com/5D4pdd4.png) ### Extraction, Transformation, and Loading * Data extraction - gathers data from multiple, heterogeneous, and external sources. * **Data cleaning** - detects errors in the data and rectifies them when possible. * **Data transformation** - converts data from legacy or host format to warehouse format. * Load - sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions. * Refresh - propagates the updates from the data sources to the warehouse. ### Metadata Repository ## Data Warehouse Modeling: Data Cube and OLAP **Data warehouses and OLAP tools are based on a multidimensional data model.** consists of **dimensions & measures** ### Data Cube: A Multidimensional Data Model data cube - allows data to be modeled and viewed in multiple dimensions 2D / 3D / 4D data cube ### Stars/ Snowflakes/ Fact Constellations: Schemas for Multidimensional Data Models ### Dimensions: The Role of Concept Hierarchies defines a sequence of mappings from a set of **low-level concepts to higher-level** ### Typical OLAP Operations * **Roll-up** - aggregation - climbing up a concept hierarchy for a dimension or by dimension reduction * **Drill-down** - the reverse of roll-up - stepping down a concept hierarchy for a dimension orintroducing additional dimensions * **Slice and dice** * **Pivot (rotate)** * Other OLAP operations ![](https://i.imgur.com/sBLcJ3J.png) ### OLAP Systems versus Statistical Databases * SDB - focus on socioeconomic applications - low-level data * OLAP - business applications - efficiently handling huge amounts of data ### A Starnet Query Model for Querying Multidimensional Databases ![](https://i.imgur.com/kjikSlZ.png) ## Data Warehouse Design and Usage ### A Business Analysis Framework for Data Warehouse Design 4 views * top-down view * data source view - often modeled by traditional data modeling techniques, such as the entity-relationship model or CASE * data warehouse view * business query view - from the end-user’s viewpoint ### Data Warehouse Design Process * top-down approach - starts with overall design and planning technology is mature and well known / the business problems that must be solved are clear and well understood * bottom-up approach - experiments and prototypes * combination 1. Choose a *business process* to model (orders, invoices, shipments, inventory, account administration, sales, or the general ledger) 3. Choose the business process *grain*, which is the fundamental, atomic level of data to be represented in the fact table for this process (e.g., individual transactions, individual daily snapshots...) 5. Choose the *dimensions* that will apply to each fact table record. (time, item, customer, supplier, warehouse, transaction type, status) 5. Choose the *measures* that will populate each fact table record. (numeric additive quantities - dollars sold, units sold) ### Data Warehouse Usage for Information Processing * Information processing - querying, basic statistical analysis, and reporting(web-based) * Analytical processing - basic OLAP operations(slice-and-dice, drill-down, roll-up, and pivoting) * Data mining - knowledge discovery (finding hidden patterns and associations, constructing analytical models, performing classification and prediction, presenting) ## Data Warehouse Implementation ### Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids ![](https://i.imgur.com/NXHudku.png) ![](https://i.imgur.com/jbHR84Q.png) define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales Each cuboid represents a different group-by ### Partial Materialization: Selected Computation of Cuboids 1. No materialization 2. Full materialization 3. Partial materialization - Materialize some cuboids --- Should consider: (1) identify the subset (2) exploit during query processing (3) efficiently update during load and refresh. ### Indexing OLAP Data: Bitmap Index and Join Index * bitmap indexing - fast not suitable for high cardinality domains ![](https://i.imgur.com/TgPMZa4.png) * join indexing join indexing registers the joinable rows of two relations from a relational database. ![](https://i.imgur.com/yWOnbRx.png) ### Efficient Processing of OLAP Queries 1. Determine which **operations** should be performed on the available cuboids 2. Determine to which **materialized cuboid(s)** the relevant operations should be applied ### OLAP Server Architectures: ROLAP / MOLAP / HOLAP * Relational OLAP (ROLAP) servers - relational or extended-relational DBMS, Greater scalability 即時、效率差、空間需求低 * Multidimensional OLAP (MOLAP) servers - Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data * Hybrid OLAP (HOLAP) servers - Flexibility. Specialized support for SQL queries over star * Specialized SQL servers # Ch5 Data Cube Technology ## Data Cube Computation: Preliminary Concepts ### Cube Materialization * Full Cube * Iceberg Cube # Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods **Frequent pattern** mining searches for **recurring relationships** in a given data set (EX) Market Basket Analysis * Frequent Itemsets - Let I = {I1, I2,..., Im} be an itemset. * Closed Itemsets * Association Rules ![](https://i.imgur.com/flEhVUM.png) ![](https://i.imgur.com/VYc3OP2.png) The occurrence frequency of an itemset is the number of transactions that contain the itemset. * Association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important) 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence. ## Frequent Itemset Mining Methods ### Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation * Apriori property: All nonempty subsets of a frequent itemset must also be frequent. ![](https://i.imgur.com/LJq6j3t.png) **Algorithm: Apriori**. Find frequent itemsets using an iterative level-wise approach based on candidate generation. Input: D, a database of transactions; min sup, the minimum support count threshold. Output: L, frequent itemsets in D. ![](https://i.imgur.com/iaaNJmn.png) ### Generating Association Rules from Frequent Itemsets ![](https://i.imgur.com/RYzinPn.png) ### Improving the Efficiency of Apriori * Hash-based technique * Transaction reduction * Partitioning ![](https://i.imgur.com/fQ1Mx4l.png) * Sampling * Dynamic itemset counting ### A Pattern-Growth Approach for Mining Frequent Itemsets suffer from two nontrivial costs: 1. It may still need to generate a huge number of candidate sets. 2. It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching. -->To solve these problems: frequent pattern growth FP-growth (finding frequent itemsets without candidate generation) * An FP-tree registers compressed, frequent pattern information. ![](https://i.imgur.com/7OrvLu7.png) ### Mining Closed and Max Patterns Item merging Sub-itemset pruning: Item skipping ## Pattern Evaluation Methods Which Patterns Are Interesting? Strong Rules Are Not Necessarily Interesting - sometime misleading From Association Analysis to Correlation Analysis