---
tags: Master
---
# Data Mining
[TOC]
## Course Info
- [Moodle link:](https://didatticaonline.unitn.it/dol/course/view.php?id=23156) updated slides + announcements
- [Whiteboard](https://bit.ly/2E4sOp6)
- Reference Book: [Mining of Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/book0n.pdf)
- (he is expecting us to read the related chapters on the book π)
- Zoom lectures
- `ID:` 929 5761 3136
- `PWD:` 1821
- Written exam (30%) $+$ project (70%), **groups of 2** (maybe 3)
## Introduction
**Data Mining:** subset of data science, process to `extract knowledge from data`, that needs to be:
- Stored
- Managed
- **ANALYZED** $\leftarrow$ this class
**Big Data:** ability to handle extremely high amounts of data
**Predictive Analytics:**
**Data Science:** broad field that aims at gaining useful insights from data (statistics, data visualization, NLP, data mining, ...)
### Data Mining
**Discover patterns** and models that are:
- `Valid:` hold on new data with some certainty
- `Useful:` can act on the item
- `Unexpected:` non-obvious to the system
- `Understandable:` by humans
**Algorithms focus** of the course**:**
- `summarization`
- `feature extraction`
#### Random Concepts
π₯**Bonferroniβs principle:** the more you look in your data, the more crap you will find.
**TF-IDF:** a method that says whether a document is related to some concept by analyzing term frequencies.
- $\text{TF}_{ij}\cdot \text{IDF}_i$
- `Term Frequency` of $i$ in document $j$ : $\text{TF}_{ij}=\dfrac{f_{ij}}{\text{max}_{k}f_{kj}}$
- Idea: a document related to some concept should contain some pertaining term multiple times
- `Inverse Term Frequency`: $\text{IDF}_i=\text{log}_2\Big(\dfrac{N}{n_i}\Big)$
- $n_i$ is the total frequency of word $i$ among all the $N$ documents
- Idea: the lesser the documents mentioning a term, the more probable that they are related to the wanted concept.
**Hash Index:** allows to store and retrieve data records efficiently.
- `clustered index:` referenced data is in the same record or in contiguous ones
- :x:slower insertions
- :heavy_check_mark:faster readings
- `unclustered index:` referenced data is spread on the disk
- :heavy_check_mark:faster insertions
- :x:slower readings
πβπ¨The disk has a fixed amount of data you can read at once, closer data can be read with fewer disk accesses.
Data mining overlaps with:
- **Databases:** Large-scale data, simple queries
- `analytic processing`
- **Machine learning:** Small data, Complex models
- `inference of models`
- **CS Theory:** (Randomized) Algorithms

But more **stress on the performance**:
- `Scalability` (big data)
- `Algorithms`
- `Computing architectures`
- `Automation` for handling large data
OVERVIEW OF THE BOOK: (ML not covered in the course)

## Data Integration Systems
Two ways of centralized system:
- **Data Warehouse Architecture:** one big database collects cleaned data from many sources.
- π data is easily `accessible` (because in one place)
- π `infrequent updates` for warehouse (warehouse polling only in certain periods in order to alleviate the burden on the source databases)
- tolerable when latest data is not critical
- **Virtual Integration Architecture (**`On Demand Integration`**):** warehouse is only virtual, does not really exist.
- data is left in the sources
- π `fresh data`
- π burden on the source databases
- π slower query execution, because views are prepared at runtime by the `mediator`
π**Heterogeneity problems** of the data sources:
- `Schema:` source schema must be mapped to the global schema
- name differences
- attribute grouping
- coverage of DBs
- granularity and format of attributes
- `Access protocol:` source data may be excel files, HTML pages, SQL databases, ...
- need `query translation` mechanisms (e.g.: from SQL to HTTP request)
**Mediator:** software that reprocesses queries in order to get results, uses mappings defined at design time
- `view:` most popular way to define schema mappings

1. `reformulation:` convert to compatible queries for schema of each data source
2. `optimization:`
3. `execution:`