Data Mining - HackMD

--- tags: Master --- # Data Mining [TOC] ## Course Info - [Moodle link:](https://didatticaonline.unitn.it/dol/course/view.php?id=23156) updated slides + announcements - [Whiteboard](https://bit.ly/2E4sOp6) - Reference Book: [Mining of Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/book0n.pdf) - (he is expecting us to read the related chapters on the book 😂) - Zoom lectures - `ID:` 929 5761 3136 - `PWD:` 1821 - Written exam (30%) $+$ project (70%), **groups of 2** (maybe 3) ## Introduction **Data Mining:** subset of data science, process to `extract knowledge from data`, that needs to be: - Stored - Managed - **ANALYZED** $\leftarrow$ this class **Big Data:** ability to handle extremely high amounts of data **Predictive Analytics:** **Data Science:** broad field that aims at gaining useful insights from data (statistics, data visualization, NLP, data mining, ...) ### Data Mining **Discover patterns** and models that are: - `Valid:` hold on new data with some certainty - `Useful:` can act on the item - `Unexpected:` non-obvious to the system - `Understandable:` by humans **Algorithms focus** of the course**:** - `summarization` - `feature extraction` #### Random Concepts 💥**Bonferroni’s principle:** the more you look in your data, the more crap you will find. **TF-IDF:** a method that says whether a document is related to some concept by analyzing term frequencies. - $\text{TF}_{ij}\cdot \text{IDF}_i$ - `Term Frequency` of $i$ in document $j$ : $\text{TF}_{ij}=\dfrac{f_{ij}}{\text{max}_{k}f_{kj}}$ - Idea: a document related to some concept should contain some pertaining term multiple times - `Inverse Term Frequency`: $\text{IDF}_i=\text{log}_2\Big(\dfrac{N}{n_i}\Big)$ - $n_i$ is the total frequency of word $i$ among all the $N$ documents - Idea: the lesser the documents mentioning a term, the more probable that they are related to the wanted concept. **Hash Index:** allows to store and retrieve data records efficiently. - `clustered index:` referenced data is in the same record or in contiguous ones - :x:slower insertions - :heavy_check_mark:faster readings - `unclustered index:` referenced data is spread on the disk - :heavy_check_mark:faster insertions - :x:slower readings 👁‍🗨The disk has a fixed amount of data you can read at once, closer data can be read with fewer disk accesses. Data mining overlaps with: - **Databases:** Large-scale data, simple queries - `analytic processing` - **Machine learning:** Small data, Complex models - `inference of models` - **CS Theory:** (Randomized) Algorithms ![](https://i.imgur.com/x15IujO.png) But more **stress on the performance**: - `Scalability` (big data) - `Algorithms` - `Computing architectures` - `Automation` for handling large data OVERVIEW OF THE BOOK: (ML not covered in the course) ![](https://i.imgur.com/u72nbxs.png) ## Data Integration Systems Two ways of centralized system: - **Data Warehouse Architecture:** one big database collects cleaned data from many sources. - 👍 data is easily `accessible` (because in one place) - 👎 `infrequent updates` for warehouse (warehouse polling only in certain periods in order to alleviate the burden on the source databases) - tolerable when latest data is not critical - **Virtual Integration Architecture (**`On Demand Integration`**):** warehouse is only virtual, does not really exist. - data is left in the sources - 👍 `fresh data` - 👎 burden on the source databases - 👎 slower query execution, because views are prepared at runtime by the `mediator` 👀**Heterogeneity problems** of the data sources: - `Schema:` source schema must be mapped to the global schema - name differences - attribute grouping - coverage of DBs - granularity and format of attributes - `Access protocol:` source data may be excel files, HTML pages, SQL databases, ... - need `query translation` mechanisms (e.g.: from SQL to HTTP request) **Mediator:** software that reprocesses queries in order to get results, uses mappings defined at design time - `view:` most popular way to define schema mappings ![](https://i.imgur.com/HQCXqDy.png) 1. `reformulation:` convert to compatible queries for schema of each data source 2. `optimization:` 3. `execution:`