PA164 - Strojove učení a přirozeny jazyk

# PA164 - Strojove učení a přirozeny jazyk ## Odpovedníky ### 10-12, Overview of ML (1) 1. Kdy je naučená hypotéza přeučená (vyberte nejvhodnější odpověď): - Pokud popisuje dobře testovací data a špatně data učicí - Pokud popisuje špatně jak učící tak testovací data - **Pokud popisuje dobře učící data a špatně data testovací** 2. Data obsahují atributy *Barva:{red, blue, green}, Tvar {circle,triangle,rectangle}, Velikost:{small, large} a Obsah (reálné kladné číslo)*. Pak jednou z generalizací hypotéz *H1: red & circle (< red, circle, ? >)*, *H2: (red & large) (< red, ?, large >)* je: - \<?, circle, ?\> - **\<red, ?, ? \>** - \<?, ?, large\> 3. V učení s učitelem (supervised learning) je-li N počet atributů, pak počet všech možných hypotéz - je lineární v N - je polynomiální v N - **je exponenciální v N** 4. V které situaci je celková správnost (accuracy) nejméně vhodná? - **Počet příkladů v jednotlivých třídách se výrazně liší** - Datová sada se skládá z mnoha tříd s přibližně stejným počtem příkladů v každé třídě - Datová sada se skládá ze dvou tříd. 5. Pro dvě hypotézy h1 a h2 platí, že h1 je obecnější nebo stejně obecná jako h2 (h1 >= h2) - právě když existuje instance, která splňuje h2 a nesplňuje h1. - právě když každá instance, která splňuje h1 také splňuje h2. - **právě když každá instance, která splňuje h2 také splňuje h1.** ### 10-19, Text representation (2) 1. Klasifikace dokumentů: Pro úpravu textového dokumentu jsme použili stemming a vytvořili jsme dokument-term matici s hodnotami TF (term frequency). Bez stemmingu - nemůžeme získat lepší výsledek (např. accuracy) - získáme vždy horší výsledek (např. accuracy) - **můžeme získat lepší výsledek (např. accuracy)** 2. Semi-supervised anomaly detection: pro testování těchto metod potřebujeme příklady, u nichž víme, zda jsou anomální nebo normální. Můžeme použít - časté vzory - náhodně vygenerovaná data i třídy - **dvoutřídní data např. z kaggle** 3. Mějme čtyři dokumenty a převrácenou hodnotu eukleidovské vzdálenosti jako míru podobnosti. Po převedení na bag-of-words (reprezentace TF, term frequency) bude větě "She sat down." nejpodobnější 1. She sat down. 2. She drank coffee. 3. She spent much time in learning text mining. 4. She invested significant efforts in learning text mining. - She spent much time in learning text mining. - She invested significant efforts in learning text mining. - **She drank coffee.** 4. Po převedení na bag-of-words (reprezentace TF, term frequency) bude větě "She drank coffee" nejpodobnější - **She sat down** 5. Klasifikace dokumentů: Z textového dokumentu jsme odstranili stop-slova a vytvořili jsme dokument-term matici s hodnotami TF-IDF. Bez odstraněni stop-slov - **můžeme získat lepší výsledek (např. accuracy)** ### 10-26, Text fragments (3) 1. Mějme syntaktický strom vytvořený mělkým syntaktickým analyzátorem. Je-li v kořeni stromu (usel hloubky 0) věta, která neobsahuje předložkové fráze, potom mělký syntaktický strom má hloubku - 0 - **1** - 2 2. Pro daný dataset a daný support>0 a confidence>0 asociační pravidla pokrývají (tj. pro příklad z datasetu existuje pravidlo, které pro daný příklad platí) - všechny příklady z datasetu - N/support příkladů z datasetu, kde N je počet příkladů v datasetu - **jen část datasetu** 3. Který způsob prohledávání prostoru hypotéz používají algoritmy pro učení rozhodovacích stromů, např. C4.5 nebo CART? - depth-first search (prohledávání do hloubky) - beam search - **hill-climbing (greedy search, heuristické prohledávání)** 4. Neplatí, že latentní sémantická analýza se dá použít - **pro redukci příkladů** - pro redukci atributů - pro přidání nových atributů 5. Vyberte variantu, která je nejblíž pravdě. Častý vzor (frequent pattern, large itemset) - **se vždycky dá použít pro klasifikaci.** - se dá použít pro klasifikaci, pokud pravá strana má určitý tvar. - se nikdy nedá použít pro klasifikaci. 6. Vyberte variantu, která je nejblíž pravdě. Asociační pravidlo - se vždycky dá použít pro klasifikaci. - **se dá použít pro klasifikaci, pokud pravá strana má určitý tvar**. - se nikdy nedá použít pro klasifikaci. 7. Zipfův zákon říká, že pokud nejčastější slovo má rank 1, frekvence F libovolného slova je - **1/F** 8. Při učení rozhodovacích stromů, např. algoritmem C4.5 nebo CART, učič najde - **jen lokálně optimální řešení** ### 11-2, Text categorization (4) 1. Tokenizujte (převeďte na tokeny) následující větu, *after reading for 3 h, he decided to read for another two.*, Předpokládejte, že členy, osobní zájmena, číslovky a předložky jsou stop slova a odstraňte je. Proveďte stemming a napište výsledek. - ~~**after read 3 h decide read another**~~ (SKONTROLUJTE) - **read h decide read two** - takto imho podla screenshotu co mam ulozeny z toho, ako prezentoval spravne odpovede 2. Předpokládejte kolekci 1000 dokumentů, kde se každý tvar slovesa a slova after a another vyskytují vždy ve 100 dokumentech. Všechna ostatní slova se vyskytují v 10 dokumentech. Napište tf-idf reprezentaci vaší předchozí odpovědi. Logaritmus nemusíte počítat. Chcete-li však, použijte se základem 10. ``` idf_read = log(1000/100) tf_read = 2 idf_h = log(1000/10) tf_h = 1 idf_decide = log(1000/100) tf_decide = 1 idf_two = log(1000/10) tf_two = 1 ``` 3. Napište větu, bez ohledu na sémantiku, avšak syntakticky správně utvořenou: 1. která bude obsahovat jiná slovesa a předložky než věta z první otázky 2. a bude mít stejnou kombinaci (tj. nezáleží na pořadí) nenulových hodnot tf-idf jako vaše odpověď na Otazku 2 - TODO ### 11-9, Opinion & Sentiment mining (5) 1. Pro redukci sloupců v matici document-term se používají metody feature selection, které používají pro ohodnocení atributů rankovací funkci. Uveďte jednu takovou (stručně, názvem nebo jednou větou) - **Information gain** 2. Pro výběr rozumného počtu sloupců v matici document-term se hodí učící křivka, kde na ose X je počet atributů a na ose Y accuracy. Která z odpovědí nejlépe popisuje její tvar? Pro redukci atributů jsme použili odstranění stop-slov. - **křivka je rostoucí, 1. derivace klesající** - křivka je rostoucí, 1. derivace rostoucí - křivka je klesající 3. Po převodu kolekce dokumentů na document-term matrix a po odstranění stop-slov bývá počet sloupců nejčastěji - **v řádu tisíců** - v řádu stovek - v řádu desítek ### 11-16, Desambiguace textu (5) 1. Uveďte aspoň dva příklady desambiguačních úloh ve zpracování přirozeného jazyka. - hladanie spravneho vyznamu slova (homonyma - napr. koruna) - spravne urcenie vazieb medzi slovami (napr. veta "Matka stretla dceru a bola opita." - kto bol opity?) 2. Vyberte odpověď, která se nejvíc blíží pravdě. Při desambiguačních úlohách generujeme učící příklad: - z celého dokumentu - z levého kontexu desambiguavaného slova - **z pravého a levého kontextu desambiguavaného slova** 3. Při analýze sentimentu definujeme sentiment/opinion jako pětici. - ![](https://i.imgur.com/lZ6waVm.png) ### 11-23, Name Entity Recognition (6) 1. Extrakce informace z textu. Uveďte dva příklady obvyklých a možných extrahovaných relací (Relation extraction) a ke každé příklad, např. *Relace = býti králem(), příklad = býti králem(Lávra)*. - The currect wife of Bill Gates is Melinda Gates. - 1. Relace = WifeOf, priklad = WifeOf(Bill Gates, Melinda Gates) - The Olympc Games 2020 are going to be located at Tokyo city. - 2. Relace = LocatedAt, priklad = LocatedAt(Olympic Games 2020, Tokyo) 2. Učící příklady v úloze extrakce pojmenovaných entit (Name Entity Recognition) z textu mají nejčastěji tvar - matice dokument-term - **levý kontext, pojmenovaná entita, pravý kontext, typ pojmenované entity** - pojmenovaná entita, typ pojmenované entity 3. Extrakce informace z textu tak, jak jsme o ní mluvili, je z pohledu strojového učení - **supervised learning** - unsupervised learning - first order clustering 4. Extrakce informace z textu: Uveďte dva typy obvyklých pojmenovaných entit a pro každý jednu konkrétní entitu (příklad), např. Entita=král Příklad Král Lávra. - typ=ludia Entita=Donald Trump Príklad=Donald Trump vyhlásil ... - typ=miesto Entita=USA Príklad=nepokoje v USA ### 12-7, Anomaly detection (8) 1. Popište jednu z metod detekce anomálií použitých v našem projektu, ale ne OneClassSVM. - napr. Isolation Forest - k-Disagreeing Neighbours - ide o percento príkladov z k najbližších susedov, ktorí sú zaradení do inej triedy 2. Unsupervised anomaly detection: pro testování těchto metod potřebujeme příklady, u nichž víme, zda jsou anomální nebo normální. Můžeme použít (vyberte nejlepší variantu) - náhodně vygenerovaná data i třídy - **generátor umělých dat (tzv. ground truth)** - asociační pravidla 3. LOF (Local Outlier Factor) je metoda - **unsupervised** - semi-supervised - supervised 4. Unsupervised anomaly detection: pro testování těchto metod potřebujeme příklady, u nichž víme, zda jsou anomální nebo normální. Můžeme použít (vyberte nejlepší variantu) - **dvoutřídní data např. z kaggle** - časté vzory - náhodně vygenerovaná data i třídy ### 12-14, Anomaly detection (9) 1. Pro detekci anomálních textů se používá řada metrik, jedna z následujících však ne. Označte ji. - Obscurity of Vocabulary Features (e.g. Top 1000 words in Gigawords) - **Complexity measures (e.g. Sum of the Error Distance by Linear Programming)** - Rank Features (e.g. Distribution of Pronouns list) 2. Pro detekci anomálních textů se používá řada metrik, jedna z následujících však ne. Označte ji. - **Landmarking features (e.g. Naive Bayes)** - Readability Measures - Part of Speech and Syntax Features (e.g. Percentage of words that are verbs) 3. Jedna z klasických metod detekce anomálií - kNN - počítá outlier faktor jako vzdálenost bodu k jeho k nejbližších sousedů. Jiná - KDN - počítá z k nejbližších sousedů odlehost příkladu jako relativní počet sousedů, kteří nesdílí s příkladem stejnou třídu. Z obou vychází CODB. Jak se od nich liší? - **CODB kombinuje tieto postupy dohromady, cize pouziva pravdepodobnost ze dany zaznam patri do danej triedy vzhladom na K-najblizsich susedov + sumu vzdialenosti od ostatnych elementov z rovnakej triedy k danemu zaznamu + vzdialenost medzi danym zaznamom a jeho K-najblizsimi susedami. Kde tieto hodnoty vahuje pomocou 2 parametrov Alpha a Beta. Tieto dve postupy kNN a KDN sa pozeraju skor na lokalnu strukturu kde CODB sa pozera viac na globalny pohlad.** 4. Pro detekci anomálních textů se používá řada metrik, jedna z následujících však ne. Označte ji. - Simple Surface Features (e.g. Average sentence length) - **Statistical features (e.g. Correlation or Geometric mean)** - Obscurity of Vocabulary Features (e.g. Top 1000 words in Gigawords) 5. Pro detekci anomálních textů se používá řada metrik, jedna z následujících však ne. Označte ji. - Part of Speech and Syntax Features (e.g. Percentage of words that are adjectives) - Rank Features (e.g. Distribution of POS tri-grams list) - **Information-theoretic meta-features (e.g. Attribute Entropy)** ### 1-4, Keyness (10) 1. Klíčová slova dokumentu charakterizují jednak aboutness tohoto dokumentu, jednak jeho styl. Která z variant častěji vypovídá o stylu než o aboutness? - substantiva - **spojky** - adjektiva 2. Negativní klíčové slovo (negative keyword) dokumentu D je podle Mika Scotta takové slovo, - které se v D vyskytuje méně často než v referenční korpusu. - **které se v D vyskytuje významně méně často než v referenční korpusu.** - které se nevyskytuje v D a vyskytuje se v referenčním korpusu. 3. Pro detekci anomálního segmentu v textu se používají specifické míry vzdálenosti. Vyberte tu nejsmysluplnější. - Rozdíl počtu určitých a neurčitých členů v textu - Počet klíčových slov delších než tři znaky - **Vzdálenost segmentu od průměru ostatních segmentů** 4. Vyberte odpověď, která je podle vás pravdě nejblíž. Klíčová slova dokumentu charakterizují jednak aboutness tohoto dokumentu, jednak jeho styl. Která z variant nejčastěji vypovídá o aboutness? - **substantiva** - předložky - číslovky 5. Pro detekci anomálního segmentu v textu se používají specifické míry vzdálenosti. Vyberte tu nejsmysluplnější. - Počet klíčových slov delších než tři znaky - **Vzdálenost segmentu textu od jeho doplňku (tj. zbytku textu)** - Rozdíl počtu určitých a neurčitých členů v textu 6. Vyberte variantu, která je nejblíž pravdě - Negativní klíčové slovo (negative keyword) dokumentu D je podle Mika Scotta takové slovo, - **které se v D vyskytuje významně méně často než v referenční korpusu.** 7. Vyberte odpověď, která je podle vás pravdě nejblíž. - Klíčová slova dokumentu charakterizují jednak aboutness tohoto dokumentu, jednak jeho styl. Která z variant častěji vypovídá o stylu než o aboutness? - **spojky** # Poznámky ## 1.Lecture - Overview of ML ### Unsupervised learning (Clustering) - Herarchical clustering - ![](https://i.imgur.com/gNQL6T8.png) - Aglommerative (bottom-up) - Devisise (partitional, top-down) - Direct clustering (k-means) - Cluster similarity - ![](https://i.imgur.com/qv94Dnz.png) - Buckshot algorithm - Soft clustering - probability for instance that belongs to certain class ### Supervised learning (classification, regression, prediction) - Inductive classification - ![](https://i.imgur.com/4wDrlP9.png) - ![](https://i.imgur.com/kQUxI3p.png) - Evaluation - confusion matrix - precision & recall - f-measures - learning curve - ROC curve, AUC - Generalization - ![](https://i.imgur.com/4O0biYT.png) - ![](https://i.imgur.com/cOHeCZp.png) - Decision tree learning - ![](https://i.imgur.com/Sl1KyYd.png) ### Instance based learning - unlike other learning algorithms, does not involve construction of an explicit abstract generalization but classifies new instances based on direct comparison and similarity to known training isntances. - it assume a function fordetermining the similarity or distance between any two instances ## 2.Lecture - Text representation ### Text representation of documents - String of characters -- too general for text mining - Words - very simply, yet sufficent for a lot of applications - tranform to lower case - remove stopwords (e.g. any, is, her, as the, on, further, an, being, had, you, he'll, then, very, ...) - remove punctuation - remove numbers - stemming - POS tags - more complex - Synstactic structures - more complex - Relations (between entities / sentences / documents) - more complex - TF, TF-IDF - ![](https://i.imgur.com/sJ3GFAW.png) - ![](https://i.imgur.com/xcMZaJI.png) ### Dimensionality reduction - Zipf's law - Frequency of any word is inversely proportional to its rank in the fequency table (the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.) - To decrease dimension of a term-document matrix - Feature selection - remove infrequent terms (words that occur one or two times) - remove frequent terms (stopwords) - Feature construction (feature extrancion / transformation) - rank and then select K with the highest rank - *information gain*, *chi^2* - n-grams construction - Latent semantic analysis - new feature = linear combination fo the original features - (sampling to decrease the number of documents without loss of accuracy) ## 3-4.Lecture - Text Categorization & Text Mining - Unsupervised learning - words: Latent Semantic Analysis, Key Phrase Extraction - documents: Document clustering, Topic Detection - Anomaly detection - Supervised learning - words: Part-of-Speech Tagging, Word Sense Disambiguation - documents: Text categorization/classification/filtering - Information Extraction - Other applications - Context-sensitive spelling checking - Named entity recognition - coreference resolution - Sentiment classification - Text summarisation - Using Relevance feedback (Rocchio) - For each category, compute a prototype vector by summing the vectors fo the training documents in the category. - Assign test documents to the category with the closest prototype vector based on cosine similarity - K nearest-neighbor - Naive Bayes - smoothing - Text categorization algorithm - Select terms (words, bi-, trigrams, named entities) - (Perform morphological analysis (lemmatization, tagging)) - (Term clustering) - Remove nonsignificant terms (stoplist, infrequent words) - Select terms with highest discriminative power - Learn classifier ### Fragments and Text Categorization - Two novel methods - *skip-tail* - only the first X sentences of a document are used - *fragments* - splits the documents into fragments which are classified independently of each others ## 5.Lecture - Disambiguation & Opinion mining ### Disambiguation - Zjednoznacnenie - odstranovanie viacznacnosti - vyber klasifikace textoveho objektu z niekolkych moznosti - podla kontextu textoveho objektu - Priklad: - *Jel kolem statku.* vs. *Urazil kolem vetev.* - *Nasadil mu korunu.* vs. *Nasadil tomu korunu.* - Zakladny algoritmus 1. Pridat kazdemu slovu (niektore/vsetky mozne) znacky. Pomocou slovniku, korpusu, morfologickeho analyzatoru 2. Pomocou pravidiel (vytvorenych clovekom/naucenych) zrus nespravne znacky. 3. Odstran rucne (niektore/vsetky zostavajuce) dvojznacnosti - ![](https://i.imgur.com/AdKsvDK.png) - In English, a sentence ending in *n* prepositional phrases has over 2^n syntactic interpretations - Many jokes rely on the ambiguity od language ### Opinion Mining & Sentiment Analysis - Two main types of textual information - **Facts and Opinions** - Sentiment analysis or opinion mining - computational study of opinions, sentiments and emotions expressed in text - ![](https://i.imgur.com/ZFkBYOG.png) - ![](https://i.imgur.com/lZ6waVm.png) - ![](https://i.imgur.com/iojBxbr.png) ## 6.Lecture - Information extraction, Named Entity Recognition - MUC (Message Understanding Conference) - established standard evaluation methodology using development (training), test data and metrics: precision, recall, F-measure - **Information extraction (IE)** - identify specific pieces of information (data) in a unstructured or semi-structured textual document - transform unstructured information in a corpus of documents or web pages into a structured database - Main tasks 1. **Named entity recognition** - the tokens in the text may refer to named entities, e.g. locations, people, organizations, price, date - "Bill Clinton" is_a person - "New York" is_a location 2. **Relationship extraction** - generally follows named entity recognition lokks for the relationships among different named entities - LocatedIn(Bill Clinton, New York) - Machine learning for IE 1. the types of entities and the realtionships among them to be learned are pre-defined, and tagged 2. training data (i.e., text segments) are available containing examples of such entities and/or relationships - **Web extraction** - An extractor for a semi-structured web site is sometimes referred to as a *wrapper+ - Process of extracting from such pages is soemtimes referred to as *screen scraping* - Template types - Regular Expression (regex) - Web extraction using DOM Trees - Active learning - annotating training documents for each application is difficult and expensive - best to focus human effort on annotating the most informative documents - Active learning is method that pick only the most informative exampels for training ## 7.Lecture - The Deep Learning Revolution - Perceptron - Deep learning revolution - Convolutional neural nets - Recurrent neural nets - Deep reinforcement learning - Independent Word Vectors - represent word meanings as vectors based on words with which they co-occur - Word2Vec ## 8-9.Lecture - Anomaly Detection - Outlier - an outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism - Application - Fraud detection - Medicine - Detecting measurement errors - Instrusion detection - Language learning - Type of outliers - **Point outliers** - cases that either individually or in small groups are very different from the others - **Contextual outliers** - cases that can only be regarded as outliers when taking the context where they occur into account - **Collective outliers** - cases that individually cannot be considered strange, but together with other associated cases are clearly outliers - **Outlier detection methods** - Statistical methods - normal data objects are generated by statistical (stochastic) model and data not following the model are outliers - Proximity-Based methods - Distance-based detection - radius *r*, k-nearest neighbors - Density-based detection - Clustering-based methods - Other types of methods - supervised methods - normal vs anomaly data - Decision tree - semi-supervised methods - training data has labeled isntances only for the normal class - One-class SVM - Clustering - normal data instances lie close to their closest cluster centroid while anomalies are far away from their clsoest cluster centroid - Unsupervised methods - no labels, most widely used - assumption: normal instaces are far more frequent than anomalies in th etest data and they make clusters - Proximity-based methods, clustering - LOF (Local Outlier Factor) - Evaluation of anomaly detection methods - supervised - easy, precision/recall - semi-supervised, unsupervised methods 1. Two class data, e.g. from UCI - 1st class normal, 2nd source of anomalies 2. Artificial data generator - more flexible - Mining with rarity - Multi-class outlier detection ### ABOD - angle-based outlier degree - object *o* is an outlier if most other objects are located in similar directions ### CODB - combination of distance-based, density-based approach w.r.t class attribute ### RF-OEX - class outlier detection with Random Forests - Random Forest is an ensemble classification and regression approach - After each tree is built, all of the data are run down the tree, and proximities re computed for each pair of cases - if two cases occupy the same treminal node, their proximity is increased by one - at the end of the run, the proximities are normalized by dividing by the number of trees - Outlier factor - sum of three different measures of proximity or outliemess - *Proximity* - to the members of the same class - *Misclassication* - proximity to the members of other classes - *Ambiguity* - measure a percentage of ambiguous classification ### ILP - nejaky bullshit na konci prednasky ### Unsupervised Detection of Anomalous Test, by David Guthrie - Process 1. Represent texts as numerical vectors 2. Measure distance of vectors from rest of set 3. Set threshold for anomaly - Text representation - numerical vector of 166 features - Simple Surface Features (19) - Readability Measures (7) - Obscurity of Vocabulary Features (7) - Part of Speech and Syntax Features (11) - Rank Features (8) - Emotional Tone Features (114) - Measuring the distances - **ClustDist** - A distance based on average linkage clustering - **SDEDist** - The Stahel-Donoho Estimator distance - **Pcout** - The weights calculated by the PCout algorithm - **MeanComp** - Distance from the mean of all other segments in the data - **TxtCompDist** - Method developed by authors that uses the distance from the textual complement - **Feature selection** - best features are the same over all experiments 1. Gunning-Fog Index 2. Number of passive sentences 3. Flesch-Kincaid Reading Ease 4. Percenteage of sentences over 15 words - worst features differ, but mostly sentiment based ## 10.Lecture - Keyness - Positive KWs - A word which is positively key occurs more often than would be expected by chance in comparison with the reference corpus - Negative KWs - A word which is negatively key occurs less often than would be expected by chance in comparison with the reference corpus - Keyness gives robust indications of the text ABOUTNESS together with indicators of STYLE - learning allows to express not only keyness of a keyword but also realtions between keywords and between keywords and the document itself - keyness of a term WORD, MULTIWORD EXPRESSION, A CONCEPT FROM AN ONTOLOGY for a text document is - the degree with which the term is frequent in the document with respect to the whole colelction of documents - Frequency - information gain, sattistical log-likelihood test - Key terms - first N terms from a ranked list of terms - Mutli-relational learning - ![](https://i.imgur.com/HfM9TTs.png) - Frequent patterns - **frequent pattern** - a formula in a form of conjuction of predicates from a given set of predicates, characterized by a level fo significance caleld *support* - **support** - a number of instances, e.g. sentences for which this formula holds - Examples of frequent patterns - ![](https://i.imgur.com/NBF8nRP.png) - **Key** - Informally - a cloud of words and groups of words (that we call keywords), their attributes, e.g. morphologival, syntactical, or obtained from an ontology, and relations between them that characterizes a given document - Formally - a set of logic formulas in first-order predicate logic with modal operators, *before(), follows(), preceedes(), after(), always_after()* that are frequent in a corpus