小抄 - HackMD

# 小抄 ## 1. ***Self introduction part*** > *Hello, My name is siwei, and I come from china. now I am a second year master student major in Computer Science at tu delft and my track is artificial intelligience. I have taken some course related to the nlp field such as information retrival deep learning,multimedia search and recomendation and so on. So, this motivates me to do a thesis project regarding natural language processing besides, I I have some experienced in doing some downstream natural language processing tasks Last but not least, the internship in philips is much related to what i did in jie yang' group* ## 2.why you suitble for this project? First, I know some frequently used models such as rnn bert. Besides, I have some experienced in doing some downstream natural language processing tasks like stance classification, passage ranking. and i have read about the paper may related to this project--Improving Spare Part Search for Maintenance Services using Topic Modelling ![](https://i.imgur.com/JbTq0Dv.png) key idea: improve the efficiency of the **spare part retrieval**, 用了 topic modelling 来对past case进行分类so, maybe this time, the knowledge base can be used to improve the accurcacy. The topic i did in Jie Yang's group is to bridging the gap between the knowledge base and language model which is related to the improvement of this direction. In addition, i wirtte a survey before regarding stance classification and knowledge graph. ## 3. related project ### 3.1 multimedia The traditional way to build a movie recommendation system is making use of the user data and history actions and then applying collaborative filtering for recommendation. we have implemented a convolutional neral network based on user features, movie features and ratting features, For a more robust reccomendation, the user feature and movie feature are gained by gcn which can give more accurate recommendations than the original one. ![](https://i.imgur.com/EsVqlmR.png) and the rating features are gathered by website and our method can adjust our recommendations based on user feedback. For cold start problem, five most popular movies are given For evaluation, mse loss and open questionaire A/B version ### 3.2 nlp project On the information retrival course, the second project is little bit related to the thesis topic. That topic aims to tackle the OffensEval 2019 challenge, and there are three subtasks, including offensive language identification,categorization of offense types, and offense target identification. For each task, we fine-tuned a BERT based classifier and built a RNNmodel with GloVe embedding. Experiments showed that we can achieve an accuracy of 0.8558 for task A, 0.8875 for task B, and 0.7377for task C, and these indicators showed that our model can get the 2nd in the competition. Besides, I did a litature review recently which is aimed to study the effect of knowledge graph for stance classification task Task A aims to distinguish offensive tweets from normal tweets; task B categorizes whether an insult is targeted or untargeted; task C tries to classify the targeted offensive tweets as individual, group, or others. Each task is built upon the previous one progressively, for instance, the data used for task B is the data labeled as "offensive" in task A. the data used is coming from crowd source--twitter, and why we choose to use gloove is due to its twitter version **data prepation** remove emojis, stop words, correct spelling **class imballence** Oversample, Undersample and threshold adjusting A confusion matrix is a summary of prediction results on a classification problem. **问题** there is one tweets which has a word criminal, however it just describing the situation. bert treated is as offensive. Bert: crossentropy-loss rnn: softmax crossentropy lose ![](https://i.imgur.com/031HHaW.png) calculate the cosine similarity from the embeddings of every pair of terms convert each row to a matching hostogram KNRM 步骤： calculate the cosine similarity from the embeddings of every pair of terms use k kernels to summarize every query word’s row of the interaction matrix The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall Finally, talk about the advantages of AUC. The calculation method of AUC also considers the classifier’s ability to classify positive and negative examples. In the case of unbalanced samples, it can still make a reasonable evaluation of the classifier. ### 3.3 dl reproduceble project The paper presents a novel method for Multi-source Domain Adaptation named WAMDA which uses multiple sources to train a predictor based on their internal relevance and their rele- vance score related to the target. Our work is to reproduce the proposed approach on only one dataset OfficeHome dataset and the other two methods (Resnet and MFSAN for evaluating the effectiveness of the proposed method). WAMDA is a method that can do effective multi-source domain adaptation based on the source- target and source-source similarities. The following gives information about the basic structure. The proposed algorithm is divided into two parts. The first stage is pre-adaptation train- ing , we can obtain the relevance score, feature extractor, source classifier, and domain classifier from this stage. Then, the other stage is multi-source adaptation training , the weighted alignment of domains is performed, classifiers and a target encoder are learned based on this weighted aligned space. dataset:office home, experiments done by google colab the diifficult, ![](https://i.imgur.com/haexXjk.png) binary cross-entropy loss, coral loss, qt loss, align loss, distill loss, entropy loss, de loss, and T->W loss ## 4. behaviour question ### 4.1 How would people describe you? “I would say that people would describe me as a good communicator because when we do groupwork, we always have a good work atmosphere. ### 4.2 Which is your most significant accomplishment When i did the first assignment of information retrival, it asked as to implement a bm25 algrithms to finish passage ranking and applied learning to rank tecniques to finish reranking. After that, it is asked to make a two improment. however, the dataset is so large that it should use some toolikits. there are no instructions for the toolkit. that's really a hard time. after i finished this project and get 8. I think there is no project that i should give up. ### 4.3 What is your greatest strength I am very good at planning and insist on doing things according to my plan. i planned to finish 300 leetcode, and I made a schedule from arrays, stacks, to dynamic programming. Now the summer is over, and I have done leetcodes as i planned. ### 4.4 What is your greatest weakness I would say my biggest weakness is that I’ don’t have a lot of work experience, but I’m a fast-learner and am highly adaptable. I’m up-to-speed with the latest programming trends and have a fresh perspective. ### What are your preferred programming languages and why? python is very easy to understand and the there are so many libraries that can be quickly used. ## 5.nlp 介绍 > ***〈1〉what is natural language processing*** > It is a technique used in Artificial Intelligence and Machine Learning. the task is to used computational methods to allow computers to understand human language. To be more specific, Techniques in NLP allow computer systems to process and interpret data in the form of human communication languages. > ***<2> real-life applications of Natural Language Processing*** > there are some easy tasks such us spell checking, information summarization, information extraction, and stance detection. Besides, there are some hard tasks: * **machine translator** > It helps convert written or spoken sentences into any language. Also, we can find the correct pronunciation and meaning of a word by using Google Translate. * **the question answering**. > To provide a better customer support service, companies have started using chatbots for 24/7 service. Chatbots helps resolve the basic queries of customers. If a chatbot is not able to resolve any query, then it forwards it to the support team * **Sentiment analysis** > (or opinion mining) uses NLP to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs. > > **<3> why nlp hard** > 1. nonstandard english > 2. NLP is difficult because Ambiguity and Uncertainty exist in the language. > 3. lack of knowldge <4> ***NLP components*** > 1. **Phonology**: speech analysis and synthesis > 2. **Morphology**: Normalization(find all forms of the words), case fording(upper to lowercase), Stemming(make all words back to the stem one), Lemmatization(reduce all form words to the basic form eg am is are->be), Tokenization > 3. **Syntax**: Part-of-speech tagging(marking up a word in a text ), Parsing(transfer the text to Linguistic structure) • > 4. **Semantics**: (advanced) similarity, ontologies, dialog analysis > 5. **Reasoning**: domain and application knowledge **stop words**： high frequency words that may not contain much information ![](https://i.imgur.com/TCUrkm7.png) ## 6 ***word embedding*** ### 6.1. Why not use wordnet? > * Lack contexts，for example only sometimes ‘proficient’ is similar with ‘good’’ > * missing new meanings of words, impossible to keep updating > * Can’t compute accurate word similarity. > * Why not use one-hot encoding: (one hot encoding就是把一句话分割成一个个词，这个词出现了就用1表示没出现就用0表示) > * not efficient to use > * high dimension ***Use a dense vector to represent each word, chosen so that it is similar to vectors of words that appear in similar contexts*** ### ***6.2. word2vec*** > 如何得到world embeddings, 就是靠word2vec > Represent each word by a vector such that it can best predict the surrounding words > Skip-gram：use one word as the input to predic its context > CBOW:use the context of the word to predict the word > word2Vec uses a layer of neural network to map one-hot (one-hot encoding) word vectors to distributed word vectors. > the word2vec objective function: Maximum likelihood estimate > 问题： they use gradient descent to minimize the objective function. however, it is very slow, so, stochastic gradient descent. negative sampling(用一些真实数据，用一些假数据 )train binary logistic regressions for a true pair (center word and a word in its context window) versus several noise pairs (the center word paired with a random word) > 有一些传统的方法：例如windows full document > windows:有些问题 Vectors increase in size with vocabulary > • Very high dimensional: use svd to reduce the dimension* ### ***6.3. Glove:*** **key idea: uses ratios of co-occurrence probabilities, rather than the co-occurrence probabilities themselves** • Fast training • Scalable to huge corpora • Good performance even with small corpus and small vectors ### 6.4. fasttext > Fasttext uses character level to express a word by n-gram 好处 > 1. The effect of word vectors generated for low-frequency words will be better. Because their n-grams can be shared with other words. > 2. For words outside the training lexicon, their word vectors can still be constructed. We can superimpose their character-level n-gram vectors. 缺点 > Words larger than the size of the context window cannot capture their order information Why pretrained helps: train them for more words on much more data > ## 5. **classification** softmax classifier->liner decision boundary so neural network can learn much more complex functions with nonlinear decision boundaries! ## 7. **RNN** RNN can predict the next word based on the previous sentences ![](https://i.imgur.com/eAehRJj.png) The value s(weights) of the hidden layer of the recurrent neural network not only depends on the current input x, but also depends on the value s(weights) of the last hidden layer. rnn的问题： Short-term memory has a greater impact while the long-term memory has less impact. 应用：Text generation Speech recognition Machine translation Generated image description Video tagging ![](https://i.imgur.com/z9APqyY.png) ## 8. LSTM LSTM is similar to the above point of focus, it can retain the "important information" in the longer sequence of data, and ignore the unimportant information. This solves the problem of RNN short-term memory. ![](https://i.imgur.com/Awdb9IS.png) GRU mainly made some simplifications and adjustments on the LSTM model, which can save a lot of time when the training data set is relatively large. ## 9. ELM0 现有的像glove其实有一些问题，所有的单词不管出现在什么context里面，都是一个表示，所以出现了contextualized embedding这种（bert) compared with taglm, ELMO instead of using context windows, elmo use long contexts Static Word Embedding cannot solve the problem of polysemous words, so after ELMO introduces context to dynamically adjust the embedding of words, has the problem of polysemous words solved? solved ![](https://i.imgur.com/HKQMHxH.png) During pre-training, use the language model to learn the emb of a word (polysemous words cannot be resolved); when used, there is a specific context between words, and the emb representation of the word can be adjusted according to the context of the word semantics (which can solve the problem of polysemous words) Understanding: because during the pre-training process , Lstm in emlo can learn the context information corresponding to each word and save it in the network. During fine-turning, downstream tasks can perform fine-turning on the network so that it can learn new features; ELMO 本身是个根据当前上下文对 Word Embedding 动态调整的思路。 ![](https://i.imgur.com/kRspt9f.png) ![](https://i.imgur.com/pM6fUyj.png) ULMfit(transfer learning的显著应用) 先在大型语料库上进行训练，然后再在specific的task上进行微调，GPT-3 size increased ## 10. attetion 输入长度已知但输出长度未知的任务，比如机器翻译，上面提到的两个模型就不能满足要求。因此就有了Sequence to Sequence模型，简称Seq2Seq。其实就会有bottlenrck problem因为capture了所有的信息。 ![](https://i.imgur.com/RJx8JKy.png) attention机制是模仿人类注意力而提出的一种解决问题的办法，简单地说就是从大量信息中快速筛选出高价值信息。主要用于解决LSTM/RNN模型输入序列较长的时候很难获得最终合理的向量表示问题，做法是保留LSTM的中间结果，用新的模型对其进行学习，并将其与输出进行关联，从而达到信息筛选的目的。 Attention函数的本质可以被描述为一个查询（query）到一系列（键key-值value）对的映射 attention和self-attention的区别： Attention机制发生在Target的元素Query和Source中的所有元素之间。self attention会给你一个矩阵，告诉你 entity1 和entity2、entity3 ….的关联程度(degree of relevance)、entity2和entity1、entity3…的关联程度。（source和source之间，entity和entity之间） The advantage of **Multi-Head attention** is that it allows the model to learn in different feature subspaces of the data. ## 11. transformer transformer： short-cut机制 to solve the problem that the gradient drop slowly.Position Embedding）的特征。具体地说，位置编码会在词向量中加入了单词的位置信息在Transformer出现之前，序列(sequence)数据处理的主流算法是RNN，RNN每次处理序列中的一个节点，并自身带有一个hidden vector，用来“记忆”之前处理的过的节点等信息，并在处理当前节点时修改hidden vector，RNN本身的特性很适合处理序列数据，但局限也很明显，第一因为RNN依次处理序列中的节点，所以很不容易并行；第二，如果要回溯离当前节点很远的节点的信息，比较困难，也就是所谓的long path depency的问题。Transformer的提出解决了上述两个问题，而transformer的秘密武器就是attention机制，下面一起看一下transformer的结构。 ![](https://i.imgur.com/8ZQGKQz.png) 1. self-attention ![](https://i.imgur.com/qGvizea.png) 2. multi-head attention ![](https://i.imgur.com/GCZWGtf.png) ## 12. **Bert** ![](https://i.imgur.com/3sGcK0X.png) Bert replace for Word2Vec, it has some improvements, first, Using Transformer [2] as the main framework of the algorithm, Transformer can more thoroughly capture the two-way relationship in the sentence; Multi-task training objectives using Mask Language Model (MLM) [3] and Next Sentence Prediction (NSP); The use of more powerful machines to train larger-scale data has made the results of BERT reach a new height, and Google has open sourced the BERT model. Users can directly use BERT as the conversion matrix of Word2Vec and efficiently apply it to their own tasks. . BERT's network architecture uses the multilayer Transformer structure proposed in "Attention is all you need". Its biggest feature is that it discards traditional RNN and CNN, and converts the distance between two words at any position into 1 through the Attention mechanism , Which effectively solves the thorny long-term dependency problem in NLP. Problem: Words see the future themselves, making the network non-generative 解决办法：mask out k% of the input words, and then predict the masked words. bert在多方面的nlp任务变现来看效果都较好，具备较强的泛化能力，对于特定的任务只需要添加一个输出层来进行fine-tuning即可。这种模型分两种，一种是interaction相关的，一种是更好地表示embedding 表征 DRMM:步骤： calculate the cosine similarity from the embeddings of every pair of terms convert each row to a matching hostogram KNRM 步骤： calculate the cosine similarity from the embeddings of every pair of terms use k kernels to summarize every query word’s row of the interaction matrix The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall Finally, talk about the advantages of AUC. The calculation method of AUC also considers the classifier's ability to classify positive and negative examples. In the case of unbalanced samples, it can still make a reasonable evaluation of the classifier. **what is the language model** we use it to predict what will appear next use attention but no rnn, so comes with transformer ## 13. knowledge ir nlp ![](https://i.imgur.com/PrcxHyM.png) Entity-Oriented Search:帮助搜索引擎明白要搜素啥 Entity Salience: ## 14. tfidf ### 1. one hot A very long vector is used to represent a word. The length of the vector is the size of the dictionary N. Each vector has only one dimension of 1, and the remaining dimensions are all 0. The position of 1 indicates the position of the word in the dictionary. ### 2. tf-idf > TF-IDF is a statistical method used to evaluate the importance of a certain word (character) in a sentence to the entire document. ![](https://i.imgur.com/CHqNvnY.png) ![](https://i.imgur.com/U65BgkS.png) easy to understand and easy to achieve. **缺点**： > Its simple structure does not consider the semantic information of words, and cannot handle the situation of multiple meanings and multiple words. ## 15.evaluation (precision recall f1) ![](https://i.imgur.com/UisVZKT.png) ![](https://i.imgur.com/TmcFYQE.png) ### 15.1 **Area Under the Curve (AUC)** The AUC helps us quantify our model’s ability to separate the classes by capturing the count of positive predictions which are correct against the count of positive predictions that are incorrect at different thresholds. ### 15.2 **Mean Reciprocal Rank (MRR)** The Mean Reciprocal Rank (MRR) evaluates the responses retrieved, in correspondence to a query, given their probability of correctness. This evaluation metric is typically used in informational retrieval tasks quite often. ### 15.3 **Mean Average Precision (MAP)** Similar to MRR, the Mean Average Precision (MAP) calculates the mean precision across each retrieved result. It’s also used heavily in information retrieval tasked for ranked retrieval results. ## 16.native bayes 整个看下来，朴素贝叶斯模型的本质是针对样本属性的统计概率模型。要想朴素贝叶斯模型的效果好，前期的特征工程和数据清洗是非常重要的工作。早期的机器学习分类模型，特征选择是至关重要的工作，直接决定了模型的效果，这点与现在的深度学模型有很大的差别。神经网络中，通常是在模型内进行特征提取与学习，这就大大减少了特征工程方面的工作。 ## 17 kt-net ![](https://i.imgur.com/BcyIHSy.png) ![](https://i.imgur.com/bOfW7g3.png) ## 18 erine 首先T-Encoder已经对输入的文本进行了语义特征的提取，得到每个单词的向量表达，这些向量[公式]会作为K-Encoder的第一个输入。其次，在这些单词中假定有[公式]个实体在知识图谱中，ERNIE使用这些实体经过的TransE训练得到的实体特征，两个输入分别使用各自的多头自注意力机制提取特征，下面公式右上角角标[公式]表示第[公式]层K-Encoder。 ![](https://i.imgur.com/t8WPW7y.png) ## 19 kar ![](https://i.imgur.com/5053mtT.png) ## 20 colake 由于知识图谱中的实体数量通常很多，在实现CoLAKE时会带来两个问题：1）输出时不可能在所有实体上做Softmax操作，这一问题可以简单地通过负采样来解决；2）输入时几乎不可能在GPU上维护一个entity embedding矩阵，我们的解决方案是把entity embedding放在CPU上，把模型的其他部分放在多张GPU上，GPU上的训练进程从CPU中读出entity embedding同时向CPU写入对应的梯度，CPU上的进程根据收集的梯度对entity embedding进行异步地更新。 ## feature-based fine-tuning base 2.1. 基于特征的方法基于特征的方法最经典的便是word2vec方法了，基于特征的方法旨在利用提取特征的技术将自然语言中的字或词进行分布式表达。通常，这些表达的结果会被放在模型输入层作为字词向量的初始化使用。然而，当这些语言模型训练结束，字词向量便固定，这将无法解决自然语言处理任务中普遍存在的一词多义问题。ELMo基于文本周围词语生成中心词的向量表达在一定程度上解决了一次多义问题，提升了模型的语言表达能力。 2.2. 微调的方法基于特征的方法只使用字词向量表达作为输入层，流入模型中。微调的方法将训练的模型结构和参数一并接入下游任务，并会在下游任务的训练过程进行微调。这种方法最开始是由Dai and Le (2015)训练了一个自编码器，然后将这个自编码器与下游任务进行结合。之后，围绕这一方法诞生了大量的文献，比较经典的如BERT，ULMFiT，GPT均获得了不错的效果。