--- title: NLP - 1. Language Processing and Python tags: self-learning, NLP description: 1. title 請改為 [課程主題] 2. 加上"{%hackmd BkVfcTxlQ %}"意為套用黑色模板 --- {%hackmd BkVfcTxlQ %} # **_NLP - 1. Language Processing and Python_** :::warning 撰寫者:黃丰嘉 撰寫時間:2019-10-10 (四) --- **_Reference:_** * Wikipedia (e.g., collocations, the Turing Test, the type-token distinction). * [Indurkhya, Nitin and Fred Damerau (eds, 2010) Handbook of Natural Language Processing (Second Edition) Chapman & Hall/CRC. 2010. (Indurkhya & Damerau, 2010) (Dale, Moisl, & Somers, 2000)](https://epdf.pub/handbook-of-natural-language-processing.html) * [Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (Second Edition). Prentice Hall. (Jurafsky & Martin, 2008)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf) * [Mitkov, Ruslan (ed, 2003) The Oxford Handbook of Computational Linguistics. Oxford University Press. (second edition expected in 2010). (Mitkov, 2002)](https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199573691.001.0001/oxfordhb-9780199573691) * [Association for Computational Linguistics (ACL website)](http://www.aclweb.org/) * information about international and regional conferences and workshops; * the ACL Wiki with links to hundreds of useful resources; * the ACL Anthology, which contains most of the NLP research literature from the past 50+ years, fully indexed and freely downloadable. * Excellent introductory Linguistics textbooks: * [Finegan2007] * (O'Grady et al, 2004) * (OSU, 2007) * You might like to consult **_LanguageLog_**, a popular linguistics blog with occasional posts that use the techniques described in this book. ::: # **課程大綱** [TOC] --- :::info 本章目標: * 通過將簡單的編程技術與大量文本結合起來,我們可以實現什麼? * 我們如何自動提取概括文本樣式和內容的關鍵詞和片語? * Python提供何種工具得以進行NLP? * NLP所面臨的挑戰為何? ::: ## **_1 Computing with Language: Texts and Words_** > 此段落我們將進行一些出於語言動機的編程任務,而不詳述其背後作法。 * 文本/文字(text) 被視為 "raw data" ### 1.1 Getting Started with Python ### 1.2 Getting Started with NLTK * 安裝 NLTK 3.0 * downloadable for free from [NLTK 3.4.5 documentation](http://nltk.org/) ``` $ pip install nltk ``` ```python= >>> import nltk >>> nltk.download() # Browse the available packages ``` ![](https://i.imgur.com/pHi2g1v.png) * Downloading the NLTK **Book** Collection: It consists of about 30 compressed files requiring about 100Mb disk space. * 使用 Python IDLE (Interactive DeveLopment Environment) 操作 ```python= >>> from nltk.book import * # or >>> import nltk.book # from NLTK's book module, load all items. *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 >>> ``` * Enter their names to find out about these texts: ```python= >>> text1 <Text: Moby Dick by Herman Melville 1851> >>> text2 <Text: Sense and Sensibility by Jane Austen 1811> ``` :::info 專有名詞: * Corpus (語料庫) * A collection of written or spoken material stored on a computer and used to find out how language is used. ::: ### 1.3 Searching Text 1. ==concordance():顯現`被搜尋詞`(Word)出現的上下文== > Shows us every occurrence of a given word, together with some context. > > Once you've spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. > > In the next chapter you will learn how to access a broader range of text, including text in languages other than English. ```python= >>> text1.concordance("monstrous") # concordance (詞語索引) Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal ``` 2. ==similar():根據`被搜尋詞`的上下文,找到類似結構,就認定他們為近似字== > Observe that we get different results for different texts. > > Austen uses this word quite differently from Melville; for her, `monstrous` has positive connotations, and sometimes functions as an intensifier like the word `very`. ```python= >>> text1.similar("monstrous") mean part maddens doleful gamesome subtly uncommon careful untoward exasperate loving passing mouldy christian few true mystifying imperial modifies contemptible >>> text2.similar("monstrous") very heartily so exceedingly remarkably as vast a great amazingly extremely good sweet ``` 3. ==common_contexts([]):印出`所有被搜尋詞`的共同上下文== * 如:`a_pretty` -> `a monstrous pretty`, `a very pretty` 文章中,皆有出現過此用法。 ```python= >>> text2.common_contexts(["monstrous", "very"]) a_pretty is_pretty am_glad be_glad a_lucky ``` :::info **概念釐清之舉例:https://stackoverflow.com/questions/43438008/difference-between-similar-and-concordance-in-nltk** * 找出與text1中`monstrous`的近似字 * 代表這些近似字與被搜尋詞之間,有相同的句法結構。 ```python= >>> text1.similar("monstrous") true contemptible christian abundant few part mean careful puzzled mystifying passing curious loving wise "doleful" gamesome singular delightfully perilous fearless ``` * 分別顯示text1中`monstrous`和`doleful`的上下文 ```python= >>> text1.concordance("monstrous") ``` > that has survived the flood ; ==most monstrous and== most mountainous ! That Himmal ```python= >>> text1.concordance("doleful") ``` > ite perspectives . There's a ==most doleful and== most mocking funeral ! The sea * 總結,印出text1中`monstrous`和`doleful`的共同上下文 * 得出`monstrous`和`doleful`共同的慣用法。 ```python= >>> text1.common_contexts(["monstrous", "doleful"]) most_and ``` ::: 4. ==dispersion_plot([]):詞彙分布圖(dispersion plot)== ``` $ pip install matplotlib $ pip install numpy ``` * 可得知該詞彙分布於文章中的哪個位置(location),以及該詞彙的出現次數。 > You can also plot the frequency of word usage through time using https://books.google.com/ngrams * Each stripe represents an instance of a word * Each row represents the entire text. > 範例:[檢視「制定美國民主」相關的字詞出現在整篇的頻率](https://medium.com/pyladies-taiwan/nltk-%E5%88%9D%E5%AD%B8%E6%8C%87%E5%8D%97-%E4%B8%80-%E7%B0%A1%E5%96%AE%E6%98%93%E4%B8%8A%E6%89%8B%E7%9A%84%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E5%B7%A5%E5%85%B7%E7%AE%B1-%E6%8E%A2%E7%B4%A2%E7%AF%87-2010fd7c7540) > ```python= > >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) > ``` > * Lexical Dispersion Plot for Words in U.S. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time. > ![](https://i.imgur.com/spApIPi.png) 5. ==generate():自動生成文章== > Let's try generating some random text in the various styles we have just seen. ```python= >>> text3.generate() ``` ### 1.4 Counting Vocabulary * 計算文章總長度:(包含字詞、標點符號) * Finding out the length of a text from start to finish. ```python= >>> len(text3) 44764 # So Genesis has 44,764 words and punctuation symbols, or "tokens." # A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. ``` > The number of tokens in a text => Phrase, Ex: `to be or not to be` > > How many distinct words does the book of Genesis contain? > Ans: 需要去掉重複的字詞,使用`set()` * 計算去除重複字詞後的文章總長度:(包含具有唯一性的字詞、標點符號) > Remember that capitalized words appear before lowercase words in sorted lists. ```python= >>> sorted(set(text3)) ['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...] >>> len(set(text3)) 2789 ``` * 詞彙多樣性 (lexical richness of the text) ```python= >>> len(set(text3)) / len(text3) 0.06230453042623537 # the number of distinct words is just 6% of the total number of words # or equivalently that each word is used 16 times on average. ``` * 計算特定字詞出現的次數 ```python= >>> text3.count("smoke") 2 >>> 100 * text4.count("a") / len(text4) 1.457973123627309 ``` :::info 將程式碼改成 function: ```python= def lexical_diversity(text): return len(set(text)) / len(text) >>> lexical_diversity(text3) 0.06230453042623537 ``` ```python= def percentage(count, total): return 100 * count / total >>> percentage(4, 5) 80.0 >>> percentage(text4.count('a'), len(text4)) 1.457973123627309 ``` ::: --- ## **_2 A Closer Look at Python: Texts as Lists of Words_** > 此段落我們將有系統地回顧關鍵的編程概念。 ### 2.1 Lists * For our purposes, we will think of a `text` as nothing more than `a sequence of words and punctuation`. ```python= >>> sent1 = ['Call', 'me', 'Ishmael', '.'] >>> sent1 ['Call', 'me', 'Ishmael', '.'] >>> len(sent1) 4 >>> lexical_diversity(sent1) 1.0 ``` ```python= >>> from nltk.book import * >>> sent2 ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.'] >>> sent3 ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.'] ``` * concatenation ```python= >>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'] >>> sent4 + sent1 ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.', '.'] ``` * appending ```python= >>> sent1.append("Some") >>> sent1 ['Call', 'me', 'Ishmael', '.', 'Some'] ``` ### 2.2 Indexing Lists * index > Notice that our indexes start from zero. > By convention, m:n means elements m…n-1. ```python= >>> text4[173] 'awaken' >>> text4.index('awaken') 173 ``` ```python= >>> sent = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'] >>> sent[0] 'word1' >>> sent[9] 'word10' >>> sent[5:8] ['word6', 'word7', 'word8'] >>> sent[:3] ['word1', 'word2', 'word3'] # Omit the first number -> begins at the start of the list >>> sent[8:] ['word9', 'word10'] # Omit the second number -> goes to the end ``` ```python= >>> sent[0] = 'First' # ['First', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'] >>> sent[9] = 'Last' # ['First', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'Last'] >>> len(sent) 10 >>> sent[1:9] = ['Second', 'Third'] >>> len(sent) 4 # ['First', 'Second', 'Third', 'Last'] ``` * slicing ```python= >>> text6[1600:1615] ['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to'] ``` ### 2.3 Variables ### 2.4 Strings * Assign a string to a variable, Index a string, Slice a string ```python= >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' ``` * Multiplication and Addition with strings ```python= >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' ``` * Join the words of a list to make a single string, Split a string into a list ```python= >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> '/'.join(['Monty', 'Python']) 'Monty/Python' >>> 'Monty Python'.split() ['Monty', 'Python'] ``` --- ## **_3 Computing with Language: Simple Statistics_** > In this section we pick up the question of what makes a text distinct, and use automatic methods to find characteristic words and expressions of a text. ### 3.1 Frequency Distributions > How can we automatically identify the words of a text that are most informative about the topic and genre of the text? > > ![](https://i.imgur.com/zBcAbks.png =300x200) * NLTK provides built-in support for them. * FreqDist ```python= >>> from nltk.book import * >>> fdist1 = FreqDist(text1) >>> print(fdist1) <FreqDist with 19317 samples and 260819 outcomes> # We can inspect the total number of words ("outcomes") that have been counted up — 260,819 in the case of Moby Dick. ``` * ==`most_common` 出現頻率最高的字詞== ```python= >>> fdist1.most_common(50) [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), ..., ('whale', 906), ('one', 889), ('you', 841), ...] # The expression most_common(50) gives us a list of the 50 most frequently occurring types in the text. >>> fdist1['whale'] 906 ``` * 上述列出前50個出現`頻率最高的字詞`中,只有 "whale" 是有意義的字詞。 > Do any words produced in the last example help us grasp the topic or genre of this text? > What proportion of the text is taken up with such words? * Generate a cumulative frequency plot for these words ```python= >>> fdist1.plot(50, cumulative=True) # These 50 words account for nearly half the book! ``` ![](https://i.imgur.com/jkb6N0r.png) * ==`hapaxes` 僅出現一次的字詞== ```python= >>> fdist1.hapaxes() ['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', ...] ``` ### 3.2 Fine-grained Selection of Words > Let's look at the ==long words (More than 15 characters)== of a text. > Perhaps these will be more characteristic and informative. * set theory (集合論) ``` { w | w ∈ V & P(w) } # Mathematical set notation [w for w in V if p(w)] # Python expression ``` ```python= >>> Voc = set(text1) >>> long_words = [w for w in Voc if len(w) > 15] >>> sorted(long_words) ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', ..., 'uninterpenetratingly'] ``` > Have we succeeded in automatically extracting words that typify a text? > Well, these very long words are often hapaxes (i.e., unique) and perhaps it would be better to find frequently occurring long words. * 自動識別,文本中經常出現的帶有內容的單詞 * `len(w) > 7`: The words are longer than seven letters. * `fdist5[w] > 7`: These words occur more than seven times. ```python= >>> fdist5 = FreqDist(text5) >>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7) ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching'] ``` ### 3.3 Collocations and Bigrams (搭配詞) > A **collocation** is a sequence of words that occur together unusually often. > Thus `red wine` is a collocation, whereas `the wine` is not. > * 雙連(搭配)詞的處理 * `bigrams()` ```python= >>> list(bigrams(['more', 'is', 'said', 'than', 'done'])) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] ``` * `collocation_list()` ```python= >>> text4.collocation_list() ['United States', 'fellow citizens', 'four years', 'years ago', 'Federal Government', 'General Government', 'American people', 'Vice President', 'God bless', 'Chief Justice', 'Old World', 'Almighty God', 'Fellow citizens', 'Chief Magistrate', 'every citizen', 'one another', 'fellow Americans', 'Indian tribes', 'public debt', 'foreign nations'] >>> text8.collocation_list() ['would like', 'medium build', 'social drinker', 'quiet nights', 'non smoker', 'long term', 'age open', 'Would like', 'easy going', 'financially secure', 'fun times', 'similar interests', 'Age open', 'weekends away', 'poss rship', 'well presented', 'never married', 'single mum', 'permanent relationship', 'slim build'] ``` ### 3.4 Counting Other Things * The distribution of word lengths in a text. ```python= >>> [len(w) for w in text1] [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, ...] >>> fdist = FreqDist(len(w) for w in text1) >>> print(fdist) <FreqDist with 19 samples and 260819 outcomes> >>> fdist FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...}) ``` ```python= >>> fdist.most_common() [(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)] >>> fdist.max() 3 # the most frequent word length is 3 >>> fdist[3] 50223 >>> fdist.freq(3) 0.19255882431878046 # Words of length 3 account for roughly 50,000 (or 20%) of the words making up the book ``` > Although we will not pursue it here, further analysis of word length might help us understand differences between authors, genres, or languages. | Example | Description | | --- | --- | | fdist = FreqDist(samples)| create a frequency distribution containing the given samples| | fdist[sample] += 1| increment the count for this sample| | fdist['monstrous']| count of the number of times a given sample occurred| | fdist.freq('monstrous')| frequency of a given sample| | fdist.N()| total number of samples| | fdist.most_common(n)| the n most common samples and their frequencies| | for sample in fdist:| iterate over the samples| | fdist.max()| sample with the greatest count| | fdist.tabulate()| tabulate the frequency distribution| | fdist.plot()| graphical plot of the frequency distribution| | fdist.plot(cumulative=True)| cumulative plot of the frequency distribution| | `fdist1 |= fdist2`| update fdist1 with counts from fdist2| | fdist1 < fdist2| test if samples in fdist1 occur less frequently than in fdist2| --- ## **_4 Back to Python: Making Decisions and Taking Control_** ### 4.1 Conditionals * `[w for w in text if condition ]` > Ex: `[w for w in sent7 if len(w) != 4]` ```python= >>> sorted(w for w in set(text1) if w.endswith('ableness')) ['comfortableness', 'honourableness', ...] >>> sorted(term for term in set(text4) if 'gnt' in term) ['Sovereignty', 'sovereignties', 'sovereignty'] >>> sorted(item for item in set(text6) if item.istitle()) ['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', ...] >>> sorted(item for item in set(sent7) if item.isdigit()) ['29', '61'] ``` ```python= >>> sorted(w for w in set(text7) if '-' in w and 'index' in w) ['Stock-index', 'index-arbitrage', 'index-fund', 'index-options', ...] >>> sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10) ['Abelmizraim', 'Allonbachuth', 'Beerlahairoi', 'Canaanitish', ...] >>> sorted(w for w in set(sent7) if not w.islower()) [',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken'] >>> sorted(t for t in set(text2) if 'cie' in t or 'cei' in t) ['ancient', 'ceiling', 'conceit', 'conceited', ...] ``` |Function | Meaning | | --- | --- | |s.startswith(t)| test if s starts with t| | s.endswith(t)| test if s ends with t| | t in s| test if t is a substring of s| | s.islower()| test if s contains cased characters and all are lowercase| | s.isupper()| test if s contains cased characters and all are uppercase| | s.isalpha()| test if s is non-empty and all characters in s are alphabetic| | s.isalnum()| test if s is non-empty and all characters in s are alphanumeric| | s.isdigit()| test if s is non-empty and all characters in s are digits| | s.istitle()| test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)| ### 4.2 Operating on Every Element ```python= >>> [len(w) for w in text1] [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, ...] >>> [w.upper() for w in text1] ['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...] ``` ```python= >>> len(text1) 260819 >>> len(set(text1)) 19317 >>> len(set(word.lower() for word in text1)) 17231 # No double-counting words like This and this, which differ only in capitalization. # we've wiped 2,000 off the vocabulary count! >>> len(set(word.lower() for word in text1 if word.isalpha())) 16948 # Eliminate 數字、標點符號 from the vocabulary count by filtering out any non-alphabetic items. # (濾掉所有非字母項目) ``` ### 4.3 Nested Code Blocks ### 4.4 Looping with Conditions ```python= >>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w) >>> for word in tricky: ... print(word, end=' ') ancient ceiling conceit conceited conceive conscience ... ``` --- ## **_5 Automatic Natural Language Understanding_** > It takes skill, knowledge, and some luck, to extract answers to such questions as: > What tourist sites can I visit between Philadelphia and Pittsburgh on a limited budget? > > Getting a computer to answer them automatically involves a range of language processing tasks, including information extraction, inference, and summarization. > In this section we describe some language understanding technologies, to give you a sense of the interesting challenges that are waiting for you. ### 5.1 Word Sense Disambiguation (詞義消歧) * 透過該字詞的上下文,自動消除歧義(得知該字詞被賦予何種意思)。 > In other words, we automatically disambiguate words using context, exploiting the simple fact that nearby words have closely related meanings. :::warning * Ex: `By` 的意思? a. The lost children were found **by** the **_searchers_** (agentive) b. The lost children were found **by** the **_mountain_** (locative) c. The lost children were found **by** the **_afternoon_** (temporal) ::: ### 5.2 Pronoun Resolution * 必須找到代詞(pronoun)所指的對象為何。 > A deeper kind of language understanding is to work out "who did what to whom" — i.e., to detect the subjects and objects of verbs. * Computational techniques for tackling this problem: * Anaphora Resolution (指代消解) * identifying what a pronoun or noun phrase refers to * Semantic Role Labeling (語義角色標注) * identifying how a noun phrase relates to the verb (as agent, patient, instrument, and so on). :::warning * Ex: Try to determine what was sold, caught, and found? a. The thieves stole the paintings. They were subsequently **_sold_**. b. The thieves stole the paintings. They were subsequently **_caught_**. c. The thieves stole the paintings. They were subsequently **_found_**. ::: ### 5.3 Generating Language Output (自動生成語言) > If we can automatically solve such problems of language understanding, we will be able to move on to tasks that involve generating language output, such as **question answering** and **machine translation**. > Try this yourself using http://translationparty.com/ * 正確的翻譯,實際上取決於對代詞的正確理解。 > Working out the sense of a word, the subject of a verb, and the antecedent of a pronoun are steps in establishing the meaning of a sentence, things we would expect a language understanding system to be able to do. ### 5.4 Machine Translation (機器翻譯) > Its roots go back to the early days of the Cold War, when the promise of automatic translation led to substantial government sponsorship, and with it, the genesis of NLP itself. * 機器翻譯系統仍存在嚴重的缺陷。 > 通過在兩種語言之間來回翻譯句子,直至達到平衡,可明顯地揭示這些缺陷。 * 機器翻譯是困難的,一方面是因為給定的單詞可能有幾種不同的翻譯(取決於其含義),另一方面是因為必須改變單詞順序才能與目標語言的語法結構一致。 > 如今,從新聞和以兩種或兩種以上語言發布文件的政府網站收集大量的平行文本,正面臨著這些困難。 > 給定一個德語和英語文檔,可能還有一個雙語詞典,我們可以自動對句子進行配對,這個過程稱為文本對齊(text alignment)。一旦我們有了一百萬或更多的句子對,我們就可以檢測出相應的單詞和短語,並建立一個可以用於翻譯新文本的模型。 ### 5.5 Spoken Dialog Systems * Turing Test * understand the user's goals > For an example of a primitive dialogue system, try having a conversation with an NLTK chatbot. ```python= import nltk nltk.chat.chatbots() ``` :::warning **Simple Pipeline Architecture for a Spoken Dialogue System:** Spoken input (top left) is analyzed, words are recognized, sentences are parsed and interpreted in context, application-specific actions take place (top right); a response is planned, realized as a syntactic structure, then to suitably inflected words, and finally to spoken output; different types of linguistic knowledge inform each stage of the process. ![](https://i.imgur.com/4z4pjSc.png) ::: ### 5.6 Textual Entailment (文本含義) * Recognizing Textual Entailment (RTE) :::warning * Ex: consider the following text-hypothesis pair: a. Text: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books b. Hypothesis: Golinkin has written eighteen books * In order to determine whether the hypothesis is supported by the text, the system needs the following background knowledge: 1. if someone is an author of a book, then he/she has written that book; 2. if someone is an editor of a book, then he/she has not written (all of) that book; 3. if someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books. ::: ### 5.7 Limitations of NLP * 儘管在RTE等任務方面取得了研究主導的進步,但已部署用於實際應用程序的自然語言系統仍無法以一般和可靠的方式執行常識性推理或利用世界知識。 --- ## **_6 Summary_** * A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)). * To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set(w.lower() for w in text if w.isalpha()).