NLP - 2. Accessing Text Corpora and Lexical Resources

--- title: NLP - 2. Accessing Text Corpora and Lexical Resources tags: self-learning, NLP description: 1. title 請改為 [課程主題] 2. 加上"{%hackmd BkVfcTxlQ %}"意為套用黑色模板 --- {%hackmd BkVfcTxlQ %} # **_NLP - 2. Accessing Text Corpora and Lexical Resources_** :::warning 撰寫者：黃丰嘉撰寫時間：2019-10-21 (ㄧ) --- **_Reference:_** * [NLTK 初學指南(二)：由外而內，從語料庫到字詞拆解 — 上手篇](https://medium.com/pyladies-taiwan/nltk-初學指南-二-由外而內-從語料庫到字詞拆解-上手篇-e9c632d2b16a) * [第二章存取語料庫與詞彙資源 ](http://qheroq.blogspot.com/2010/06/python21.html) ::: # **課程大綱** [TOC] --- > 本章目標： > 1. 好用的corpora和lexical resources(辭彙資源)有哪些？如何透過Python來取得？ > 2. 哪些Python構造對NLP最有幫助？ > 3. 編寫Python程式碼時，如何避免功能重複撰寫？ * NLP 會使用大量的 linguistic data(語言資料) 或 corpora(語料庫)。 * 【大】語料庫：corpora （複數） vs. corpus (單數) * 【小】文本/文章：texts ![](https://i.imgur.com/GvMMCoH.png) --- ## **_1 Accessing Text Corpora_** > 本節目標： > * 研究各種文本語料庫(text corpora)，以便處理文本(texts)。 > * 如何選擇及使用單個文本(texts)。 * 使用nltk內建的文本(texts)：`from nltk.book import *` * 在語料庫的設計、語料的蒐集上 * 應盡量做到平衡分配在不同的主題和語式上(平衡各種風格體裁)。 > 某些語料庫可提供更豐富的語言內容，如：詞性標注(part-of-speech tags)，對話標籤(dialogue tags)，句法樹(syntactic trees)...等。 * `US Presidential Inaugural Addresses`語料庫： * 此語料庫實際上包含數十個單獨的文本。 * 但為了方便起見，將其首尾相接並視為單個文本(single text)。 ### 1.1 Gutenberg Corpus (文學作品) * [Project Gutenberg electronic text archive](http://www.gutenberg.org/) > NLTK包含`古騰堡計劃電子文本檔案庫`中的少量文本，該檔案庫擁有約25,000本免費電子書。 * `gutenberg`語料庫內，含有哪些文本？ * ==查找語料庫中的文本id：`corpus.fileids()`== ```python= ### 方法一 >>> import nltk >>> nltk.corpus.gutenberg.fileids() ### 方法二 (解決鍵入的名稱很長之麻煩問題) >>> import nltk >>> from nltk.corpus import gutenberg >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] ``` * `gutenberg`語料庫內的`austen-emma.txt`，含有多少單詞(word)？ * ==單詞列表：`.words()`== * `len()`：包括單詞之間的空格。 ```python= ### 方法一 >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt') >>> len(emma) 192427 ### 方法二 (解決鍵入的名稱很長之麻煩問題) >>> emma = gutenberg.words('austen-emma.txt') >>> len(emma) 192427 >>> emma ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...] ``` * 列出關於每個文本(text)的訊息 1. Average word length(平均單詞長度) * ==原始內容：`corpus.raw(fileids)` (不做任何處理)== 2. Average sentence length(平均句子長度) * ==單詞列表：`corpus.words(fileids)` (依據有意義的單字做斷詞)== 3. The number of times each vocabulary item appears in the text on average (our lexical diversity score) 每個詞彙平均出現在文本中的次數（詞彙多樣性的分數） * ==句子列表：`corpus.sents(fileids)`== <br /> ```python= >>> for fileid in gutenberg.fileids(): ... num_chars = len(gutenberg.raw(fileid)) ... num_words = len(gutenberg.words(fileid)) ... num_sents = len(gutenberg.sents(fileid)) ... num_vocab = len(set(w.lower() for w in gutenberg.words(fileid))) ... print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) 1. 每個單詞的平均字元數：若數值為4，則實際上應為3。(因為要扣掉空白字元) 2. 每個句子的平均單詞數：依據不同的作者，而有不同的特徵。 3. 每個單詞集合內單詞的平均出現次數：依據不同的作者，而有不同的特徵。 4. 文本名稱 5 25 26 austen-emma.txt 5 26 17 austen-persuasion.txt 5 28 22 austen-sense.txt 4 34 79 bible-kjv.txt ... #round()：四捨五入到最接近的整數。 ``` ```python= >>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt') >>> macbeth_sentences [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...] >>> macbeth_sentences[1116] ['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble'] >>> longest_len = max(len(s) for s in macbeth_sentences) >>> [s for s in macbeth_sentences if len(s) == longest_len] [['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', ..., 'Battlements']] ``` * Note: * 配合第一章`.concordance()` ```python= >>> import nltk >>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt')) >>> emma.concordance("surprize") Displaying 25 of 37 matches: er father , was sometimes taken by surprize at his being still able to pity ` hem do the other any good ." " You surprize me ! Emma must do Harriet good : a Knightley actually looked red with surprize and displeasure , as he stood up , r . Elton , and found to his great surprize , that Mr . Elton was actually on ... ``` ### 1.2 Web and Chat Text (非正式文本) * 網路上的文本，包括Firefox論壇內容、口語對話內容、電影劇本、個人廣告和葡萄酒評論。 ```python= >>> from nltk.corpus import webtext >>> for fileid in webtext.fileids(): ... print(fileid, webtext.raw(fileid)[:65], '...') ... firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ... grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop ... overheard.txt White guy: So, do you have any plans for this evening? Asian girl ... pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ... singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ... wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... ``` * 即時通訊的聊天會話文本，該語料庫包含超過10,000個帖子，已做過匿名處理`"UserNNN"`及手動刪除個人資訊。 > 最初由Naval Postgraduate School收集，用於研究自動偵測網絡掠食者(Internet predators)。 > 該語料庫分為15個資料夾，每個資料夾包含給定日期內收集的數百個帖子，包含特定年齡的聊天室（青少年、20歲、30歲、40歲以及普通成人）。 > > 資料夾名稱包含日期、聊天室和帖子數。 > 如：`10-19-20s_706posts.xml`：2006年10月19日，從20歲聊天室收集的706個帖子。 ```python= >>> from nltk.corpus import nps_chat >>> chatroom = nps_chat.posts('10-19-20s_706posts.xml') >>> chatroom[123] ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.'] ``` ### 1.3 Brown Corpus * 布朗語料庫是第一個百萬字的英語電子語料庫，於1961年在布朗大學創建。 > 該語料庫內含500個文本，並且[按類型進行分類(如：新聞、社論...)](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM) > 用於研究體裁之間的差異，這是一種稱為文體學(**stylistics**)的語言探究。 > * Example Document for Each Section of the Brown Corpus |ID|File|Genre|Description| |---|---|---|---| |A16|ca16|news|Chicago Tribune: Society Reportage| |B02|cb02|editorial|Christian Science Monitor: Editorials| |C17|cc17|reviews|Time Magazine: Reviews| |D12|cd12|religion|Underwood: Probing the Ethics of Realtors| |E36|ce36|hobbies|Norling: Renting a Car in Europe| |F25|cf25|lore|Boroff: Jewish Teenage Culture| |G22|cg22|belles_lettres|Reiner: Coping with Runaway Technology| |H15|ch15|government|US Office of Civil and Defence Mobilization: The Family Fallout Shelter| |J17|cj19|learned|Mosteller: Probability with Statistical Applications| |K04|ck04|fiction|W.E.B. Du Bois: Worlds of Color| |L13|cl13|mystery|Hitchens: Footsteps in the Night| |M01|cm01|science_fiction|Heinlein: Stranger in a Strange Land| |N14|cn15|adventure|Field: Rattlesnake Ridge| |P12|cp12|romance|Callaghan: A Passion in Rome| |R06|cr06|humor|Thurber: The Future, If Any, of Comedy| * 使用語料庫的方式： * 單詞列表`corpus.words(fileids)` * 句子列表（每個句子本身為單詞列表）`corpus.sents(fileids)` * 指定讀取特定類別或資料夾 ```python= >>> from nltk.corpus import brown >>> brown.categories() #列出所有類別 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') #指定讀取特定類別 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] # file or directory: 'C:\\Users\\pcsh1\\AppData\\Roaming\\nltk_data\\corpora\\brown\\cg22' >>> brown.sents(categories=['news', 'editorial', 'reviews']) #指定讀取多個特定類別 [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...] ``` * 比較不同體裁之情態動詞(modal verbs)的用法 * 第一步驟：對特定體裁(genre)產生計數。 ```python= >>> from nltk.corpus import brown >>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist(w.lower() for w in news_text) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals: ... print(m + ':', fdist[m], end=' ') # end ='' => 以便輸出印在同一行上。 can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 ``` * 第二步驟：獲取每種感興趣字詞在特定類裁的計數。(conditional frequency distributions) > 在news體裁中，出現頻率最高的是`will`； > 在romance體裁中，出現頻率最高的是`could`。 ```python= >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals) can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 ``` ### 1.4 Reuters Corpus * 路透社語料庫，包含10,788個新聞文檔，總計130萬字。 > 這些文檔已分為90個主題，並分為兩組"training"和"test"。 > 文本名稱為'test/14826'，表示從測試集(test set)中提取的文檔。 > 之所以拆分，其目的為訓練和測試演算法，這些演算法會自動偵測文檔的主題。 ```python= >>> from nltk.corpus import reuters >>> reuters.fileids() ['test/14826', 'test/14828', 'test/14829', 'test/14832', ...] >>> reuters.categories() ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', ...] ``` * 與布朗語料庫不同，路透社語料庫中的類別彼此重疊，僅是因為一篇新聞通常涵蓋多個主題。 > 我們可以要求一份或多份文檔涵蓋的多個主題，或要求一個或多個類別中包含的多個文檔。 > 為了方便起見，語料庫的方法接受單個fileid或fileid列表。 ```python= >>> reuters.categories('training/9865') ['barley', 'corn', 'grain', 'wheat'] >>> reuters.categories(['training/9865', 'training/9880']) ['barley', 'corn', 'grain', 'money-fx', 'wheat'] >>> reuters.fileids('barley') ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...] >>> reuters.fileids(['barley', 'corn']) ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648', ...] ``` * 我們可根據資料夾或類別指定所需的單詞或句子。 ```python= >>> reuters.words('training/9865')[:14] ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export'] >>> reuters.words(['training/9865', 'training/9880']) ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] >>> reuters.words(categories='barley') ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] >>> reuters.words(categories=['barley', 'corn']) ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...] ``` ### 1.5 Inaugural Address Corpus (就職演說) * 就職演說語料庫，實際上包含55個文本，每個總統其致辭各一個，其較為特別的屬性是`時間維度(年份)`。 > 在第一章時，我們將其視為單個文本。 > `Lexical Dispersion Plot` for Words in U.S. Presidential Inaugural Addresses： > X軸的"word offset"`單詞偏移量`，代表以數值作為該文本單詞的索引(從開頭的第一個單詞開始計算) > ![](https://i.imgur.com/ilu7Mfx.png) ```python= >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...] >>> [fileid[:4] for fileid in inaugural.fileids()] #取出文本名稱的前四個字元 ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...] ``` * 目標：得知隨時間流逝，`america`和`citizen`一詞的使用趨勢。 * `w.lower()`：將字母變為小寫。 * `startswith()`：檢查是否以`america`或`citizen`的目標字串為開頭。因此，`American's`和`Citizens`將會被一併算入。 ```python= >>> import nltk >>> cfd = nltk.ConditionalFreqDist( ... (target, fileid[:4]) ... for fileid in inaugural.fileids() ... for w in inaugural.words(fileid) ... for target in ['america', 'citizen'] ... if w.lower().startswith(target)) >>> cfd.plot() ``` * Plot of a Conditional Frequency Distribution: * Note: `counts` -> 其各別的文本長度尚未被標準化。 ![](https://i.imgur.com/ZCK3wYc.png) ### 1.6 Annotated Text Corpora (已標註的文本語料庫) * 許多文本語料庫包含語言標註，以表示POS標籤(POS tags)、命名實體(named entities)、句法結構(syntactic structures)、語義角色(semantic roles)...等。 * NLTK提供訪問這些語料庫的便捷方法，並且擁有語料庫和語料庫樣本的數據包，可免費下載並用於教學和研究。 * [更多資訊關於下載語料庫的資料](http://www.nltk.org/data)。 * [關於如何訪問NLTK語料庫的範例](http://www.nltk.org/howto/) * 語料庫列表 |Corpus|Compiler|Contents| |---|---|---| |Brown Corpus|Francis, Kucera|15 genres, 1.15M words, tagged, categorized| |CESS Treebanks|CLiC-UB|1M words, tagged and parsed (Catalan, Spanish)| |Chat-80 Data Files|Pereira & Warren|World Geographic Database| |CMU Pronouncing Dictionary|CMU|127k entries| |CoNLL 2000 Chunking Data|CoNLL|270k words, tagged and chunked| |CoNLL 2002 Named Entity|CoNLL|700k words, pos- and named-entity-tagged (Dutch, Spanish)| |CoNLL 2007 Dependency Treebanks (sel)|CoNLL|150k words, dependency parsed (Basque, Catalan)| |Dependency Treebank|Narad|Dependency parsed version of Penn Treebank sample| |FrameNet|Fillmore, Baker et al|10k word senses, 170k manually annotated sentences| |Floresta Treebank|Diana Santos et al|9k sentences, tagged and parsed (Portuguese)| |Gazetteer Lists|Various|Lists of cities and countries| |Genesis Corpus|Misc web sources|6 texts, 200k words, 6 languages| |Gutenberg (selections)|Hart, Newby, et al|18 texts, 2M words| |Inaugural Address Corpus|CSpan|US Presidential Inaugural Addresses (1789-present)| |Indian POS-Tagged Corpus|Kumaran et al|60k words, tagged (Bangla, Hindi, Marathi, Telugu)| |MacMorpho Corpus|NILC, USP, Brazil|1M words, tagged (Brazilian Portuguese)| |Movie Reviews|Pang, Lee|2k movie reviews with sentiment polarity classification| |Names Corpus|Kantrowitz, Ross|8k male and female names| |NIST 1999 Info Extr (selections)|Garofolo|63k words, newswire and named-entity SGML markup| |Nombank|Meyers|115k propositions, 1400 noun frames| |NPS Chat Corpus|Forsyth, Martell|10k IM chat posts, POS-tagged and dialogue-act tagged| |Open Multilingual WordNet|Bond et al|15 languages, aligned to English WordNet| |PP Attachment Corpus|Ratnaparkhi|28k prepositional phrases, tagged as noun or verb modifiers| |Proposition Bank|Palmer|113k propositions, 3300 verb frames| |Question Classification|Li, Roth|6k questions, categorized| |Reuters Corpus|Reuters|1.3M words, 10k news documents, categorized| |Roget's Thesaurus|Project Gutenberg|200k words, formatted text| |RTE Textual Entailment|Dagan et al|8k sentence pairs, categorized| |SEMCOR|Rus, Mihalcea|880k words, part-of-speech and sense tagged| |Senseval 2 Corpus|Pedersen|600k words, part-of-speech and sense tagged| |SentiWordNet|Esuli, Sebastiani|sentiment scores for 145k WordNet synonym sets| |Shakespeare texts (selections)|Bosak|8 books in XML format| |State of the Union Corpus|CSPAN|485k words, formatted text| |Stopwords Corpus|Porter et al|2,400 stopwords for 11 languages| |Swadesh Corpus|Wiktionary|comparative wordlists in 24 languages| |Switchboard Corpus (selections)|LDC|36 phonecalls, transcribed, parsed| |Univ Decl of Human Rights|United Nations|480k words, 300+ languages| |Penn Treebank (selections)|LDC|40k words, tagged and parsed| |TIMIT Corpus (selections)|NIST/LDC|audio files and transcripts for 16 speakers| |VerbNet 2.1|Palmer et al|5k verbs, hierarchically organized, linked to WordNet| |Wordlist Corpus|OpenOffice.org et al|960k words and 20k affixes for 8 languages| |WordNet 3.0 (English)|Miller, Fellbaum|145k synonym sets| ### 1.7 Corpora in Other Languages * NLTK隨附多種語言的語料庫。 ```python= >>> import nltk >>> nltk.download('cess_esp') >>> nltk.corpus.cess_esp.words() ['El', 'grupo', 'estatal', 'Electricité_de_France', ...] >>> nltk.corpus.indian.words('hindi.pos') ['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक', 'संयुक्त', ...] >>> nltk.corpus.udhr.fileids() #《世界人權宣言》的譯本及字元編碼列表 ['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', ...] >>> nltk.corpus.udhr.words('Javanese-Latin1')[11:] ['Saben', 'umat', 'manungsa', 'lair', 'kanthi', 'hak', ...] ``` > `udhr語料庫`，包含超過300種語言的《世界人權宣言》(Universal Declaration of Human Rights)。 > 該語料庫的文檔名稱包含有關使用的字元編碼的訊息，例如`UTF8`或`Latin1`。 * 檢視`udhr語料庫`中，所選擇語言(6種譯本)的單詞長度差異。 * Cumulative Word Length Distributions: * 下圖表示少於5個字元長度的單詞佔`Ibibio文本`的80%；佔`German文本`的60%；佔`Inuktitut文本`的25%。 ```python= >>> from nltk.corpus import udhr >>> languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] >>> cfd = nltk.ConditionalFreqDist( ... (lang, len(word)) ... for lang in languages ... for word in udhr.words(lang + '-Latin1')) >>> cfd.plot(cumulative=True) ``` ![](https://i.imgur.com/TTnZPx7.png) * **Your Turn:** Plot a frequency distribution of the letters of the text ```python= >>> nltk.corpus.udhr.fileids() >>> raw_text = udhr.raw('Russian-UTF8') >>> nltk.FreqDist(raw_text).plot() ``` ![](https://i.imgur.com/JkjiwQ5.png) > Note: > 不幸的是，對於許多語言，還沒有大量的語料庫。 > 通常，政府或產業界對開發語言資源的支持不足，雖然有個人努力但零散、難以發現或重複使用。某些語言沒有建立的書寫系統，或者受到威脅。 > （有關如何查找語言資源的建議，請參見7） > ### 1.8 Text Corpus Structure * 到目前為止，我們已經看了各種語料庫結構(參見下圖)。 1. `isolated`沒有任何結構：它只是文本的集合，是最簡單的一種。 2. `categorized`依據類別區分：通常文本依據體裁(genre)、來源(source)、作者、語言等類別進行分組。 3. `overlapping`有時類別會重疊：特別是在主題類別的情況下，因為文本可能與多個主題相關。 4. `temporal`時間性：有時，文本集合具有時間結構。如：新聞。 ![](https://i.imgur.com/ENjwA8x.png) * Basic Corpus Functionality defined in NLTK: > [Online Corpus HOWTO](http://nltk.org/howto) > Python Command :`help(nltk.corpus.reader)` |Example|Description| |---|---| |fileids()|the files of the corpus| |fileids([categories])|the files of the corpus corresponding to these categories| |categories()|the categories of the corpus| |categories([fileids])|the categories of the corpus corresponding to these files| |raw()|the raw content of the corpus| |raw(fileids=[f1,f2,f3])|the raw content of the specified files| |raw(categories=[c1,c2])|the raw content of the specified categories| |words()|the words of the whole corpus| |words(fileids=[f1,f2,f3])|the words of the specified fileids| |words(categories=[c1,c2])|the words of the specified categories| |sents()|the sentences of the whole corpus| |sents(fileids=[f1,f2,f3])|the sentences of the specified fileids| |sents(categories=[c1,c2])|the sentences of the specified categories| |abspath(fileid)|the location of the given file on disk| |encoding(fileid)|the encoding of the file (if known)| |open(fileid)|open a stream for reading the given corpus file| |root|if the path to the root of locally installed corpus| |readme()|the contents of the README file of the corpus| * NLTK的語料庫閱讀器支持對各種語料庫的有效訪問，並可與新語料庫一起使用。 ```python= >>> from nltk.corpus import gutenberg >>> raw = gutenberg.raw("burgess-busterbrown.txt") >>> raw[1:20] 'The Adventures of B' >>> words = gutenberg.words("burgess-busterbrown.txt") >>> words[1:20] ['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.', 'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster', 'Bear'] >>> sents = gutenberg.sents("burgess-busterbrown.txt") >>> sents[1:20] [['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as', 'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched', 'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...] ``` ### 1.9 Loading your own Corpus * 使用NLTK的PlaintextCorpusReader讀取自己的文本。 * 步驟一：取得並設定`該文本資料夾`在電腦上的位置 * 步驟二：初始化`PlaintextCorpusReader`。參數的設定，可以是文檔名稱列表`['a.txt', 'test/b.txt']`或`'[abc]/.*\.txt'`(使用正則表達方式) ```python= >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = 'C:\\Users\\pcsh1\\OneDrive\\桌面' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids() ['1071300643_2_鼓勵學士班成績優異學生就讀碩士班獎學金申請表.doc', 'for_testing.txt', '108年8月合法旅館清單.xlsx', ...] >>> wordlists.words('for_testing.txt') ['Build', 'responsive', ',', 'mobile', '-', 'first', ...] ``` * 使用Penn Treebank語料庫(the Penn Treebank Corpus)(release 3) > Penn Treebank搜集《華爾街日報》的文章(Wall Street Journal)。 > Use the `BracketParseCorpusReader` to access this corpus. > NLTK上已有許多手動詞性標注的文集了。 > [Day 3: 親手讓電腦幫你標動詞和名詞吧！](https://ithelp.ithome.com.tw/articles/10213383) ```python= ### 下載 Penn Treebank語料庫 ### from nltk.corpus import treebank nltk.download('treebank') print(treebank.tagged_sents()[0]) [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] # 測試這個語料庫中的第一個句子 # ".tagged_sents()"提取了詞性標註過的句子 >>> treebank <BracketParseCorpusReader in 'C:\\Users\\pcsh1\\AppData\\Roaming\\nltk_data\\corpora\\treebank\\combined'> # --- 查看語料庫內容 --- # >>> from nltk.corpus import BracketParseCorpusReader >>> corpus_root = r"C:\Users\pcsh1\AppData\Roaming\nltk_data\corpora\treebank\combined" >>> file_pattern = r".*\.mrg" >>> ptb = BracketParseCorpusReader(corpus_root, file_pattern) >>> ptb.fileids() ['00/wsj_0001.mrg', '00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...] >>> len(ptb.sents()) 3914 >>> >>> ptb.sents(fileids='wsj_0150.mrg')[4] ['Primerica', 'closed', 'at', '$', '28.25', '*U*', ',', 'down', '50', 'cents', '.'] ``` --- ## **_2 Conditional Frequency Distributions_** * 將語料庫中的文本按體裁、主題、作者等劃分類別時，可以計算每個類別的頻率分佈，以研究各類別之間的差異。 * **Conditional Frequency Distributions** 是頻率分佈的集合，每個頻率分佈對應一個不同的「condition」，condition通常是文本的類別。 > Counting Words Appearing in a Text Collection (a conditional frequency distribution) > 2 conditions：一個是新聞文本，另一個是浪漫文本。 > ![](https://i.imgur.com/c3G5UxN.png) ### 2.1 Conditions and Events * 可觀察的事件才能計算頻率分佈，如：文本中出現的單詞。 * 需要處理單詞序列(a sequence of words)和序列對(a sequence of pairs)。 ``` ### 單詞序列 ### text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] ### 序列對 ### pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] ``` > Each pair has the form `(condition, event)`. > If we were processing the entire Brown Corpus by genre there would be 15 `conditions` (one per `genre`), and 1,161,192 `events` (one per `word`). ### 2.2 Counting Words by Genre * ==FreqDist()== takes a simple list as input. * ==ConditionalFreqDist()== takes a list of pairs. * Ex: * 對布朗語料庫中的`所有類型`作圖。 ```python= >>> import nltk >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in brown.categories() ... for word in brown.words(categories=genre)) >>> cfd.plot(cumulative=True) ``` * 對布朗語料庫中的`兩種類型(news and romance)`作圖。 ```python= >>> import nltk >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)) >>> cfd.plot(cumulative=True) ``` * 建立布朗語料庫中`兩種類型(news and romance)`的`(類型, 單詞)對`，以利後續`ConditionalFreqDist`的作圖。 ```python= >>> import nltk >>> from nltk.corpus import brown >>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> len(genre_word) 170576 ## (類型, 單詞)對 ## >>> genre_word[:4] # [_start-genre] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> genre_word[-4:] # [_end-genre] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] ## 驗證是否具有兩個條件 ## >>> cfd = nltk.ConditionalFreqDist(genre_word) >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() #conditions ['news', 'romance'] ``` > ConditionalFreqDist()中的`condition`，相當於語料庫所劃分的`類型` ```python= >>> print(cfd['news']) <FreqDist with 14394 samples and 100554 outcomes> >>> print(cfd['romance']) <FreqDist with 8452 samples and 70022 outcomes> ## 出現頻率最高的前20個 (單詞, 次數) ## >>> cfd['romance'].most_common(20) [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)] >>> cfd['romance']['could'] 193 ``` ### 2.3 Plotting and Tabulating Distributions * ConditionalFreqDist功能 * 合併並計算多個類型的頻率分布，且易於初始化。 * 提供實用的製表和繪圖方法。 * 繪圖範例 ```python= >>> from nltk.corpus import inaugural >>> cfd = nltk.ConditionalFreqDist( ... (target, fileid[:4]) #單詞, 年份 ... for fileid in inaugural.fileids() ... for w in inaugural.words(fileid) ... for target in ['america', 'citizen'] ... if w.lower().startswith(target)) >>> cfd.plot() ``` ![](https://i.imgur.com/FxDDYdj.png) ```python= >>> from nltk.corpus import udhr >>> languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] >>> cfd = nltk.ConditionalFreqDist( ... (lang, len(word)) #含'-Latin1'字符編碼的語言名稱, 單詞長度 ... for lang in languages ... for word in udhr.words(lang + '-Latin1')) >>> cfd.plot() ``` ![](https://i.imgur.com/3Y8DOPi.png) * 製表範例 * `plot()`和`tabulate()`方法中 * 使用`conditions=`來顯示特定的條件。若沒有特別設定，則視為顯示所有條件。 * 使用`samples=`限制樣本的顯示。 * 還可以完全控制任何條件和樣本的顯示順序。 * Ex: 累積頻率數據製成表格。 ```python= >>> cfd.tabulate(conditions=['English', 'German_Deutsch'], #兩種語言 ... samples=range(10), cumulative=True) #少於10個字元長度的單詞 0 1 2 3 4 5 6 7 8 9 English 0 185 525 883 997 1166 1283 1440 1558 '1638' German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275 # 1,638 words of the English text have 9 or fewer letters. ``` * Your Turn: ```python= >>> import nltk >>> from nltk.corpus import brown >>> genre_word = [(genre, word) ... for genre in ['news', 'romance'] ... for word in brown.words(categories=genre)] >>> cfd = nltk.ConditionalFreqDist(genre_word) >>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] >>> cfd.tabulate(samples=days) Monday Tuesday Wednesday Thursday Friday Saturday Sunday news 54 43 22 20 41 33 51 romance 2 3 3 1 3 4 5 ``` ### 2.4 Generating Random Text with Bigrams > Conditional frequency distributions are a useful data structure for many NLP tasks. * 使用conditional frequency distribution來創建一個bigrams表（word pairs）。 * `bigrams()`函數獲取單詞列表並構建一個連續單詞對列表。 * Note: 使用`list()`函數 ```python= >>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.'] >>> list(nltk.bigrams(sent)) [('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('God', 'created'), ('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'), ('the', 'earth'), ('earth', '.')] ``` * 把**每一個字**都當成一種condition，並針對每個字都能有效地`計算在其後面的字詞出現次數`。 * 應用：找出該字詞的後面最有可能的用詞習慣(出現次數最高的字詞)，來自動生成文本。 * 缺點：這種簡單的文本生成方法，往往會陷入循環中。另一種方法，是從可用單詞中隨機選擇下一個單詞。 > [NLTK中的条件概率分布](https://www.jianshu.com/p/7427f9354344) > [NLTK 筆記:條件次數分配（Conditional Frequency Distributions）](https://medium.com/@kiki1223/nlp-%E7%AD%86%E8%A8%98-%E6%A2%9D%E4%BB%B6%E6%AC%A1%E6%95%B8%E5%88%86%E9%85%8D-conditional-frequency-distributions-aaeb04062617) * Ex: Generating Random Text ```python= ## 用於生成文本的簡單循環 ## >>> def generate_model(cfdist, word, num=15): ... for i in range(num): ... print(word, end=' ') ... word = cfdist[word].max() ... >>> text = nltk.corpus.genesis.words('english-kjv.txt') >>> bigrams = nltk.bigrams(text) >>> cfd = nltk.ConditionalFreqDist(bigrams) >>> cfd['living'] FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1}) >>> generate_model(cfd, 'living') living creature that he said , and the land of the land of the land ``` * NLTK's Conditional Frequency Distributions的常用方法: |Example|Description| |---|---| |cfdist = ConditionalFreqDist(pairs)|create a conditional frequency distribution from a list of pairs| |cfdist.conditions()|the conditions| |cfdist[condition]|the frequency distribution for this condition| |cfdist[condition][sample]|frequency for the given sample for this condition| |cfdist.tabulate()|tabulate the conditional frequency distribution| |cfdist.tabulate(samples, conditions)|tabulation limited to the specified samples and conditions| |cfdist.plot()|graphical plot of the conditional frequency distribution| |cfdist.plot(samples, conditions)|graphical plot limited to the specified samples and conditions| |cfdist1 < cfdist2|test if samples in cfdist1 occur less frequently than in cfdist2| --- ## **_3 More Python: Reusing Code_** ### 3.1 Creating Programs with a Text Editor ### 3.2 Functions ```python= >>> def lexical_diversity(my_text_data): ... word_count = len(my_text_data) ... vocab_size = len(set(my_text_data)) ... diversity_score = vocab_size / word_count ... return diversity_score >>> from nltk.corpus import genesis >>> kjv = genesis.words('english-kjv.txt') >>> lexical_diversity(kjv) 0.06230453042623537 ``` * 單詞轉換為複數型式： ```python= >>> def plural(word): ... if word.endswith('y'): ... return word[:-1] + 'ies' ... elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']: ... return word + 'es' ... elif word.endswith('an'): ... return word[:-2] + 'en' ... else: ... return word + 's' ... >>> plural('fairy') 'fairies' >>> plural('woman') 'women' ``` ### 3.3 Modules * Module內，包含多個Fuctions。 > A collection of `variable` and `function` definitions in a file is called a Python ==module==. > A collection of related `modules` is called a ==package==. > NLTK itself is a set of `packages`, sometimes called a ==library==. * NLTK's code for processing the Brown Corpus is an example of a `module`, and its collection of code for **processing all the different corpora is an example of a `package`.** > Do not name your file `nltk.py`！ > It may get imported in place of the "real" NLTK package. * 將Functions保存在一個名為`code_plural.py`的文件中。當需要使用時，只要引入即可使用自定義的Fuctions。 ```python= $ cd C:\Users\pcsh1\Downloads >>> from code_plural import plural >>> plural('wish') wishes >>> plural('fan') fen ``` --- ## **_4 Lexical Resources_** * lexicon(詞典)，又稱lexical resource(詞彙資源) * 是words和/或phrases，以及相關信息（如：詞性和感官定義）的集合。 * 詞彙資源僅次於文本，通常在文本的幫助下創建和豐富。 * 在第1章中的`concordance`用法，提供有關的單詞用法，有助於字典的建置。 * 詞彙術語(Lexicon Terminology) * 詞彙條目 Lexical entries * consists of a headword (also known as a `lemma`) along with additional information such as the `part of speech` and the `sense definition`. * 同音異義詞 homonyms * `two lemmas` having the same spelling. * 詞性 part of speech、感官定義 sense definitions ![](https://i.imgur.com/OnM44o3.png) * The simplest kind of lexicon is nothing more than a sorted list of words. (詞表) * Sophisticated lexicons include complex structure within and across the individual entries. ### 4.1 Wordlist Corpora (詞表) * NLTK includes some corpora that are nothing more than wordlists. * 過濾不常見或拼字錯誤的字詞 ```python= >>> import nltk >>> def unusual_words(text): ... text_vocab = set(w.lower() for w in text if w.isalpha()) ... english_vocab = set(w.lower() for w in nltk.corpus.words.words()) ... unusual = text_vocab - english_vocab ... return sorted(unusual) >>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', ...] >>> unusual_words(nltk.corpus.nps_chat.words()) ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', ...] ``` * 停用詞語料庫 * 停用詞(stopwords)特性： * 高頻率且無意義的字詞，如：`the`, `to` and `also`。 * 停用詞的出現，無法將之與其他文本區分。 ```python= >>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", ...] ``` * 計算文本中哪些字詞不在停用詞列表內 * 借助停用詞，我們可以過濾掉文本中超過四分之一的詞。 > Note: 這裡結合兩種不同的語料庫，使用`詞法資源`來過濾`文本語料庫`的內容。 ```python= >>> def content_fraction(text): ... stopwords = nltk.corpus.stopwords.words('english') ... content = [w for w in text if w.lower() not in stopwords] ... return len(content) / len(text) ... >>> content_fraction(nltk.corpus.reuters.words()) 0.735240435097661 ``` ```python= >>> import nltk >>> puzzle_letters = nltk.FreqDist('egivrvonl') >>> obligatory = 'r' >>> wordlist = nltk.corpus.words.words() >>> [w for w in wordlist if len(w) >= 6 and obligatory in w and nltk.FreqDist(w) <= puzzle_letters] ['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole'] ``` * wordlist corpus: `Names corpus` * 收錄 8000 個按性別分類的名字。男性和女性姓名分別儲存在單獨的文件中。 > 若有一個名稱，其在男性或女性中皆有出現，則代表此名稱屬於中性(性別不明確)。 ```python= >>> import nltk >>> names = nltk.corpus.names >>> names.fileids() ['female.txt', 'male.txt'] >>> male_names = names.words('male.txt') >>> female_names = names.words('female.txt') >>> [w for w in male_names if w in female_names] ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', ... ] ``` * 由此可知，以`a`結尾的姓名，大多為女性。 > `name[-1]`: the last letter of name。 > ``` $ pip install matplotlib ``` ```python= >>> import nltk >>> import matplotlib >>> names = nltk.corpus.names >>> male_names = names.words('male.txt') >>> female_names = names.words('female.txt') >>> cfd = nltk.ConditionalFreqDist((fileid, name[-1]) for fileid in names.fileids() for name in names.words(fileid)) >>> cfd.plot() ``` ![](https://i.imgur.com/kS4TpCn.png) > 條件頻率分佈：此圖顯示以字母表中的每個字母結尾的女性和男性姓名的數量；大多數以a，e或i結尾的名字是女性；以h和l結尾的名字同樣可能是男性或女性；以k，o，r，s和t結尾的名稱可能是男性。 ### 4.2 A Pronouncing Dictionary * [發音字典](http://qheroq.blogspot.com/2010/08/24-lexical-resources_11.html) 是一種較為豐富的詞彙資源，NLTK提供卡內基美隆大學的CMU Pronouncing Dictionary，這是設計給語音合成器的工具。 > 每個字都提供一組發音編碼的清單，清楚地標示每段的發聲。你可以注意到「fire」有兩個讀音，有一音節(phone)的念法「F AY1 R」或兩音節的「F AY1 ER0」，這些符號是由[美國國防部高研院制訂的音素符號表「Arpabet」](http://en.wikipedia.org/wiki/Arpabet)而來。 ```python= >>> import nltk >>> entries = nltk.corpus.cmudict.entries() >>> len(entries) 133737 >>> for entry in entries[42371:42379]: ... print(entry) ... ('fir', ['F', 'ER1']) #(word, [phonetic codes 語音代碼]) ('fire', ['F', 'AY1', 'ER0']) ('fire', ['F', 'AY1', 'R']) ('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M']) ('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M']) ('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z']) ('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z']) ('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L']) ``` ```python= >>> for word, pron in entries: ... if len(pron) == 3: ... ph1, ph2, ph3 = pron ... if ph1 == 'P' and ph3 == 'T': ... print(word, ph2, end=' ') ... pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1 ``` * 查找所有單詞的發音以`nicks`為結尾的所有單詞。 > 您可以使用此方法查找`押韻單詞`。 ```python= >>> syllable = ['N', 'IH0', 'K', 'S'] >>> [word for word, pron in entries if pron[-4:] == syllable] ["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', ... ] ``` ```python= >>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n'] ['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn'] >>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n')) ['gn', 'kn', 'mn', 'pn'] ``` * 定義一個函數來提取`重音`，然後掃描詞典以查找具有特定重音模式的單詞。 > 這些音節會包括一些數字在裡面，它們代表重音的程度，主要重音(1)、次要(2)與無重音(0)。 > ```python= >>> def stress(pron): ... return [char for phone in pron for char in phone if char.isdigit()] ... >>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']] ['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating', 'accelerator', 'accelerators', 'accentuated', 'accentuating', 'accommodated', 'accommodating', 'accommodative', 'accumulated', ...] >>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']] ['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients', 'academicians', 'accommodation', 'accommodations', ...] ``` * 找尋最小比對（minimally-contrasting）的單詞組 > 找尋以P開頭的三音節字，根據它們的首尾音來加以分群。 > ```python= >>> import nltk >>> entries = nltk.corpus.cmudict.entries() >>> p3 = [(pron[0]+'-'+pron[2], word) for (word, pron) in entries if pron[0] == 'P' and len(pron) == 3] >>> cfd = nltk.ConditionalFreqDist(p3) >>> for template in sorted(cfd.conditions()): ... if len(cfd[template]) > 10: ... words = sorted(cfd[template]) ... wordstring = ' '.join(words) ... print(template, wordstring[:70] + "...") ... P-CH patch pautsch peach perch petsch petsche piche piech pietsch pitch pit... P-K pac pack paek paik pak pake paque peak peake pech peck peek perc perk ... P-L pahl pail paille pal pale pall paul paule paull peal peale pearl pearl... P-N paign pain paine pan pane pawn payne peine pen penh penn pin pine pinn... P-P paap paape pap pape papp paup peep pep pip pipe pipp poop pop pope pop... P-R paar pair par pare parr pear peer pier poor poore por pore porr pour... P-S pace pass pasts peace pearse pease perce pers perse pesce piece piss p... P-T pait pat pate patt peart peat peet peete pert pet pete pett piet piett... P-UW1 peru peugh pew plew plue prew pru prue prugh pshew pugh... P-Z p's p.'s p.s pais paiz pao's pas pause paws pays paz peas pease pei's ... ``` * 除了遍歷整個字典之外，還可以通過查找特定單詞來訪問它。 > 在方括號內輸入關鍵字（例如 "fire" 一詞），以查找字典。 > 如果嘗試查找不存在的鍵，則會得到KeyError。這對NLTK語料集沒有影響；下次訪問`blog`時，`blog`仍然不存在。 > ```python= >>> prondict = nltk.corpus.cmudict.dict() >>> prondict['fire'] [['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']] >>> prondict['blog'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'blog' >>> prondict['blog'] = [['B', 'L', 'AA1', 'G']] >>> prondict['blog'] [['B', 'L', 'AA1', 'G']] ``` * 我們可以使用任何詞法資源來處理文本。例如：過濾出具有詞法屬性的詞（例如名詞）或映射文本的每個詞。 > 以下，得知每個單詞在發音詞典中對應的發音。 > ```python= >>> text = ['natural', 'language', 'processing'] >>> [ph for w in text for ph in prondict[w][0]] ['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L', 'AE1', 'NG', 'G', 'W', 'AH0', 'JH', 'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0', 'NG'] ``` ### 4.3 Comparative Wordlists * 比較型文字列表（Comparative Wordlists） * NLTK提供了所謂的「Swadesh wordlists」，收錄各種語言的200個常用字，語言識別碼是依照 ISO 639的兩碼制。 ```python= >>> import nltk >>> from nltk.corpus import swadesh >>> nltk.download('swadesh') >>> swadesh.fileids() ['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk'] >>> swadesh.words('en') ['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', ...] ``` * 我們可以透過`entries()`來製作一個配對組的串列來存取不同語言的同源字（cognate words，就是相同意思啦），甚至更進一步，我們可以把整個字典都做轉換（對應）。 ```python= >>> fr2en = swadesh.entries(['fr', 'en']) >>> fr2en [('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ...] >>> translate = dict(fr2en) >>> translate['chien'] 'dog' >>> translate['jeter'] 'throw' ``` ```python= >>> de2en = swadesh.entries(['de', 'en']) >>> es2en = swadesh.entries(['es', 'en']) >>> translate.update(dict(de2en)) >>> translate.update(dict(es2en)) >>> de2en [('ich', 'I'), ('du, Sie', 'you (singular), thou'), ('er', 'he'), ('wir', 'we'), ...] >>> es2en [('yo', 'I'), ('tú, usted', 'you (singular), thou'), ('él', 'he'), ('nosotros', 'we'), ...] >>> translate['yo'] 'I' >>> translate['ich'] 'I' ``` * 一次比較多種語言 > 比較日耳曼語系與拉丁語系在一些字上的差異 ```python= >>> languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la'] >>> for i in [139, 140, 141, 142]: ... print(swadesh.entries(languages)[i]) ... ('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere') ('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere') ('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere') ('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare') ``` ### 4.4 Shoebox and Toolbox Lexicons * Toolbox（它的前身為「Shoebox」），常被語言學家用來管理資料並可以自由地下載來使用。 * Toolbox由大量的字詞款目組成，每個款目都包括一個或一個以上的欄位值，大部分的欄位值是選擇性或可重複的，這表示這種詞彙資源不能轉換為表格形式儲存。 > 以下是Rotokas語的字典，我們就來看看它的第一條款目「kaa」意思是英文的「to gag」。 ```python= >>> from nltk.corpus import toolbox >>> nltk.download('toolbox') [nltk_data] Downloading package toolbox to /home/user/nltk_data... [nltk_data] Package toolbox is already up-to-date! True >>> toolbox.entries('rotokas.dic') [('kaa', [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'), ('tkp', 'nek i pas'), ('dcsv', 'true'), ('vx', '1'), ('sc', '???'), ('dt', '29/Oct/2005'), ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'), ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'), ...] ``` > 可以清楚看到它的款目是由一系列的屬性/值的配對所組成的，如('ps', 'V')就表示詞性（part-of-speech）為「'V' (verb)動詞」； ('ge', 'gag')表示對應到英文（gloss-into-English）為「'gag'」這個字。最後三組配對是一個例句，分別是Rotokas語、Tok Pisin語以及英文。 * 這種Toolbox的鬆散結構讓我們在應用上困難許多，不過我們將會在11章學會利用強大的XML來處理。 --- ## **_5 [WordNet](http://qheroq.blogspot.com/2010/10/python25.html#more)_** > WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network. > WordNet是語義導向的英語詞典，由155,287個單詞和117,659個同義詞集組成。 > ### 5.1 Senses and Synonyms > 同義詞、如何在WordNet中訪問同義詞。 > * 同義詞：`motorcar`與`automobile`代換後，句子的含義幾乎相同。 a. Benz is credited with the invention of the `motorcar`. b. Benz is credited with the invention of the `automobile`. ```python= >>> import nltk >>> nltk.download('wordnet') >>> from nltk.corpus import wordnet as wn >>> wn.synsets('motorcar') #代表'motorcar'只有一個可能的含義，被標識為car.n.01 #'car.n.01'同義詞集 [Synset('car.n.01')] >>> wn.synset('car.n.01').lemma_names() #同義詞集的每個詞，其各自具有多種含義。 #但是，我們只對上述同義詞集的所有單詞共有的單一含義感興趣。 ['car', 'auto', 'automobile', 'machine', 'motorcar'] #同義詞集，還附帶定義和例句 >>> wn.synset('car.n.01').definition() 'a motor vehicle with four wheels; usually propelled by an internal combustion engine' >>> wn.synset('car.n.01').examples() ['he needs a car to get to work'] ``` * 同義詞集的作用 * 定義 - 幫助人們理解同義詞集的預期含義 * 單詞 - 有助於消除歧義 ```python= >>> wn.synset('car.n.01').lemmas() [Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] >>> wn.lemma('car.n.01.automobile') Lemma('car.n.01.automobile') >>> wn.lemma('car.n.01.automobile').synset() Synset('car.n.01') >>> wn.lemma('car.n.01.automobile').name() 'automobile' ``` ```python= >>> wn.synsets('car') [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] >>> for synset in wn.synsets('car'): ... print(synset.lemma_names()) ... ['car', 'auto', 'automobile', 'machine', 'motorcar'] ['car', 'railcar', 'railway_car', 'railroad_car'] ['car', 'gondola'] ['car', 'elevator_car'] ['cable_car', 'car'] >>> wn.lemmas('car') [Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')] ``` ### 5.2 The WordNet Hierarchy * 以抽象概念做階層分類，並呈現上下位關係的層次結構。（詞彙關係） * 下位詞：更具體 * 上位詞：更抽象 * 它們將一個同義詞集與另一個同義詞集聯繫起來。 * 每個同義詞集，內含多個單詞。 * 若一個單詞，有多種意思，則該單詞可被歸類於多個不同的同義詞集內。 ![](https://i.imgur.com/Dlhil6V.png) > WordNet概念層次結構的片段： > 節點 - 對應於同義詞集。 > 線 - 表示上位詞/同義詞關係，即上位和下位概念之間的關係。 > ```python= >>> motorcar = wn.synset('car.n.01') >>> types_of_motorcar = motorcar.hyponyms() >>> types_of_motorcar[0] Synset('ambulance.n.01') >>> sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas()) ['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', ...] ``` * 我們也可以存取上位詞（hypernyms），有一些字有多種路徑，因為它們可以被歸屬到一個以上的概念下面。 > 例子：「motorcar」的路徑會有兩種， > 因為在「wheeled_vehicle.n.01」這裡時可能被歸屬到車輛（vehicle）、或容器（container）。 > ```python= >>> motorcar.hypernyms() [Synset('motor_vehicle.n.01')] >>> paths = motorcar.hypernym_paths() >>> len(paths) 2 >>> [synset.name() for synset in paths[0]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] >>> [synset.name() for synset in paths[1]] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] #獲得同義詞集最通用的上位詞 >>> motorcar.root_hypernyms() [Synset('entity.n.01')] ``` ### 5.3 More Lexical Relations * 詞彙關係（lexical relations） * 上位詞（hypernyms）與下位詞（ hyponyms）能把一個同義詞組（synset）與另一個做連結。 * 關連 1. 在階層關係中的向上與向下的從屬關連。 2. 部分（meronyms）與全體（holonyms）的關連。 > 透過`part_meronyms()`可發現一棵樹是由樹幹（trunk）、樹冠（crown）等等所組成的。 > 透過`substance_meronyms()`可發現一棵樹會包含樹心材（heartwood）與邊材（sapwood）等物質組成。 > 利用`member_holonyms()`找出一群樹可以產生什麼？森林（forest） > ```python= >>> wn.synset('tree.n.01').part_meronyms() [Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] >>> wn.synset('tree.n.01').substance_meronyms() [Synset('heartwood.n.01'), Synset('sapwood.n.01')] >>> wn.synset('tree.n.01').member_holonyms() [Synset('forest.n.01')] ``` ```python= >>> for synset in wn.synsets('mint', wn.NOUN): ... print(synset.name() + ':', synset.definition()) ... batch.n.02: (often followed by `of') a large number or amount or extent mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers mint.n.03: any member of the mint family of plants mint.n.04: the leaves of a mint plant used fresh or candied mint.n.05: a candy that is flavored with a mint oil mint.n.06: a plant where money is coined by authority of the government >>> wn.synset('mint.n.04').part_holonyms() [Synset('mint.n.02')] >>> wn.synset('mint.n.04').substance_holonyms() [Synset('mint.n.05')] ``` * 動詞之間也有關係。 > 例如，步行動作涉及踩踏動作，因此步行需要踏步。 > ```python= >>> wn.synset('walk.v.01').entailments() [Synset('step.v.01')] >>> wn.synset('eat.v.01').entailments() [Synset('chew.v.01'), Synset('swallow.v.01')] >>> wn.synset('tease.v.03').entailments() [Synset('arouse.v.07'), Synset('disappoint.v.01')] ``` * 反義詞（antonymy） ```python= >>> wn.lemma('supply.n.02.supply').antonyms() [Lemma('demand.n.02.demand')] >>> wn.lemma('rush.v.01.rush').antonyms() [Lemma('linger.v.04.linger')] >>> wn.lemma('horizontal.a.01.horizontal').antonyms() [Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')] >>> wn.lemma('staccato.r.01.staccato').antonyms() [Lemma('legato.r.01.legato')] ``` * 使用`dir（）`查看詞法關係、在同義詞集上定義的其他方法 ```python= >>> dir(wn.synset('harmony.n.02')) ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity'] ``` ### 5.4 Semantic Similarity 語義相似度 * 同義詞集，通過複雜的詞彙關係網絡鏈接在一起。 * 給定特定的同義詞集，我們可以遍歷WordNet網絡，以查找具有相關含義的同義詞集。 * 知道哪些詞在語義上相關，這對索引文本集合很有用。 ```python= >>> right = wn.synset('right_whale.n.01') #right_whale露脊鯨 >>> orca = wn.synset('orca.n.01') >>> minke = wn.synset('minke_whale.n.01') >>> tortoise = wn.synset('tortoise.n.01') >>> novel = wn.synset('novel.n.01') >>> right.lowest_common_hypernyms(minke) [Synset('baleen_whale.n.01')] >>> right.lowest_common_hypernyms(orca) [Synset('whale.n.02')] >>> right.lowest_common_hypernyms(tortoise) [Synset('vertebrate.n.01')] >>> right.lowest_common_hypernyms(novel) [Synset('entity.n.01')] >>> wn.synset('baleen_whale.n.01').min_depth() #baleen whale鬚鯨小目 14 >>> wn.synset('whale.n.02').min_depth() #whale鯨 13 >>> wn.synset('vertebrate.n.01').min_depth() #vertebrate脊椎動物 8 >>> wn.synset('entity.n.01').min_depth() 0 ``` * 相似性測量（Similarity measure） * `path_likeity` > 根據連接上位詞層次結構中概念的最短路徑，在0–1範圍內分配分數 > 若找不到路徑，返回 -1 ```python= >>> right.path_similarity(minke) #minke小鬚鯨 0.25 >>> right.path_similarity(orca) #orca虎鯨 0.16666666666666666 >>> right.path_similarity(tortoise) #tortoise陸龜 0.07692307692307693 >>> right.path_similarity(novel) #novel小說 0.043478260869565216 ```