http://www.speech.sri.com/projects/srilm/manpages/ngram-class.1.html
ngram-class - induce word classes from N-gram statistics
ngram-class [ -help ] option …
ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model.
The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model.
vocabulary file.
不在檔案中的詞會被判定成OOV,若沒有提供檔案則假設text/count file中所有詞均為IV。
Map the vocabulary to lowercase.
N-gram counts file.
透過現有的counts file建立class-ngram counts file,counts file可透過ngram-count(1)產生。
counts file與text file可同時使用,但需注意的二者的詞會被累計。
透過text file建立class-ngram counts file。
text file與counts file可同時使用,但需注意的二者的詞會被累計。
classes的最大數量,程式會根據這個數量對詞進行class的自動分數。
若指定numclasses = 0,若每個word獨自為一個class,class數量=word數量,配合-read使用則可產生出自定義的class-ngram counts file。
若numclasses = 1/不指定numclasses,則所有word均會在同一個class。
A zero argument suppresses automatic class merging altogether (e.g., for use with -interact).
Limits the number of words in a class to M in incremental merging.
By default there is no such limit.
當使用full merging(-full)時此設定會被忽略,只能針對incremental merging使用。
maxwordsperclass = 1是無效呼叫,實際上輸出結果會等同於maxwordsperclass=2時的結果(bug),建議使用numclasses=0來達到預期效果。
當maxwordsperclass與numclasses互相衝突時,輸出的結果難以預料,可能出現maxwordsperclass與numclasses均超過設定值的情況,建議避免同時使用。
可指定不需被加入class的詞,檔案中的詞將不會分類到任何的class裡。
These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null.
當使用自定義的noclass-vocab檔案時,由於會覆蓋預設設定,請視需要自行將<s>及</s>加入到檔案中。
可讀取一個初始的classes檔案,檔案格式可參考classes-format(5),但檔案中每個class-to-word的probability為必填,且每一行中class-to-word必須是一對一關係。
若檔案中的詞沒有在text中出現,則該詞最終的機率為0且不會出現在classes檔案中
高級演算法
Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992).
預設演算法
Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default.
完全不懂怎麼用
Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps.
Write class N-gram counts to file when done.
The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model.
Write class definitions (member words and their probabilities) to file when done.
The output format is the same as required by the -classes option of ngram(1).
Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. The saved class memberships can also be used with the -read option to restart class merging at a later time. S=0 (the default) suppresses the saving actions.
Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.)
ngram-count(1), ngram(1), classes-format(5).
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, ''Class-Based n-gram Models of Natural Language,'' Computational Linguistics 18(4), 467-479, 1992.
Classes are optimized only for bigram models at present.
AUTHOR
Andreas Stolcke stolcke@icsi.berkeley.edu, Seppo Enarvi seppo.enarvi@aalto.fi
Copyright © 1999-2010 SRI International
Copyright © 2013-2014 Seppo Enarvi
Copyright © 2011-2014 Andreas Stolcke
Copyright © 2012-2014 Microsoft Corp.