SRILM Class N-gram Note

http://www.speech.sri.com/projects/srilm/manpages/ngram-class.1.html

NAME

ngram-class - induce word classes from N-gram statistics

SYNOPSIS

ngram-class [ -help ] option …

DESCRIPTION

ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model.
The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model.

Input Options

-vocab file

vocabulary file.
不在檔案中的詞會被判定成OOV，若沒有提供檔案則假設text/count file中所有詞均為IV。

-tolower

Map the vocabulary to lowercase.

-counts file

N-gram counts file.
透過現有的counts file建立class-ngram counts file，counts file可透過ngram-count(1)產生。
counts file與text file可同時使用，但需注意的二者的詞會被累計。

-text textfile

透過text file建立class-ngram counts file。
text file與counts file可同時使用，但需注意的二者的詞會被累計。

Class Merging

-numclasses C

classes的最大數量，程式會根據這個數量對詞進行class的自動分數。
若指定numclasses = 0，若每個word獨自為一個class，class數量=word數量，配合-read使用則可產生出自定義的class-ngram counts file。
若numclasses = 1/不指定numclasses，則所有word均會在同一個class。
A zero argument suppresses automatic class merging altogether (e.g., for use with -interact).

-maxwordsperclass M

Limits the number of words in a class to M in incremental merging.
By default there is no such limit.
當使用full merging(-full)時此設定會被忽略，只能針對incremental merging使用。
maxwordsperclass = 1是無效呼叫，實際上輸出結果會等同於maxwordsperclass=2時的結果(bug)，建議使用numclasses=0來達到預期效果。
當maxwordsperclass與numclasses互相衝突時，輸出的結果難以預料，可能出現maxwordsperclass與numclasses均超過設定值的情況，建議避免同時使用。

-noclass-vocab file

可指定不需被加入class的詞，檔案中的詞將不會分類到任何的class裡。
These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null.
當使用自定義的noclass-vocab檔案時，由於會覆蓋預設設定，請視需要自行將<s>及</s>加入到檔案中。

-read file

可讀取一個初始的classes檔案，檔案格式可參考classes-format(5)，但檔案中每個class-to-word的probability為必填，且每一行中class-to-word必須是一對一關係。
若檔案中的詞沒有在text中出現，則該詞最終的機率為0且不會出現在classes檔案中

-full

高級演算法
Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992).

-incremental

預設演算法
Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default.

-interact

完全不懂怎麼用
Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps.

Output Options

-class-counts file

Write class N-gram counts to file when done.
The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model.

-classes file

Write class definitions (member words and their probabilities) to file when done.
The output format is the same as required by the -classes option of ngram(1).

-save S

Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. The saved class memberships can also be used with the -read option to restart class merging at a later time. S=0 (the default) suppresses the saving actions.

-save-maxclasses K

Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.)

BUGS

Classes are optimized only for bigram models at present.
AUTHOR
Andreas Stolcke stolcke@icsi.berkeley.edu, Seppo Enarvi seppo.enarvi@aalto.fi
Copyright © 1999-2010 SRI International
Copyright © 2013-2014 Seppo Enarvi
Copyright © 2011-2014 Andreas Stolcke
Copyright © 2012-2014 Microsoft Corp.

Example

script










# build LM
ngram-class -tolower -read ${initial-class} -noclass-vocab ${noclasses-vocab} -text ${text} -numclasses 0 \ 
    ${text} -class-counts ${lm-counts} -classes ${lm-classes}
ngram-count -read ${lm-count} -lm ${classs-gram.lm} -order 2
# convert class ngram to word ngram
ngram -lm ${classes-gram.lm} -classes ${lm-classes} \ 
    -expand-classes 2 -write-lm ${word-gram} -tolower
# test class ngram
ngram -lm ${classes-gram.lm} -classes ${lm-classes} \ 
    -ppl ${text} -debug 2 -tolower

text

She was already at the bus stop
the car warmed qucikly and she fell asleep again
I go to school by bus

initial class

class-transport 0.5 bus
class-transport 0.5 car

noclasses-vocab

<s>
</s>
in
of
a
an
the
at
on
to
for
from
and
or

sample output - classes















class-00001 1 again
class-00002 1 already
class-00003 1 asleep
class-00004 1 by
class-00005 1 fell
class-00006 1 go
class-00007 1 i
class-00008 1 quickly
class-00009 1 school
class-00010 1 she
class-00011 1 station
class-00012 1 warmed
class-00013 1 was
class-transport 0.6666666666666666 bus
class-transport 0.3333333333333333 car

sample output - class-counts












































</s> 3
<s> 3
<s> class-00007 1
<s> class-00010 1
<s> the 1
and 1
and class-00010 1
at 1
at the 1
class-00001 1
class-00001 </s> 1
class-00002 1
class-00002 at 1
class-00003 1
class-00003 class-00001 1
class-00004 1
class-00004 class-transport 1
class-00005 1
class-00005 class-transport 1
class-00006 1
class-00006 to 1
class-00007 1
class-00007 class-00006 1
class-00008 1
class-00008 and 1
class-00009 1
class-00009 class-00004 1
class-00010 2
class-00010 class-00005 1
class-00010 class-00013 1 
class-00011 1
class-00011 </s> 1
class-00012 1
class-00012 class-00008 1
class-00013 1
class-00013 class-00002 1
class-transport 3
class-transport </s> 1
class-transport class-00011 1
class-transport class-00012 1
the 2
the class-transport 2
to 1
to class-00009 1