The 2.3 billion sequences generated from predicted open reading frames, were clustered into three categories based on different approaches and thresholds.
A unigene or unigene cluster contains sequences that represent a unique gene. They are obtained by clustering the 2.3 billion nucleotide sequences at 95% identity resulting in approximately 300 million unigenes.
Protein cluster
A protein cluster is produced by translating the 300 million unigenes to amino-acid followed by clustering at 90% identity.
Protein family
Protein families capture distant homology relationships by clustering the translated 300 million unigenes at 20% identity requiring also a minimum of 50% sequence overlap.