RepeatModeler2 / RepeatMasker

tags: `program commands`

RepeatModeler2

RepeatModeler2 identifies repeats and assemble consensus sequences from a genome assembly. It attemps a basic classification based on the DFAM database.

genome.fasta (genome sequence) –> RM2 –> genome-families.fasta (TE consensi)

Example for a given genome called "genome"

1. Build database for RM2

<RepeatModelerPath>/BuildDatabase -name genome genome.fa

2. Run RM2

nohup <RepeatModelerPath>/RepeatModeler -database genome -pa 20 -LTRStruct >& run.out &

LTRStruct enables the LTR module of RM2

3. Output file to keep is `genome-families.fa`

RepeatMasker

full documentation

RepeatMasker will identify repeats on the genome using the library made and annotated by RepeatModeler2 genome-families.fa. The default engine is rmblastn (modified version of blastn for RepeatMasker).

nohup <RepeatMaskerPath>/RepeatMasker -pa 15 -a -s -gccalc -gff -cutoff 200 -no_is -lib genome-families.fa genome.fa

-pa: CPUs WARNING RepeatMasker multiplies CPU x 4 using rmblastn !!!
-a: .align file (needed for TE landscapes)
-s: "slow"-search mode (recommended)
-gccalc: computes the gc content
-gff: produces a gff track
-cutoff 200: min size to keep hit (recommended)
-no_is: don't look for insertion sequences (prokaryotic TE)