GraffiTE - HackMD

# GraffiTE https://github.com/cgroza/GraffiTE ### Installation Make sure that you have conda initialized first as well. ``` module load gcc/10.1.0 openjdk/11.0.2 conda create -n nextflow conda activate nextflow conda install nextflow #Clone this in your work directory cd /lustre/work/daray/software/GraffiTE git clone https://github.com/cgroza/GraffiTE.git singularity remote use SylabsCloud #Pull this in your home directory singularity pull --disable-cache /lustre/work/daray/software/GraffiTE/graffite_latest.sif library://cgroza/collection/graffite:latest #Go into this and replace every "library://cgroza/collection/graffite:latest" with "~/graffite_latest.sif" in the nextflow.config file. sed -i "s|library://cgroza/collection/graffite:latest|~/graffite_latest.sif|g" ~/GraffiTE/nextflow.config #Also correct a problem with using tmp directory as described way down on this page. sed -i "s|--contain --bind \$(pwd):/tmp|--contain -B /lustre/scratch/daray/bat1k_TE_analyses/graffiTE:/tmp|g" ~/GraffiTE/nextflow.config ``` Test run the software: ``` cd /lustre/work/daray/software/GraffiTE/GraffiTE/test tar xzvf human_test_set.tar.gz cd GraffiTE_testset nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta ``` ``` (nextflow) cpu-23-20:$ cd ~/GraffiTE/test (nextflow) cpu-23-20:/GraffiTE/test$ tar xzvf human_test_set.tar.gz GraffiTE_testset/assemblies.csv GraffiTE_testset/HG002.100k.set1.Illumina.fastq.gz GraffiTE_testset/HG002.mat.cur.20211005_chr22.fasta.gz GraffiTE_testset/hs37d5.chr22.fa GraffiTE_testset/human_DFAM3.6.fasta GraffiTE_testset/out/ GraffiTE_testset/out/1_SV_search/ GraffiTE_testset/out/1_SV_search/svim-asm_individual_VCFs/ GraffiTE_testset/out/1_SV_search/svim-asm_individual_VCFs/HG002.mat.vcf GraffiTE_testset/out/1_SV_search/vcfs.txt GraffiTE_testset/out/1_SV_search/svim-asm_variants.vcf GraffiTE_testset/out/3_TSD_search/ GraffiTE_testset/out/3_TSD_search/TSD_summary.txt GraffiTE_testset/out/3_TSD_search/TSD_full_log.txt GraffiTE_testset/out/3_TSD_search/pangenome.vcf GraffiTE_testset/out/2_Repeat_Filtering/ GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/ GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/OneCode_LTR.dic GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.out GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.out.log.txt GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.out.length GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.tbl GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.cat GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.masked GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/indels.fa.onecode.out GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/ALL.onecode.elem_sorted.bak GraffiTE_testset/out/2_Repeat_Filtering/repeatmasker_dir/onecode.log GraffiTE_testset/out/2_Repeat_Filtering/genotypes_repmasked_filtered.vcf GraffiTE_testset/out/4_Genotyping/ GraffiTE_testset/out/4_Genotyping/short_test1_genotyping.vcf.gz.tbi GraffiTE_testset/out/4_Genotyping/GraffiTE.merged.genotypes.vcf GraffiTE_testset/out/4_Genotyping/short_test1_genotyping.vcf.gz GraffiTE_testset/reads.csv (nextflow) cpu-23-20:/GraffiTE/test$ cd GraffiTE_testset (nextflow) cpu-23-20:/GraffiTE/test/GraffiTE_testset$ nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta N E X T F L O W ~ version 23.04.3 Pulling cgroza/GraffiTE ... downloaded from https://github.com/cgroza/GraffiTE.git Launching `https://github.com/cgroza/GraffiTE` [dreamy_mahavira] DSL2 - revision: b8cff71a8f [main] ▄████ ██▀███ ▄▄▄ █████▒ █████▒██▓▄▄▄█████▓▓█████ ██▒ ▀█▒▓██ ▒ ██▒▒████▄ ▓██ ▒▓██ ██▒ ▓▒▓█ ▀ ▒██░▄▄▄░▓██ ░▄█ ▒▒██ ▀█▄ ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███ ░▓█ ██▓▒██▀▀█▄ ░██▄▄▄▄██ ░▓█▒ ░░▓█▒ ░░██░░ ▓██▓ ░ ▒▓█ ▄ ░▒▓███▀▒░██▓ ▒██▒ █ ▓██▒░▒█░ ░▒█░ ░██░ ▒██▒ ░ ░▒████▒ ░▒ ▒ ░ ▒▓ ░▒▓░ ▒▒ ▓▒█░ ▒ ░ ▒ ░ ░▓ ▒ ░░ ░░ ▒░ ░ ░ ░ ░▒ ░ ▒░ ▒ ▒▒ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ V . 0.2.5 beta (09-11-2023) Find and Genotype Transposable Elements Insertion Polymorphisms in Genome Assemblies using a Pangenomic Approach Authors: Cristian Groza and Clément Goubert Bug/issues: https://github.com/cgroza/GraffiTE/issues executor > local (3) [c3/28522e] process > svim_asm (1) [100%] 1 of 1 ✔ [c3/f39b9c] process > survivor_merge [100%] 1 of 1 ✔ [b2/da7450] process > repeatmask_VCF (1) [100%] 1 of 1 ✔ [1b/49a8a7] process > tsd_prep (1) [100%] 1 of 1, failed: 1 [- ] process > tsd_search - [- ] process > tsd_report - [- ] process > pangenie - [- ] process > merge_VCFs - ERROR ~ Error executing process > 'tsd_prep (1)' Caused by: Process requirement exceeds available memory -- req: 10 GB; avail: 3.9 GB Command executed: cp repeatmasker_dir/repeatmasker_dir/* . prepTSD.sh hs37d5.chr22.fa 30 #wc -l indels.txt > indel_len Command exit status: - Command output: (empty) Work dir: /home/daray/GraffiTE/test/GraffiTE_testset/work/1b/49a8a7cf935b29c3708efd1a274f8e Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out` -- Check '.nextflow.log' file for details ``` ### Note the memory error. I tried again by getting an interactive session with 12 processors. ``` login-20-26:$ i12 [CPUs=12 NNodes=1 Name=INTERACTIVE Account=default Partition=nocona X11=YES] salloc: Pending job allocation 12199856 salloc: job 12199856 queued and waiting for resources salloc: job 12199856 has been allocated resources salloc: Granted job allocation 12199856 salloc: Waiting for resource configuration salloc: Nodes cpu-23-38 are ready for job ivy cpu-23-38:$ actconda conda activate nextflo(base) cpu-23-38:$ conda activate nextflow (nextflow) cpu-23-38:$ cd ~/GraffiTE/test (nextflow) cpu-23-38:/GraffiTE/test$ cd GraffiTE_testset (nextflow) cpu-23-38:/GraffiTE/test/GraffiTE_testset$ nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta N E X T F L O W ~ version 23.04.3 Launching `https://github.com/cgroza/GraffiTE` [focused_bell] DSL2 - revision: b8cff71a8f [main] ▄████ ██▀███ ▄▄▄ █████▒ █████▒██▓▄▄▄█████▓▓█████ ██▒ ▀█▒▓██ ▒ ██▒▒████▄ ▓██ ▒▓██ ██▒ ▓▒▓█ ▀ ▒██░▄▄▄░▓██ ░▄█ ▒▒██ ▀█▄ ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███ ░▓█ ██▓▒██▀▀█▄ ░██▄▄▄▄██ ░▓█▒ ░░▓█▒ ░░██░░ ▓██▓ ░ ▒▓█ ▄ ░▒▓███▀▒░██▓ ▒██▒ █ ▓██▒░▒█░ ░▒█░ ░██░ ▒██▒ ░ ░▒████▒ ░▒ ▒ ░ ▒▓ ░▒▓░ ▒▒ ▓▒█░ ▒ ░ ▒ ░ ░▓ ▒ ░░ ░░ ▒░ ░ ░ ░ ░▒ ░ ▒░ ▒ ▒▒ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ V . 0.2.5 beta (09-11-2023) Find and Genotype Transposable Elements Insertion Polymorphisms in Genome Assemblies using a Pangenomic Approach Authors: Cristian Groza and Clément Goubert Bug/issues: https://github.com/cgroza/GraffiTE/issues executor > local (8) [c7/9c2552] process > svim_asm (1) [100%] 1 of 1 ✔ [37/46f6fb] process > survivor_merge [100%] 1 of 1 ✔ [3f/dfce61] process > repeatmask_VCF (1) [100%] 1 of 1 ✔ [87/94ffcf] process > tsd_prep (1) [100%] 1 of 1 ✔ [c1/d6cf7c] process > tsd_search (1) [100%] 1 of 1 ✔ [a4/f87258] process > tsd_report (1) [100%] 1 of 1 ✔ [09/c11298] process > pangenie (1) [100%] 1 of 1 ✔ [74/a6cf50] process > merge_VCFs (1) [100%] 1 of 1 ✔ Completed at: 20-Sep-2023 14:05:37 Duration : 2m 35s CPU hours : (a few seconds) Succeeded : 8 ``` ### Test run on Craseonycteris haplotypes ``` login-20-25:$ scratch login-20-25:/lustre/scratch/daray$ mkdir graffiTE login-20-25:/lustre/scratch/daray$ cd graffiTE/ login-20-25:/lustre/scratch/daray/graffiTE$ mkdir assemblies login-20-25:/lustre/scratch/daray/graffiTE$ cd assemblies/ login-20-25:/lustre/scratch/daray/graffiTE/assemblies$ gunzip -c /lustre/work/daray/bat_genomes/bat1k/HLcraTho2A.fa.gz >cTho_A.fa login-20-25:/lustre/scratch/daray/graffiTE/assemblies$ gunzip -c /lustre/work/daray/bat_genomes/bat1k/HLcraTho2B.fa.gz >cTho_B.fa login-20-25:/lustre/scratch/daray/graffiTE/assemblies$ gunzip -c /lustre/work/daray/bat_genomes/zoonomia/CraTho.fa.gz >CraTho.fa login-20-25:/lustre/scratch/daray/graffiTE/assemblies$ cd .. login-20-25:/lustre/scratch/daray/graffiTE$ mkdir cTho_test login-20-25:/lustre/scratch/daray/graffiTE$ cd cTho_test login-20-25:/lustre/scratch/daray/graffiTE/cTho_test$ cp /lustre/scratch/daray/saliogen/ram_analyses/DNA/final/aAng/mammals.plus.covid_bats2.14072022.fa . ``` Generate a run script - cTho_graffiTE.sh ``` #!/bin/bash #SBATCH --job-name=graffiTE #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=12 cd /lustre/scratch/daray/graffiTE/cTho_test . ~/conda/etc/profile.d/conda.sh conda activate nextflow nextflow run https://github.com/cgroza/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype true \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence 5 \ --svim_asm_time 2h \ ``` Generate the assemblies file - cTho_assemblies.csv ``` path,sample ../assemblies/cTho_B.fa,cTho_B ../assemblies/CraTho.fa,CraTho ``` Run ``` (nextflow) cpu-23-38:/lustre/scratch/daray/graffiTE$ sbatch cTho_graffiTE.sh ``` Did not work ``` ERROR ~ No such file or directory: /lustre/scratch/daray/graffiTE/cTho_test/reads.csv -- Check '.nextflow.log' file for details ``` Change --genotype to 'false' and retry ``` #!/bin/bash #SBATCH --job-name=graffiTE #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=12 cd /lustre/scratch/daray/graffiTE/cTho_test . ~/conda/etc/profile.d/conda.sh conda activate nextflow nextflow run https://github.com/cgroza/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype false \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence 5 \ --svim_asm_time 2h \ ``` Running on the command line this time to save time. Failed again. ``` (nextflow) cpu-23-38:/lustre/scratch/daray/graffiTE/cTho_test$ nextflow run https://github.com/cgroza/GraffiTE \ > --assemblies cTho_assemblies.csv \ > --TE_library mammals.plus.covid_bats2.14072022.fa \ > --reference ../assemblies/cTho_A.fa \ > --graph_method pangenie \ --asm_divergence 5 \ > --genotype false \ > --cores 12 \ > --mammal \ > --svim_asm_threads 12 \ > --asm_divergence 5 \ > --svim_asm_time 2h \ > N E X T F L O W ~ version 23.04.3 Launching `https://github.com/cgroza/GraffiTE` [gloomy_knuth] DSL2 - revision: b8cff71a8f [main] ▄████ ██▀███ ▄▄▄ █████▒ █████▒██▓▄▄▄█████▓▓█████ ██▒ ▀█▒▓██ ▒ ██▒▒████▄ ▓██ ▒▓██ ██▒ ▓▒▓█ ▀ ▒██░▄▄▄░▓██ ░▄█ ▒▒██ ▀█▄ ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███ ░▓█ ██▓▒██▀▀█▄ ░██▄▄▄▄██ ░▓█▒ ░░▓█▒ ░░██░░ ▓██▓ ░ ▒▓█ ▄ ░▒▓███▀▒░██▓ ▒██▒ █ ▓██▒░▒█░ ░▒█░ ░██░ ▒██▒ ░ ░▒████▒ ░▒ ▒ ░ ▒▓ ░▒▓░ ▒▒ ▓▒█░ ▒ ░ ▒ ░ ░▓ ▒ ░░ ░░ ▒░ ░ ░ ░ ░▒ ░ ▒░ ▒ ▒▒ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ V . 0.2.5 beta (09-11-2023) Find and Genotype Transposable Elements Insertion Polymorphisms in Genome Assemblies using a Pangenomic Approach Authors: Cristian Groza and Clément Goubert Bug/issues: https://github.com/cgroza/GraffiTE/issues executor > local (2) executor > local (2) [42/4cd1ee] process > svim_asm (2) [100%] 1 of 1, failed: 1 [- ] process > survivor_merge - [- ] process > repeatmask_VCF - [- ] process > tsd_prep - [- ] process > tsd_search - [- ] process > tsd_report - ERROR ~ Error executing process > 'svim_asm (1)' Caused by: Process `svim_asm (1)` terminated with an error exit status (1) Command executed: mkdir asm minimap2 -a -x 5 --cs -r2k -t 12 -K 500M cTho_A.fa cTho_B.fa | samtools sort -m4G -@4 -o asm/asm.sorted.bam - samtools index asm/asm.sorted.bam svim-asm haploid --min_sv_size 100 --types INS,DEL --sample cTho_B asm/ asm/asm.sorted.bam cTho_A.fa sed 's/svim_asm\./cTho_B\.svim_asm\./g' asm/variants.vcf > cTho_B.vcf Command exit status: 1 Command output: (empty) Command error: [ERROR] unknown preset '5' [W::hts_set_opt] Cannot change block size for this format samtools sort: failed to read header from "-" Work dir: /lustre/scratch/daray/graffiTE/cTho_test/work/5c/961b690a48f34e56afb8461c094513 Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line -- Check '.nextflow.log' file for details ``` Something is wrong with that asm_divergence line maybe? Try this: ``` nextflow run https://github.com/cgroza/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype false \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence 5 \ --svim_asm_time 2h ``` Nope: ``` N E X T F L O W ~ version 23.04.3 Launching `https://github.com/cgroza/GraffiTE` [friendly_mcnulty] DSL2 - revision: b8cff71a8f [main] ▄████ ██▀███ ▄▄▄ █████▒ █████▒██▓▄▄▄█████▓▓█████ ██▒ ▀█▒▓██ ▒ ██▒▒████▄ ▓██ ▒▓██ ██▒ ▓▒▓█ ▀ ▒██░▄▄▄░▓██ ░▄█ ▒▒██ ▀█▄ ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███ ░▓█ ██▓▒██▀▀█▄ ░██▄▄▄▄██ ░▓█▒ ░░▓█▒ ░░██░░ ▓██▓ ░ ▒▓█ ▄ ░▒▓███▀▒░██▓ ▒██▒ █ ▓██▒░▒█░ ░▒█░ ░██░ ▒██▒ ░ ░▒████▒ ░▒ ▒ ░ ▒▓ ░▒▓░ ▒▒ ▓▒█░ ▒ ░ ▒ ░ ░▓ ▒ ░░ ░░ ▒░ ░ ░ ░ ░▒ ░ ▒░ ▒ ▒▒ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ V . 0.2.5 beta (09-11-2023) Find and Genotype Transposable Elements Insertion Polymorphisms in Genome Assemblies using a Pangenomic Approach Authors: Cristian Groza and Clément Goubert Bug/issues: https://github.com/cgroza/GraffiTE/issues executor > local (2) executor > local (2) [8d/b7222f] process > svim_asm (1) [100%] 1 of 1, failed: 1 [- ] process > survivor_merge - [- ] process > repeatmask_VCF - [- ] process > tsd_prep - [- ] process > tsd_search - [- ] process > tsd_report - ERROR ~ Error executing process > 'svim_asm (2)' Caused by: Process `svim_asm (2)` terminated with an error exit status (1) Command executed: mkdir asm minimap2 -a -x 5 --cs -r2k -t 12 -K 500M cTho_A.fa CraTho.fa | samtools sort -m4G -@4 -o asm/asm.sorted.bam - samtools index asm/asm.sorted.bam svim-asm haploid --min_sv_size 100 --types INS,DEL --sample CraTho asm/ asm/asm.sorted.bam cTho_A.fa sed 's/svim_asm\./CraTho\.svim_asm\./g' asm/variants.vcf > CraTho.vcf Command exit status: 1 Command output: (empty) Command error: [ERROR] unknown preset '5' [W::hts_set_opt] Cannot change block size for this format samtools sort: failed to read header from "-" Work dir: /lustre/scratch/daray/graffiTE/cTho_test/work/f1/beeaae756b578b40fbb9d177862274 Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out` -- Check '.nextflow.log' file for details ``` Re-read the documentation. It should be asm5. ``` nextflow run https://github.com/cgroza/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype false \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence asm5 --svim_asm_time 2h ``` That seems to have done it. Re-run via the submission script. ``` (nextflow) cpu-23-38:/lustre/scratch/daray/graffiTE$ sbatch cTho_graffiTE.sh ``` Got a new error that, according to google, is related to CIGAR string lengths of reads. I'm not using reads. Correspondence on github is below: ======================= > Managed to get the software to run but it only ran for four minutes before hitting this CIGAR error. > > A quick google search suggests that the read lengths are too long to handle (samtools/samtools#1667). > > However, I'm not dealing with reads, this is a job where I'm analyzing whole assemblies. I'm sure this is a mistake on my part somewhere given that the documentation specifically says graffiTE can be run using whole assemblies. > > My command line: > > nextflow run https://github.com/cgroza/GraffiTE \ > --assemblies cTho_assemblies.csv \ > --TE_library mammals.plus.covid_bats2.14072022.fa \ > --reference ../assemblies/cTho_A.fa \ > --graph_method pangenie \ > --genotype false \ > --cores 12 \ > --mammal \ > --svim_asm_threads 12 \ > --asm_divergence asm5 > --svim_asm_time 2h > The error: > > > [- ] process > svim_asm - > [- ] process > survivor_merge - > [- ] process > repeatmask_VCF - > [- ] process > tsd_prep - > [- ] process > tsd_search - > [- ] process > tsd_report - > > executor > local (1) > [11/563cf2] process > svim_asm (1) [ 0%] 0 of 2 > [- ] process > survivor_merge - > [- ] process > repeatmask_VCF - > [- ] process > tsd_prep - > [- ] process > tsd_search - > [- ] process > tsd_report - > > executor > local (2) > [12/90fee5] process > svim_asm (2) [ 0%] 0 of 2 > [- ] process > survivor_merge - > [- ] process > repeatmask_VCF - > [- ] process > tsd_prep - > [- ] process > tsd_search - > [- ] process > tsd_report - > ERROR ~ Error executing process > 'svim_asm (1)' > > Caused by: > Process `svim_asm (1)` terminated with an error exit status (1) > > Command executed: > > mkdir asm > minimap2 -a -x asm5 --cs -r2k -t 12 -K 500M cTho_A.fa cTho_B.fa | samtools sort -m4G -@4 -o asm/asm.sorted.bam - > samtools index asm/asm.sorted.bam > svim-asm haploid --min_sv_size 100 --types INS,DEL --sample cTho_B asm/ asm/asm.sorted.bam cTho_A.fa > sed 's/svim_asm\./cTho_B\.svim_asm\./g' asm/variants.vcf > cTho_B.vcf > > Command exit status: > 1 > > Command output: > (empty) > > Command error: > [M::mm_idx_gen::31.011*1.36] collected minimizers > [M::mm_idx_gen::37.874*1.71] sorted minimizers > [M::main::37.874*1.71] loaded/built the index for 812 target sequence(s) > [M::mm_mapopt_update::40.498*1.66] mid_occ = 168 > [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 812 > [M::mm_idx_stat::42.456*1.63] distinct minimizers: 161244371 (94.15% are singletons); average occurrences: 1.421; average spacing: 9.923; total length: 2273669687 > [E::parse_cigar] CIGAR length too long at position 1 (274808464H) > [E::parse_cigar] CIGAR length too long at position 877 (272289946H) > [E::parse_cigar] CIGAR length too long at position 4012 (275636627H) > samtools sort: truncated file. Aborting > > ================ > > It seems that crash is in minimap2, due to the CIGAR exceeding a certain length. > We will try passing the -L flag to minimap2 and push a new commit, to see if it fixes the issue. Re-cloned the repository and re-ran the job using an interactive session. It's been running for several minutes now with no issues thus far. Nope. Same error. Corresponding with Clement and he suggested using the local installation rather than calling from github. ``` nextflow run ~/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype false \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence asm5 \ --svim_asm_time 2h ``` Suggestion by Cristian on github. > Indeed, the problem could be samtools sort. > > You could try splitting the largest contigs in half to avoid the basepair limit. ``` conda activate samtools samtools faidx CraTho.fa && awk {'print $1,"",$2'} CraTho.fa.fai > CraTho_genomeFile.txt samtools faidx cTho_B && awk {'print $1,"",$2'} cTho_B.fa.fai > cTho_B_genomeFile.txt samtools faidx cTho_B.fa && awk {'print $1,"",$2'} cTho_B.fa.fai > cTho_B_genomeFile.txt 55111 samtools faidx cTho_A.fa && awk {'print $1,"",$2'} cTho_A.fa.fai > cTho_A_genomeFile.txt conda activate kmcp kmcp utils split-genomes -m 1 -k 1 --split-number 2 --split-overlap 0 cTho_B.id_h2_SUPER_1.fa -O cTho_B.id_h2_SUPER_1.split kmcp utils split-genomes -m 1 -k 1 --split-number 2 --split-overlap 0 cTho_B.id_h2_SUPER_2.fa -O cTho_B.id_h2_SUPER_2.split kmcp utils split-genomes -m 1 -k 1 --split-number 2 --split-overlap 0 cTho_B.id_h2_SUPER_3.fa -O cTho_B.id_h2_SUPER_3.split gunzip cTho_B.id_h2_SUPER_1.split/* gunzip cTho_B.id_h2_SUPER_2.split/* gunzip cTho_B.id_h2_SUPER_3.split/* ``` Renamed split scaffolds using nano and text files. ``` nano cTho_B.id_h2_SUPER_1.split/chunk001.fa nano cTho_B.id_h2_SUPER_1.split/chunk002.fa cd /lustre/scratch/daray/graffiTE/assemblies/cTho_B.fa.split/cTho_B.id_h2_SUPER_2.split nano chunk001.fa nano chunk002.fa cd /lustre/scratch/daray/graffiTE/assemblies/cTho_B.fa.split/cTho_B.id_h2_SUPER_3.split nano chunk001.fa nano chunk002.fa ``` Move original scaffolds to new storage directory and move the split files to the split directory for concatenation. ``` mkdir large_scaffolds mv cTho_B.id_h2_SUPER_1.fa cTho_B.id_h2_SUPER_2.fa cTho_B.id_h2_SUPER_3.fa large_scaffolds mv cTho_B.id_h2_SUPER_3.split/* cTho_B.id_h2_SUPER_2.split/* cTho_B.id_h2_SUPER_1.split/* . cat *.fa >../cTho_B_split.fa ``` Modify the cTho_assemblies.csv as needed and try again. ``` cd /lustre/scratch/daray/graffiTE/cTho_test . ~/conda/etc/profile.d/conda.sh conda activate nextflow nextflow run ~/GraffiTE \ --assemblies cTho_assemblies.csv \ --TE_library mammals.plus.covid_bats2.14072022.fa \ --reference ../assemblies/cTho_A.fa \ --graph_method pangenie \ --genotype false \ --cores 12 \ --mammal \ --svim_asm_threads 12 \ --asm_divergence asm5 \ --svim_asm_time 2h ``` According to the nextflow logs, I've made it past the previous crash. Fingers crossed. That worked but there's a new problem. Only the first two folders of four are being generated. Just 1 and 2. OUTPUT_FOLDER/ ├── 1_SV_search │ ├── HG002_mat.vcf │ └── HG002_pat.vcf ├── 2_Repeat_Filtering │ ├── genotypes_repmasked_filtered.vcf │ └── repeatmasker_dir │ ├── ALL.onecode.elem_sorted.bak │ ├── indels.fa.cat.gz │ ├── indels.fa.masked │ ├── indels.fa.onecode.out │ ├── indels.fa.out │ ├── indels.fa.out.length │ ├── indels.fa.out.log.txt │ ├── indels.fa.tbl │ ├── onecode.log │ └── OneCode_LTR.dic ├── 3_TSD_search │ ├── pangenome.vcf │ ├── TSD_full_log.txt │ └── TSD_summary.txt └── 4_Genotyping ├── GraffiTE.merged.genotypes.vcf ├── HG002_s1_10X_genotyping.vcf.gz ├── HG002_s1_10X_genotyping.vcf.gz.tbi ├── HG002_s2_10X_genotyping.vcf.gz └── HG002_s2_10X_genotyping.vcf.gz.tbi Clement suggests that the run is failing on the TSD search and evidence is that the genotypes_repmasked_filtered.vcf has no actual variant calls, only the header. Had me read two issues. https://github.com/cgroza/GraffiTE/issues/12 https://github.com/cgroza/GraffiTE/issues/8 A simple possible solution in both cases is the inability to generate a tmp directory due to permission problems. I implemented a potential solution in the nextflow config file, changing ``` singularity.runOptions = '--contain' ``` to ``` singularity.runOptions = '--contain -B /lustre/scratch/daray/graffiTE:/tmp ``` Another potential solution: > if you still have the Nextflow output, it should give you an alpha numeric code for each process. it correspond to a folder in work/ that is created by Nextflow in the directory you started the program. > For example, in your previous post: > [12/90fee5] process > svim_asm (2) [ 0%] 0 of 2 > then, you can find the process-specific log and error in work/12/90fee5[something]/.command.log or `work/12/90fee5[something]/.command.err` > Now, you should also have one created for the RepeatMasker step, and you can look if and error related to mktmp or /tmp is reported. > For some reason when this error occurs, Netflow doesn't complain, but the TSD step does not run. And it is always associated with no variants in the VCF of "2_..." After much back and forth, the solution was as follows. Change the nextflow.config file in the installation directory: > In your case, you want to do this: > singularity.runOptions = '--contain -B /lustre/scratch/daray/graffiTE/cTho_test:/tmp' > it should mount the directory /lustre/scratch/daray/graffiTE/cTho_test as the /tmp dir of the singularity container, instead of the default. > David Ray > 6:48 PM > We got it. That run option was the solution. Thank you. ## Conversations about interpreting the results: #### From me on Slack: I now have a table like this that contains the SV name in column A and the presence/absence data from the support vectors is in columns E and F. I want to make sure my interpretation is correct. As I recall, all of the 1's and 0's are relative to the reference, Maya. The genome order from the vcfs.txt is iSca followed by Murphy. 12:33 So, for "iSca.svim_asm.DEL.127", the support vector is "SUPP_VEC=10". Thus, this LINE is present in iSca (1) but not in Maya (the reference) or Murphy? Could it also be present in Maya? There are only two numbers in the support vector and I don't see any entries where the support vector is "SUPP_VEC=00", which would indicate presence in Maya but absence in the other two. There have to be at least some. #### Clément Goubert Hi David. You can't actually have 00, because it would mean that both iSca and Murphy have the same haplotype than the reference, so there is no variant; i.e. it would be a fixed TE presence or absence. "1" means "variant" from the reference. So to know if the TE is present or absent in the ref, you need to look at SVTYPE=. If it is DEL, it means the TE is present in the ref, and 1 means Deletion in iSca or Murphy. If SVTYPE=INS, then the TE is absent from Maya, and 1 means TE present in other samples. This use the SV conventions or reporting variants; to translate it into TE presence/absence, you can basically keep the support vector for INS, giving a 0 to the ref genome, and reverse it for DEL: you give a 1 to the ref genome and reverse the support vector, which will give you 0 for a given ind: SVTYPE=INS; SUPP_VEC=01 --> TE presence matrix Maya|iSCa|Murphy = 001 SVTYPE=DEL; SUPP_VEC=01 --> TE presence matrix Maya|iSCa|Murphy = 110 Separate conversation. Order of species in 1/0 is taken from the vcfs.txt document. Top taxon is first, second taxon is second. For example: ``` iSca.vcf murphy_PDC_2.3.1.vcf ``` Per Clement - PASS in pangenome.vcf is meaningless. PASS in TSD_summary.txt is useful. It means the TSD was verified. I plan to use only PASSes from TSD_summary.txt file. iSca is first number, murphy is second number. ### Getting the data into an interpretable form Only interested in data from the pangenome.vcf and TSD_summary.txt in 3_TSD_search. ``` cd /lustre/scratch/daray/ixodes/final_library_work/graffiTE/3taxa_v1/3_TSD_search ``` 1. Reduce TSD summary to only PASS. ``` grep "PASS" TSD_summary.txt >TSD_summary_pass.txt ``` 2. Get support vectors from pangenome.vcf ``` sed "s/;/ /g" pangenome.vcf | awk '{print $3 "\t" $9}' | sed "s/=01/=0 1/g" | sed "s/=11/=1 1/g" | sed "s/=10/=1 0/g" | sed "s/=00/=0 0/g" | sed "s/SUPP_VEC=//g" | grep -E 'INS|DEL' >support_vectors.txt ``` 3. support_vectors.txt will contain ALL potential indels. Need to reduce to only the ones that PASSed from TSD_summary_pass.txt ``` awk '{print $1}' TSD_summary_pass.txt > pass_list.txt cat pass_list.txt | while read i; do grep -w "$i" support_vectors.txt >>support_vectors_pass.txt; done ``` Output looks like this: ``` iSca.svim_asm.INS.48 1 0 iSca.svim_asm.INS.511 1 0 iSca.svim_asm.INS.512 1 0 iSca.svim_asm.INS.515 1 0 iSca.svim_asm.INS.61 1 0 murphy_PDC_2.3.1.svim_asm.DEL.10001 0 1 murphy_PDC_2.3.1.svim_asm.DEL.10007 0 1 murphy_PDC_2.3.1.svim_asm.DEL.10008 0 1 murphy_PDC_2.3.1.svim_asm.DEL.10010 0 1 ``` 4. Now, reverse order of the columns for iSca and Murphy ``` awk '{print $1 "\t" $3 "\t" $2}' support_vectors_pass.txt >support_vectors_pass_switched.txt ``` ``` iSca.svim_asm.INS.512 0 1 iSca.svim_asm.INS.515 0 1 iSca.svim_asm.INS.61 0 1 murphy_PDC_2.3.1.svim_asm.DEL.10001 1 0 murphy_PDC_2.3.1.svim_asm.DEL.10007 1 0 murphy_PDC_2.3.1.svim_asm.DEL.10008 1 0 murphy_PDC_2.3.1.svim_asm.DEL.10010 1 0 ``` 5. Need to add a column for Maya, the reference. It should be a "0" for lines with INS and a "1" for lines with DEL. ``` grep "INS" support_vectors_pass_switched.txt >support_vectors_pass_switched_INS.txt grep "DEL" support_vectors_pass_switched.txt >support_vectors_pass_switched_DEL.txt awk '{print $1 "\t" 0 "\t" $2 "\t" $3}' support_vectors_pass_switched_INS.txt >support_vectors_new.txt awk '{print $1 "\t" 1 "\t" $2 "\t" $3}' support_vectors_pass_switched_DEL.txt >>support_vectors_new.txt ``` 6. For all entries where Maya has a '1', reverse the value for Murphy and iSca. Based on "This use the SV conventions or reporting variants; to translate it into TE presence/absence, you can basically keep the support vector for INS, giving a 0 to the ref genome, and reverse it for DEL: you give a 1 to the ref genome and reverse the support vector, which will give you 0 for a given ind: SVTYPE=INS; SUPP_VEC=01 --> TE presence matrix Maya|iSCa|Murphy = 001 SVTYPE=DEL; SUPP_VEC=01 --> TE presence matrix Maya|iSCa|Murphy = 110" Can't figure out how to do this. Will take care of it when I get to the spreadsheet. 7. For all of these, need the TE name, family and class, and orientation from the TSD_summary_pass.txt ``` awk '{print $1}' support_vectors_new.txt >support_vector_list.txt rm summary_pass_data.txt awk '{print $1 "\t" $2}' te_list.txt | sed 's/\//\'$'\t/g' >te_list_final.txt cat support_vector_list.txt | while read i; do grep -w "$i" TSD_summary_pass.txt | awk '{print $2 "\t" $3 "\t" $12 "\t" $13 }' >> summary_pass_data.txt; done paste support_vectors_new.txt summary_pass_data.txt >support_vectors_new_data.txt awk '{print $5}' support_vectors_new_data.txt >support_vector_te_list.txt cat support_vector_te_list.txt | while read i; do grep -w -m 1 "$i" te_list_final.txt | awk '{print $2 "\t" $3}' >> summary_pass_te_data.txt; done paste support_vectors_new_data.txt summary_pass_te_data.txt >support_vectors_new_data_te.txt awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $9 "\t" $10 "\t" $6 "\t" $7 "\t" $8}' support_vectors_new_data_te.txt > support_vectors_new_data_arranged.txt ```