Hybpiper - HackMD

### Analysing Sequence Reads ``` interactive -p nocona -c 5``` To unzip ```tar.gz``` files ```tar -zxf filename.tar.gz``` ```control K``` delete an entire line in an ```.sh or .txt``` on the terminal. Use ```unzip``` to unzip files with ```.zip``` files The raw reads come in zipped or compressed fastq files(.fastq.gz). Unzip these files using ```gunzip``` (.gz files) Then run a fastqc analysis to give a statistical summary of the quality of reads. Using either of the following, *. is a wild card to analyse multiple files. -o command specifies the output directory. ``` conda activate fastqc``` ``` fastqc filename.fastq``` ``` fastqc *.fastq``` ```fastqc -o /path/to/output -t 4 *.fastq``` OR ``` fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN``` ```--noextract``` will leave the output file zipped(.zip) ```-f``` specify the format of the file (e.g fastq) ```-c``` specify the path to the directory that contains contaminat file ```-o``` specify the path to the output directory. FastP is used to trim reads before pairing them. This is could be to remove adapter sequence or low quality, short sections sections of reads. ```fastp -i [untrimmed.forward].fq -I [untrimmed.reverse].fq -o [trimmedoutput.forward].fq -O [trimmedoutput.reverse].fq``` ### HYBPIPER ```conda activate hypiper``` ``` !/bin/bash #SBATCH --job-name=Hybpiper_batch #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 cd /lustre/scratch/oshodipo/Haplopappus/Hybpiper while read name; do hybpiper assemble -t_dna ../mega353.fasta -r ../Haplopappus/$name/*.fastq --prefix $name --bwa --cpu 5 --kvals 21,31,41 --compress_sample_folder done < samplesName.txt OR !/bin/bash #SBATCH --job-name=Hybpiper #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 cd /lustre/scratch/oshodipo/Xanthisma/Hybpiperoutput while read name; do hybpiper assemble -t_dna ../mega353.fasta -r ../Shodipo/$name/*.trimmed.*.fastq.gz --prefix $name --bwa --cpu 5 --kvals 21,31,41 --compress_sample_folder done < SampleNames.txt ``` ``` sbatch .sh, squeue -u oshodipo``` Because Outgroups do not have trimmed in their fastq file name I ran them separately. ``` #!/bin/bash #SBATCH --job-name=OutHybpiper #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 cd /lustre/scratch/oshodipo/Xanthisma/Hybpiperoutput_Outgroups while read name; do hybpiper assemble -t_dna ../mega353.fasta -r ../Outgroups/$name/*.fastq.gz --prefix $name --bwa --cpu 5 --kvals 21,31,41 --compress_sample_folder done < Outgroup_names.txt ``` Ensure the mega353 file is just in one directory up if using this code e.g if in your code you have cd L/S/O/XANTHISMA/HYBPIPEROUTPUT, then your mega353 should be in L/S/O/XANTHISMA Note that the samplesName.txt file is in output directory. OR the below command to run on each file cd to the directory that contains the reads you want to run hybpiper on. ``` hybpiper assemble -t_dna mega353.fasta -r *.fastq --prefix HAPLOseq009 --bwa --cpu 5 --kvals 21,31,41 --compress_sample_folder ``` For trimmed fastq files ``` hybpiper assemble -t_dna mega353.fasta -r *.trimmed.*.fastq.gz --prefix XANTHseq062 --bwa --cpu 5 --kvals 21,31,41 --compress_sample_folder ``` move the samplesName.txt to the working directory that you are running this command for me it is Hybpiper when in /l/s/o/H/H: while read name; do cd $name; ls; cd ..; done < samplesName.txt ``` ; is to separate command Do ```ls``` before the above command while in /lustre/scratch/oshodipo/Haplopappus/Haplopappus$ run the command below; ``` while read name do echo $name done < samplesName.txt ``` To easily copy the sample names without typing ``` ls``` in the directory that contains the sample e.g /lustre/scratch/oshodipo/Haplopappus/Haplopappus then ```ls >namelist.txt``` you can go ahed and remove the name of the files written that ypu dont want. ``` ### Hybpiper stats ``` hybpiper stats --targetfile_dna mega353.fasta gene namelist.txt --stats_filename hybpiper_summary_stats --seq_lengths_filename hybpiper_seq_lengths``` ``` For the new 93 samples(--cpu has to be specified cos the samples are large if not specified the analysis will get killed) ``` hybpiper stats -t_dna mega353.fasta gene SampleNames.txt --stats_filename hybpiper_summary_stats --seq_lengths_filename hybpiper_seq_lengths --cpu 3 ``` ### hybpiper heatmap ``` hybpiper recovery_heatmap hybpiper_seq_lengths.tsv --figure_length 11 --figure_height 8 ``` Running export ```mplbackend=agg``` first means you dont require display of the heatmap ``` export MPLBACKEND=Agg hybpiper recovery_heatmap hybpiper_seq_lengths.tsv --heatmap_filetype pdf ``` ### hybpiper retrieve sequences ``` hybpiper retrieve_sequences -t_dna mega353.fasta --sample_names namelist.txt supercontig ``` ### Gene Trees MAFFT ```mamba activate phylo``` ``` mkdir MAFFT``` Be in Hybpiper directory, Mafft will run on the output files(```.fasta```) from retrieve_sequences i.e the supercontigs retrieved. ``` parallel --eta "mafft --preservecase --maxiterate 1000 --localpair {}.fasta > MAFFT/{}.mafft.fasta" :::: genelist.txt ``` I was having issues getting the mafft file of all the genes so aia decided to run it as a shell script' ``` #!/bin/bash #SBATCH --job-name=MAFFT #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 . ~/conda/etc/profile.d/conda.sh conda activate phylo cd /lustre/scratch/oshodipo/Xanthisma/Hybpiperoutput parallel --eta "mafft --preservecase --maxiterate 1000 --localpair {}.fasta > MAFFT/{}.mafft.fasta" :::: genelist.txt ``` To pick the genes with 20 and above sequences so that the gene tree can make sense. Ideally for a gene tree to make sense, there need to be at least four gene present. ``` parallel --tag "grep '>' {} | wc -l" ::: *.fasta | sort -k2 -n > gene_seq_length.txt ``` ### Trimal ```mkdir Trimal``` To run the trimal command be in the Hybpiper directory not the MAFFT directory and create Trimal Directory in Hybpiper directory just like MAFFT. Also the ```.txt``` file you will give to the trimal arugument should only have the genes name without any extention and be in the hybpiper directory (Here, the trimal was run on the genes only with 20 and above sequences) ``` parallel --eta "trimal -in MAFFT/{}.mafft.fasta -out Trimal/{}.trimal.fasta -gt .5" :::: gene_20_cleaned.txt ``` ``` #!/bin/bash #SBATCH --job-name=Trimal #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 . ~/conda/etc/profile.d/conda.sh conda activate phylo cd /lustre/scratch/oshodipo/Xanthisma/Hybpiperoutput parallel --eta "trimal -in MAFFT/{}.mafft.fasta -out Trimal/{}.trimal.fasta -gt .5" :::: gene20.txt ``` make sure the gene20.txt does not have blank spaces Remove trailing spaces and blank lines ``` sed -i 's/[[:space:]]*$//' gene20.txt``` Ensure there are no carriage return characters (for Unix compatibility) ```sed -i 's/\r//' gene20.txt``` IQTREE ```mkdir IQTREE``` in Hybpiper directory ``` parallel --eta "iqtree -s Trimal/{}.trimal.fasta -pre Iqtree/{}" :::: gene_20_cleaned.txt ``` ``` #!/bin/bash #SBATCH --job-name=Iqtree #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks=36 . ~/conda/etc/profile.d/conda.sh conda activate phylo cd /lustre/scratch/oshodipo/Xanthisma/Hybpiperoutput parallel --eta "iqtree -s Trimal/{}.trimal.fasta -pre IqTree/{}" :::: gene20.txt ``` ### Determing what model is best IQtree supports a wide range of models for DNA, protein, codons, binary and nonbinary alignments. If you are not sure which model to use for your file you can run ModelFinder to figure out which is best. ```-m``` MFP is the option to specify the model name to use during the analysis. The special MFP key word stands for ModelFinder Plus, which tells IQ-TREE to perform ModelFinder and the remaining analysis using the selected model. Assessing branch support with bootstrap estimate IQtree presents an ultrafast bootstrap estimate that is faster than the standard procedure and provides relatively unbiased bootstrap support estimates. To run UF boot ```iqtree -s geneName.trimal.fasta -m MFP -B 1000 --redo``` ```-B``` specifies the number of bootstrap replicates where 1000 is the minimum number recommended. ```--redo``` tells IQTree that it's ok to rerun the analysis and overwrite the previous files To run in parallel ``` parallel --eta "iqtree -s Trimal/{}.trimal.fasta -m MFP -B 1000 --redo -pre Iqtree_bootstrap/{}" :::: gene_20_cleaned.txt ``` ### Species tree ``` mamba create -n species_trees mamba activate species_trees mamba install astral-tree newick_utils msaconverter ``` After installing the packages, you just need to activate the environment. ``` mamba activate species_trees``` Then create a file containing all the gene tree files using the command below ``` cat *.treefile > genetrees.tre``` To collapse all branches below 50% bootstrap support: ```nw_ed genetrees.tre 'i & b<50' o > gene_50.tre``` ```nw_ed genetrees.tre 'i & b<33' o > gene_33.tre``` ASTRAL ```astral -i gene_50.tre -o astral_spp.tre``` ```astral -i gene_33.tre -o astral_33spp.tre``` To change the node names first activate phylo and be in the folder containing the astral.tre To confirm if ete3 is installed and working ``` python -c "from ete3 import Tree; print('ETE3 is installed')" python -c "from ete3 import Tree; print('ETE3 is working')" ``` Then run this script below after creating a .csv file conntaining both old and new names ``` python /home/joh97948/scripts/replace_taxon_names.py astral_spp.tre Change_nodesrecent.csv > astral_spp_renamed.tre python /home/joh97948/scripts/replace_taxon_names.py astral_spp33.tre Change_nodesrecent.csv > astral_spp33renamed.tre python /home/joh97948/scripts/replace_taxon_names.py gene_50.tre Change_nodesrecent.csv > gene_50renamed.tre ``` To get my outgroup reads ``` wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR761/006/ERR7618476/ERR7618476_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR761/006/ERR7618476/ERR7618476_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR761/004/ERR7618474/ERR7618474_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR761/004/ERR7618474/ERR7618474_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR503/006/ERR5033766/ERR5033766_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR503/006/ERR5033766/ERR5033766_1.fastq.gz ```