--- Title: Cycad Hybridization tags: HybSeq --- ## Trimmomatic > Most of this section taken from Johnathan's [Hybpiper 2 Workflow page](https://hackmd.io/zLFynXJvQQ-y0-rj6nqolQ). [Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md) Most files coming straight from the sequencing facility have a few extra characters in the names. Clean up the file names that have something like samplename_S79_R1_001.fastq.gz: ```bash= #to remove the S79: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done ``` Trimmomatic filter out low quality reads, remove adapter sequences, etc. Trimmomatic conda environment needs to be activated before running. ```bash= conda activate trimmomatic ``` :::spoiler Trimmomatic Arguments - `PE -phred33` refers to the quality scoring scheme that is used in the raw reads - `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on - `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files - The remaining arguments informs trimmomatic's decision in keeping/removing sequences ::: ```bash= #Use nano to create a shell script and name it -> trimmomatic.sh #For loop to run on folder of samples for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` This should be run using a slurm environment. Create a slurm environment using nano and name it `slurm_trimmomatic.sh` ``` #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=15:00:00 #SBATCH --mem=1GB #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=slurm_trimmomatic #SBATCH --output=slurm_outlog.log #SBATCH --error=slurm_errors.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu bash ./trimmomatic.sh ``` Trimmomatic cannot utilize more than 1GB of memory, so increasing does not make it run faster. I had ~250 files and ~168GB of data, which took a little over 14 hours to run. To run use sbatch command ```bash= sbatch slurm_trimmomatic.sh ``` After trimmomatic runs you will have the original file, R1 and R1 paired files, and R1 and R2 unpaired files. Examples: HE1_R1.fastq.gz HE1_R2.fastq.gz HE1_R1_paired.fastq.gz HE1_R2_paired.fastq.gz HE1_R1_unpaired.fastq.gz HE1_R2_unpaired.fastq.gz For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file. To concatenate the two unpaired files into one ```bash= for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` Take a moment to organize your files. Make a new directory for your ./raw_reads and within that, a new directory for your ./unconcatenated_unpaired_reads. Now you should only have these three files per sample: HE1_R1_paired.fastq.gz HE1_R2_paired.fastq.gz HE1_unpaired.fastq.gz ## HybPiper 2 > Most of this section taken from Jonathan's [Hybpiper 2 Workflow page](https://hackmd.io/zLFynXJvQQ-y0-rj6nqolQ). [HybPiper GitHub](https://github.com/mossmatters/HybPiper) [Hybpiper Wiki](https://github.com/mossmatters/HybPiper/wiki/HybPiper-Legacy-Wiki) Hybpiper conda envrionment needs to be activated before starting ```bash= conda create hybpiper conda activate hybpiper conda install bioconda::hybpiper ``` You will need to create a text file with the names of all individauls as they appear in the file names. I named this `namefile.txt` ### GoFlag bait file > [Weston Testo's GitHub](https://github.com/wtesto/GoFlagHybPipeline) for the GoFlag pipeline Download the `combinedTarget.fasta` file and upload it into Quest using either Cyberduck or Globus. This is the complete bait file and has all loci from all taxa that were used to create the GoFlag. For some reason, the pipline for using this file requires you to create a custom reference file to use with hybpiper. To create this custom reference file create this script using `nano` command and name it `extract_sequences.sh` ```bash= #!/bin/bash ###Insert the taxon name for your reference of interest between the first two "/ /" ### Note that for some species (e.g., Polypodium hesperium), there are two references in the reference library ###For these species, indicate the reference number in the tag to disambiguate between them (e.g., "Polypodium_hesperium_2") awk '/Cycadales/ {print; getline; print}' combinedTarget.fasta >> ref_seq.fasta ###add additional ref if needed sed -E "s/>L([0-9]*)_.*_.*_(.*_.*)_[1-2]__REF/>\2-\1/g" ref_seq.fasta > target.fa rm ref_seq.fasta ``` This will create a file called target.fa that will be used as the bait file in the HybPiper pipeline. ### Assembly HybPiper's first command to run is called `assemble` and is for sequence assembly and extraction. This process is resource intensive and will take a long time. In a folder containing the quality filtered raw reads, run these commands. Use nano command and name this script `assembly.sh` This is an intensive process and will take a long time, use a slurm environment to run. :::spoiler NOTES ABOUT ASSEMBLY - The syntax below will need to change depending on the naming scheme of your files - It may be a good idea to back up your quality filtered raw reads somewhere (if storage space is not an issue to worry about) - Here I've redirected the output of this for loop into a text file/shell script. That way, you can check to see if all the variables were correctly stored before running this step. To run the shell script, type bash assemble - Arguments: - -t_dna: is the bait file - -r: the paired read files - --unpaired: the concatenated unpaired read file - --bwa: using _nucleotide_ target files - --run_intronerate: collect intron regions as well (you may not want this depending on your goals) - See github for additional arguements/details ::: ```bash= for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo hybpiper assemble -t_dna target.fa -r $i $f2 --unpaired $f3 --prefix $f4 --bwa >> assemble.sh done ``` :::danger The argument --run_intronerate is usually used, but it would lead to the program failing to run, so I had to exlude it. *Note* Without run_intronerate you aren't able to retrieve a supercontig at the end of the hybpiper pipeline. A work around *might* be able to be done by aligning and trimming the consensus files made during the hybphaser pipeline. FIX! -run_intronerate is no longer a required perameter in HybPiper v2.1.6, it is automatically done without specifying it. If you don't want to run intronerate use the perameter -no_intronerate. ::: Create an assembly slurm envionment and name it `slurm_assembly.sh` ``` #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=long #SBATCH --time=48:00:00 #SBATCH --mem=70GB #SBATCH --nodes=1 #SBATCH --ntasks-per-node=20 #SBATCH --job-name=slurm_hybpiper_assembly #SBATCH --output=slurm_outlog.log #SBATCH --error=slurm_errors.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu bash ./assembly.sh #to run in slurm sbatch slurm_assembly.sh ``` I had 116 individuals, which took ~34 hours to complete, the CPU usage was 6 (36% of allocation) and memory usage was 101.5GB (133% of allocation). Next time I would lower `ntasks-per-node=8` and `mem=84GB` (84GB is the max we are allowed to allocate). ### Summary Statistics HybPiper also includes a number of commands that produce statistics about the assmebled sequences. This is separate from retrieving the sequences (described below). `cd` to the parent folder containing HybPiper's outputs and type these commands to generate summary statistics. :::spoiler NOTES ABOUT STATISTICS - Arguments - -t_dna: The nucleotide bait file - gene: "gene" if you want to generate statistics about the gene sequence, "supercontig" if you want supercontig statistics (requires intronerate) - namefile.txt: a text file containing individual names, one on each line ::: ```bash= #This line generates some tsv's containing statistics about each individual's assemblies hybpiper stats -t_dna target.fa gene namefile.txt #This line generates a heatmap based off of the summary statistics in the previous line hybpiper recovery_heatmap seq_lengths.tsv ``` ### Retrieve Sequences Once we've both reduced missing data and added the outgroup to our assemblies, we can retrieve the loci sequences assembled for each individual using HybPiper's retrieve_sequences command: ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=short #SBATCH --time=04:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=retrieve_seqs #SBATCH --error=retrieve_seqs.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu hybpiper retrieve_sequences dna -t_dna target.fa --sample_names namefile.txt --fasta_dir ./seqs ``` ### Retrieve Supercontigs! We need to retrieve supercontigs to make a more accurate supermatrix tree. With HybPiper v2.1.6 intronerate is automatically run during assembly, so retrieving supercontigs should just work. :::danger Run through the first section of HybPhaser before retrieving contigs. The first section of hybphaser gets rid of poor reads, then you can remove those individuals from the namefile when running this script to remove them from the tree analysis. ::: ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=short #SBATCH --time=04:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=retrieve_supercontigs #SBATCH --error=retrieve_supercontigs.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu hybpiper retrieve_sequences supercontig -t_dna /projects/p32038/Fant44_02.22.2024/test/fastq/target.fa --sample_names namefile.txt --fasta_dir ./supercontigs ``` ## HybPhaser > A lot of this taken from [Nora's HybPhaser Protocol](https://hackmd.io/LMYO6q6tQMGeaRxVo8cOLg?both). Note that Nora was not using this to assess hybridization > [HybPhaser GitHub](https://github.com/LarsNauheimer/HybPhaser) HybPhaser Version 2.0 – was developed to deal with hybrids (and polyploids) in target capture datasets. It detects hybrids by measuring heterozygosity in the dataset and phase hybrid accessions by separating reads according to similarity with selected taxa that represent parental clades. HybPhaser is built as an extension to the assembly pipeline HybPiper. #### New notes after running HybPhaser v2 on Quest in 2023: When running 1_generate_consensus_sequences.sh in a slurm script, be sure to ```module load samtools``` and ```module load bwa``` in the slurm script. Do not module load bcftools (see below). Note that the bcftools that is installed on quest is v1.18, however this script requires the older version v1.9, and confusingly the flag -I, which specifies *bcftools consensus* use IAPUC codes, is no longer accepted in v.18, but instead uses an -i flag for that, however v1.9 uses the flag -i for filtering the vcf! To solve this issue, you'll have to install bcftools v1.9 in your conda environment. Do not ```module load bcftools``` in the environment or in the slurm script because this will load the newer version. ```bash= conda activate myEnv conda install bcftools=1.9 ``` Make sure that you are using the correct bcftools install: ``` which bcftools # it should be in your environment bin/ ~/.conda/envs/myEnv/bin/bcftools # if it is in /software then type module purge # and check again ``` Before starting you also need to create a directory for hybphaser outputs ```bash= mkdir ./(folder with hybpiper outputs)/hybphaser #for me this was ./fastq/hybphaser ``` #### Step 1 - generate consensus sequences The first step is to locate the [1_generate_consensus_sequence](https://github.com/LarsNauheimer/HybPhaser/blob/main/1_generate_consensus_sequences.sh) code in the HybPhaser github. I used the `nano` command and copy and pasted it over and named it `generate_sequences.sh`. :::spoiler HybPhaser's generate consensus sequences script ``` bash=! #!/bin/bash ############ set +e set -u set -o pipefail CONTIGTYPE="normal" THREADS=1 SAMPLENAME="" ALLELE_FREQ=0.15 READ_DEPTH=10 ALLELE_COUNT=4 PIPERDIR="./" OUTDIR="../HybPhaser" NAMELIST="not a file" CLEANUP="FALSE" while getopts 'ict:s:a:f:d:p:o:n:' OPTION; do case "$OPTION" in i) CONTIGTYPE="ISC" ;; c) CLEANUP="TRUE" ;; t) THREADS=$OPTARG ;; s) SAMPLENAME=$OPTARG ;; f) ALLELE_FREQ=$OPTARG ;; a) ALLELE_COUNT=$OPTARG ;; d) READ_DEPTH=$OPTARG ;; p) PIPERDIR=$OPTARG ;; o) OUTDIR_BASE=$OPTARG ;; n) NAMELIST=$OPTARG ;; ?) echo "This script is used to generate consensus sequences by remapping reads to de novo contigs generated by HybPiper. It is part of the HybPhaser workflow and has to be executed first. Usage: generate_consensus_sequences.sh [options] Options: General: -s <Name of sample> (if not providing a namelist) -n <Namelist> (txt file with sample names, one per line) -p <Path to HybPiper results folder> Default is current folder. -o <Path to output folder> (will be created, if it doesnt exist). Default is ../HybPhaser -t <Maximum number of threads used> Default is 1. (multiple threads so not speed up much) -i -intronerated: If set, intronerate_supercontigs are used in addition to normal contigs. -c -clean-up: If set, reads and mapping files are removed (.bam, .vcf.gz) Adjust consensus sequence generation: -d Minimum coverage on site to be regarded for assigning ambiguity code. If read depth on that site is lower than chosen value, the site is not used for ambiguity code but the most frequent base is returned. Default is 10. -f Minimum allele frequency regarded for assigning ambiguity code. If the alternative allele is less frequent than the chosen value, it is not included in ambiguity coding. Default is 0.15. -a Minimum count of alleles to be regarded for assigning ambiguity code. If the alternative allele ocurrs less often than the chosen value, it is not included in ambiguity coding. Default is 4. " >&2 exit 1 ;; esac done shift "$(($OPTIND -1))" if [[ -f $NAMELIST ]]; then SAMPLES=$(<$NAMELIST) elif [[ $SAMPLENAME != "" ]]; then SAMPLES=$SAMPLENAME else echo "No sample name given or namelist file does not exist!" fi for SAMPLE in $SAMPLES do SECONDS=0 #collecting reads and contigs from each sample echo Collecting read and contigs files for $SAMPLE OUTDIR=$OUTDIR_BASE"/01_data/" mkdir -p "$OUTDIR/$SAMPLE/reads" cp $PIPERDIR/$SAMPLE/*/*_unpaired.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null cp $PIPERDIR/$SAMPLE/*/*_interleaved.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null if compgen -G $OUTDIR/$SAMPLE/reads/*_interleaved.fasta > /dev/null; then for i in $OUTDIR/$SAMPLE/reads/*_interleaved.fasta do cat $i ${i/_interleaved.fasta/_unpaired.fasta} > ${i/_interleaved.fasta/_combined.fasta} 2> /dev/null rm $i ${i/_interleaved.fasta/_unpaired.fasta} 2> /dev/null done fi if [[ $CONTIGTYPE == "normal" ]]; then mkdir -p $OUTDIR/$SAMPLE/contigs cp $PIPERDIR/$SAMPLE/*/*/sequences/FNA/*.FNA $OUTDIR/$SAMPLE/contigs/ 2> /dev/null # adding gene name into the fasta files ">sample-gene" for i in $OUTDIR/$SAMPLE/contigs/*.FNA do FILE=${i/*\/contigs\//} GENE=${FILE/.FNA/} sed -i "s/\(>.*\)/\1-$GENE/" $i done # renaming .FNA to .fasta for consistency for f in $OUTDIR/$SAMPLE/contigs/*.FNA; do mv "$f" "${f/.FNA/.fasta}"; done CONTIGPATH="$OUTDIR/$SAMPLE/contigs" mkdir -p "$OUTDIR/$SAMPLE/consensus" else # when intronerate is selected mkdir -p $OUTDIR/$SAMPLE/intronerated_contigs cp $PIPERDIR/$SAMPLE/*/*/sequences/intron/*_supercontig.fasta $OUTDIR/$SAMPLE/intronerated_contigs/ 2> /dev/null for f in $OUTDIR/$SAMPLE/intronerated_contigs/*.* do mv "$f" "${f/_supercontig/_intronerated}" 2> /dev/null done CONTIGPATH="$OUTDIR/$SAMPLE/intronerated_contigs" mkdir -p "$OUTDIR/$SAMPLE/intronerated_consensus" fi mkdir -p "$OUTDIR/$SAMPLE/mapping_files" # start remapping reads to contigs for CONTIG in $CONTIGPATH/*.fasta do FILE=${CONTIG/*\/*contigs\//} GENE=${FILE/.fasta/} echo -e '\e[1A\e[K'Generating consensus sequences for $SAMPLE - $GENE if [[ $CONTIGTYPE == "normal" ]]; then CONSENSUS=$OUTDIR/$SAMPLE/consensus/$GENE".fasta" BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE".bam" VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE".vcf.gz" else GENE=${GENE/_intronerated/} CONSENSUS=$OUTDIR/$SAMPLE/intronerated_consensus/$GENE"_intronerated.fasta" BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.bam" VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.vcf.gz" fi # checking for paired-end reads if [[ -f $OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta" ]];then READS=$OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta" READTYPE="PE" else READS=$OUTDIR/$SAMPLE/reads/$GENE"_unpaired.fasta" READTYPE="SE" fi bwa index $CONTIG 2> /dev/null if [ "$READTYPE" = "SE" ];then bwa mem $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM else bwa mem -p $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM fi bcftools mpileup -I -Ov -f $CONTIG $BAM 2> /dev/null | bcftools call -mv -A -Oz -o $VCFZ 2> /dev/null bcftools index -f --threads $THREADS $VCFZ 2> /dev/null bcftools consensus -I -i "(DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $ALLELE_FREQ && (DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $READ_DEPTH && (DP4[2]+DP4[3]) >= $ALLELE_COUNT " -f $CONTIG $VCFZ 2> /dev/null | awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' > $CONSENSUS echo "" >> $CONSENSUS rm $CONTIG.* 2> /dev/null rm $VCFZ".csi" 2> /dev/null done # optional clean-up step: remove folders with mapping stats and reads if [[ $CLEANUP == "TRUE" ]]; then rmdir $OUTDIR/$SAMPLE/mapping_files/ --ignore-fail-on-non-empty rmdir $OUTDIR/$SAMPLE/reads --ignore-fail-on-non-empty fi DURATION_SAMPLE=$SECONDS echo -e '\e[1A\e[K'Generated consensus for $SAMPLE in $(($DURATION_SAMPLE / 60)) minutes and $(($DURATION_SAMPLE % 60)) seconds. done ``` ::: Here Nora uses a screen to run this script. This did not work well for me and/or was also slow. To use a screen with this, it works better to use a qnode and allocate yourself extra memory for it to work. To request a qnode :::spoiler Arguments #-A: is user #-p: is time partition length #-t: is time #--mem: is memory ::: ```bash= #an example salloc -A p32038 -p short -t 00:30:00--mem=10GB ``` This will come back with a qnodeXXXX Then log into qnode using ssh ``` bash= ssh qnodeXXXX ``` Then run script on on the qnode :::spoiler Arguments -n: path to file with sample names -p: path to directory with HybPiper outputs -o: path to directory you want hybphaser outputs put ::: ```bash= generate_consensus_sequences.sh -n ./namefile.txt -p ./fastq -o ./fastq/hybphaser ``` I don't like using screens so I ended up using a slurm enviornment for this and named it `slurm_geneerate_consensus_sequences.sh` :::danger I was not able to get generate_consensus_sequence.sh to work while useing path shortcuts when bash-ing or sbatch-ing. To fix this I had to write out the entire pathway. ::: ``` #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=15:00:00 #SBATCH --mem=50G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=slurm_generate_consensus_sequence #SBATCH --output=slurm_outlog.log #SBATCH --error=slurm_errors.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu bash generate_consensus_sequences.sh -n /projects/p32038/Fant44_02.22.2024/test/fastq/namefile.txt -p /projects/p32038/Fant44_02.22.2024/test/fastq/ -o /projects/p32038/Fant44_02.22.2024/test/fastq/hybphaser -t 4 ``` ```bash= sbatch slurm_generate_consensus_sequences.sh ``` #### Step 2 - Assessment of consensus sequences (count SNPs) ##### Preparation This step uses R within the terminal. You also need to download [config.txt](https://github.com/LarsNauheimer/HybPhaser/blob/main/config.txt) and [1a_count_snps.R](https://github.com/LarsNauheimer/HybPhaser/blob/main/1a_count_snps.R) from the github. To see if you have R type ```bash= # Check what environment you have conda env list # if "R" is listed then conda activate "name of env" #if "R" is not listed then conda install R #if need to create new enviornments. conda create -n NameForNewEnv python=2.7.14 # Many Python versions are available ``` Once you have called up the environment. ```bash= # Start R R # To quit q() ``` `1a_count_snps.R` requires 3 packages, open R in the terminal and run ```bash= install.packages(c("ape", "seqinr", "stringr")) ``` Then you can quit R and save the environment :::info The next step is to locate a file called `config.txt` in the HybPhaser folder. Open it and enter the necessary information in the `# General Settings` section. ::: ### Changes made :::spoiler Changes made to Config Details below ### Changes made - `path_to_output_folder:` set path for HybPhaser output folder ("/path/to/hybphaser_output_folder/") - `fasta_file_with_targets`: direct to the file with target sequences (baits) ("/path/to/target_file.fasta") - `target_file_format`: set whether target file is in DNA ("DNA") or AA ("AA") format - `path_to_namelist`: direct to a file that contains all sample names ("/path/to/namelist.txt") - `intronerated_contig`: select whether HybPiper was run with intronerate to fetch the intronerated contigs ("yes"), or whether normal contigs should be used ("no") - Here are the `# General Settings` I used: ``` path_to_output_folder = "/projects/p32038/Fant44_02.22.2024/test/fastq/hybphaser" fasta_file_with_targets = "/projects/p32038/Fant44_02.22.2024/test/fastq/target.fa" targets_file_format = "DNA" # "DNA" or "AA" path_to_namelist = "/projects/p32038/Fant44_02.22.2024/test/fastq/namefile.txt" intronerated_contig = "no" ``` ::: Than run the R script ```bash= Rscript 1a_count_snps.R ``` This creates a subfolder called 00_0R_objects #### Step 3 - Assessment of SNPs and dataset optimization ##### Preparation This R script generates tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene. ### Changes to make config txt Details below for parameters set in config.txt. :::spoiler Config changes - `name_for_dataset_optimization_subset = ""` These values are the values _Nauheimer et. al._ uses in their paper - `remove_samples_with_less_than_this_propotion_of_loci_recovered = 0.2` - `remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.45` - `remove_loci_with_less_than_this_propotion_of_samples_recovered = 0.2` - `remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered` For the paralogs: - `remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers"` - `file_with_putative_paralogs_to_remove_for_all_samples = ""` - `remove_outlier_loci_for_each_sample = "no"` These values should be carefully evaluated per dataset. ::: ### Run To run the script type ```bash= ! Rscript 1b_assess_dataset.R ``` #### Step 4 - Generate Sequeence Lists This R script generates a list of sequences for every locus and sample. These files are used for downstream analyses. The script itself had quite a few occurences of the same error. The brighamia data did not have any loci that completely failed (I think I removed them earlier on due to poor coverage & issues with some of the scripts). This R script expects the existence of an R object called failed loci, unsure of where/when in the script this object is created but it didn't exist for Brighamia data. An example of how I fixed this can be found in troubleshooting. Running the actual script is again as easy as ```bash= Rscript 1c_generate_sequence_lists.R ``` :::danger THE REST OF HYBPHASER IS DONE AFTER MAKING PHYLOGENIES!!! ::: ### Hybphaser 2 - Clade Association Hybrid accessions contain sequence reads from divergent parents, which can be phased, if the parents are sufficiently divergent. The clade association step of HybPhaser maps all reads onto multiple references representing divergent clades and thus identify hybrids with divergent parents. The software BBSplit (BBMap) is used to simultaneously map the reads to the multiple references and record the proportion of reads mapped unambiguously to each reference. All required variables are set in the configuration file config.txt. The R scripts will generate a command line script that executes the BBSplit analysis (2a) and summarize the resulting log files into tables (2b). :::warning I chose clade references after making supermatrix phylogeneis using aligned and trimmed consensus files. I made a tree for each genus seperately, chose clades from those trees, and selected one individuals to represent each clade. These individuals need to be equally spaced throughout the phylogeny, too closely related and it will result in a lot of ambiguous matches, too divergent and it will lead to reads not matching with either reference. Reference individuals also need to have good read covereage, low locus heterozygosity and low allele divergence. The number of references needed is seemingly arbitrary. Too many references and you likely won't catch distinct hybrid clade associations. Too many reference increases your change of catching hybrids with low allele divergence, but greately increases the risk of non-hybrids showing multiple clade associations. ::: #### Creating clade reference csv file Example of .csv file, Nora uses the sample name for both sample and abbreviation columns. | sample | abbreviation | | -------- | -------- | | MZ16 | Z_onan-reyeskii | | HE13 | E_humulis | #### 2_extract_mapped_reads.sh Read files vary in the proportion of reads that match the target sequences, which has impact on the proportion of reads mapping to the clade references. It is therefore recommended to only use reads for the clade association that mapped to the target sequences. :::spoiler 2_extract_mapped_reads.sh #!/bin/bash ############ set +e set -u set -o pipefail BASEDIR="./" OUTDIR="./mapped_reads/" DEDUPSEQ="FALSE" NAMELIST="not-a-file" while getopts 'b:o:sn:' OPTION; do case "$OPTION" in b) BASEDIR=$OPTARG ;; o) OUTDIR=$OPTARG ;; s) DEDUPSEQ="TRUE" ;; n) NAMELIST=$OPTARG ;; ?) echo "script usage: $(basename $0) [-n file with samples names (nameslist.txt)] [-b HybPiper base folder] [-o output folder]" echo " Usage: 2_extract_mapped_reads.sh [options] Options: -b <PATH> Base folder of either HybPhaser (contains '01_data/'), HybPiper (contains sample directories), or HybPiper-RBGV (contains '04_processed_gene_directories/'). Default is './' -o <PATH> Outpout folder to write read files into. Default is './mapped_reads/' -n <PATH> Namelist. optional, if not set, all folders in samples directory are used as samples. -s Select if duplicated sequences should be removed. " exit 1 ;; esac done shift "$(($OPTIND -1))" mkdir -p $OUTDIR if [[ -d $BASEDIR/01_data ]]; then SAMPLESDIR=$BASEDIR/01_data/ HYB="phaser" echo $HYB elif [[ -d $BASEDIR/04_processed_gene_directories/ ]]; then SAMPLESDIR=$BASEDIR/04_processed_gene_directories/ HYB="piper" else SAMPLESDIR=$BASEDIR HYB="piper" fi if [[ -f $NAMELIST ]]; then SAMPLES=$(<$NAMELIST) else SAMPLES=$(ls -d -1 $SAMPLESDIR/* | rev | cut -f 1 -d "/" | rev) fi for SAMPLE in $SAMPLES do echo Generating on-target reads files for $SAMPLE # concatenate reads (mapped per gene) per sample from either HybPhaser or HybPiper folders if [[ $HYB == "phaser" ]]; then cat $SAMPLESDIR/$SAMPLE/reads/*_unpaired.fasta $SAMPLESDIR/$SAMPLE/reads/*_combined.fasta > $OUTDIR/$SAMPLE"_mapped_reads.fasta" 2>/dev/null else cat $SAMPLESDIR/$SAMPLE/*/*_unpaired.fasta $SAMPLESDIR/$SAMPLE/*/*_interleaved.fasta > $OUTDIR/$SAMPLE"_mapped_reads.fasta" 2>/dev/null fi # The following pipline is based on Pierre Lindenbaum's script: https://www.biostars.org/p/3003/#3008 and composes of the following actions: # in every row that starts with '>' put '@' at the end of the row # replace '>' with '#' # remove the break lines, replace '#' with a breakline, replace '@' with a tab # sort the file according to the second column (sequences). the -u option keeps only one copy of each unique string. # add '>' at the begining of each line # sustitute tab (\t) with a breakline (\n) # remove the first line (it's a blank line) and save the result into $output if [[ $DEDUPSEQ == "TRUE" ]]; then echo Deduplicating reads files for $SAMPLE sed -e '/^>/s/$/@/' -e 's/^>/#/' $OUTDIR/$SAMPLE"_mapped_reads.fasta" |\ tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\ sort -u -t $'\t' -f -k 2,2 |\ sed -e 's/^/>/' -e 's/\t/\n/' |\ tail -n +2 > $OUTDIR/$SAMPLE"_mapped_reads_dedup-seq.fasta" rm $OUTDIR/$SAMPLE"_mapped_reads.fasta" 2>/dev/null else echo Keeping all reads for $SAMPLE # NOTE: if the following commented code is run on paired-end interleaved data, # which is the normal output from HybPhaser, one of the paired reads # will be dropped (they have identical names), which seems like non-ideal # behaviour (and different from what happens in removing duplicates above) #sed -e '/^>/s/$/@/' -e 's/^>/#/' $OUTDIR/$SAMPLE"_mapped_reads.fasta" |\ #tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\ #sort -u -t $'\t' -f -k 1,1 |\ #sed -e 's/^/>/' -e 's/\t/\n/' |\ #tail -n +2 > $OUTDIR/$SAMPLE"_mapped_reads_nodupnames.fasta" #mv $OUTDIR/$SAMPLE"_mapped_reads_nodupnames.fasta" $OUTDIR/$SAMPLE"_mapped_reads.fasta" fi done ::: Options: `-b` <pathway> Base folder of either HybPhaser (contains '01_data/'), HybPiper (contains sample directories), or HybPiper-RBGV (contains '04_processed_gene_directories/'). Default is './' `-o` <pathway> Output folder to write read files into. Default is './mapped_reads/' `-n` <pathway> Namelist. Optional, if not set, all folders in samples directory are used as samples. `-s` Set to remove duplicated sequences (not only sequences with duplicated names). I ran created a `mapped_reads` directory and ran ```bash= 2_extract_mapped_reads -b ~/hybphaser/01_data -o ~/hybphaser/mapped_reads ``` :::warning I tried running this as a slurm because I wasn't sure how long it would take. When I did this, the log showed no errors, but most of my mapped reads were empty, and the others were pretty small files. This didn't make sense with the assessment files showing a fair amount of target region coverage. So I ran it again without a slurm and it worked. Don't know if that was a coincidence or not. It took maybe 20 minutes or so to run 111 taxa. ::: :::danger For some reason, when trying to run this normally, it skips some files, and doesn't fully map others. I found out that using `chmod 777` on the script fixes this problem for me. ::: :::success FIX!!! You need to have the `-b` path be the entire hybphaser directory, not just the the 01_data directory. For me this was /projects/p32038/combined_data/hybpiper_data/hybphaser. Previously I was trying -b ./01_data and it wouldn't work at all ::: #### 2a_prepare_bbsplit_script.R **You need to instal bbmap!** Download this with conda directly into your R conda environment!!! Otherwise you can't run the R script and the `bbslipt.sh` commands it creates at the same time. There are a lot of changes that need to be made to the config file here. :::spoiler config.txt changes `path_to_clade_association_folder:` path to the clade association output folder (e.g. "/path/to/hybphaser_output/04_clade_association/") `csv_file_with_clade_reference_names:` set filename for the previously prepared text file with reference sample names (e.g. "/path/to/hybphaser_output/04_clade_association/clade_references.csv") `path_to_reference_sequences:` set folder that contains the sequences of samples to select the clade reference sequences (e.g. "/path/to/hybphaser_output/03_sequence_lists/samples_consensus/") `path_to_read_files_cladeassociation:` set folder that contains sequence read files (e.g. "/path/to/hybphaser_output/mapped_reads/") `read_type_cladeassociation:` set whether reads are single or paired-end reads (e.g. "single-end", if you use the mapped-only reads) :::warning Here I used the mapped reads for 'path_to_read_files_cladeassociation' which are single read files for each individual so I had this set to "single-end and left the ID_read_pair lines blank" `no_of_threads`: set number of threads/cores to be used for BBSplit (default is 1) `run_clade_association_mapping_in_R:` select whether the script is run directly in R (“yes”) or whether to run it manually from command line using the generated script ("no") ::: :::warning I put the no_of_threads set to "auto" and use a slurm to account for threads. I highly recommend using max amount of threads available otherwise this take a very long time to run. On regular quest accounts the max numer of threads is 28 and is set on by ntasks_per_node slurm line. Set run_clade_association_mapping_in_R to "yes", the R script creates a bash command that runs bbsplit scripts and setting this to yes automattically runs the bbsplit scripts, which is why you need bbmap installed in your R conda environment. *You need to create your own `04_clade_association` directory!! The scripts do not do this for you this time around ::: Here is what mine looked like: `path_to_clade_association_folder` = "~/hybphaser/ 04_clade_association" `csv_file_with_clade_reference_names` = "~/hybphaser/clade_references1.csv" `path_to_reference_sequences` = "~/hybphaser/03_sequence_lists/samples_consensus/" `path_to_read_files_cladeassociation` = "~/hybphaser/mapped_reads" `read_type_cladeassociation` = "single-end" `ID_read_pair1` = "" `ID_read_pair2` = "" `file_with_samples_included` = "" `path_to_bbmap` = "" `no_of_threads_clade_association` = "auto" r`un_clade_association_mapping_in_R` = "yes" `java_memory_usage_clade_association` = "" # e.g. 2G (for 2GB) needed when java -Xmx error comes up. should be max. 85% of physical memory After these config changes are made you can run script 2a_prepare_bbsplit_scripts.R. Make sure that the path to the config file is in the script is correct before running. :::spoiler 2a_prepare_bbsplit_script.R ```bash! ########################################################### ### Preparation of BBSplit script for Clade association ### ########################################################### # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) if(read_type_cladeassociation == "paired-end"){ read_files <- list.files(path_to_read_files_cladeassociation, full.names = F, pattern = paste("*",ID_read_pair1,".*", sep="") , include.dirs = F) } else if(read_type_cladeassociation == "single-end") { read_files <- list.files(path_to_read_files_cladeassociation, full.names = F, pattern = "*.*", include.dirs = F) ID_read_pair1 <- "" } else { print("ERROR READ TYPE NOT SELECTED") } if(file_with_samples_included == "" | file_with_samples_included == "none" ){ } else { samples_in <- readLines(file_with_samples_included) samples_in <- gsub("\\s*$","",samples_in) read_files_noID <- gsub(ID_read_pair1,"",read_files) read_files_noend <- gsub("[.]fast.*","",read_files_noID) read_files <- read_files[match(samples_in, read_files_noend)] } read_files <- na.exclude(read_files) folder_bbsplit_stats <- file.path(path_to_clade_association_folder,"bbsplit_stats/") dir.create(folder_bbsplit_stats, showWarnings = FALSE) sample_sequences <- list.files(path_to_reference_sequences) ref_samples <- read.csv(csv_file_with_clade_reference_names, header = T) colnames(ref_samples)[1] <- "samples" colnames(ref_samples)[2] <- "abb" ref_samples_files <- sample_sequences[match(ref_samples$samples,gsub("(_intronerated|_consensus|_contig).*","",sample_sequences))] if(length(ref_samples$abb)>0){ ref_command <- paste(paste("ref_",ref_samples[,2],"=", path_to_reference_sequences,"/",ref_samples_files, sep=""), collapse=" ") } else { ref_command <- paste("ref=",paste(path_to_reference_sequences,ref_samples_files, collapse = ",",sep=""),sep="") } # setting no of threads if(no_of_threads_clade_association == 0 || no_of_threads_clade_association == "auto") { threadtext <- "" } else { threadtext <- paste(" threads=",no_of_threads_clade_association,sep="") } # setting Java memory usage if(java_memory_usage_clade_association != ""){ caXmx <- paste(" -Xmx",java_memory_usage_clade_association, sep="") } else { caXmx <- "" } if(path_to_bbmap == ""){ bbsplit_sh <- "bbsplit.sh" } else { bbsplit_sh <- file.path(path_to_bbmap,"bbsplit.sh") } write("#!/bin/bash",file=file.path(path_to_clade_association_folder,"run_bbsplit4clade_association.sh"), append=F) for(i in 1:length(read_files)){ if(read_type_cladeassociation == "paired-end"){ bbsplit_command <- paste(bbsplit_sh," ",ref_command, " in=",file.path(path_to_read_files_cladeassociation,read_files[i]), " in2=",sub(ID_read_pair1,ID_read_pair2,file.path(path_to_read_files_cladeassociation,read_files[i])),threadtext," ambiguous=random ambiguous2=all refstats=",folder_bbsplit_stats, gsub(ID_read_pair1,"",read_files[i]),"_bbsplit-stats.txt", caXmx, sep="") } else { bbsplit_command <- paste(bbsplit_sh," ",ref_command, " in=",file.path(path_to_read_files_cladeassociation,read_files[i]),threadtext," ambiguous=random ambiguous2=all refstats=",folder_bbsplit_stats,read_files[i],"_bbsplit-stats.txt", caXmx, sep="") } write(bbsplit_command,file=file.path(path_to_clade_association_folder,"run_bbsplit4clade_association.sh"), append=T) } system(command = paste("chmod +x",file.path(path_to_clade_association_folder,"run_bbsplit4clade_association.sh"))) if(run_clade_association_mapping_in_R=="yes"){ system(command = file.path(path_to_clade_association_folder,"run_bbsplit4clade_association.sh")) } ``` ::: I recommend running this through a slurm. Because it is an R script I don't know how to run it as a slurm without creating a new script. ```bash! #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=30:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=28 #SBATCH --job-name=hyb_2a #SBATCH --error=hyb_2ad.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu Rscript ./2a_prepare_bbsplit_script.R ``` This program using up to 2GB of mem, so setting it above 2 doesnt change anything. Setting threads to as high as possible makes all the difference here!!! #### 2b_collate_bbsplit_results.R This will generate summary tables based on the bbsplit files create in the last step. :::spoiler 2b_collate_bbsplit_results.R ```bash! ############################################################## ### Collection of info in log files from clade association ### ############################################################## # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) # get names of stats files and from it get the name of samples folder_bbsplit_stats <- file.path(path_to_clade_association_folder,"bbsplit_stats") stats_files <- list.files(folder_bbsplit_stats, "*_bbsplit-stats.txt") sample <- gsub("(_mapped.*|.fasta.*|_bbsplit-stats.*)","",stats_files) # this removes everything after the first . in the name ref_samples <- read.csv(csv_file_with_clade_reference_names, header = T) colnames(ref_samples)[1] <- "sample" colnames(ref_samples)[2] <- "abbreviation" #set reference names (full names or abbreviations) if("abbreviation" %in% colnames(ref_samples)){ ref_names <- ref_samples$abbreviation } else { ref_names <- ref_samples$sample } # generate empty table (ref_names x samples) tab_clade_assoc <- matrix(nrow = length(ref_names), ncol = length(sample)) rownames(tab_clade_assoc) <- ref_names colnames(tab_clade_assoc) <- sample for(i in 1:length(stats_files)){ #i=1 stats_file <- stats_files[i] p_unamb <- read.delim(file.path(folder_bbsplit_stats, stats_file))[,1:2] tab_clade_assoc[,i] <- p_unamb$X.unambiguousReads[match(rownames(tab_clade_assoc), p_unamb$X.name)] } # The proportions of reads matching unambigously to a reference are sensitive to the genetic distance of references. # E.g. two clade references that are closer related to each other have lower proportions of reads matching to the reference because more reads match ambiguously between those references. # One way to get more comparable data is normalization of proportions, by dividing the proportion of reads matched to a reference by the results from the sample that was used as reference. # This assumes that the reads of the reference sequence matching to the same reference is the maximum possible and therefore set to 1. However, this can deviate due to differnces in sequencing quality, read coverage and other factors. tab_clade_assoc_norm <- tab_clade_assoc for(i in 1:length(rownames(tab_clade_assoc))){ if(length(grep(paste("\\b",ref_samples$sample[i],"\\b",sep=""),colnames(tab_clade_assoc)))>0){ tab_clade_assoc_norm[i,] <- tab_clade_assoc[i,] / tab_clade_assoc[i,grep(paste("\\b",ref_samples$sample[i],"\\b",sep=""),colnames(tab_clade_assoc))] } } match("3",c("1","2","22", "3")) ttab_clade_assoc <- as.data.frame(t(tab_clade_assoc)) ttab_clade_assoc$sample <- rownames(ttab_clade_assoc) ttab_clade_assoc_norm <- as.data.frame(t(tab_clade_assoc_norm)) ttab_clade_assoc_norm$sample <- rownames(ttab_clade_assoc_norm) # for better assessment, the summary table is added to the clade assessment results summary_het_ad_clean <- readRDS(file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset, "Summary_table.Rds")) tab_hetad_cladeasso <- merge(summary_het_ad_clean,ttab_clade_assoc, by="sample", incomparables = F) tab_hetad_cladeasso_norm <- merge(summary_het_ad_clean,ttab_clade_assoc_norm, by="sample") # save output write.csv(t(tab_clade_assoc), file=file.path(path_to_clade_association_folder,"Table_clade_association.csv")) write.csv(t(tab_clade_assoc_norm), file=file.path(path_to_clade_association_folder,"Table_clade_association_normalised.csv")) write.csv(tab_hetad_cladeasso, file=file.path(path_to_clade_association_folder,"Table_clade_association_and_summary_table.csv")) write.csv(tab_hetad_cladeasso_norm, file=file.path(path_to_clade_association_folder,"Table_clade_association_normalized_and_summary_table.csv")) ``` ::: :::danger I had a lot of trouble with this script. For me it isn't making output tables with a "samples" header, so when it tries to merge these summary files with the summary files made in 00_R_objects earlier in the pipeline it causes errors. I'm not good enough at R to fix that, so I got rid of the parts that merge the old summary table with the new summary tables and just looked at the two files seperately for further analysis. ::: Here is what the script I used looks like :::spoiler My 2b_collate_bbsplit_results.R script ```bash! ############################################################## ### Collection of info in log files from clade association ### ############################################################## # load config if (!(exists("config_file"))) {config_file <- "/projects/p32038/Fant44_02.22.2024/test/fastq/hybphaser/hybphaser2_cofig.txt"} source(config_file) # get names of stats files and from it get the name of samples folder_bbsplit_stats <- file.path(path_to_clade_association_folder,"bbsplit_stats") stats_files <- list.files(folder_bbsplit_stats, "*_bbsplit-stats.txt") sample <- gsub("(_mapped.*|.fasta.*|_bbsplit-stats.*)","",stats_files) # this removes everything after the first . in the name ref_samples <- read.csv(csv_file_with_clade_reference_names, header = T) colnames(ref_samples)[1] <- "sample" colnames(ref_samples)[2] <- "abbreviation" #set reference names (full names or abbreviations) if("abbreviation" %in% colnames(ref_samples)){ ref_names <- ref_samples$abbreviation } else { ref_names <- ref_samples$sample } # generate empty table (ref_names x samples) tab_clade_assoc <- matrix(nrow = length(ref_names), ncol = length(sample)) rownames(tab_clade_assoc) <- ref_names colnames(tab_clade_assoc) <- sample for(i in 1:length(stats_files)){ #i=1 stats_file <- stats_files[i] p_unamb <- read.delim(file.path(folder_bbsplit_stats, stats_file))[,1:2] tab_clade_assoc[,i] <- p_unamb$X.unambiguousReads[match(rownames(tab_clade_assoc), p_unamb$X.name)] } # The proportions of reads matching unambigously to a reference are sensitive to the genetic distance of references. # E.g. two clade references that are closer related to each other have lower proportions of reads matching to the reference because more reads match ambiguously between those references. # One way to get more comparable data is normalization of proportions, by dividing the proportion of reads matched to a reference by the results from the sample that was used as reference. # This assumes that the reads of the reference sequence matching to the same reference is the maximum possible and therefore set to 1. However, this can deviate due to differnces in sequencing quality, read coverage and other factors. tab_clade_assoc_norm <- tab_clade_assoc for(i in 1:length(rownames(tab_clade_assoc))){ if(length(grep(paste("\\b",ref_samples$sample[i],"\\b",sep=""),colnames(tab_clade_assoc)))>0){ tab_clade_assoc_norm[i,] <- tab_clade_assoc[i,] / tab_clade_assoc[i,grep(paste("\\b",ref_samples$sample[i],"\\b",sep=""),colnames(tab_clade_assoc))] } } match("3",c("1","2","22", "3")) ttab_clade_assoc <- as.data.frame(t(tab_clade_assoc)) ttab_clade_assoc$sample <- rownames(ttab_clade_assoc) ttab_clade_assoc_norm <- as.data.frame(t(tab_clade_assoc_norm)) ttab_clade_assoc_norm$sample <- rownames(ttab_clade_assoc_norm) # save output write.csv(t(tab_clade_assoc), file=file.path(path_to_clade_association_folder,"Table_clade_association.csv")) write.csv(t(tab_clade_assoc_norm), file=file.path(path_to_clade_association_folder,"Table_clade_association_normalised.csv")) ``` ::: Make sure to look at the normalized output file, which as all the numbers as percentages. The highest should be 1 at the reference individuals, but I noticed a few were higher than one, this may be some sort of ambiguity error. :::danger This is where I lose a little bit of faith in this process, it relies HEAVILY on the individuals you choose for clade references. ::: ## When to create a phylogeny Creating a phylogeny can be done straight from HybPiper outputs, or here after creating consensus sequences. According to the HybPhaser paper "We preferred to use consensus sequences over contigs, as they mask heterozygoius sites with ambiguity codes and therefore reduce conflicting phylogenetic signals from any hybhrids" To me, it makes sense to make the gene tree with the consensus files, a supermatrix with the supercontigs, and the species network with the supercontigs. Although the fern paper makes both their sets of trees and species networks with the consensus files, one pre-phasing and one post-phasing. Will need to look into this more. ## Aligning and trimming consensus sequences to make phylogenies This is following Johnathans [Lobeliad protocol](https:/hackmd.io/OK2ulfZsRXWNAQZXqXN-6A). The length of sequence recovered for each individual is going to be different for each individual in a locus. These sequences need to be aligned before software like IQ Tree can generate gene trees. I use [**MAFFT**](https://mafft.cbrc.jp/alignment/software/) to align each loci's fasta first using the `--auto` option. In a folder containing the retrieved sequences, use a for loop to run MAFFT on all loci before running, need to load mafft ```bash! module load mafft ``` Then run the slurm below. ***Make sure you are in the loci_consensus folder** This doesn't necessarily have be run as a slurm. Without a slurm it only takes a few minutes. ```bash! #!/bin/bash #SBATCH --account=p2038 #SBATCH --partition=normal #SBATCH --time=00:30:00 #SBATCH --mem=5G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --job-name=coal_align #SBATCH --error=align.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu ###For loop to run MAFFT on unaligned loci fasta for i in *.fasta; do mafft --auto --thread 12 ${i} > ${i%.*}_aligned.fasta; done ``` NOTE: If aligning straight from the retreive sequences from HybPiper, the files will be .FNA files, not .fasta files. To get the above script to work change the look to look for *.FNA files instead of *.fasta files. When sequences are aligned, a lot of gaps are introduced, especially towards the beginning/ends of sequences. This is because all sequences must end up being the same length, which is at minimum the longest sequence in the file. We can get rid of base pair positions that are mostly gaps using [**trimAl**](http://trimal.cgenomics.org/trimal). Use `-gt 0.5` to remove base pair positions that are 50% gaps. Note I installed trimAl in my conda environment instead of using modules on quest. Before running TrimAl, you need to create a conda environment and download it. To do this first create an environment ```bash conda create --name trimal ``` Then activate the environment ```bash! conda activate trimal ``` Then download trimal ```bash! conda install trimal ``` Now you can run the script below ```bash #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=short #SBATCH --time=00:30:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_trim #SBATCH --error=trim.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu for i in *.fasta; do trimal -in $i -out ${i//aligned/trim} -gt 0.5 done ``` ## Generate gene tree/species tree from aligned and trimmed loci files Once we have aligned gene sequences, we use [**IQTree**](http://www.iqtree.org/doc/) to generate genetrees that represent the evolutionary history of the gene. I've installed IQTree in my 'genomics' conda environment. In our approach, we let IQTree decide what substitution model is best and perform ultrafast bootstrapping to generate support values for branches. This procedure takes a very long time so we will submit a job to Quest **as an array**. Running an array of jobs allows us to run multiple jobs of the same type simultaneously. In essence, instead of running one iqtree job at a time, *we can run up to 9 simultaneously*. The "base" IQTree command will look something like this: ```bash iqtree -s <input file> -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb ``` - `-nt AUTO and -ntmax 4`: Decide optimal number of threads to use, with a maximum of 4 - `-m TEST`: searches for best substitution model before generating gene tree - `-bb 1000`: ultrafast bootstrap with 1000 replicates - `-wbtl`: write branch lengths - `-czb`: collapse near-zero branches - see more about arguements [here](http://www.iqtree.org/doc/Command-Reference) To generate a list of IQTree commands to run on all aligned gene files, we use a for loop again: just type this into the command line and it will create the script file ```bash for i in *.fasta; do echo iqtree -s $i -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb done > full_iqtree.sh ``` Because we want to run this job as an array, we need to split this full_iqtree.sh file into multiple files each containing a subset of the IQTree commands. Use the split command, just type this into the command line ```bash split -a 1 -d -l 45 full_iqtree.sh iqtree --additional-suffix=.sh ``` :::spoiler Arguments for split - `-a 1`: label split files with **one** letter extensions to base name (i.e. a, b, c, ..) - `-d`: label split files with numbers instead of letters, starting from 0 - `-l 45`: split the original file into smaller files, each with **45 lines** - `iqtree_full.sh`: the file to split - `iqtree`: base name of the split files - `--additional-suffix=.sh`: add a ".sh" to the end of base name + one letter extension ::: > If my original iqtree file has 403 iqtree commands, this will end up generating 9 subfiles each with 45 lines, called `iqtree0.sh iqtree1.sh iqtree2.sh ... iqtree8.sh`. Next we create a simple text file called `input_args.txt` that contains the names of the **subfiles** so that it looks like this. ```bash iqtree0.sh iqtree1.sh iqtree2.sh iqtree3.sh iqtree4.sh iqtree5.sh iqtree6.sh iqtree7.sh iqtree8.sh ``` Before running the last script, you need to create an iqtree conda environment and download it. ```bash conda create --name iqtree ``` Then activate the environment ```bash! conda activate iqtree ``` Then download IQTree ```bash! conda install iqtree ``` Now we can write a job submission script for Quest. Note that we need to include a `#SBATCH --array 0-8` line to tell Quest we're running an array of 9 jobs. Here is what the submission script will look like: ```bash #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --array=0-8 #SBATCH --time=30:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_array #SBATCH --error=iqtree.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics IFS=$'\n' read -d '' -r -a input_args < input_args.txt bash ${input_args[$SLURM_ARRAY_TASK_ID]} ``` :::danger I still need to figure out how to get gCF and sCF using IQtree ::: ::: #### IQTree Outputs For each gene fasta, IQTree will generate many files. Ultimately, the file we are interested in is the `.treefile` file. I'd recommend generating a new folder and moving all the genetree files into this folder, so you don't have to work in a directory containing thousands of files: ```bash mkdir ../genetrees mv *.treefile ../genetrees ``` ### 3.3 Removing branches with low support **Newick Utilites** is a collection of tools for phylogenetic trees. It is a little convoluted to get this to work because you cant download it with conda. The first thing you need to do is go into the genetrees directory. Then you need to use the git clone command to download the Newick Utilities files directly from their [github](https://https://github.com/tjunier/newick_utils) ```bash! git clone https://github.com/tjunier/newick_utils.git ``` Then go into the newly created newick_utils directory and type this commend: ```bash! autoreconf -fi ``` Then type the command: ```bash! ./configure ``` Then go into the scv directory and type in the command: ```bash! ./nw_ed ``` Now, in the nw_ed commands below, you need to type in the entire pathway to where the nw_ed command is for it to run no matter which directory you are trying to run it. We use the `nw_ed` tool to collapse branches in each gene tree that have very low support (<10%) with the following function: ```bash /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/genetrees/newick_utils/src/.libs/lt-nw_ed <input genetree> 'i & b<=10' o > <output genetree> ``` To run it as a loop over all of the gene tree files use the following loop. This does not need to be submitted via slurm on Quest because it is relatively simple: ```bash for i in *.treefile; do /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/genetrees/newick_utils/src/.libs/lt-nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile} done ``` ### Using Astral to infer a species tree [**Astral**](https://github.com/smirarab/ASTRAL) is an algorithm used to infer a species trees from gene trees. First, astral needs to be downloaded as a conda environment. Then all gene trees filtered through newick utilies are concatenated into a single file. ```bash cat *.treefile > cycad_genetrees.tre ``` We can call Astral using the following syntax (with `-t 2` to record quartet support) in a slurm job. ***Note*** I was unable to get this to work bycalling just the astral.5.7.8.jar file, I had to find where it was located within the conda environment and write out the entire pathway to get it to work. ```bash #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=short #SBATCH --time=01:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=astral_lob #SBATCH --error=astral.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu java -jar /home/srg9103/.conda/envs/astral/ASTRAL/astral.5.7.8.jar -i cycad_genetrees.tre -t 2 -o cycad_sptre.tre 2>astral.log ``` This file, cycad_sptree.tre is the finished species tree/gene tree. This is downloaded from cyberduck or globus and opened in a tree editing software such as [figtree](https://http://tree.bio.ed.ac.uk/software/figtree/). ### Getting bootstrap support on Astral [Astral github](https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial-template.md) I want to get bootstrap support on my gene tree to have the same support on both the gene trees and the supermatrix. The first thing you need to do is find all of the .ufboot files that were originally output from IQtree. You need to get the pathways to each one of these files into a single file. I used the `find` command ```bash= find <pathway to directory that has all the .ufboot files> -name '*.ufboot' ``` This will print out the pathways to each of these files. Next copy all of the pathways and use nano to open an editor, and paste all of the pathways into it. I saved it as `bootstraps` DO NOT MAKE THIS A .txt FILE, KEEP IT WITHOUT A . ENDING. I made a new directory called astral_bootstraps to keep everything clean and put the `bootstraps` file in it. Next you need the .gene.tre file that was an output of the first run of astral. Mine was called cycad_genetrees.tre. This will be the input tree file. I copied this file and moved it into my astral_bootstraps directory. These are the only two files you need. To run this you just use a modified version of the previous astral command. ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=20:00:00 #SBATCH --mem=10G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=astral_bootstraps #SBATCH --error=astral_bootstraps.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu java -jar /home/srg9103/.conda/envs/astral/ASTRAL/astral.5.7.8.jar -i cycad_genetrees.tre -b boots -o cycad_sptre_with_bootstraps.tre ``` Parameters: `-i` Is the input file that contains all the maximum likelihood gene trees `-b` option tells ASTRAL that bootstrapping needs to be performed.Following `-b` is the name of a file that contains the file path of gene tree bootstrap files, one line per gene. Thus, the input file is not a file full of trees. It's a file full of paths of files full of trees. `-o` is the name you want for the output file This process seems to take a long time. For ~409 genes and 111 individuals it is taking 11 hours. ## Getting gCF and sCF on IQ-Tree2 [Concordance Factor paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7475031/) Gene concordance factors (gCF) and novel site concordance factors (sCF) are a way to measure genealogical concordance in phylogenetic datasets. gCF is deffined as the percentage if decisive gene trees containing that branch. sCF is defined as the percentage of decisive sites supporting a branch in the reference tree. These together provide a more full description of underlying disagreement among loci and sites. This is not necessarily something that needs to be done. Both the hybphaser paper and the fern hybrid paper support their trees with bootstraps, gCF, and sCF so I want to try and stick to what appears to be the standard protocol for this type of study. [Here is the iqtree page I followed](http://www.iqtree.org/doc/Concordance-Factor). #### Troubleshooting I had a lot of trouble here. At first I tried to skip the first two sections, 'inferring species tree' and 'inferring gene/locus trees' becuse running iqtree initially gave me both a concatenated gene tree and a gene tree that to me seems to be the same as the loci tree made on the concordance factor page. So I initially ran this command: ```bash= iqtree2 -t cycad_sptre.tre --gcf cycad_genetrees.tre --prefix concord ``` This technically worked, but it gave gCF that were insanely low, almost all 0's and a few nodes between 1-3. This does not make sense given the bootstraps show good support. So I tried starting from the top of the concordance factor page to see if the trees made in the species tree and loci tree sections make a difference. ### Inferring Species Tree This is for an edge-linked proportional partition model. I am unsure how to make a partition nexus file, so I opted to try and command that allows you to run the command using an entire directory of alignment files. Here it is the retrieved hybpiper files that were run through BOTH mafft and trimal. :::danger You CANNOT have anything in the directory with the .fasta input files. Because it is scanning the entire directory, anything that is not a target input .fasta file will cause an error. ALSO with the way this command works, doing a for loop using "for i in *.fasta" DOES NOT WORK as a work around to having other files in the directory!! This means you HAVE TO have the slurm script in a different directory and have the entire pathway to the input directory written out inside the bash command! ::: The generic command for running this is: ```bash= iqtree2 -p ALN_DIR --prefix concat -B 1000 -T AUTO ``` `-p` is for the input directory that holds all your aligned and trimmed .fasta files. `--prefix concat` make output files in a concat.* format. If --prefix is omitted, all output files will be in a PARTITION_FILE.* format. `-B` is the bootstrap support, here using 1000 replicates. `-T AUTO` automatically decides threads to use ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=10:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree2 #SBATCH --error=iqtree2.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu iqtree2 -p /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/trimmed_fasta --prefix concat -B 1000 -T AUTO ``` This took about 5 hours to run and used 22.47% of the memory. Try changing memory to 2GB. These are what the output files should be: Analysis results written to: IQ-TREE report: concat.iqtree Maximum-likelihood tree: concat.treefile Likelihood distances: concat.mldist Ultrafast bootstrap approximation results written to: Split support values: concat.splits.nex Consensus tree: concat.contree Screen log file: concat.log From the outputs we want the `concat.treefile` file. This might be able to be used instead of the tree made earlier using john's old code. :::danger Make sure to check logs!!!! I had warnings on both species tree and loci tree "Warning: Estimated model parameters are at boundary that can cause numerical instability!" This is because K>n, so this means that results may not be accurate. I am trying to fix this by adding `-m MFP` to the bash command. This should run a model finder to find a best fit model. Then it can be run again using `-m <best fit model>` to hopefullyt improve the scores and get rid of the warnings. ::: ### Inferring Loci Tree This went much smoother after A LOT of troubleshooting in the previous section. Here all you do is change a little bit of the bash command perameters. `-S` here takes the place of -p in the previous command, it specifies the aligned .fasta input directory (AGAIN, THIS MUST HAVE NOTHING BUT TARGET .FASTA FILES!!). ` --prefix loci` gives all outputs `loci.*` format, for ease of use `-T AUTO` automatticaly identifies threads ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=30:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree2 #SBATCH --error=iqtree2.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu iqtree2 -S /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/trimmed_fasta --prefix loci -T AUTO ``` This only ran for ~50 minutes, so next time change time to 2-3 hours. Memory usage was only 14.8% so drop this to 1-2GB. The output files from this are: Analysis results written to: IQ-TREE report: loci.iqtree Maximum-likelihood tree: loci.treefile Screen log file: loci.log For the next step we only need the `loci.treefile` file. :::danger My initial run of this seemingly worked. It worked to get gCF tree, but did not work when trying to get the bootstrap/gCF/sCF tree. For some reason, even though it's using the same input directory it didn't include all individuals in the set of trees. Need to try and troubleshoot. ::: :::danger If you start this script and it makes output files, and you want to rerun it, iqtree will flag it as already been done, and slurm will tell you it instantly fails to run. To rerun it you have to add an additional perameter. Use `-redo` option if you really want to redo the analysis and overwrite all output files. Use `--redo-tree` option if you want to restore ModelFinder and only redo tree search. Use `--undo` option if you want to continue previous run when changing/adding options. ::: ### Gene Concordance Factor :::danger GO TO "Getting Bootstrap/gCF/sCF on single tree" SECTION FIRST!!!! ::: :::danger Try to cat the .treefiles from Johns code and use that instead of loci.treefile ::: gCF can be calculated using the concat.treefile and loci.treefile that were made in the previous two steps. ```bash= iqtree2 -t concat.treefile --gcf loci.treefile --prefix concord ``` This take a second to run, so no need for a slurm. There will be three output files, `concord.cf.tree`, `concord.cf.branch`, and `concord.cf.stat`. A note from the concordance factor paper about analysis of gCF values. "gCF values tend to range from 0% to 100%...a gene tree can support not only the three possible relationships shown in figure 1 but also any other relationship in which one or more of clades A, B, C, or D are not monophyletic. The higher the number of gene trees in this latter group, the closer the gCF value will be to 0%. Because of this, **we should expect gCF values to be particularly low when gene trees are estimated from alignments with limited information or where branch x is extremely short; in such cases either technical or biological processes may increase the proportion of gene trees that fail to recover the monophyly of clades** A, B, C, or D found in the reference tree." From [this paper](https://pdf.sciencedirectassets.com/272322/1-s2.0-S1055790323X0013X/1-s2.0-S105579032400006X/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEAAaCXVzLWVhc3QtMSJHMEUCIQDMuf8DfST70pCsyMZlO55joObqRiZPUsq5lMlud4aRggIgRzEfW%2F%2BDJjPy3q4apg3xLyeEA3QorLZeI3dxFdrv5eYqsgUISRAFGgwwNTkwMDM1NDY4NjUiDKjNwsIUKI4tMQx78SqPBTjTp4PQWyjytDx9EJ%2BoCCC7av5s90DuMCTNO8B3YWubzsTjX25mhG4IgCe%2BoB3yaDeGYQvNqOx01Nie5UwDiQtH1TVwHb7bQRhF8bxzNUGspt8RV2%2B4IIhMtRp0MhsRogwHIqp1OcK2Q2EBY3iuq4VSTLiiorUEv1%2FHLxlczYGt2dKdRtHIUGSl0z3nWg%2B9Ty0vw62EVtXQR21Zan8BTPAryPNu7xi64zxigNsaaOFSyq5keIrWM1eUEOu6%2FZPCt1rFb0BOSkvBryRQAcy0DZMYLp1wM%2Fai5w1Y67aoQepqTCpXQa%2Fxd%2BDk7Ber2qRdySQoK%2BcoDwDS3NRxBFADry0usWfOknzSp1vcwZkIgk%2BwJ9das5beYoF8FVb0bxQSAF2E0VXKYNxYV%2FJ1j4EewUjwDWdJvxanQg5fSoq0nJOHMT7wXTbg72EC0q3RQ4dd4ufx6RsZb4UAQ6i6LZyn1dN6cCThXEij3M1UCJDcs0G4CPyOxp9wMeRa4BSe4Z7E3zoAhzYWr53wBKMJ1nT%2Btu3yN7YczDsjQ8ULqLlUC%2FYbOoRNP%2Bd7UQBAB%2FzIOkI2o%2BQPtIPm7Kw7lw3vOL3YJaZra%2BXNWRcNU7SQnB5XVwt94eaCk043ppE9TGOLeftrw6dLv0eNPT6z4MwPakEe%2FNBtbbUv%2FnyQMmFHv8jgFw8JX6egOOJRSYJq45sAmn0Y0wDvJO9HnmrPQpcJOad9ki76Cz5IZJTQz7R68SkCiqseDKkqmVG4kxHs5yV2idFeP0u6KI6Z6fHYx4X%2BXCJiFTxDTRAFUXcINYLZGebgM2kfPGFLY%2F%2BWLXyy65jIYjqfiuyk4haw3KAj0jFEpsE7m3%2FTvAz9UwSKOp%2FWG8LUDPMw5fGUsQY6sQGpsBL5ZUNhRXvBeo0ROL4niYUzCvUm9Dqq%2B0btkpjGVLnvpTpTRNZG0W0XOf8TzmJ0t0KiTaivhNdRPpKHKx16mqzNJ68EY9a%2B%2B%2Bu8M5BUX%2BxE1Mz8oZH0Xjc%2FwBMProT4AuCmIiTuT4gAXgvuqEJA0IGQfmpf%2B0UtZhr8DzRd%2B4pVI5kf2O1e%2FfKznc%2B8PTbvVT%2BD5%2FpiWZ9Lzn9AWDc4JoobazhZnLQUGyuKjfgl8K4%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240421T165252Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYXEYSQGNX%2F20240421%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=e60742d01f38895ed927a18d0b3e26a6b4c593d8943fff01b8e042e56c69c0e8&hash=1663a91ea9d70d2680d9399b5b0f741a1085e28297596985d8563af03a8b4d44&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S105579032400006X&tid=spdf-a2e0111b-30da-48bf-91e6-b62359cf4947&sid=8dcd818446a4c947f57945a7071cc3d8480cgxrqa&type=client&tsoh=d3d3LXNjaWVuY2VkaXJlY3QtY29tLnR1cmluZy5saWJyYXJ5Lm5vcnRod2VzdGVybi5lZHU%3D&ua=0f15585459050357515350&rr=877ee3112c3c2c68&cc=us) "Low gCF and sCF scores suggest introgression via recent hybridization. This supports the idea that previously described hybrid swarms backcrossed with D. fumella in the past. Introgression can reduce differentiation between gene pools, explaining the lack of matching gene tree topologies despite strong support" ### Site Concordance Factor sCF can be calculated from the concat.treefile made in step 1, and the directory if aligned and trimmmed .fasta files. Here is the general bash command: ```bash= iqtree2 -te concat.treefile -p ALN_DIR --scfl 100 --prefix concor ``` --scfl is only for iqtree2.2.2 and newer which is a "new anmd more accurate sCF based on likelihood, whereas the original sCF was bassed on parsimony" This one needs to be ran as a slurm, it takes much longer than gCF. ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=10:00:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=sCF #SBATCH --error=sCF.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu iqtree2 -te concat.treefile -p /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/trimmed_fasta --scfl 100 --prefix concord ``` This only took about an hour to run. The output files from this are Analysis results written to: IQ-TREE report: concord.iqtree Maximum-likelihood tree: concord.treefile Screen log file: concord.log ### Getting Bootstrap/gCF/sCF on single tree So, doing the last few steps was unnecessary, I guess. You can combine everything after step 2 that makes the loci.treefile. Here is the general bash command: ```bash= iqtree2 -t concat.treefile --gcf loci.treefile -p ALN_DIR --scf 100 --prefix concord -T 10 ``` I also added `-B 1000` to add the bootstrap support. Note that you have to use the older `--scf 100` perameter instead of the newer `--scfl` peramater when combining everything. For me I also added a slurm: ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --time=10:00:00 #SBATCH --mem=5G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=gcf_scf_iqtree2 #SBATCH --error=gcf_scf_iqtree2.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu iqtree2 -t concat.treefile --gcf loci.treefile -p /projects/p32038/Fant44_02.22.2024/test/fastq/retrieved_hybpiper_sequences/trimmed/trimmed_fasta --scf 100 --prefix final -B 1000 -T 4 ``` I tried to do `-T 10`, and change `--ntasks-per-node=10`, but I couldn't get this to work, so I dropped it to 4 arbitratally. I also changed the `--prefix` perameter to 'final' otherwise it would have had the same output prefixes as a previous step. ## Supermatrix Tree :::danger Make sure to run through the first section of hybphaser first. This way will tell you which reads are too poor for analysis, in which case, when you are retrieving contigs, you can take those individuals out of the namefile, otherwise their poor reads will negatively effect your tree. ::: A concatenated supermatrix tree will help to give a more full picture of phylogenetic relationships along with a ML gene tree and a species network. Here I make a supermatrix using supercontigs that were retrieved right after hybpiper assembly. The first step is the same as making a gene tree, the supercontig.fasta files need to be aligned and trimmed before running them through IQtree. All of the parameters for aligning with MAFFT and trimming with TrimAl are the same as making gene trees. After the supercontig files are aligned and trimmed, they need to be concatenated. :::danger I ran into trouble here because I had a few loci with no reads; to fix this problem make sure that you trim all the files with TrimAl, which will also filter out any empty reads. ::: To concatenate the files you will use the `fasta.merge.py` python script. :::spoiler `fasta.merge.py` #!/usr/bin/env python helptext ='''This script will take a list of FASTA files and concatenate them for use in phylogenetic inference. The sequence headers (up until the first space) must be identical in each individual FASTA file. Individual gene sequences should be aligned prior to running this script! This script requires BioPython to read/write FASTA sequences.''' import os,sys,argparse from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def read_sequences(fastafiles): '''Given a list of FASTA file names, read in each sequence to a dictionary of dictionaries, one per file''' return {filename:SeqIO.to_dict(SeqIO.parse(filename,'fasta')) for filename in fastafiles} def get_unique_names(gene_dict): '''Given the dictionary of SeqRecord dictionaries, return a list of the unique sequence headers''' all_names = [] for gene in gene_dict: all_names += list(gene_dict[gene].keys()) return set(all_names) def insert_sequences(gene_dict,unique_names): '''Given the dictionary of dictionaries, insert blank sequences if any are missing for a gene''' inserted_sequences = 0 for gene in gene_dict: for name in unique_names: if name not in gene_dict[gene]: gene_length = len(next(iter(gene_dict[gene].values()))) gene_dict[gene][name] = SeqRecord(Seq("-"*gene_length),id=name) inserted_sequences += 1 sys.stderr.write("{} Empty sequences inserted across all genes.\n".format(inserted_sequences)) return gene_dict def concatenate_sequences(gene_dict,fastafiles,unique_names): '''Given a dictionary of dictionaries with complete sampling in each gene, write out concatenated sequences to stdout. Returns a list of partition lengths.''' new_seq_dict = {} partition_lengths = [] for gene in fastafiles: for name in unique_names: try: new_seq_dict[name] += gene_dict[gene][name] except KeyError: new_seq_dict[name] = gene_dict[gene][name] partition_lengths.append(len(next(iter(gene_dict[gene].values())))) for final_seq in new_seq_dict: SeqIO.write(new_seq_dict[final_seq],sys.stdout,'fasta') final_seq_length = len(new_seq_dict[final_seq]) sys.stderr.write("Final conatenated sequence length: {}\n".format(final_seq_length)) return partition_lengths def raxml_partition(fastafiles,partition_lengths,partition_type): '''Generate a raxml partition file for the given fastafiles. User specifies the partition type''' gene_start = 1 partition_file = open("partition.raxml",'w') if partition_type == 'CODON': for g in range(len(fastafiles)): codon3_start = gene_start + 2 codon3_end = gene_start + partition_lengths[g] - 1 codon1_end = codon3_end - 2 codon2_start = gene_start + 1 codon2_end = codon3_end - 1 partition_file.write("{},{}{}={}-{}\\3,{}-{}\\3\n".format("DNA",fastafiles[g],"12",gene_start,codon1_end,codon2_start,codon2_end)) partition_file.write("{},{}{}={}-{}\\3\n".format("DNA",fastafiles[g],"3",codon3_start,codon3_end)) gene_start = codon3_end + 1 else: for g in range(len(fastafiles)): gene_end = gene_start + partition_lengths[g] - 1 partition_file.write("{},{}={}-{}\n".format(partition_type,fastafiles[g],gene_start,gene_end)) gene_start = gene_end + 1 partition_file.close() def main(): parser = argparse.ArgumentParser(description=helptext,formatter_class=argparse.RawTextHelpFormatter) parser.add_argument("--fastafiles",nargs='+',help="List of Fasta Files. Can use wildcard on Linux/Mac systems") parser.add_argument("--filelist",help="File containing list of Fasta files. Alternative to --fastalist") parser.add_argument("--raxml",help="Create a partition file 'partitions.raxml' intended for raxml in the current directory. For amino acid sequences, select the substitution model. To specify a separate model for 1st/2nd vs. 3rd codon positions, select CODON.", choices = ['DNA','WAG','JTT','CODON' ],default=None) if len(sys.argv) < 2: parser.print_help() sys.exit(1) args = parser.parse_args() if args.fastafiles: #print args.fastafiles if args.filelist: sys.stderr.write("Specify either a list of FASTA files or a file containing names, not both!\n") sys.exit(1) else: fastafiles = args.fastafiles elif args.filelist: #print args.filelist if os.path.isfile(args.filelist): fastafiles = [x.rstrip() for x in open(args.filelist)] else: sys.stderr.write("File containing list of FASTA files not found!") sys.exit(1) else: sys.stderr.write("You must specify the FASTA files as a list or in a file.\n") sys.exit(1) sys.stderr.write("{} FASTA files found.\n".format(len(fastafiles))) gene_dict = read_sequences(fastafiles) sys.stderr.write("All sequences read successfully.\n") unique_names = get_unique_names(gene_dict) sys.stderr.write("{} Unique names found. If you were expecting fewer sequences, check your IDs!\n".format(len(unique_names))) gaps_inserted = insert_sequences(gene_dict,unique_names) partition_lengths = concatenate_sequences(gaps_inserted,fastafiles,unique_names) if args.raxml: raxml_partition(fastafiles,partition_lengths,args.raxml) if __name__ == "__main__":main() ::: To run this you need to first activate your hybpiper conda environment. Then the normal way to run this is: ```bash= python fasta_merge.py --fastafiles *.fasta > super.fasta ``` I couldn't get this to work, I either needed to create/upload the `fasta_merge.py` file within the directory with the trimmed supercontig files or I needed to write out the entire path to the `fasta_merge.py` file within the conda environment. From here the concatenated super.fasta file will be run through IQtree. :::danger I still need to figure out how to get gCF and sCF using IQtree ::: Don't run this as an array, it won't change anything. Just run it with a single file command in a slurm. ```bash= #!/bin/bash #SBATCH --account=p32038 #SBATCH --partition=normal #SBATCH --array=0-8 #SBATCH --time=30:00:00 #SBATCH --mem=10G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_supermatrix #SBATCH --error=iqtree_supermatrix.log #SBATCH --mail-type=ALL #SBATCH --mail-user=stephangirard2024@u.northwestern.edu iqtree -s super.fasta -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb ``` This will give the following output files Analysis results written to: * IQ-TREE report: super.fasta.iqtree * Maximum-likelihood tree: super.fasta.treefile * Likelihood distances: super.fasta.mldist Ultrafast bootstrap approximation results written to: * Split support values: super.fasta.splits.nex * Consensus tree: super.fasta.contree * UFBoot trees: super.fasta.ufboot * Screen log file: super.fasta.log The file used for the supermatrix tree is the consensus tree, super.fasta.contree. The split support values file super.fasta.splits.nex is a nexus file that will be used later with SplitsTree5 to create a species network. The supermatrix tree is not run through astral like the gene tree was. This is done after IQtree.