HybPhaser Protocol

--- title: HybPhaser Protocol tags: HybSeq --- # HYBPHASER HybPhaser Version 2.0 -- was developed to deal with hybrids (and polyploids) in target capture datasets. It detects hybrids by measuring heterozygosity in the dataset and phase hybrid accessions by separating reads according to similarity with selected taxa that represent parental clades. HybPhaser is built as an extension to the assembly pipeline HybPiper. ## New notes after running HybPhaser v2 on Quest in 2023: When running 1_generate_consensus_sequences.sh in a slurm script, be sure to ```module load samtools``` and ```module load bwa``` in the slurm script. Do not module load bcftools (see below). Note that the bcftools that is installed on quest is v1.18, however this script requires the older version v1.9, and confusingly the flag -I, which specifies *bcftools consensus* use IAPUC codes, is no longer accepted in v.18, but instead uses an -i flag for that, however v1.9 uses the flag -i for filtering the vcf! To solve this issue, you'll have to install bcftools v1.9 in your conda environment. Do not ```module load bcftools``` in the environment or in the slurm script because this will load the newer version. ``` conda activate myEnv conda install bcftools=1.9 ``` Make sure that you are using the correct bcftools install: ``` which bcftools # it should be in your environment bin/ ~/.conda/envs/myEnv/bin/bcftools # if it is in /software then type module purge # and check again ``` Check your consensus sequences to make sure that a) the fastas are populated and that b) there are IAPUC ambiguity codes in the sequences. This HybPhaser script suppresses error codes, so it can be hard to tell if bwa or samtools or bcftools is not working properly. Ambiguity codes are Y, R, S, K, etc. Check to see if there are any in the consensus sequences: ``` grep -B1 -e "G" -e "C" -e "A" -e "T" ./01_data/sample_name/intronerated_consensus/* ``` There should only be a few abiguity codes in each sequence. Have not yet looked at the rest of these notes... / end note 11/29/23 Nora ## Step 1: Run Hybpiper Information can be found [here](https://hackmd.io/@CBG-genetics-team/r1mCsb5n_) ## Step 2: Go to Wiki and download shell script Website wiki can be found [here](https://github.com/LarsNauheimer/HybPhaser) Locate the 1_generate_consensus_sequence [code](https://github.com/LarsNauheimer/HybPhaser/blob/main/1_generate_consensus_sequences.sh). Copy the code over to a notepad or equivalent and save file :bulb: I called it generate_consensus_sequence.sh Move file onto server Make a Hybphaser director ```bash= mkdir ./Brighamia_Paralog/hybphaser ``` ## Step 3: Run the generate consesus shell script Run the shell script ### 1_generate_consensus_sequences.sh files :::spoiler HybPhaser's script ``` bash=! #!/bin/bash ############ set +e set -u set -o pipefail CONTIGTYPE="normal" THREADS=1 SAMPLENAME="" ALLELE_FREQ=0.15 READ_DEPTH=10 ALLELE_COUNT=4 PIPERDIR="./" OUTDIR="../HybPhaser" NAMELIST="not a file" CLEANUP="FALSE" while getopts 'ict:s:a:f:d:p:o:n:' OPTION; do case "$OPTION" in i) CONTIGTYPE="ISC" ;; c) CLEANUP="TRUE" ;; t) THREADS=$OPTARG ;; s) SAMPLENAME=$OPTARG ;; f) ALLELE_FREQ=$OPTARG ;; a) ALLELE_COUNT=$OPTARG ;; d) READ_DEPTH=$OPTARG ;; p) PIPERDIR=$OPTARG ;; o) OUTDIR_BASE=$OPTARG ;; n) NAMELIST=$OPTARG ;; ?) echo "This script is used to generate consensus sequences by remapping reads to de novo contigs generated by HybPiper. It is part of the HybPhaser workflow and has to be executed first. Usage: generate_consensus_sequences.sh [options] Options: General: -s <Name of sample> (if not providing a namelist) -n <Namelist> (txt file with sample names, one per line) -p <Path to HybPiper results folder> Default is current folder. -o <Path to output folder> (will be created, if it doesnt exist). Default is ../HybPhaser -t <Maximum number of threads used> Default is 1. (multiple threads so not speed up much) -i -intronerated: If set, intronerate_supercontigs are used in addition to normal contigs. -c -clean-up: If set, reads and mapping files are removed (.bam, .vcf.gz) Adjust consensus sequence generation: -d Minimum coverage on site to be regarded for assigning ambiguity code. If read depth on that site is lower than chosen value, the site is not used for ambiguity code but the most frequent base is returned. Default is 10. -f Minimum allele frequency regarded for assigning ambiguity code. If the alternative allele is less frequent than the chosen value, it is not included in ambiguity coding. Default is 0.15. -a Minimum count of alleles to be regarded for assigning ambiguity code. If the alternative allele ocurrs less often than the chosen value, it is not included in ambiguity coding. Default is 4. " >&2 exit 1 ;; esac done shift "$(($OPTIND -1))" if [[ -f $NAMELIST ]]; then SAMPLES=$(<$NAMELIST) elif [[ $SAMPLENAME != "" ]]; then SAMPLES=$SAMPLENAME else echo "No sample name given or namelist file does not exist!" fi for SAMPLE in $SAMPLES do SECONDS=0 #collecting reads and contigs from each sample echo Collecting read and contigs files for $SAMPLE OUTDIR=$OUTDIR_BASE"/01_data/" mkdir -p "$OUTDIR/$SAMPLE/reads" cp $PIPERDIR/$SAMPLE/*/*_unpaired.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null cp $PIPERDIR/$SAMPLE/*/*_interleaved.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null if compgen -G $OUTDIR/$SAMPLE/reads/*_interleaved.fasta > /dev/null; then for i in $OUTDIR/$SAMPLE/reads/*_interleaved.fasta do cat $i ${i/_interleaved.fasta/_unpaired.fasta} > ${i/_interleaved.fasta/_combined.fasta} 2> /dev/null rm $i ${i/_interleaved.fasta/_unpaired.fasta} 2> /dev/null done fi if [[ $CONTIGTYPE == "normal" ]]; then mkdir -p $OUTDIR/$SAMPLE/contigs cp $PIPERDIR/$SAMPLE/*/*/sequences/FNA/*.FNA $OUTDIR/$SAMPLE/contigs/ 2> /dev/null # adding gene name into the fasta files ">sample-gene" for i in $OUTDIR/$SAMPLE/contigs/*.FNA do FILE=${i/*\/contigs\//} GENE=${FILE/.FNA/} sed -i "s/$>.*$/\1-$GENE/" $i done # renaming .FNA to .fasta for consistency for f in $OUTDIR/$SAMPLE/contigs/*.FNA; do mv "$f" "${f/.FNA/.fasta}"; done CONTIGPATH="$OUTDIR/$SAMPLE/contigs" mkdir -p "$OUTDIR/$SAMPLE/consensus" else # when intronerate is selected mkdir -p $OUTDIR/$SAMPLE/intronerated_contigs cp $PIPERDIR/$SAMPLE/*/*/sequences/intron/*_supercontig.fasta $OUTDIR/$SAMPLE/intronerated_contigs/ 2> /dev/null for f in $OUTDIR/$SAMPLE/intronerated_contigs/*.* do mv "$f" "${f/_supercontig/_intronerated}" 2> /dev/null done CONTIGPATH="$OUTDIR/$SAMPLE/intronerated_contigs" mkdir -p "$OUTDIR/$SAMPLE/intronerated_consensus" fi mkdir -p "$OUTDIR/$SAMPLE/mapping_files" # start remapping reads to contigs for CONTIG in $CONTIGPATH/*.fasta do FILE=${CONTIG/*\/*contigs\//} GENE=${FILE/.fasta/} echo -e '\e[1A\e[K'Generating consensus sequences for $SAMPLE - $GENE if [[ $CONTIGTYPE == "normal" ]]; then CONSENSUS=$OUTDIR/$SAMPLE/consensus/$GENE".fasta" BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE".bam" VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE".vcf.gz" else GENE=${GENE/_intronerated/} CONSENSUS=$OUTDIR/$SAMPLE/intronerated_consensus/$GENE"_intronerated.fasta" BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.bam" VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.vcf.gz" fi # checking for paired-end reads if [[ -f $OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta" ]];then READS=$OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta" READTYPE="PE" else READS=$OUTDIR/$SAMPLE/reads/$GENE"_unpaired.fasta" READTYPE="SE" fi bwa index $CONTIG 2> /dev/null if [ "$READTYPE" = "SE" ];then bwa mem $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM else bwa mem -p $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM fi bcftools mpileup -I -Ov -f $CONTIG $BAM 2> /dev/null | bcftools call -mv -A -Oz -o $VCFZ 2> /dev/null bcftools index -f --threads $THREADS $VCFZ 2> /dev/null bcftools consensus -I -i "(DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $ALLELE_FREQ && (DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $READ_DEPTH && (DP4[2]+DP4[3]) >= $ALLELE_COUNT " -f $CONTIG $VCFZ 2> /dev/null | awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' > $CONSENSUS echo "" >> $CONSENSUS rm $CONTIG.* 2> /dev/null rm $VCFZ".csi" 2> /dev/null done # optional clean-up step: remove folders with mapping stats and reads if [[ $CLEANUP == "TRUE" ]]; then rmdir $OUTDIR/$SAMPLE/mapping_files/ --ignore-fail-on-non-empty rmdir $OUTDIR/$SAMPLE/reads --ignore-fail-on-non-empty fi DURATION_SAMPLE=$SECONDS echo -e '\e[1A\e[K'Generated consensus for $SAMPLE in $(($DURATION_SAMPLE / 60)) minutes and $(($DURATION_SAMPLE % 60)) seconds. done ``` ::: ```bash= screen -S HybPhaser ./generate_consensus_sequences.sh -n ./Brighamia_Paralog/name_file.txt -p ./Brighamia_Paralog/ -o ./Brighamia_Paralog/hybphaser -t 4 ``` :::danger Some users have had errors with this script finding the out directory. Using absolute path names, for example -o /home/ngavinsmyth/Noras_Sequences/HybPhaser_output_folder instead of ../HybPhaser_output_folder can be a solution to this. **Error:** About: Create consensus sequence by applying VCF variants to a reference fasta file. By default, the program will apply all ALT variants. Using the --sample (and, optionally, --haplotype) option will apply genotype (or haplotype) calls from FORMAT/GT. The program ignores allelic depth information, such as INFO/AD or FORMAT/AD. Usage: bcftools consensus [OPTIONS] <file.vcf> Options: -f, --fasta-ref \<file> reference sequence in fasta format -H, --haplotype <1|2> apply variants for the given haplotype -i, --iupac-codes output variants in the form of IUPAC ambiguity codes -m, --mask \<file> replace regions with N -o, --output \<file> write output to a file [standard output] -c, --chain \<file> write a chain file for liftover -s, --sample \<name> apply variants of the given sample Examples: Get the consensus for one region. The fasta header lines are then expected in the form ">chr:from-to". samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa Generating consensus sequences for BRIN-001 - 2084634_BI149 Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [mpileup] 1 samples in 1 input files consensus: invalid option -- 'I' ::: :::info :bulb: SOLUTION: Check dependencies ```bcftools --version``` In this case we need to update bcftools ```Sudo apt-get install bcftools``` You might need to update R in conda. See **[website](https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/)** ::: ## Step 4: Run the 1a_count_snps.R shell script ### Preparation This next step uses r in the terminal. You will need to download the config.txt and the R script To see if you have R type ```bash= # Check what environment you have conda env list # if "R" is listed then conda activate "name of env" #if "R" is not listed then conda install R #if need to create new enviornments. conda create -n NameForNewEnv python=2.7.14 # Many Python versions are available ``` #Once you have called up the environment. ```bash= # Start R R # To quit q() ``` Make sure the necessary packages are installed. ```r= install.packages("package") install.packages("ape") install.packages("stringr") ``` The R script 1a_count_snps.R is used to generate a table with the proportions of SNPs in each locus and sample as well as a table with recovered sequence length :::info :bulb:The first step is to locate a file called `config.txt` in the HybPhaser folder. Open it and enter the necessary information in the `# General Settings` section. ::: ### Changes made :::spoiler Changes made to Config Details below ### Changes made - `path_to_output_folder:` set path for HybPhaser output folder ("/path/to/hybphaser_output_folder/") - `fasta_file_with_targets`: direct to the file with target sequences (baits) ("/path/to/target_file.fasta") - `target_file_format`: set whether target file is in DNA ("DNA") or AA ("AA") format - `path_to_namelist`: direct to a file that contains all sample names ("/path/to/namelist.txt") - `intronerated_contig`: select whether HybPiper was run with intronerate to fetch the intronerated contigs ("yes"), or whether normal contigs should be used ("no") - Here are the `# General Settings` I used: ``` path_to_output_folder = "/home/jz/Desktop/brighamia_scbaits/HybPhaserOutputs" fasta_file_with_targets = "/home/jz/Desktop/brighamia_scbaits/supercontig_baits.fasta" target_file_format = "DNA" path_to_namelist = "/home/jz/Desktop/brighamia_scbaits/name_file2.txt" intronerated_contig = "no" ``` ### Nora's example "Config.txt" ``` bash= ! ###################################################### ### Configuration File for all HybPhaser R scripts ### ###################################################### # General settings path_to_output_folder = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2" fasta_file_with_targets = "./Angiosperms353_targetSequences.fasta" targets_file_format = "DNA" # "DNA" or "AA" path_to_namelist = "./names_list_luk.txt" intronerated_contig = "no" ############################### ### Part 1: SNP Assessment ### ############################### name_for_dataset_optimization_subset = "" # missing data remove_samples_with_less_than_this_propotion_of_loci_recovered = .2 remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = .1 remove_loci_with_less_than_this_propotion_of_samples_recovered = .2 remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered = .1 ##lowered filters for target sequence length recovered because we find it less important that our sequences match the targets well, more important that we get good proportion of sequences recovered. # Paralogs remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers" # any number between 0 and 1, "none" or "outliers" file_with_putative_paralogs_to_remove_for_all_samples = "" remove_outlier_loci_for_each_sample = "yes" ################################## ### Part 2: Clade Association ### ################################## # set variables to determine thresholds for the dataset optimization and run the script from the main script path_to_clade_association_folder = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/04_clade_association" csv_file_with_clade_reference_names = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/03_sequence_lists/trimmed_loci_consensus_unique_alleles/4471_ref.csv" path_to_reference_sequences = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/03_sequence_lists/samples_consensus" path_to_read_files_cladeassociation = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/mapped_reads" read_type_cladeassociation = "single-end" ID_read_pair1 = "" ID_read_pair2 = "" file_with_samples_included = "" path_to_bbmap = "/opt/apps/bbmap/" no_of_threads_clade_association = "auto" run_clade_association_mapping_in_R = "no" java_memory_usage_clade_association = "" # e.g. 2G (for 2GB) needed when java -Xmx error comes up. should be max. 85% of physical memory ####################### ### Part 3: Phasing ### ####################### # set variables and direct the scripts to the right folders and files before running the single scripts (from inside this file) path_to_phasing_folder = "" csv_file_with_phasing_prep_info = "" path_to_read_files_phasing = "" read_type_4phasing = "paired-end" ID_read_pair1 = "_R1.fastq.gz" # e.g. "_R1.fastq.gz" ID_read_pair2 = "_R2.fastq.gz" # e.g. "_R2.fastq.gz" reference_sequence_folder = "" folder_for_phased_reads = "" folder_for_phasing_stats = "" path_to_bbmap_executables = "" no_of_threads_phasing = "auto" java_memory_usage_phasing = "" # e.g. 2G (for 2GB) needed when java -Xmx error comes up. should be max. 85% of physical memory run_bash_script_in_R = "no" ################################ ### Part 4: Merging Datasets ### ################################ # This script is to generate sequence lists that combine the phased sequences with the normal non-phased ones (but exclude the normal ones of the phased accessions) # It is also possible to make subsets of samples or loci using lists of included or excluded samples/loci path_to_sequence_lists_normal = "" path_to_sequence_lists_phased = "" path_of_sequence_lists_output = "" path_to_namelist_normal = "" path_to_namelist_phased = "" # to generate subsets one can chose to define samples that are either included or excluded file_with_samples_included = "" file_with_samples_excluded = "" # Similarly one can choose to exclude loci or use a list with only the included loci. file_with_loci_excluded = "" file_with_loci_included = "" exchange_phased_with_not_phased_samples = "yes" # "yes" or "no" include_phased_seqlists_when_non_phased_locus_absent = "no" # "yes" or "no" ``` ::: ###### ###### ### Examples of the 1a_count_snps.R files :::spoiler HybPhaser script ``` bash= ! ################## ### SNPs count ### ################## # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) # load packages library(ape) library(seqinr) library(stringr) # generate folder output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset) dir.create(output_Robjects, showWarnings = F, recursive = T) ##################################################################################### ### Counting polymorphic sites (masked as ambiguity codes in conseneus sequences) ### ##################################################################################### # getting targets and sample names if(targets_file_format == "AA"){ targets <- read.fasta(fasta_file_with_targets, seqtype = "AA", as.string = TRUE, set.attributes = FALSE) } else if(targets_file_format == "DNA"){ targets <- read.fasta(fasta_file_with_targets, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE) } else { print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!") } targets_name <- unique(gsub(".*-","",labels(targets))) samples <- readLines(path_to_namelist) if(intronerated_contig=="yes"){ intronerated_name <- "intronerated" intronerated_underscore <- "_" } else { intronerated_name <- "" intronerated_underscore <- "" } # function for counting ambiguities seq_stats <- function(file){ fasta <- read.fasta(file, as.string=TRUE, set.attributes = FALSE ) seq <- gsub("N|[?]|-","",fasta[[1]]) c(round(str_length(seq),0),(str_count(seq,"Y|K|R|S|M|y|k|r|s|m|w") + str_count(seq,"W|D|H|B|V|w|d|h|b|v")*2) ) } # generate empty tables for SNPS and sequence length tab_snps <- data.frame(loci=targets_name) tab_length <- data.frame(loci=targets_name) # fill tables with information on snps and sequence length for each sample and locus start_time <- Sys.time() for(sample in samples){ #sample <- samples[1] print(paste(sample," ", round(difftime(Sys.time(),start_time, units="secs"),0),"s", sep="") ) tab_sample <- data.frame(targets=targets_name, seq_length=NA, ambis=NA, ambi_prop=NA) consensus_files <- list.files(file.path(path_to_output_folder,"01_data",sample, paste(intronerated_name,intronerated_underscore,"consensus", sep="")),full.names = T) for(consensus_file in consensus_files) { gene <- gsub("(_intronerated|).fasta","",gsub(".*/","",consensus_file)) file.path(path_to_output_folder,"01_data",sample, "consensus", paste(gene,intronerated_underscore,intronerated_name,".fasta",sep="")) if(file.info(consensus_file)$size !=0){ stats <- seq_stats(consensus_file) } else { stats <- c(NA,NA) } tab_sample$seq_length[grep(gene,tab_sample$targets)] <- stats[1] tab_sample$ambis[grep(gene,tab_sample$targets)] <- stats[2] } tab_sample$ambi_prop <- tab_sample$ambis/tab_sample$seq_length tab_snps[match(sample,samples)] <- tab_sample$ambi_prop tab_length[match(sample,samples)] <- tab_sample$seq_length } colnames(tab_snps) <- samples rownames(tab_snps) <- targets_name colnames(tab_length) <- samples rownames(tab_length) <- targets_name ### Generate output tables and save data in Robjects saveRDS(tab_snps,file=file.path(output_Robjects,"Table_SNPs.Rds")) saveRDS(tab_length,file=file.path(output_Robjects,"Table_consensus_length.Rds")) ``` ::: :::spoiler Nora's script ``` bash= ! ################## ### SNPs count ### ################## # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) # load packages library(ape) library(seqinr) library(stringr) # generate folder output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset) dir.create(output_Robjects, showWarnings = F, recursive = T) ##################################################################################### ### Counting polymorphic sites (masked as ambiguity codes in conseneus sequences) ### ##################################################################################### # getting targets and sample names if(targets_file_format == "AA"){ targets <- read.fasta(fasta_file_with_targets, seqtype = "AA", as.string = TRUE, set.attributes = FALSE) } else if(targets_file_format == "DNA"){ targets <- read.fasta(fasta_file_with_targets, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE) } else { print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!") } targets_name <- unique(gsub(".*-","",labels(targets))) samples <- readLines(path_to_namelist) if(intronerated_contig=="yes"){ intronerated_name <- "intronerated" intronerated_underscore <- "_" } else { intronerated_name <- "" intronerated_underscore <- "" } # function for counting ambiguities seq_stats <- function(file){ fasta <- read.fasta(file, as.string=TRUE, set.attributes = FALSE ) seq <- gsub("N|[?]|-","",fasta[[1]]) c(round(str_length(seq),0),(str_count(seq,"Y|K|R|S|M|y|k|r|s|m|w") + str_count(seq,"W|D|H|B|V|w|d|h|b|v")*2) ) } # generate empty tables for SNPS and sequence length tab_snps <- data.frame(loci=targets_name) tab_length <- data.frame(loci=targets_name) # fill tables with information on snps and sequence length for each sample and locus start_time <- Sys.time() for(sample in samples){ #sample <- samples[1] print(paste(sample," ", round(difftime(Sys.time(),start_time, units="secs"),0),"s", sep="") ) tab_sample <- data.frame(targets=targets_name, seq_length=NA, ambis=NA, ambi_prop=NA) consensus_files <- list.files(file.path(path_to_output_folder,"01_data",sample, paste(intronerated_name,intronerated_underscore,"consensus", sep="")),full.names = T) for(consensus_file in consensus_files) { gene <- gsub("(_intronerated|).fasta","",gsub(".*/","",consensus_file)) file.path(path_to_output_folder,"01_data",sample, "consensus", paste(gene,intronerated_underscore,intronerated_name,".fasta",sep="")) if(file.info(consensus_file)$size !=0){ stats <- seq_stats(consensus_file) } else { stats <- c(NA,NA) } tab_sample$seq_length[grep(gene,tab_sample$targets)] <- stats[1] tab_sample$ambis[grep(gene,tab_sample$targets)] <- stats[2] } tab_sample$ambi_prop <- tab_sample$ambis/tab_sample$seq_length tab_snps[match(sample,samples)] <- tab_sample$ambi_prop tab_length[match(sample,samples)] <- tab_sample$seq_length } colnames(tab_snps) <- samples rownames(tab_snps) <- targets_name colnames(tab_length) <- samples rownames(tab_length) <- targets_name ### Generate output tables and save data in Robjects saveRDS(tab_snps,file=file.path(output_Robjects,"Table_SNPs.Rds")) saveRDS(tab_length,file=file.path(output_Robjects,"Table_consensus_length.Rds")) ``` ::: ### Run script In terminal, you can now run the R script. ``` bash= ! Rscript 1a_count_snps.R ``` This will produce the output folder 00_R_objects with consensus sequence lengths and SNP proportions in samples and loci :::danger **Issue with 1_generate_consensus_sequence**: (Pretty sure this is a logical error in `generate_consensus_sequence` that causes an actual error in `1a_count_snps.R`) I think if `reads_first.py` doesn't generate a contig for any bait in an individual, `1_generate_consensus_sequences` is unable to copy over files (it interprets the lack of a file as an error) and skips to the next individual. This (1) potentially leads to lost data and (2) causes downstream problems when future scripts expect folders/files to be there based on the original namelist. Current solution: update the namelist by removing individuals that have no contigs for a bait. ::: ## Step 5: 1b_assess_dataset.R ### Preparation This R script generates tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene. ###### ### Examples of the 1a_count_snps.R files :::spoiler HybPhaser script ``` bash= ! ##################################################################################### ### Generating files for assessment of missing data,. paralogs and heterozygosity ### ##################################################################################### # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) # load packages library(ape) # generate folders if(name_for_dataset_optimization_subset != ""){ folder_subset_add <- paste("_",name_for_dataset_optimization_subset, sep="") }else { folder_subset_add <- "" } output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset) output_assess <- file.path(path_to_output_folder,paste("02_assessment", folder_subset_add, sep="")) dir.create(output_assess, showWarnings = F) dir.create(output_Robjects, showWarnings = F) # check if SNPs count for subset has been done, if not use SNPs count of first run subset with name "") if(file.exists(file=file.path(output_Robjects,"Table_SNPs.Rds"))){ tab_snps <- readRDS(file=file.path(output_Robjects,"Table_SNPs.Rds")) tab_length <- readRDS(file=file.path(output_Robjects,"Table_consensus_length.Rds")) } else { tab_snps <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_SNPs.Rds")) tab_length <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_consensus_length.Rds")) } tab_snps <- as.matrix(tab_snps) loci <- t(tab_snps) ########################################################## ### Dataset optimization step 1: Reducing missing data ### ########################################################## nloci <- length(colnames(loci)) nsamples <- length(rownames(loci)) failed_loci <- which(colSums(is.na(loci))==nrow(loci)) failed_samples <- which(colSums(is.na(tab_snps))==nrow(tab_snps)) # per locus seq_per_locus <- vector() for(i in 1:nloci){ seq_per_locus[i] <- length(which(!(is.na(loci[,i])))) } names(seq_per_locus) <- colnames(loci) seq_per_locus_prop <- seq_per_locus/nsamples # per sample seq_per_sample <- vector() for(i in 1:length(colnames(tab_snps))){ seq_per_sample[i] <- length(which(!(is.na(tab_snps[,i])))) } names(seq_per_sample) <- colnames(tab_snps) seq_per_sample_prop <- seq_per_sample/nloci # proportion of target sequence length if(targets_file_format == "AA"){ targets_length_all <- lengths(read.FASTA(fasta_file_with_targets, type = "AA"))*3 } else if(targets_file_format == "DNA"){ targets_length_all <- lengths(read.FASTA(fasta_file_with_targets)) } else { print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!") } gene_names <- unique(gsub(".*-","",gsub(" .*","",names(targets_length_all)))) max_target_length <- vector() for(i in 1:length(gene_names)){ max_target_length[i] <- max(targets_length_all[grep(paste("\\b",gene_names[i],"\\b",sep=""),names(targets_length_all))]) } names(max_target_length) <- gene_names comb_target_length <- sum(max_target_length) comb_seq_length_samples <- colSums(tab_length, na.rm = T) prop_target_length_per_sample <- comb_seq_length_samples/comb_target_length mean_seq_length_loci <- rowMeans(tab_length, na.rm = T) names(mean_seq_length_loci) <- gene_names prop_target_length_per_locus <- mean_seq_length_loci/max_target_length # application of thresholds outsamples_missing_loci <- seq_per_sample_prop[which(seq_per_sample_prop < remove_samples_with_less_than_this_propotion_of_loci_recovered)] outsamples_missing_target <- prop_target_length_per_sample[which(prop_target_length_per_sample < remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered)] outsamples_missing <- unique(names(c(outsamples_missing_loci,outsamples_missing_target))) outloci_missing_samples <- seq_per_locus_prop[which(seq_per_locus_prop < remove_loci_with_less_than_this_propotion_of_samples_recovered)] outloci_missing_target <- prop_target_length_per_locus[which(prop_target_length_per_locus < remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered)] outloci_missing <- unique(names(c(outloci_missing_samples,outloci_missing_target))) # removing bad loci and samples from the table tab_snps_cl1 <- tab_snps if(length(outsamples_missing) != 0){ tab_snps_cl1 <- tab_snps_cl1[,-which(colnames(tab_snps) %in% outsamples_missing)] } if(length(outloci_missing) != 0){ tab_snps_cl1 <- tab_snps_cl1[-which(rownames(tab_snps) %in% outloci_missing),] } loci_cl1 <- t(tab_snps_cl1) # output # graphics for(i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"1_Data_recovered_overview.pdf"), width = 11, height=7) } else { png(file=file.path(output_assess,"1_Data_recovered_overview.png"), width = 1400, height=1000) par(cex.axis=2, cex.lab=2, cex.main=2) } par(mfrow=c(2,3)) boxplot(seq_per_sample_prop, main=paste("Samples: prop. of",nloci,"loci recovered"), xlab=paste("mean:",round(mean(seq_per_sample_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_sample_prop, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_loci_recovered," (",length(outsamples_missing_loci)," out)",sep="")) abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red") boxplot(prop_target_length_per_sample, main=paste("Samples: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_sample, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_sample, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outsamples_missing_target)," out)",sep="")) abline(h=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") plot(prop_target_length_per_sample,seq_per_sample_prop, main = "Prop. of loci vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of loci" ) abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red") abline(v=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") boxplot(seq_per_locus_prop, main=paste("Loci: prop. of",nsamples,"samples recovered"), xlab=paste("mean:",round(mean(seq_per_locus_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_locus_prop, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_samples_recovered," (",length(outloci_missing_samples)," out)",sep="")) abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red") boxplot(prop_target_length_per_locus, main=paste("Loci: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_locus, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_locus, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outloci_missing_target)," out)",sep="")) abline(h=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") plot(prop_target_length_per_locus,seq_per_locus_prop, main = "Prop. of samples vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of samples" ) abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red") abline(v=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") dev.off() } # tables tab_seq_per_sample <- cbind(seq_per_sample,round(seq_per_sample_prop,3), round(prop_target_length_per_sample,3)) colnames(tab_seq_per_sample) <- c("No. loci", "Prop. of loci", "Prop. of target length") write.csv(tab_seq_per_sample, file.path(output_assess, "1_Data_recovered_per_sample.csv")) tab_seq_per_locus <- cbind(seq_per_locus,round(seq_per_locus_prop,3), round(prop_target_length_per_locus,3)) colnames(tab_seq_per_locus) <- c("No. samples", "Prop.of samples", "Prop. of target length") write.csv(tab_seq_per_locus, file.path(output_assess, "1_Data_recovered_per_locus.csv")) # summary text file summary_file=file.path(output_assess,"1_Summary_missing_data.txt") cat(file=summary_file, append = FALSE, "Dataset optimisation: Samples and loci removed to reduce missing data\n") cat(file=summary_file, append = T, "\n", length(failed_samples)," samples failed completely:\n", paste(names(failed_samples)),"\n", sep="") cat(file=summary_file, append = T, "\n", length(outsamples_missing_loci)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_loci_recovered,") for proportion of recovered loci:\n", paste(names(outsamples_missing_loci),"\t",round(outsamples_missing_loci,3),"\n"), sep="") cat(file=summary_file, append = T, "\n", length(outsamples_missing_target)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered,") for recovered target sequence length\n", paste(names(outsamples_missing_target),"\t",round(outsamples_missing_target,3),"\n"), sep="") cat(file=summary_file, append = T, "\nIn total ", length(outsamples_missing), " samples were removed:\n", paste(outsamples_missing,"\n"), sep="") cat(file=summary_file, append = T, "\n", length(failed_loci)," loci failed completely:\n", paste(names(failed_loci)),"\n", sep="") cat(file=summary_file, append = T, "\n", length(outloci_missing_samples)," loci are below the threshold (", remove_loci_with_less_than_this_propotion_of_samples_recovered,") for proportion of recovered samples:\n", paste(names(outloci_missing_samples),"\t",round(outloci_missing_samples,3),"\n"), sep="") cat(file=summary_file, append = T, "\n", length(outloci_missing_target)," loci are below the threshold (",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered,") for proportion of recovered target sequence length:\n", paste(names(outloci_missing_target),"\t",round(outloci_missing_target,3),"\n"), sep="") cat(file=summary_file, append = T, "\nIn total ", length(outloci_missing), " loci were removed:\n", paste(outloci_missing,"\n"), sep="") ############################################################################################ ### Dataset optimization step 2, removing paralogs for a) all samples and b) each sample ### ############################################################################################ ### 2a) Paralogs across multiple samples (removing loci with unusually high proportions of SNPs across all samples) ################################################################################################################### loci_cl1_colmeans <- colMeans(as.matrix(loci_cl1), na.rm = T) nloci_cl1 <- length(colnames(loci_cl1)) nsamples_cl1 <- length(colnames(tab_snps_cl1)) loci_cl1_colmeans_mean <- round(mean(loci_cl1_colmeans),4) loci_cl1_colmeans_median <- round(median(loci_cl1_colmeans),4) # applying chosen threshold if (length(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs) != 1 || remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "none"){ outloci_para_all <- vector() threshold_value <- 1 } else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "outliers"){ threshold_value <- 1.5*IQR(loci_cl1_colmeans, na.rm = TRUE )+quantile(loci_cl1_colmeans, na.rm = TRUE )[4] outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)] outloci_para_all <- names(outloci_para_all_values) } else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "file"){ if(file.exists(file_with_putative_paralogs_to_remove_for_all_samples) == FALSE){ print("File with list of paralogs to remove for all samples does not exist.") } else { outloci_para_all <- readLines(file_with_putative_paralogs_to_remove_for_all_samples) outloci_para_all_values <-loci_cl1_colmeans[which(names(loci_cl1_colmeans) %in% outloci_para_all)] } } else { threshold_value <- remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)] outloci_para_all <- names(outloci_para_all_values) } # color outliers red colour_outparaall <- rep("black",nloci_cl1) colour_outparaall[which(colnames(loci_cl1[,order(loci_cl1_colmeans)]) %in% outloci_para_all)] <- "red" loci_cl1_order_means <- loci_cl1[,order(loci_cl1_colmeans)] # generate bar graph for(i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"2a_Paralogs_for_all_samples.pdf"), width = 11, height=7) } else { png(file=file.path(output_assess,"2a_Paralogs_for_all_samples.png"), width = 1400, height=1000) par(cex.axis=2, cex.lab=2, cex.main=2) } layout(matrix(c(1,2),2,2, byrow=TRUE), widths=c(5,1)) barplot(sort(loci_cl1_colmeans), col=colour_outparaall, border = NA, las=2, main=paste("Mean % SNPs across samples (n=",nsamples_cl1,") for each locus (n=", nloci_cl1,")", sep="")) if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)} boxplot(loci_cl1_colmeans, las=2) if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)} dev.off() } # removing marked loci from table if(length(outloci_para_all)==0) {tab_snps_cl2a <- tab_snps_cl1 } else { tab_snps_cl2a <- tab_snps_cl1[-which(rownames(tab_snps_cl1) %in% outloci_para_all),]} ### 2b) Paralogs for each sample (removing outlier loci for each sample) ########################################################################## tab_snps_cl2b <- tab_snps_cl2a if(!exists("remove_outlier_loci_for_each_sample")) {remove_outlier_loci_for_each_sample <- "no"} # generate tables without zeros to count only loci with SNPs tab_snps_cl2a_nozero <- tab_snps_cl2a tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero==0)] <- NA tab_snps_cl2b_nozero <- tab_snps_cl2a_nozero if(remove_outlier_loci_for_each_sample == "yes" ){ outloci_para_each <- list() outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL) threshold_para_each <- outloci_para_each for(i in 1:length(colnames(tab_snps_cl2a))){ threshold_i <- 1.5*IQR(tab_snps_cl2a_nozero[,i], na.rm = TRUE ) + quantile(tab_snps_cl2a_nozero[,i] , na.rm = TRUE)[[4]] outlier_loci_i <- tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero[,i] > threshold_i),i] outloci_para_each[i] <- list(outlier_loci_i) threshold_para_each[i] <- threshold_i tab_snps_cl2b[which(rownames(tab_snps_cl2a) %in% outlier_loci_i),i] <- NA tab_snps_cl2b_nozero[which(rownames(tab_snps_cl2b_nozero) %in% names(outlier_loci_i)),i] <- NA outliers_color <- "red" } } else { outloci_para_each <- list() outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL) tab_snps_cl2b <- tab_snps_cl2a outliers_color <- "black" } tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))] ### output # generate graphic for (i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"2b_Paralogs_for_each_sample.pdf"), width = 10, h=14) } else { png(file=file.path(output_assess,"2b_Paralogs_for_each_sample.png"), width = 1000, h=1400) } par(mfrow=c(1,1)) boxplot(as.data.frame(tab_snps_cl2a_nozero[,order(colMeans(as.matrix(tab_snps_cl2a_nozero), na.rm = T))]), horizontal=T, las=1, yaxt='n', main="Proportions of SNPs for all loci per sample\n(only loci with any SNPs)", ylab="Samples", xlab="Proportion of SNPs", col="grey", pars=list(outcol=outliers_color), outpch=20 ) dev.off() } # write summary text file cl2b_file <- file.path(output_assess,"2_Summary_Paralogs.txt") cat(file=cl2b_file,"Removal of putative paralog loci.") cat(file=cl2b_file,"Paralogs removed for all samples:\n", append = T) cat(file=cl2b_file, paste("Variable 'remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs' set to: ", remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs,"\n", sep=""), append=T) if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="file"){ cat(file=cl2b_file, paste("Loci listed in this file were removed: '", file_with_putative_paralogs_to_remove_for_all_samples,"'\n"), append=T) } else if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="none"){ cat(file=cl2b_file,"None!\n", append = T) } else { cat(file=cl2b_file, paste("Resulting threshold value (mean proportion of SNPs):", round(threshold_value,5),"\n"), append=T) } if(length(outloci_para_all)>0){ cat(file=cl2b_file, paste(length(outloci_para_all)," loci were removed:\n",sep=""), append=T) cat(file=cl2b_file, "locus\tmean_prop_SNPs\n", append=T) cat(file=cl2b_file, paste(paste(outloci_para_all, round(outloci_para_all_values,4), sep="\t"), collapse="\n"), append=T) cat(file=cl2b_file, "\n\n", append = T) } cat(file= file.path(output_assess,"2a_List_of_paralogs_removed_for_all_samples.txt"), paste(outloci_para_all,collapse = "\n")) if(remove_outlier_loci_for_each_sample=="yes"){ cat(file=cl2b_file, "Paralogs removed for each sample:\n", append=T) cat(file=cl2b_file, "Sample\tthreshold\t#removed\tnames\n", append=T) for(i in 1:length(names(outloci_para_each))){ cat(file=cl2b_file, names(outloci_para_each)[i],"\t", append=T) cat(file=cl2b_file, round(threshold_para_each[[i]],5), length(outloci_para_each[[i]]), paste(names(outloci_para_each[[i]]),collapse=", "), sep="\t", append=T) cat(file=cl2b_file, "\n", append=T) } } else{ cat(file=cl2b_file, "The step for removing paralogs for each samples was skipped.\n", append=T) } # tables write.csv(tab_snps_cl2b, file = file.path(output_assess,"0_Table_SNPs.csv")) write.csv(tab_length_cl2b, file = file.path(output_assess,"0_Table_consensus_length.csv")) # txt file with included samples write(samples[which(!(samples %in% outsamples_missing))], file=file.path(output_assess,"0_namelist_included_samples.txt")) #write(paste(rownames(tab_snps_cl2a),colMeans(t(tab_snps_cl2a), na.rm = T)), file=file.path(output_assess,"Mean_SNPs_loci.txt")) ### save Data as R objects saveRDS(tab_snps_cl2b,file=file.path(output_Robjects,"Table_SNPs_cleaned.Rds")) saveRDS(tab_length_cl2b,file=file.path(output_Robjects,"Table_consensus_length_cleaned.Rds")) saveRDS(outloci_missing,file=file.path(output_Robjects,"outloci_missing.Rds")) saveRDS(outsamples_missing,file=file.path(output_Robjects,"outsamples_missing.Rds")) saveRDS(outloci_para_all,file=file.path(output_Robjects,"outloci_para_all.Rds")) saveRDS(outloci_para_each,file=file.path(output_Robjects,"outloci_para_each.Rds")) ############################################################################################################# ### Generating summary table and graphs for assessment of Locus heterozygosity and allele divergence of samples ### ############################################################################################################# tab_length <- as.matrix(tab_length) tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))] targets_length_cl2b <- sum(max_target_length[which(gsub(".*-","",names(max_target_length)) %in% rownames(tab_snps_cl2b))]) ########## generating summary table nloci_cl2 <- length(tab_snps_cl2b[,1]) tab_het_ad <- data.frame("sample"=colnames(tab_snps_cl2b)) for(i in 1:length(colnames(tab_snps_cl2b))){ tab_het_ad$bp[i] <- sum(tab_length_cl2b[,i], na.rm = T) tab_het_ad$bpoftarget[i] <- round(sum(tab_length_cl2b[,i], na.rm = T)/targets_length_cl2b,3)*100 tab_het_ad$paralogs_all[i] <- length(outloci_para_all) tab_het_ad$paralogs_each[i] <- length(outloci_para_each[[i]]) tab_het_ad$nloci[i] <- nloci_cl2-length(which(is.na(tab_snps_cl2b[,i]))) tab_het_ad$allele_divergence[i] <- 100*round(sum(tab_length_cl2b[,i] * tab_snps_cl2b[,i], na.rm = T) / sum(tab_length_cl2b[,i], na.rm = T),5) tab_het_ad$locus_heterozygosity[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]==0))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >0.5% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.005))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >1% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.01))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >2% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.02))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) } # output as csv file write.csv(tab_het_ad, file = file.path(output_assess, "4_Summary_table.csv")) #output as R-object saveRDS(tab_het_ad, file = file.path(output_Robjects, "Summary_table.Rds")) ### Generating graphs text_size_mod <- 1 nrows <- length(tab_het_ad[,1]) text_size <- (15+200/nrows)*text_size_mod for(i in 1:2){ if(i==1){ pdf(file.path(output_assess,"3_LH_vs_AD.pdf"), h=10,w=10) } else { png(file.path(output_assess,"3_LH_vs_AD.png"), h=1000,w=1000) } plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity, xlab="Allele divergence [%]", ylab="Locus heterozygosity [%]", main="Locus heterozygosity vs allele divergence", las=1) dev.off() } for(i in 1:2){ if(i==1){ pdf(file.path(output_assess,"3_varLH_vs_AD.pdf"), h=10,w=10) } else { png(file.path(output_assess,"3_varLH_vs_AD.png"), h=1000,w=1000) } par(mfrow=c(2,2)) plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity, xlab="Allele divergence [%]", ylab="Locus heterozygosity (0% SNPs) [%]", main="Locus heterozygosity (0% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >0.5% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>0.5% SNPs) [%]", main="Locus heterozygosity (>0.5% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >1% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>1% SNPs) [%]", main="Locus heterozygosity (>1% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >2% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>2% SNPs) [%]", main="Locus heterozygosity (>2% SNPs) vs allele divergence",las=1 ) par(mfrow=c(1,1)) dev.off() } ``` ::: :::spoiler Nora's script ``` bash= ! ##################################################################################### ### Generating files for assessment of missing data,. paralogs and heterozygosity ### ##################################################################################### # load config if (!(exists("config_file"))) {config_file <- "./config.txt"} source(config_file) # load packages library(ape) # generate folders if(name_for_dataset_optimization_subset != ""){ folder_subset_add <- paste("_",name_for_dataset_optimization_subset, sep="") }else { folder_subset_add <- "" } output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset) output_assess <- file.path(path_to_output_folder,paste("02_assessment", folder_subset_add, sep="")) dir.create(output_assess, showWarnings = F) dir.create(output_Robjects, showWarnings = F) # check if SNPs count for subset has been done, if not use SNPs count of first run subset with name "") if(file.exists(file=file.path(output_Robjects,"Table_SNPs.Rds"))){ tab_snps <- readRDS(file=file.path(output_Robjects,"Table_SNPs.Rds")) tab_length <- readRDS(file=file.path(output_Robjects,"Table_consensus_length.Rds")) } else { tab_snps <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_SNPs.Rds")) tab_length <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_consensus_length.Rds")) } tab_snps <- as.matrix(tab_snps) loci <- t(tab_snps) ########################################################## ### Dataset optimization step 1: Reducing missing data ### ########################################################## nloci <- length(colnames(loci)) nsamples <- length(rownames(loci)) failed_loci <- which(colSums(is.na(loci))==nrow(loci)) failed_samples <- which(colSums(is.na(tab_snps))==nrow(tab_snps)) # per locus seq_per_locus <- vector() for(i in 1:nloci){ seq_per_locus[i] <- length(which(!(is.na(loci[,i])))) } names(seq_per_locus) <- colnames(loci) seq_per_locus_prop <- seq_per_locus/nsamples # per sample seq_per_sample <- vector() for(i in 1:length(colnames(tab_snps))){ seq_per_sample[i] <- length(which(!(is.na(tab_snps[,i])))) } names(seq_per_sample) <- colnames(tab_snps) seq_per_sample_prop <- seq_per_sample/nloci # proportion of target sequence length if(targets_file_format == "AA"){ targets_length_all <- lengths(read.FASTA(fasta_file_with_targets, type = "AA"))*3 } else if(targets_file_format == "DNA"){ targets_length_all <- lengths(read.FASTA(fasta_file_with_targets)) } else { print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!") } gene_names <- unique(gsub(".*-","",gsub(" .*","",names(targets_length_all)))) max_target_length <- vector() for(i in 1:length(gene_names)){ max_target_length[i] <- max(targets_length_all[grep(paste("\\b",gene_names[i],"\\b",sep=""),names(targets_length_all))]) } names(max_target_length) <- gene_names comb_target_length <- sum(max_target_length) comb_seq_length_samples <- colSums(tab_length, na.rm = T) prop_target_length_per_sample <- comb_seq_length_samples/comb_target_length mean_seq_length_loci <- rowMeans(tab_length, na.rm = T) names(mean_seq_length_loci) <- gene_names prop_target_length_per_locus <- mean_seq_length_loci/max_target_length # application of thresholds outsamples_missing_loci <- seq_per_sample_prop[which(seq_per_sample_prop < remove_samples_with_less_than_this_propotion_of_loci_recovered)] outsamples_missing_target <- prop_target_length_per_sample[which(prop_target_length_per_sample < remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered)] outsamples_missing <- unique(names(c(outsamples_missing_loci,outsamples_missing_target))) outloci_missing_samples <- seq_per_locus_prop[which(seq_per_locus_prop < remove_loci_with_less_than_this_propotion_of_samples_recovered)] outloci_missing_target <- prop_target_length_per_locus[which(prop_target_length_per_locus < remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered)] outloci_missing <- unique(names(c(outloci_missing_samples,outloci_missing_target))) # removing bad loci and samples from the table tab_snps_cl1 <- tab_snps if(length(outsamples_missing) != 0){ tab_snps_cl1 <- tab_snps_cl1[,-which(colnames(tab_snps) %in% outsamples_missing)] } if(length(outloci_missing) != 0){ tab_snps_cl1 <- tab_snps_cl1[-which(rownames(tab_snps) %in% outloci_missing),] } loci_cl1 <- t(tab_snps_cl1) # output # graphics for(i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"1_Data_recovered_overview.pdf"), width = 11, height=7) } else { png(file=file.path(output_assess,"1_Data_recovered_overview.png"), width = 1400, height=1000) par(cex.axis=2, cex.lab=2, cex.main=2) } par(mfrow=c(2,3)) boxplot(seq_per_sample_prop, main=paste("Samples: prop. of",nloci,"loci recovered"), xlab=paste("mean:",round(mean(seq_per_sample_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_sample_prop, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_loci_recovered," (",length(outsamples_missing_loci)," out)",sep="")) abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red") boxplot(prop_target_length_per_sample, main=paste("Samples: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_sample, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_sample, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outsamples_missing_target)," out)",sep="")) abline(h=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") plot(prop_target_length_per_sample,seq_per_sample_prop, main = "Prop. of loci vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of loci" ) abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red") abline(v=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") boxplot(seq_per_locus_prop, main=paste("Loci: prop. of",nsamples,"samples recovered"), xlab=paste("mean:",round(mean(seq_per_locus_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_locus_prop, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_samples_recovered," (",length(outloci_missing_samples)," out)",sep="")) abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red") boxplot(prop_target_length_per_locus, main=paste("Loci: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_locus, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_locus, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outloci_missing_target)," out)",sep="")) abline(h=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") plot(prop_target_length_per_locus,seq_per_locus_prop, main = "Prop. of samples vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of samples" ) abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red") abline(v=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red") dev.off() } # tables tab_seq_per_sample <- cbind(seq_per_sample,round(seq_per_sample_prop,3), round(prop_target_length_per_sample,3)) colnames(tab_seq_per_sample) <- c("No. loci", "Prop. of loci", "Prop. of target length") write.csv(tab_seq_per_sample, file.path(output_assess, "1_Data_recovered_per_sample.csv")) tab_seq_per_locus <- cbind(seq_per_locus,round(seq_per_locus_prop,3), round(prop_target_length_per_locus,3)) colnames(tab_seq_per_locus) <- c("No. samples", "Prop.of samples", "Prop. of target length") write.csv(tab_seq_per_locus, file.path(output_assess, "1_Data_recovered_per_locus.csv")) # summary text file summary_file=file.path(output_assess,"1_Summary_missing_data.txt") cat(file=summary_file, append = FALSE, "Dataset optimisation: Samples and loci removed to reduce missing data\n") cat(file=summary_file, append = T, "\n", length(failed_samples)," samples failed completely:\n", paste(names(failed_samples)),"\n", sep="") cat(file=summary_file, append = T, "\n", length(outsamples_missing_loci)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_loci_recovered,") for proportion of recovered loci:\n", paste(names(outsamples_missing_loci),"\t",round(outsamples_missing_loci,3),"\n"), sep="") cat(file=summary_file, append = T, "\n", length(outsamples_missing_target)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered,") for recovered target sequence length\n", paste(names(outsamples_missing_target),"\t",round(outsamples_missing_target,3),"\n"), sep="") cat(file=summary_file, append = T, "\nIn total ", length(outsamples_missing), " samples were removed:\n", paste(outsamples_missing,"\n"), sep="") cat(file=summary_file, append = T, "\n", length(failed_loci)," loci failed completely:\n", paste(names(failed_loci)),"\n", sep="") cat(file=summary_file, append = T, "\n", length(outloci_missing_samples)," loci are below the threshold (", remove_loci_with_less_than_this_propotion_of_samples_recovered,") for proportion of recovered samples:\n", paste(names(outloci_missing_samples),"\t",round(outloci_missing_samples,3),"\n"), sep="") cat(file=summary_file, append = T, "\n", length(outloci_missing_target)," loci are below the threshold (",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered,") for proportion of recovered target sequence length:\n", paste(names(outloci_missing_target),"\t",round(outloci_missing_target,3),"\n"), sep="") cat(file=summary_file, append = T, "\nIn total ", length(outloci_missing), " loci were removed:\n", paste(outloci_missing,"\n"), sep="") ############################################################################################ ### Dataset optimization step 2, removing paralogs for a) all samples and b) each sample ### ############################################################################################ ### 2a) Paralogs across multiple samples (removing loci with unusually high proportions of SNPs across all samples) ################################################################################################################### loci_cl1_colmeans <- colMeans(as.matrix(loci_cl1), na.rm = T) nloci_cl1 <- length(colnames(loci_cl1)) nsamples_cl1 <- length(colnames(tab_snps_cl1)) loci_cl1_colmeans_mean <- round(mean(loci_cl1_colmeans),4) loci_cl1_colmeans_median <- round(median(loci_cl1_colmeans),4) # applying chosen threshold if (length(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs) != 1 || remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "none"){ outloci_para_all <- vector() threshold_value <- 1 } else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "outliers"){ threshold_value <- 1.5*IQR(loci_cl1_colmeans, na.rm = TRUE )+quantile(loci_cl1_colmeans, na.rm = TRUE )[4] outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)] outloci_para_all <- names(outloci_para_all_values) } else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "file"){ if(file.exists(file_with_putative_paralogs_to_remove_for_all_samples) == FALSE){ print("File with list of paralogs to remove for all samples does not exist.") } else { outloci_para_all <- readLines(file_with_putative_paralogs_to_remove_for_all_samples) outloci_para_all_values <-loci_cl1_colmeans[which(names(loci_cl1_colmeans) %in% outloci_para_all)] } } else { threshold_value <- remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)] outloci_para_all <- names(outloci_para_all_values) } # color outliers red colour_outparaall <- rep("black",nloci_cl1) colour_outparaall[which(colnames(loci_cl1[,order(loci_cl1_colmeans)]) %in% outloci_para_all)] <- "red" loci_cl1_order_means <- loci_cl1[,order(loci_cl1_colmeans)] # generate bar graph for(i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"2a_Paralogs_for_all_samples.pdf"), width = 11, height=7) } else { png(file=file.path(output_assess,"2a_Paralogs_for_all_samples.png"), width = 1400, height=1000) par(cex.axis=2, cex.lab=2, cex.main=2) } layout(matrix(c(1,2),2,2, byrow=TRUE), widths=c(5,1)) barplot(sort(loci_cl1_colmeans), col=colour_outparaall, border = NA, las=2, main=paste("Mean % SNPs across samples (n=",nsamples_cl1,") for each locus (n=", nloci_cl1,")", sep="")) if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)} boxplot(loci_cl1_colmeans, las=2) if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)} dev.off() } # removing marked loci from table if(length(outloci_para_all)==0) {tab_snps_cl2a <- tab_snps_cl1 } else { tab_snps_cl2a <- tab_snps_cl1[-which(rownames(tab_snps_cl1) %in% outloci_para_all),]} ### 2b) Paralogs for each sample (removing outlier loci for each sample) ########################################################################## tab_snps_cl2b <- tab_snps_cl2a if(!exists("remove_outlier_loci_for_each_sample")) {remove_outlier_loci_for_each_sample <- "no"} # generate tables without zeros to count only loci with SNPs tab_snps_cl2a_nozero <- tab_snps_cl2a tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero==0)] <- NA tab_snps_cl2b_nozero <- tab_snps_cl2a_nozero if(remove_outlier_loci_for_each_sample == "yes" ){ outloci_para_each <- list() outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL) threshold_para_each <- outloci_para_each for(i in 1:length(colnames(tab_snps_cl2a))){ threshold_i <- 1.5*IQR(tab_snps_cl2a_nozero[,i], na.rm = TRUE ) + quantile(tab_snps_cl2a_nozero[,i] , na.rm = TRUE)[[4]] outlier_loci_i <- tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero[,i] > threshold_i),i] outloci_para_each[i] <- list(outlier_loci_i) threshold_para_each[i] <- threshold_i tab_snps_cl2b[which(rownames(tab_snps_cl2a) %in% outlier_loci_i),i] <- NA tab_snps_cl2b_nozero[which(rownames(tab_snps_cl2b_nozero) %in% names(outlier_loci_i)),i] <- NA outliers_color <- "red" } } else { outloci_para_each <- list() outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL) tab_snps_cl2b <- tab_snps_cl2a outliers_color <- "black" } tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))] ### output # generate graphic for (i in 1:2){ if(i==1){ pdf(file=file.path(output_assess,"2b_Paralogs_for_each_sample.pdf"), width = 10, h=14) } else { png(file=file.path(output_assess,"2b_Paralogs_for_each_sample.png"), width = 1000, h=1400) } par(mfrow=c(1,1)) boxplot(as.data.frame(tab_snps_cl2a_nozero[,order(colMeans(as.matrix(tab_snps_cl2a_nozero), na.rm = T))]), horizontal=T, las=1, yaxt='n', main="Proportions of SNPs for all loci per sample\n(only loci with any SNPs)", ylab="Samples", xlab="Proportion of SNPs", col="grey", pars=list(outcol=outliers_color), outpch=20 ) dev.off() } # write summary text file cl2b_file <- file.path(output_assess,"2_Summary_Paralogs.txt") cat(file=cl2b_file,"Removal of putative paralog loci.") cat(file=cl2b_file,"Paralogs removed for all samples:\n", append = T) cat(file=cl2b_file, paste("Variable 'remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs' set to: ", remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs,"\n", sep=""), append=T) if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="file"){ cat(file=cl2b_file, paste("Loci listed in this file were removed: '", file_with_putative_paralogs_to_remove_for_all_samples,"'\n"), append=T) } else if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="none"){ cat(file=cl2b_file,"None!\n", append = T) } else { cat(file=cl2b_file, paste("Resulting threshold value (mean proportion of SNPs):", round(threshold_value,5),"\n"), append=T) } if(length(outloci_para_all)>0){ cat(file=cl2b_file, paste(length(outloci_para_all)," loci were removed:\n",sep=""), append=T) cat(file=cl2b_file, "locus\tmean_prop_SNPs\n", append=T) cat(file=cl2b_file, paste(paste(outloci_para_all, round(outloci_para_all_values,4), sep="\t"), collapse="\n"), append=T) cat(file=cl2b_file, "\n\n", append = T) } cat(file= file.path(output_assess,"2a_List_of_paralogs_removed_for_all_samples.txt"), paste(outloci_para_all,collapse = "\n")) if(remove_outlier_loci_for_each_sample=="yes"){ cat(file=cl2b_file, "Paralogs removed for each sample:\n", append=T) cat(file=cl2b_file, "Sample\tthreshold\t#removed\tnames\n", append=T) for(i in 1:length(names(outloci_para_each))){ cat(file=cl2b_file, names(outloci_para_each)[i],"\t", append=T) cat(file=cl2b_file, round(threshold_para_each[[i]],5), length(outloci_para_each[[i]]), paste(names(outloci_para_each[[i]]),collapse=", "), sep="\t", append=T) cat(file=cl2b_file, "\n", append=T) } } else{ cat(file=cl2b_file, "The step for removing paralogs for each samples was skipped.\n", append=T) } # tables write.csv(tab_snps_cl2b, file = file.path(output_assess,"0_Table_SNPs.csv")) write.csv(tab_length_cl2b, file = file.path(output_assess,"0_Table_consensus_length.csv")) # txt file with included samples write(samples[which(!(colnames(tab_snps) %in% outsamples_missing))], file=file.path(output_assess,"0_namelist_included_samples.txt")) # write(outsamples_missing, file=file.path(output_assess, "0_namelist_excluded_samples.txt")) #write(paste(rownames(tab_snps_cl2a),colMeans(t(tab_snps_cl2a), na.rm = T)), file=file.path(output_assess,"Mean_SNPs_loci.txt")) #outsamples_missing <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0 } ### save Data as R objects saveRDS(failed_loci,file=file.path(output_Robjects,"failed_loci.Rds")) saveRDS(failed_samples,file=file.path(output_Robjects,"failed_samples.Rds")) saveRDS(tab_snps_cl2b,file=file.path(output_Robjects,"Table_SNPs_cleaned.Rds")) saveRDS(tab_length_cl2b,file=file.path(output_Robjects,"Table_consensus_length_cleaned.Rds")) saveRDS(outloci_missing,file=file.path(output_Robjects,"outloci_missing.Rds")) saveRDS(outsamples_missing,file=file.path(output_Robjects,"outsamples_missing.Rds")) saveRDS(outloci_para_all,file=file.path(output_Robjects,"outloci_para_all.Rds")) saveRDS(outloci_para_each,file=file.path(output_Robjects,"outloci_para_each.Rds")) ############################################################################################################# ### Generating summary table and graphs for assessment of Locus heterozygosity and allele divergence of samples ### ############################################################################################################# tab_length <- as.matrix(tab_length) tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))] targets_length_cl2b <- sum(max_target_length[which(gsub(".*-","",names(max_target_length)) %in% rownames(tab_snps_cl2b))]) ########## generating summary table nloci_cl2 <- length(tab_snps_cl2b[,1]) tab_het_ad <- data.frame("sample"=colnames(tab_snps_cl2b)) for(i in 1:length(colnames(tab_snps_cl2b))){ tab_het_ad$bp[i] <- sum(tab_length_cl2b[,i], na.rm = T) tab_het_ad$bpoftarget[i] <- round(sum(tab_length_cl2b[,i], na.rm = T)/targets_length_cl2b,3)*100 tab_het_ad$paralogs_all[i] <- length(outloci_para_all) tab_het_ad$paralogs_each[i] <- length(outloci_para_each[[i]]) tab_het_ad$nloci[i] <- nloci_cl2-length(which(is.na(tab_snps_cl2b[,i]))) tab_het_ad$allele_divergence[i] <- 100*round(sum(tab_length_cl2b[,i] * tab_snps_cl2b[,i], na.rm = T) / sum(tab_length_cl2b[,i], na.rm = T),5) tab_het_ad$locus_heterozygosity[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]==0))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >0.5% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.005))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >1% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.01))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) tab_het_ad$'loci with >2% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.02))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4) } # output as csv file write.csv(tab_het_ad, file = file.path(output_assess, "4_Summary_table.csv")) #output as R-object saveRDS(tab_het_ad, file = file.path(output_Robjects, "Summary_table.Rds")) ### Generating graphs text_size_mod <- 1 nrows <- length(tab_het_ad[,1]) text_size <- (15+200/nrows)*text_size_mod for(i in 1:2){ if(i==1){ pdf(file.path(output_assess,"3_LH_vs_AD.pdf"), h=10,w=10) } else { png(file.path(output_assess,"3_LH_vs_AD.png"), h=1000,w=1000) } plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity, xlab="Allele divergence [%]", ylab="Locus heterozygosity [%]", main="Locus heterozygosity vs allele divergence", las=1) dev.off() } for(i in 1:2){ if(i==1){ pdf(file.path(output_assess,"3_varLH_vs_AD.pdf"), h=10,w=10) } else { png(file.path(output_assess,"3_varLH_vs_AD.png"), h=1000,w=1000) } par(mfrow=c(2,2)) plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity, xlab="Allele divergence [%]", ylab="Locus heterozygosity (0% SNPs) [%]", main="Locus heterozygosity (0% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >0.5% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>0.5% SNPs) [%]", main="Locus heterozygosity (>0.5% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >1% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>1% SNPs) [%]", main="Locus heterozygosity (>1% SNPs) vs allele divergence",las=1 ) plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >2% SNPs`, xlab="Allele divergence [%]", ylab="Locus heterozygosity (>2% SNPs) [%]", main="Locus heterozygosity (>2% SNPs) vs allele divergence",las=1 ) par(mfrow=c(1,1)) dev.off() } ``` ::: ### Changes to make config txt Details below for parameters set in config.txt. :::spoiler Config changes - `name_for_dataset_optimization_subset = ""` These values are the values _Nauheimer et. al._ uses in their paper - `remove_samples_with_less_than_this_propotion_of_loci_recovered = 0.2` - `remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.45` - `remove_loci_with_less_than_this_propotion_of_samples_recovered = 0.2` - `remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered` For the paralogs: - `remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers"` - `file_with_putative_paralogs_to_remove_for_all_samples = ""` - `remove_outlier_loci_for_each_sample = "no"` These values should be carefully evaluated per dataset. ::: ###### ### Run To run the script type ```bash= ! Rscript 1b_assess_dataset.R ``` :::danger **1b_assess_dataset_r**: Line 370 `write(samples[which(!(samples %in% outsamples_missing))], file=file.path(output_asses, "0_namelist_included_samples.txt"))` gave error- commented out the line. Unsure how to fix. Nora's fix: The problem is that the sample names are not stored in an object called "samples". I modified the code at line 370 removing the object samples. ```r= write(colnames(tab_snps)[which(!(colnames(tab_snps) %in% outsamples_missing))], file=file.path(output_assess, "0_namelist_included_samples.txt")) ``` In 1c_generate_sequence_lists.R script, there is an issue that there are no objects "failed_loci" and "failed_samples". Nora's fix is to generate Robjects for these in this script. At "Save Data as R Objects", line 375, I added: ```=R saveRDS(failed_loci,file=file.path(output_Robjects,"failed_loci.Rds")) saveRDS(failed_samples,file=file.path(output_Robjects,"failed_samples.Rds")) ``` ::: ## Step 6: ++1c_generate_sequence_lists.R++ This R script generates a list of sequences for every locus and sample. These files are used for downstream analyses. The script itself had quite a few occurences of the same error. The brighamia data did not have any loci that completely failed (I think I removed them earlier on due to poor coverage & issues with some of the scripts). This R script expects the existence of an R object called failed loci, unsure of where/when in the script this object is created but it didn't exist for Brighamia data. An example of how I fixed this can be found in troubleshooting. Running the actual script is again as easy as ```bash= Rscript 1c_generate_sequence_lists.R ``` :::danger **1c_generate_sequence_lists.R**: The latter half of this script expects R objects to exist (like failed_samples, failed_loci), but the objects didn't exist for me and I could not find where they are supposedly created. ```r= ! loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) ###ERROR: object (failed_loci) not found ``` This line returns the error `object failed_loci not found`. I'm not sure if my fix still leads to the desired outcome of the script but I added a step to check for the existence of the object. If it didn't exist I would create an empty vector for it ```r= ! if(exists("failed_loci")) { loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) } else { loci_to_remove <- c() failed_loci <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0 } ``` This solution was applied everywhere the Rscript expected the existence of an object that did not exist for me (specifically objects `failed_samples` and `failed_loci`) Nora's Troubleshooting for this error: Impatiens had several failed loci, so using an empty vector would not be advised. I fixed this by editing the 1b_assess_dataset.R script to write objects for failed_loci and failed_samples (see above). The 1c_generate_sequence_lists.R will also have to be edited to read these Robjects. At line 40, I added: ```r= ! failed_loci<-readRDS(file=file.path(output_Robjects,"failed_loci.Rds")) failed_samples<-readRDS(file=file.path(output_Robjects,"failed_samples.Rds")) ``` Script ran without issues after that. ::: ## Step 7: Aligning and trimming consensus sequences ```bash= #####Running MAFFT on an individual sample: mafft --auto 2000468_BI005_consensus.fasta > 2000468_BI005_consensus_aligned.fasta #####Creating a loop to run MAFFT on all files in the folder for file in * do file2=${file//.fasta/_aligned.fasta} #I created a folder called loci_consensus_aligned to house all the aligned files echo "mafft --auto $file > $file2" >> mafft_loop.sh echo "mv $file2 loci_consensus_aligned" >> mafft_loop.sh done #####Running trimAl on an individual sample: trimal -in 2001114_BI146_consensus_aligned.fasta -out 2001114_BI146_trimmed.fasta -gt 0.85 -st 0.001 -cons 35 #####Loop trimAl through folder of files for file in * do file2=${file//_consensus_aligned.fasta/_consensus_trimmed.fasta} echo "trimal -in $file -out $file2 -gt 0.85 -st 0.001 -cons 10" >> trimAl_loop.sh echo "mv $file2 loci_consensus_trimmed" >> trimAl_loop.sh done ``` # HybPhaser 2 ## Extract mapped reads -- don't remember what we did for this ## BBsplit analysis and collation For the Rscripts `2a_prepare_bb_split` and `2b_collate_bbsplit_results.R` make sure that the config file is properly points to the correct folders. - path_to_clade_association_folder = "~/HybPhaser2/04_clade_association" - csv_file_with_clade_reference_names = "~/HybPhaser2/03_sequence_lists/trimmed_loci_consensus_unique_alleles/4471_ref.csv" (this is Nora's folder) - path_to_reference_sequences = "~/HybPhaser2/03_sequence_lists/samples_consensus" - path_to_read_files_cladeassociation = "~/HybPhaser2/mapped_reads" - run_clade_association_mapping_in_R = "yes" - If this is set to no, you must run the bash script created in 04_clade_association that is created as a result of `2a_prepare_bbsplit` Then simply run `Rscript 2a_prepare_bb_split.R` and `Rscript 2b_collate_bbsplit_results.R` ### Restructuring HybPhaser 2 We restructure hybphaser 2 to act on the individaul/gene level opposed to species/genome level. Use python script `allele_association.py` to do this. To hopefully get cleaner HybPhaser2 results, we reduce the number of alleles per gene to 2 (previously ranged from 0-7), selecting the most divergent alleles. Use python script `max_divergence.py` --- NOTES Step 2 Clade association At FIgure 1C step in HybPhaser paper. Look at consensus sequences to identify individuals with no abmiguity.