---
title: HybPhaser Protocol
tags: HybSeq
---
# HYBPHASER
HybPhaser Version 2.0 -- was developed to deal with hybrids (and polyploids) in target capture datasets. It detects hybrids by measuring heterozygosity in the dataset and phase hybrid accessions by separating reads according to similarity with selected taxa that represent parental clades.
HybPhaser is built as an extension to the assembly pipeline HybPiper.
## New notes after running HybPhaser v2 on Quest in 2023:
When running 1_generate_consensus_sequences.sh in a slurm script, be sure to ```module load samtools``` and ```module load bwa``` in the slurm script. Do not module load bcftools (see below).
Note that the bcftools that is installed on quest is v1.18, however this script requires the older version v1.9, and confusingly the flag -I, which specifies *bcftools consensus* use IAPUC codes, is no longer accepted in v.18, but instead uses an -i flag for that, however v1.9 uses the flag -i for filtering the vcf! To solve this issue, you'll have to install bcftools v1.9 in your conda environment. Do not ```module load bcftools``` in the environment or in the slurm script because this will load the newer version.
```
conda activate myEnv
conda install bcftools=1.9
```
Make sure that you are using the correct bcftools install:
```
which bcftools
# it should be in your environment bin/
~/.conda/envs/myEnv/bin/bcftools
# if it is in /software then type
module purge
# and check again
```
Check your consensus sequences to make sure that a) the fastas are populated and that b) there are IAPUC ambiguity codes in the sequences. This HybPhaser script suppresses error codes, so it can be hard to tell if bwa or samtools or bcftools is not working properly.
Ambiguity codes are Y, R, S, K, etc. Check to see if there are any in the consensus sequences:
```
grep -B1 -e "G" -e "C" -e "A" -e "T" ./01_data/sample_name/intronerated_consensus/*
```
There should only be a few abiguity codes in each sequence.
Have not yet looked at the rest of these notes...
/
end note 11/29/23 Nora
## Step 1: Run Hybpiper
Information can be found [here](https://hackmd.io/@CBG-genetics-team/r1mCsb5n_)
## Step 2: Go to Wiki and download shell script
Website wiki can be found [here](https://github.com/LarsNauheimer/HybPhaser)
Locate the 1_generate_consensus_sequence [code](https://github.com/LarsNauheimer/HybPhaser/blob/main/1_generate_consensus_sequences.sh).
Copy the code over to a notepad or equivalent and save file
:bulb: I called it generate_consensus_sequence.sh
Move file onto server
Make a Hybphaser director
```bash=
mkdir ./Brighamia_Paralog/hybphaser
```
## Step 3: Run the generate consesus shell script
Run the shell script
### 1_generate_consensus_sequences.sh files
:::spoiler HybPhaser's script
``` bash=!
#!/bin/bash
############
set +e
set -u
set -o pipefail
CONTIGTYPE="normal"
THREADS=1
SAMPLENAME=""
ALLELE_FREQ=0.15
READ_DEPTH=10
ALLELE_COUNT=4
PIPERDIR="./"
OUTDIR="../HybPhaser"
NAMELIST="not a file"
CLEANUP="FALSE"
while getopts 'ict:s:a:f:d:p:o:n:' OPTION; do
case "$OPTION" in
i)
CONTIGTYPE="ISC"
;;
c)
CLEANUP="TRUE"
;;
t)
THREADS=$OPTARG
;;
s)
SAMPLENAME=$OPTARG
;;
f)
ALLELE_FREQ=$OPTARG
;;
a)
ALLELE_COUNT=$OPTARG
;;
d)
READ_DEPTH=$OPTARG
;;
p)
PIPERDIR=$OPTARG
;;
o)
OUTDIR_BASE=$OPTARG
;;
n)
NAMELIST=$OPTARG
;;
?)
echo "This script is used to generate consensus sequences by remapping reads to de novo contigs generated by HybPiper. It is part of the HybPhaser workflow and has to be executed first.
Usage: generate_consensus_sequences.sh [options]
Options:
General:
-s <Name of sample> (if not providing a namelist)
-n <Namelist> (txt file with sample names, one per line)
-p <Path to HybPiper results folder> Default is current folder.
-o <Path to output folder> (will be created, if it doesnt exist). Default is ../HybPhaser
-t <Maximum number of threads used> Default is 1. (multiple threads so not speed up much)
-i -intronerated: If set, intronerate_supercontigs are used in addition to normal contigs.
-c -clean-up: If set, reads and mapping files are removed (.bam, .vcf.gz)
Adjust consensus sequence generation:
-d Minimum coverage on site to be regarded for assigning ambiguity code.
If read depth on that site is lower than chosen value, the site is not used for ambiguity code but the most frequent base is returned. Default is 10.
-f Minimum allele frequency regarded for assigning ambiguity code.
If the alternative allele is less frequent than the chosen value, it is not included in ambiguity coding. Default is 0.15.
-a Minimum count of alleles to be regarded for assigning ambiguity code.
If the alternative allele ocurrs less often than the chosen value, it is not included in ambiguity coding. Default is 4.
" >&2
exit 1
;;
esac
done
shift "$(($OPTIND -1))"
if [[ -f $NAMELIST ]]; then
SAMPLES=$(<$NAMELIST)
elif [[ $SAMPLENAME != "" ]]; then
SAMPLES=$SAMPLENAME
else
echo "No sample name given or namelist file does not exist!"
fi
for SAMPLE in $SAMPLES
do
SECONDS=0
#collecting reads and contigs from each sample
echo Collecting read and contigs files for $SAMPLE
OUTDIR=$OUTDIR_BASE"/01_data/"
mkdir -p "$OUTDIR/$SAMPLE/reads"
cp $PIPERDIR/$SAMPLE/*/*_unpaired.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null
cp $PIPERDIR/$SAMPLE/*/*_interleaved.fasta $OUTDIR/$SAMPLE/reads/ 2> /dev/null
if compgen -G $OUTDIR/$SAMPLE/reads/*_interleaved.fasta > /dev/null; then
for i in $OUTDIR/$SAMPLE/reads/*_interleaved.fasta
do
cat $i ${i/_interleaved.fasta/_unpaired.fasta} > ${i/_interleaved.fasta/_combined.fasta} 2> /dev/null
rm $i ${i/_interleaved.fasta/_unpaired.fasta} 2> /dev/null
done
fi
if [[ $CONTIGTYPE == "normal" ]]; then
mkdir -p $OUTDIR/$SAMPLE/contigs
cp $PIPERDIR/$SAMPLE/*/*/sequences/FNA/*.FNA $OUTDIR/$SAMPLE/contigs/ 2> /dev/null
# adding gene name into the fasta files ">sample-gene"
for i in $OUTDIR/$SAMPLE/contigs/*.FNA
do
FILE=${i/*\/contigs\//}
GENE=${FILE/.FNA/}
sed -i "s/\(>.*\)/\1-$GENE/" $i
done
# renaming .FNA to .fasta for consistency
for f in $OUTDIR/$SAMPLE/contigs/*.FNA; do mv "$f" "${f/.FNA/.fasta}"; done
CONTIGPATH="$OUTDIR/$SAMPLE/contigs"
mkdir -p "$OUTDIR/$SAMPLE/consensus"
else
# when intronerate is selected
mkdir -p $OUTDIR/$SAMPLE/intronerated_contigs
cp $PIPERDIR/$SAMPLE/*/*/sequences/intron/*_supercontig.fasta $OUTDIR/$SAMPLE/intronerated_contigs/ 2> /dev/null
for f in $OUTDIR/$SAMPLE/intronerated_contigs/*.*
do
mv "$f" "${f/_supercontig/_intronerated}" 2> /dev/null
done
CONTIGPATH="$OUTDIR/$SAMPLE/intronerated_contigs"
mkdir -p "$OUTDIR/$SAMPLE/intronerated_consensus"
fi
mkdir -p "$OUTDIR/$SAMPLE/mapping_files"
# start remapping reads to contigs
for CONTIG in $CONTIGPATH/*.fasta
do
FILE=${CONTIG/*\/*contigs\//}
GENE=${FILE/.fasta/}
echo -e '\e[1A\e[K'Generating consensus sequences for $SAMPLE - $GENE
if [[ $CONTIGTYPE == "normal" ]]; then
CONSENSUS=$OUTDIR/$SAMPLE/consensus/$GENE".fasta"
BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE".bam"
VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE".vcf.gz"
else
GENE=${GENE/_intronerated/}
CONSENSUS=$OUTDIR/$SAMPLE/intronerated_consensus/$GENE"_intronerated.fasta"
BAM=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.bam"
VCFZ=$OUTDIR/$SAMPLE/mapping_files/$GENE"_intronerated.vcf.gz"
fi
# checking for paired-end reads
if [[ -f $OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta" ]];then
READS=$OUTDIR/$SAMPLE/reads/$GENE"_combined.fasta"
READTYPE="PE"
else
READS=$OUTDIR/$SAMPLE/reads/$GENE"_unpaired.fasta"
READTYPE="SE"
fi
bwa index $CONTIG 2> /dev/null
if [ "$READTYPE" = "SE" ];then
bwa mem $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM
else
bwa mem -p $CONTIG $READS -t $THREADS -v 1 2> /dev/null | samtools sort > $BAM
fi
bcftools mpileup -I -Ov -f $CONTIG $BAM 2> /dev/null | bcftools call -mv -A -Oz -o $VCFZ 2> /dev/null
bcftools index -f --threads $THREADS $VCFZ 2> /dev/null
bcftools consensus -I -i "(DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $ALLELE_FREQ && (DP4[0]+DP4[1]+DP4[2]+DP4[3]) >= $READ_DEPTH && (DP4[2]+DP4[3]) >= $ALLELE_COUNT " -f $CONTIG $VCFZ 2> /dev/null | awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' > $CONSENSUS
echo "" >> $CONSENSUS
rm $CONTIG.* 2> /dev/null
rm $VCFZ".csi" 2> /dev/null
done
# optional clean-up step: remove folders with mapping stats and reads
if [[ $CLEANUP == "TRUE" ]]; then
rmdir $OUTDIR/$SAMPLE/mapping_files/ --ignore-fail-on-non-empty
rmdir $OUTDIR/$SAMPLE/reads --ignore-fail-on-non-empty
fi
DURATION_SAMPLE=$SECONDS
echo -e '\e[1A\e[K'Generated consensus for $SAMPLE in $(($DURATION_SAMPLE / 60)) minutes and $(($DURATION_SAMPLE % 60)) seconds.
done
```
:::
```bash=
screen -S HybPhaser
./generate_consensus_sequences.sh -n ./Brighamia_Paralog/name_file.txt -p ./Brighamia_Paralog/ -o ./Brighamia_Paralog/hybphaser -t 4
```
:::danger
Some users have had errors with this script finding the out directory. Using absolute path names, for example -o /home/ngavinsmyth/Noras_Sequences/HybPhaser_output_folder instead of ../HybPhaser_output_folder can be a solution to this.
**Error:**
About: Create consensus sequence by applying VCF variants to a reference fasta
file. By default, the program will apply all ALT variants. Using the
--sample (and, optionally, --haplotype) option will apply genotype
(or haplotype) calls from FORMAT/GT. The program ignores allelic depth
information, such as INFO/AD or FORMAT/AD.
Usage: bcftools consensus [OPTIONS] <file.vcf>
Options:
-f, --fasta-ref \<file> reference sequence in fasta format
-H, --haplotype <1|2> apply variants for the given haplotype
-i, --iupac-codes output variants in the form of IUPAC ambiguity codes
-m, --mask \<file> replace regions with N
-o, --output \<file> write output to a file [standard output]
-c, --chain \<file> write a chain file for liftover
-s, --sample \<name> apply variants of the given sample
Examples:
Get the consensus for one region. The fasta header lines are then expected in the form ">chr:from-to".
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
Generating consensus sequences for BRIN-001 - 2084634_BI149
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
consensus: invalid option -- 'I'
:::
:::info
:bulb: SOLUTION: Check dependencies
```bcftools --version```
In this case we need to update bcftools
```Sudo apt-get install bcftools```
You might need to update R in conda. See **[website](https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/)**
:::
## Step 4: Run the 1a_count_snps.R shell script
### Preparation
This next step uses r in the terminal. You will need to download the config.txt and the R script
To see if you have R type
```bash=
# Check what environment you have
conda env list
# if "R" is listed then
conda activate "name of env"
#if "R" is not listed then
conda install R
#if need to create new enviornments.
conda create -n NameForNewEnv python=2.7.14 # Many Python versions are available
```
#Once you have called up the environment.
```bash=
# Start R
R
# To quit
q()
```
Make sure the necessary packages are installed.
```r=
install.packages("package")
install.packages("ape")
install.packages("stringr")
```
The R script 1a_count_snps.R is used to generate a table with the proportions of SNPs in each locus and sample as well as a table with recovered sequence length
:::info
:bulb:The first step is to locate a file called `config.txt` in the HybPhaser folder. Open it and enter the necessary information in the `# General Settings` section.
:::
### Changes made
:::spoiler Changes made to Config Details below
### Changes made
- `path_to_output_folder:` set path for HybPhaser output folder ("/path/to/hybphaser_output_folder/")
- `fasta_file_with_targets`: direct to the file with target sequences (baits) ("/path/to/target_file.fasta")
- `target_file_format`: set whether target file is in DNA ("DNA") or AA ("AA") format
- `path_to_namelist`: direct to a file that contains all sample names ("/path/to/namelist.txt")
- `intronerated_contig`: select whether HybPiper was run with intronerate to fetch the intronerated contigs ("yes"), or whether normal contigs should be used ("no")
- Here are the `# General Settings` I used:
```
path_to_output_folder = "/home/jz/Desktop/brighamia_scbaits/HybPhaserOutputs"
fasta_file_with_targets = "/home/jz/Desktop/brighamia_scbaits/supercontig_baits.fasta"
target_file_format = "DNA"
path_to_namelist = "/home/jz/Desktop/brighamia_scbaits/name_file2.txt"
intronerated_contig = "no"
```
### Nora's example "Config.txt"
``` bash= !
######################################################
### Configuration File for all HybPhaser R scripts ###
######################################################
# General settings
path_to_output_folder = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2"
fasta_file_with_targets = "./Angiosperms353_targetSequences.fasta"
targets_file_format = "DNA" # "DNA" or "AA"
path_to_namelist = "./names_list_luk.txt"
intronerated_contig = "no"
###############################
### Part 1: SNP Assessment ###
###############################
name_for_dataset_optimization_subset = ""
# missing data
remove_samples_with_less_than_this_propotion_of_loci_recovered = .2
remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = .1
remove_loci_with_less_than_this_propotion_of_samples_recovered = .2
remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered = .1
##lowered filters for target sequence length recovered because we find it less important that our sequences match the targets well, more important that we get good proportion of sequences recovered.
# Paralogs
remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers" # any number between 0 and 1, "none" or "outliers"
file_with_putative_paralogs_to_remove_for_all_samples = ""
remove_outlier_loci_for_each_sample = "yes"
##################################
### Part 2: Clade Association ###
##################################
# set variables to determine thresholds for the dataset optimization and run the script from the main script
path_to_clade_association_folder = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/04_clade_association"
csv_file_with_clade_reference_names = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/03_sequence_lists/trimmed_loci_consensus_unique_alleles/4471_ref.csv"
path_to_reference_sequences = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/03_sequence_lists/samples_consensus"
path_to_read_files_cladeassociation = "/home/ngavinsmyth/Noras_Sequences/HybPhaser2/mapped_reads"
read_type_cladeassociation = "single-end"
ID_read_pair1 = ""
ID_read_pair2 = ""
file_with_samples_included = ""
path_to_bbmap = "/opt/apps/bbmap/"
no_of_threads_clade_association = "auto"
run_clade_association_mapping_in_R = "no"
java_memory_usage_clade_association = "" # e.g. 2G (for 2GB) needed when java -Xmx error comes up. should be max. 85% of physical memory
#######################
### Part 3: Phasing ###
#######################
# set variables and direct the scripts to the right folders and files before running the single scripts (from inside this file)
path_to_phasing_folder = ""
csv_file_with_phasing_prep_info = ""
path_to_read_files_phasing = ""
read_type_4phasing = "paired-end"
ID_read_pair1 = "_R1.fastq.gz" # e.g. "_R1.fastq.gz"
ID_read_pair2 = "_R2.fastq.gz" # e.g. "_R2.fastq.gz"
reference_sequence_folder = ""
folder_for_phased_reads = ""
folder_for_phasing_stats = ""
path_to_bbmap_executables = ""
no_of_threads_phasing = "auto"
java_memory_usage_phasing = "" # e.g. 2G (for 2GB) needed when java -Xmx error comes up. should be max. 85% of physical memory
run_bash_script_in_R = "no"
################################
### Part 4: Merging Datasets ###
################################
# This script is to generate sequence lists that combine the phased sequences with the normal non-phased ones (but exclude the normal ones of the phased accessions)
# It is also possible to make subsets of samples or loci using lists of included or excluded samples/loci
path_to_sequence_lists_normal = ""
path_to_sequence_lists_phased = ""
path_of_sequence_lists_output = ""
path_to_namelist_normal = ""
path_to_namelist_phased = ""
# to generate subsets one can chose to define samples that are either included or excluded
file_with_samples_included = ""
file_with_samples_excluded = ""
# Similarly one can choose to exclude loci or use a list with only the included loci.
file_with_loci_excluded = ""
file_with_loci_included = ""
exchange_phased_with_not_phased_samples = "yes" # "yes" or "no"
include_phased_seqlists_when_non_phased_locus_absent = "no" # "yes" or "no"
```
:::
######
######
### Examples of the 1a_count_snps.R files
:::spoiler HybPhaser script
``` bash= !
##################
### SNPs count ###
##################
# load config
if (!(exists("config_file"))) {config_file <- "./config.txt"}
source(config_file)
# load packages
library(ape)
library(seqinr)
library(stringr)
# generate folder
output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset)
dir.create(output_Robjects, showWarnings = F, recursive = T)
#####################################################################################
### Counting polymorphic sites (masked as ambiguity codes in conseneus sequences) ###
#####################################################################################
# getting targets and sample names
if(targets_file_format == "AA"){
targets <- read.fasta(fasta_file_with_targets, seqtype = "AA", as.string = TRUE, set.attributes = FALSE)
} else if(targets_file_format == "DNA"){
targets <- read.fasta(fasta_file_with_targets, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE)
} else {
print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!")
}
targets_name <- unique(gsub(".*-","",labels(targets)))
samples <- readLines(path_to_namelist)
if(intronerated_contig=="yes"){
intronerated_name <- "intronerated"
intronerated_underscore <- "_"
} else {
intronerated_name <- ""
intronerated_underscore <- ""
}
# function for counting ambiguities
seq_stats <- function(file){
fasta <- read.fasta(file, as.string=TRUE, set.attributes = FALSE )
seq <- gsub("N|[?]|-","",fasta[[1]])
c(round(str_length(seq),0),(str_count(seq,"Y|K|R|S|M|y|k|r|s|m|w") + str_count(seq,"W|D|H|B|V|w|d|h|b|v")*2) )
}
# generate empty tables for SNPS and sequence length
tab_snps <- data.frame(loci=targets_name)
tab_length <- data.frame(loci=targets_name)
# fill tables with information on snps and sequence length for each sample and locus
start_time <- Sys.time()
for(sample in samples){
#sample <- samples[1]
print(paste(sample," ", round(difftime(Sys.time(),start_time, units="secs"),0),"s", sep="") )
tab_sample <- data.frame(targets=targets_name, seq_length=NA, ambis=NA, ambi_prop=NA)
consensus_files <- list.files(file.path(path_to_output_folder,"01_data",sample, paste(intronerated_name,intronerated_underscore,"consensus", sep="")),full.names = T)
for(consensus_file in consensus_files) {
gene <- gsub("(_intronerated|).fasta","",gsub(".*/","",consensus_file))
file.path(path_to_output_folder,"01_data",sample, "consensus", paste(gene,intronerated_underscore,intronerated_name,".fasta",sep=""))
if(file.info(consensus_file)$size !=0){
stats <- seq_stats(consensus_file)
} else {
stats <- c(NA,NA)
}
tab_sample$seq_length[grep(gene,tab_sample$targets)] <- stats[1]
tab_sample$ambis[grep(gene,tab_sample$targets)] <- stats[2]
}
tab_sample$ambi_prop <- tab_sample$ambis/tab_sample$seq_length
tab_snps[match(sample,samples)] <- tab_sample$ambi_prop
tab_length[match(sample,samples)] <- tab_sample$seq_length
}
colnames(tab_snps) <- samples
rownames(tab_snps) <- targets_name
colnames(tab_length) <- samples
rownames(tab_length) <- targets_name
### Generate output tables and save data in Robjects
saveRDS(tab_snps,file=file.path(output_Robjects,"Table_SNPs.Rds"))
saveRDS(tab_length,file=file.path(output_Robjects,"Table_consensus_length.Rds"))
```
:::
:::spoiler Nora's script
``` bash= !
##################
### SNPs count ###
##################
# load config
if (!(exists("config_file"))) {config_file <- "./config.txt"}
source(config_file)
# load packages
library(ape)
library(seqinr)
library(stringr)
# generate folder
output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset)
dir.create(output_Robjects, showWarnings = F, recursive = T)
#####################################################################################
### Counting polymorphic sites (masked as ambiguity codes in conseneus sequences) ###
#####################################################################################
# getting targets and sample names
if(targets_file_format == "AA"){
targets <- read.fasta(fasta_file_with_targets, seqtype = "AA", as.string = TRUE, set.attributes = FALSE)
} else if(targets_file_format == "DNA"){
targets <- read.fasta(fasta_file_with_targets, seqtype = "DNA", as.string = TRUE, set.attributes = FALSE)
} else {
print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!")
}
targets_name <- unique(gsub(".*-","",labels(targets)))
samples <- readLines(path_to_namelist)
if(intronerated_contig=="yes"){
intronerated_name <- "intronerated"
intronerated_underscore <- "_"
} else {
intronerated_name <- ""
intronerated_underscore <- ""
}
# function for counting ambiguities
seq_stats <- function(file){
fasta <- read.fasta(file, as.string=TRUE, set.attributes = FALSE )
seq <- gsub("N|[?]|-","",fasta[[1]])
c(round(str_length(seq),0),(str_count(seq,"Y|K|R|S|M|y|k|r|s|m|w") + str_count(seq,"W|D|H|B|V|w|d|h|b|v")*2) )
}
# generate empty tables for SNPS and sequence length
tab_snps <- data.frame(loci=targets_name)
tab_length <- data.frame(loci=targets_name)
# fill tables with information on snps and sequence length for each sample and locus
start_time <- Sys.time()
for(sample in samples){
#sample <- samples[1]
print(paste(sample," ", round(difftime(Sys.time(),start_time, units="secs"),0),"s", sep="") )
tab_sample <- data.frame(targets=targets_name, seq_length=NA, ambis=NA, ambi_prop=NA)
consensus_files <- list.files(file.path(path_to_output_folder,"01_data",sample, paste(intronerated_name,intronerated_underscore,"consensus", sep="")),full.names = T)
for(consensus_file in consensus_files) {
gene <- gsub("(_intronerated|).fasta","",gsub(".*/","",consensus_file))
file.path(path_to_output_folder,"01_data",sample, "consensus", paste(gene,intronerated_underscore,intronerated_name,".fasta",sep=""))
if(file.info(consensus_file)$size !=0){
stats <- seq_stats(consensus_file)
} else {
stats <- c(NA,NA)
}
tab_sample$seq_length[grep(gene,tab_sample$targets)] <- stats[1]
tab_sample$ambis[grep(gene,tab_sample$targets)] <- stats[2]
}
tab_sample$ambi_prop <- tab_sample$ambis/tab_sample$seq_length
tab_snps[match(sample,samples)] <- tab_sample$ambi_prop
tab_length[match(sample,samples)] <- tab_sample$seq_length
}
colnames(tab_snps) <- samples
rownames(tab_snps) <- targets_name
colnames(tab_length) <- samples
rownames(tab_length) <- targets_name
### Generate output tables and save data in Robjects
saveRDS(tab_snps,file=file.path(output_Robjects,"Table_SNPs.Rds"))
saveRDS(tab_length,file=file.path(output_Robjects,"Table_consensus_length.Rds"))
```
:::
### Run script
In terminal, you can now run the R script.
``` bash= !
Rscript 1a_count_snps.R
```
This will produce the output folder 00_R_objects with consensus sequence lengths and SNP proportions in samples and loci
:::danger
**Issue with 1_generate_consensus_sequence**: (Pretty sure this is a logical error in `generate_consensus_sequence` that causes an actual error in `1a_count_snps.R`) I think if `reads_first.py` doesn't generate a contig for any bait in an individual, `1_generate_consensus_sequences` is unable to copy over files (it interprets the lack of a file as an error) and skips to the next individual. This (1) potentially leads to lost data and (2) causes downstream problems when future scripts expect folders/files to be there based on the original namelist. Current solution: update the namelist by removing individuals that have no contigs for a bait.
:::
## Step 5: 1b_assess_dataset.R
### Preparation
This R script generates tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene.
######
### Examples of the 1a_count_snps.R files
:::spoiler HybPhaser script
``` bash= !
#####################################################################################
### Generating files for assessment of missing data,. paralogs and heterozygosity ###
#####################################################################################
# load config
if (!(exists("config_file"))) {config_file <- "./config.txt"}
source(config_file)
# load packages
library(ape)
# generate folders
if(name_for_dataset_optimization_subset != ""){
folder_subset_add <- paste("_",name_for_dataset_optimization_subset, sep="")
}else {
folder_subset_add <- ""
}
output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset)
output_assess <- file.path(path_to_output_folder,paste("02_assessment", folder_subset_add, sep=""))
dir.create(output_assess, showWarnings = F)
dir.create(output_Robjects, showWarnings = F)
# check if SNPs count for subset has been done, if not use SNPs count of first run subset with name "")
if(file.exists(file=file.path(output_Robjects,"Table_SNPs.Rds"))){
tab_snps <- readRDS(file=file.path(output_Robjects,"Table_SNPs.Rds"))
tab_length <- readRDS(file=file.path(output_Robjects,"Table_consensus_length.Rds"))
} else {
tab_snps <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_SNPs.Rds"))
tab_length <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_consensus_length.Rds"))
}
tab_snps <- as.matrix(tab_snps)
loci <- t(tab_snps)
##########################################################
### Dataset optimization step 1: Reducing missing data ###
##########################################################
nloci <- length(colnames(loci))
nsamples <- length(rownames(loci))
failed_loci <- which(colSums(is.na(loci))==nrow(loci))
failed_samples <- which(colSums(is.na(tab_snps))==nrow(tab_snps))
# per locus
seq_per_locus <- vector()
for(i in 1:nloci){
seq_per_locus[i] <- length(which(!(is.na(loci[,i]))))
}
names(seq_per_locus) <- colnames(loci)
seq_per_locus_prop <- seq_per_locus/nsamples
# per sample
seq_per_sample <- vector()
for(i in 1:length(colnames(tab_snps))){
seq_per_sample[i] <- length(which(!(is.na(tab_snps[,i]))))
}
names(seq_per_sample) <- colnames(tab_snps)
seq_per_sample_prop <- seq_per_sample/nloci
# proportion of target sequence length
if(targets_file_format == "AA"){
targets_length_all <- lengths(read.FASTA(fasta_file_with_targets, type = "AA"))*3
} else if(targets_file_format == "DNA"){
targets_length_all <- lengths(read.FASTA(fasta_file_with_targets))
} else {
print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!")
}
gene_names <- unique(gsub(".*-","",gsub(" .*","",names(targets_length_all))))
max_target_length <- vector()
for(i in 1:length(gene_names)){
max_target_length[i] <- max(targets_length_all[grep(paste("\\b",gene_names[i],"\\b",sep=""),names(targets_length_all))])
}
names(max_target_length) <- gene_names
comb_target_length <- sum(max_target_length)
comb_seq_length_samples <- colSums(tab_length, na.rm = T)
prop_target_length_per_sample <- comb_seq_length_samples/comb_target_length
mean_seq_length_loci <- rowMeans(tab_length, na.rm = T)
names(mean_seq_length_loci) <- gene_names
prop_target_length_per_locus <- mean_seq_length_loci/max_target_length
# application of thresholds
outsamples_missing_loci <- seq_per_sample_prop[which(seq_per_sample_prop < remove_samples_with_less_than_this_propotion_of_loci_recovered)]
outsamples_missing_target <- prop_target_length_per_sample[which(prop_target_length_per_sample < remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered)]
outsamples_missing <- unique(names(c(outsamples_missing_loci,outsamples_missing_target)))
outloci_missing_samples <- seq_per_locus_prop[which(seq_per_locus_prop < remove_loci_with_less_than_this_propotion_of_samples_recovered)]
outloci_missing_target <- prop_target_length_per_locus[which(prop_target_length_per_locus < remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered)]
outloci_missing <- unique(names(c(outloci_missing_samples,outloci_missing_target)))
# removing bad loci and samples from the table
tab_snps_cl1 <- tab_snps
if(length(outsamples_missing) != 0){
tab_snps_cl1 <- tab_snps_cl1[,-which(colnames(tab_snps) %in% outsamples_missing)]
}
if(length(outloci_missing) != 0){
tab_snps_cl1 <- tab_snps_cl1[-which(rownames(tab_snps) %in% outloci_missing),]
}
loci_cl1 <- t(tab_snps_cl1)
# output
# graphics
for(i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"1_Data_recovered_overview.pdf"), width = 11, height=7)
} else {
png(file=file.path(output_assess,"1_Data_recovered_overview.png"), width = 1400, height=1000)
par(cex.axis=2, cex.lab=2, cex.main=2)
}
par(mfrow=c(2,3))
boxplot(seq_per_sample_prop, main=paste("Samples: prop. of",nloci,"loci recovered"), xlab=paste("mean:",round(mean(seq_per_sample_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_sample_prop, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_loci_recovered," (",length(outsamples_missing_loci)," out)",sep=""))
abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red")
boxplot(prop_target_length_per_sample, main=paste("Samples: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_sample, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_sample, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outsamples_missing_target)," out)",sep=""))
abline(h=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
plot(prop_target_length_per_sample,seq_per_sample_prop, main = "Prop. of loci vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of loci" )
abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red")
abline(v=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
boxplot(seq_per_locus_prop, main=paste("Loci: prop. of",nsamples,"samples recovered"), xlab=paste("mean:",round(mean(seq_per_locus_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_locus_prop, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_samples_recovered," (",length(outloci_missing_samples)," out)",sep=""))
abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red")
boxplot(prop_target_length_per_locus, main=paste("Loci: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_locus, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_locus, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outloci_missing_target)," out)",sep=""))
abline(h=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
plot(prop_target_length_per_locus,seq_per_locus_prop, main = "Prop. of samples vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of samples" )
abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red")
abline(v=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
dev.off()
}
# tables
tab_seq_per_sample <- cbind(seq_per_sample,round(seq_per_sample_prop,3), round(prop_target_length_per_sample,3))
colnames(tab_seq_per_sample) <- c("No. loci", "Prop. of loci", "Prop. of target length")
write.csv(tab_seq_per_sample, file.path(output_assess, "1_Data_recovered_per_sample.csv"))
tab_seq_per_locus <- cbind(seq_per_locus,round(seq_per_locus_prop,3), round(prop_target_length_per_locus,3))
colnames(tab_seq_per_locus) <- c("No. samples", "Prop.of samples", "Prop. of target length")
write.csv(tab_seq_per_locus, file.path(output_assess, "1_Data_recovered_per_locus.csv"))
# summary text file
summary_file=file.path(output_assess,"1_Summary_missing_data.txt")
cat(file=summary_file, append = FALSE, "Dataset optimisation: Samples and loci removed to reduce missing data\n")
cat(file=summary_file, append = T, "\n", length(failed_samples)," samples failed completely:\n", paste(names(failed_samples)),"\n", sep="")
cat(file=summary_file, append = T, "\n", length(outsamples_missing_loci)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_loci_recovered,") for proportion of recovered loci:\n", paste(names(outsamples_missing_loci),"\t",round(outsamples_missing_loci,3),"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(outsamples_missing_target)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered,") for recovered target sequence length\n", paste(names(outsamples_missing_target),"\t",round(outsamples_missing_target,3),"\n"), sep="")
cat(file=summary_file, append = T, "\nIn total ", length(outsamples_missing), " samples were removed:\n", paste(outsamples_missing,"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(failed_loci)," loci failed completely:\n", paste(names(failed_loci)),"\n", sep="")
cat(file=summary_file, append = T, "\n", length(outloci_missing_samples)," loci are below the threshold (", remove_loci_with_less_than_this_propotion_of_samples_recovered,") for proportion of recovered samples:\n", paste(names(outloci_missing_samples),"\t",round(outloci_missing_samples,3),"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(outloci_missing_target)," loci are below the threshold (",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered,") for proportion of recovered target sequence length:\n", paste(names(outloci_missing_target),"\t",round(outloci_missing_target,3),"\n"), sep="")
cat(file=summary_file, append = T, "\nIn total ", length(outloci_missing), " loci were removed:\n", paste(outloci_missing,"\n"), sep="")
############################################################################################
### Dataset optimization step 2, removing paralogs for a) all samples and b) each sample ###
############################################################################################
### 2a) Paralogs across multiple samples (removing loci with unusually high proportions of SNPs across all samples)
###################################################################################################################
loci_cl1_colmeans <- colMeans(as.matrix(loci_cl1), na.rm = T)
nloci_cl1 <- length(colnames(loci_cl1))
nsamples_cl1 <- length(colnames(tab_snps_cl1))
loci_cl1_colmeans_mean <- round(mean(loci_cl1_colmeans),4)
loci_cl1_colmeans_median <- round(median(loci_cl1_colmeans),4)
# applying chosen threshold
if (length(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs) != 1 || remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "none"){
outloci_para_all <- vector()
threshold_value <- 1
} else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "outliers"){
threshold_value <- 1.5*IQR(loci_cl1_colmeans, na.rm = TRUE )+quantile(loci_cl1_colmeans, na.rm = TRUE )[4]
outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)]
outloci_para_all <- names(outloci_para_all_values)
} else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "file"){
if(file.exists(file_with_putative_paralogs_to_remove_for_all_samples) == FALSE){
print("File with list of paralogs to remove for all samples does not exist.")
} else {
outloci_para_all <- readLines(file_with_putative_paralogs_to_remove_for_all_samples)
outloci_para_all_values <-loci_cl1_colmeans[which(names(loci_cl1_colmeans) %in% outloci_para_all)]
}
} else {
threshold_value <- remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs
outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)]
outloci_para_all <- names(outloci_para_all_values)
}
# color outliers red
colour_outparaall <- rep("black",nloci_cl1)
colour_outparaall[which(colnames(loci_cl1[,order(loci_cl1_colmeans)]) %in% outloci_para_all)] <- "red"
loci_cl1_order_means <- loci_cl1[,order(loci_cl1_colmeans)]
# generate bar graph
for(i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"2a_Paralogs_for_all_samples.pdf"), width = 11, height=7)
} else {
png(file=file.path(output_assess,"2a_Paralogs_for_all_samples.png"), width = 1400, height=1000)
par(cex.axis=2, cex.lab=2, cex.main=2)
}
layout(matrix(c(1,2),2,2, byrow=TRUE), widths=c(5,1))
barplot(sort(loci_cl1_colmeans), col=colour_outparaall, border = NA, las=2,
main=paste("Mean % SNPs across samples (n=",nsamples_cl1,") for each locus (n=", nloci_cl1,")", sep=""))
if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)}
boxplot(loci_cl1_colmeans, las=2)
if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)}
dev.off()
}
# removing marked loci from table
if(length(outloci_para_all)==0) {tab_snps_cl2a <- tab_snps_cl1
} else { tab_snps_cl2a <- tab_snps_cl1[-which(rownames(tab_snps_cl1) %in% outloci_para_all),]}
### 2b) Paralogs for each sample (removing outlier loci for each sample)
##########################################################################
tab_snps_cl2b <- tab_snps_cl2a
if(!exists("remove_outlier_loci_for_each_sample")) {remove_outlier_loci_for_each_sample <- "no"}
# generate tables without zeros to count only loci with SNPs
tab_snps_cl2a_nozero <- tab_snps_cl2a
tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero==0)] <- NA
tab_snps_cl2b_nozero <- tab_snps_cl2a_nozero
if(remove_outlier_loci_for_each_sample == "yes" ){
outloci_para_each <- list()
outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL)
threshold_para_each <- outloci_para_each
for(i in 1:length(colnames(tab_snps_cl2a))){
threshold_i <- 1.5*IQR(tab_snps_cl2a_nozero[,i], na.rm = TRUE ) + quantile(tab_snps_cl2a_nozero[,i] , na.rm = TRUE)[[4]]
outlier_loci_i <- tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero[,i] > threshold_i),i]
outloci_para_each[i] <- list(outlier_loci_i)
threshold_para_each[i] <- threshold_i
tab_snps_cl2b[which(rownames(tab_snps_cl2a) %in% outlier_loci_i),i] <- NA
tab_snps_cl2b_nozero[which(rownames(tab_snps_cl2b_nozero) %in% names(outlier_loci_i)),i] <- NA
outliers_color <- "red"
}
} else {
outloci_para_each <- list()
outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL)
tab_snps_cl2b <- tab_snps_cl2a
outliers_color <- "black"
}
tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))]
### output
# generate graphic
for (i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"2b_Paralogs_for_each_sample.pdf"), width = 10, h=14)
} else {
png(file=file.path(output_assess,"2b_Paralogs_for_each_sample.png"), width = 1000, h=1400)
}
par(mfrow=c(1,1))
boxplot(as.data.frame(tab_snps_cl2a_nozero[,order(colMeans(as.matrix(tab_snps_cl2a_nozero), na.rm = T))]),
horizontal=T, las=1, yaxt='n',
main="Proportions of SNPs for all loci per sample\n(only loci with any SNPs)",
ylab="Samples",
xlab="Proportion of SNPs",
col="grey", pars=list(outcol=outliers_color),
outpch=20
)
dev.off()
}
# write summary text file
cl2b_file <- file.path(output_assess,"2_Summary_Paralogs.txt")
cat(file=cl2b_file,"Removal of putative paralog loci.")
cat(file=cl2b_file,"Paralogs removed for all samples:\n", append = T)
cat(file=cl2b_file, paste("Variable 'remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs' set to: ", remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs,"\n", sep=""), append=T)
if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="file"){
cat(file=cl2b_file, paste("Loci listed in this file were removed: '", file_with_putative_paralogs_to_remove_for_all_samples,"'\n"), append=T)
} else if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="none"){
cat(file=cl2b_file,"None!\n", append = T)
} else {
cat(file=cl2b_file, paste("Resulting threshold value (mean proportion of SNPs):", round(threshold_value,5),"\n"), append=T)
}
if(length(outloci_para_all)>0){
cat(file=cl2b_file, paste(length(outloci_para_all)," loci were removed:\n",sep=""), append=T)
cat(file=cl2b_file, "locus\tmean_prop_SNPs\n", append=T)
cat(file=cl2b_file, paste(paste(outloci_para_all, round(outloci_para_all_values,4), sep="\t"), collapse="\n"), append=T)
cat(file=cl2b_file, "\n\n", append = T)
}
cat(file= file.path(output_assess,"2a_List_of_paralogs_removed_for_all_samples.txt"), paste(outloci_para_all,collapse = "\n"))
if(remove_outlier_loci_for_each_sample=="yes"){
cat(file=cl2b_file, "Paralogs removed for each sample:\n", append=T)
cat(file=cl2b_file, "Sample\tthreshold\t#removed\tnames\n", append=T)
for(i in 1:length(names(outloci_para_each))){
cat(file=cl2b_file, names(outloci_para_each)[i],"\t", append=T)
cat(file=cl2b_file, round(threshold_para_each[[i]],5), length(outloci_para_each[[i]]), paste(names(outloci_para_each[[i]]),collapse=", "), sep="\t", append=T)
cat(file=cl2b_file, "\n", append=T)
}
} else{
cat(file=cl2b_file, "The step for removing paralogs for each samples was skipped.\n", append=T)
}
# tables
write.csv(tab_snps_cl2b, file = file.path(output_assess,"0_Table_SNPs.csv"))
write.csv(tab_length_cl2b, file = file.path(output_assess,"0_Table_consensus_length.csv"))
# txt file with included samples
write(samples[which(!(samples %in% outsamples_missing))], file=file.path(output_assess,"0_namelist_included_samples.txt"))
#write(paste(rownames(tab_snps_cl2a),colMeans(t(tab_snps_cl2a), na.rm = T)), file=file.path(output_assess,"Mean_SNPs_loci.txt"))
### save Data as R objects
saveRDS(tab_snps_cl2b,file=file.path(output_Robjects,"Table_SNPs_cleaned.Rds"))
saveRDS(tab_length_cl2b,file=file.path(output_Robjects,"Table_consensus_length_cleaned.Rds"))
saveRDS(outloci_missing,file=file.path(output_Robjects,"outloci_missing.Rds"))
saveRDS(outsamples_missing,file=file.path(output_Robjects,"outsamples_missing.Rds"))
saveRDS(outloci_para_all,file=file.path(output_Robjects,"outloci_para_all.Rds"))
saveRDS(outloci_para_each,file=file.path(output_Robjects,"outloci_para_each.Rds"))
#############################################################################################################
### Generating summary table and graphs for assessment of Locus heterozygosity and allele divergence of samples ###
#############################################################################################################
tab_length <- as.matrix(tab_length)
tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))]
targets_length_cl2b <- sum(max_target_length[which(gsub(".*-","",names(max_target_length)) %in% rownames(tab_snps_cl2b))])
########## generating summary table
nloci_cl2 <- length(tab_snps_cl2b[,1])
tab_het_ad <- data.frame("sample"=colnames(tab_snps_cl2b))
for(i in 1:length(colnames(tab_snps_cl2b))){
tab_het_ad$bp[i] <- sum(tab_length_cl2b[,i], na.rm = T)
tab_het_ad$bpoftarget[i] <- round(sum(tab_length_cl2b[,i], na.rm = T)/targets_length_cl2b,3)*100
tab_het_ad$paralogs_all[i] <- length(outloci_para_all)
tab_het_ad$paralogs_each[i] <- length(outloci_para_each[[i]])
tab_het_ad$nloci[i] <- nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))
tab_het_ad$allele_divergence[i] <- 100*round(sum(tab_length_cl2b[,i] * tab_snps_cl2b[,i], na.rm = T) / sum(tab_length_cl2b[,i], na.rm = T),5)
tab_het_ad$locus_heterozygosity[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]==0))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >0.5% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.005))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >1% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.01))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >2% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.02))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
}
# output as csv file
write.csv(tab_het_ad, file = file.path(output_assess, "4_Summary_table.csv"))
#output as R-object
saveRDS(tab_het_ad, file = file.path(output_Robjects, "Summary_table.Rds"))
### Generating graphs
text_size_mod <- 1
nrows <- length(tab_het_ad[,1])
text_size <- (15+200/nrows)*text_size_mod
for(i in 1:2){
if(i==1){
pdf(file.path(output_assess,"3_LH_vs_AD.pdf"), h=10,w=10)
} else {
png(file.path(output_assess,"3_LH_vs_AD.png"), h=1000,w=1000)
}
plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity,
xlab="Allele divergence [%]", ylab="Locus heterozygosity [%]", main="Locus heterozygosity vs allele divergence", las=1)
dev.off()
}
for(i in 1:2){
if(i==1){
pdf(file.path(output_assess,"3_varLH_vs_AD.pdf"), h=10,w=10)
} else {
png(file.path(output_assess,"3_varLH_vs_AD.png"), h=1000,w=1000)
}
par(mfrow=c(2,2))
plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (0% SNPs) [%]", main="Locus heterozygosity (0% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >0.5% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>0.5% SNPs) [%]", main="Locus heterozygosity (>0.5% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >1% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>1% SNPs) [%]", main="Locus heterozygosity (>1% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >2% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>2% SNPs) [%]", main="Locus heterozygosity (>2% SNPs) vs allele divergence",las=1
)
par(mfrow=c(1,1))
dev.off()
}
```
:::
:::spoiler Nora's script
``` bash= !
#####################################################################################
### Generating files for assessment of missing data,. paralogs and heterozygosity ###
#####################################################################################
# load config
if (!(exists("config_file"))) {config_file <- "./config.txt"}
source(config_file)
# load packages
library(ape)
# generate folders
if(name_for_dataset_optimization_subset != ""){
folder_subset_add <- paste("_",name_for_dataset_optimization_subset, sep="")
}else {
folder_subset_add <- ""
}
output_Robjects <- file.path(path_to_output_folder,"00_R_objects", name_for_dataset_optimization_subset)
output_assess <- file.path(path_to_output_folder,paste("02_assessment", folder_subset_add, sep=""))
dir.create(output_assess, showWarnings = F)
dir.create(output_Robjects, showWarnings = F)
# check if SNPs count for subset has been done, if not use SNPs count of first run subset with name "")
if(file.exists(file=file.path(output_Robjects,"Table_SNPs.Rds"))){
tab_snps <- readRDS(file=file.path(output_Robjects,"Table_SNPs.Rds"))
tab_length <- readRDS(file=file.path(output_Robjects,"Table_consensus_length.Rds"))
} else {
tab_snps <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_SNPs.Rds"))
tab_length <- readRDS(file=file.path(path_to_output_folder,"00_R_objects/Table_consensus_length.Rds"))
}
tab_snps <- as.matrix(tab_snps)
loci <- t(tab_snps)
##########################################################
### Dataset optimization step 1: Reducing missing data ###
##########################################################
nloci <- length(colnames(loci))
nsamples <- length(rownames(loci))
failed_loci <- which(colSums(is.na(loci))==nrow(loci))
failed_samples <- which(colSums(is.na(tab_snps))==nrow(tab_snps))
# per locus
seq_per_locus <- vector()
for(i in 1:nloci){
seq_per_locus[i] <- length(which(!(is.na(loci[,i]))))
}
names(seq_per_locus) <- colnames(loci)
seq_per_locus_prop <- seq_per_locus/nsamples
# per sample
seq_per_sample <- vector()
for(i in 1:length(colnames(tab_snps))){
seq_per_sample[i] <- length(which(!(is.na(tab_snps[,i]))))
}
names(seq_per_sample) <- colnames(tab_snps)
seq_per_sample_prop <- seq_per_sample/nloci
# proportion of target sequence length
if(targets_file_format == "AA"){
targets_length_all <- lengths(read.FASTA(fasta_file_with_targets, type = "AA"))*3
} else if(targets_file_format == "DNA"){
targets_length_all <- lengths(read.FASTA(fasta_file_with_targets))
} else {
print("Warning! Target file type not set properly. Should be 'DNA' or 'AA'!")
}
gene_names <- unique(gsub(".*-","",gsub(" .*","",names(targets_length_all))))
max_target_length <- vector()
for(i in 1:length(gene_names)){
max_target_length[i] <- max(targets_length_all[grep(paste("\\b",gene_names[i],"\\b",sep=""),names(targets_length_all))])
}
names(max_target_length) <- gene_names
comb_target_length <- sum(max_target_length)
comb_seq_length_samples <- colSums(tab_length, na.rm = T)
prop_target_length_per_sample <- comb_seq_length_samples/comb_target_length
mean_seq_length_loci <- rowMeans(tab_length, na.rm = T)
names(mean_seq_length_loci) <- gene_names
prop_target_length_per_locus <- mean_seq_length_loci/max_target_length
# application of thresholds
outsamples_missing_loci <- seq_per_sample_prop[which(seq_per_sample_prop < remove_samples_with_less_than_this_propotion_of_loci_recovered)]
outsamples_missing_target <- prop_target_length_per_sample[which(prop_target_length_per_sample < remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered)]
outsamples_missing <- unique(names(c(outsamples_missing_loci,outsamples_missing_target)))
outloci_missing_samples <- seq_per_locus_prop[which(seq_per_locus_prop < remove_loci_with_less_than_this_propotion_of_samples_recovered)]
outloci_missing_target <- prop_target_length_per_locus[which(prop_target_length_per_locus < remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered)]
outloci_missing <- unique(names(c(outloci_missing_samples,outloci_missing_target)))
# removing bad loci and samples from the table
tab_snps_cl1 <- tab_snps
if(length(outsamples_missing) != 0){
tab_snps_cl1 <- tab_snps_cl1[,-which(colnames(tab_snps) %in% outsamples_missing)]
}
if(length(outloci_missing) != 0){
tab_snps_cl1 <- tab_snps_cl1[-which(rownames(tab_snps) %in% outloci_missing),]
}
loci_cl1 <- t(tab_snps_cl1)
# output
# graphics
for(i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"1_Data_recovered_overview.pdf"), width = 11, height=7)
} else {
png(file=file.path(output_assess,"1_Data_recovered_overview.png"), width = 1400, height=1000)
par(cex.axis=2, cex.lab=2, cex.main=2)
}
par(mfrow=c(2,3))
boxplot(seq_per_sample_prop, main=paste("Samples: prop. of",nloci,"loci recovered"), xlab=paste("mean:",round(mean(seq_per_sample_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_sample_prop, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_loci_recovered," (",length(outsamples_missing_loci)," out)",sep=""))
abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red")
boxplot(prop_target_length_per_sample, main=paste("Samples: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_sample, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_sample, na.rm = TRUE),2)," | threshold:",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outsamples_missing_target)," out)",sep=""))
abline(h=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
plot(prop_target_length_per_sample,seq_per_sample_prop, main = "Prop. of loci vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of loci" )
abline(h=remove_samples_with_less_than_this_propotion_of_loci_recovered, lty=2, col="red")
abline(v=remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
boxplot(seq_per_locus_prop, main=paste("Loci: prop. of",nsamples,"samples recovered"), xlab=paste("mean:",round(mean(seq_per_locus_prop, na.rm = TRUE),2), " | median:", round(median(seq_per_locus_prop, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_samples_recovered," (",length(outloci_missing_samples)," out)",sep=""))
abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red")
boxplot(prop_target_length_per_locus, main=paste("Loci: prop. of target sequence length recovered"), xlab=paste("mean:",round(mean(prop_target_length_per_locus, na.rm = TRUE),2)," | median:", round(median(prop_target_length_per_locus, na.rm = TRUE),2)," | threshold:",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered," (",length(outloci_missing_target)," out)",sep=""))
abline(h=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
plot(prop_target_length_per_locus,seq_per_locus_prop, main = "Prop. of samples vs\n prop. of target length", xlab = "Prop. of target length", ylab= "Prop. of samples" )
abline(h=remove_loci_with_less_than_this_propotion_of_samples_recovered, lty=2, col="red")
abline(v=remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered, lty=2, col="red")
dev.off()
}
# tables
tab_seq_per_sample <- cbind(seq_per_sample,round(seq_per_sample_prop,3), round(prop_target_length_per_sample,3))
colnames(tab_seq_per_sample) <- c("No. loci", "Prop. of loci", "Prop. of target length")
write.csv(tab_seq_per_sample, file.path(output_assess, "1_Data_recovered_per_sample.csv"))
tab_seq_per_locus <- cbind(seq_per_locus,round(seq_per_locus_prop,3), round(prop_target_length_per_locus,3))
colnames(tab_seq_per_locus) <- c("No. samples", "Prop.of samples", "Prop. of target length")
write.csv(tab_seq_per_locus, file.path(output_assess, "1_Data_recovered_per_locus.csv"))
# summary text file
summary_file=file.path(output_assess,"1_Summary_missing_data.txt")
cat(file=summary_file, append = FALSE, "Dataset optimisation: Samples and loci removed to reduce missing data\n")
cat(file=summary_file, append = T, "\n", length(failed_samples)," samples failed completely:\n", paste(names(failed_samples)),"\n", sep="")
cat(file=summary_file, append = T, "\n", length(outsamples_missing_loci)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_loci_recovered,") for proportion of recovered loci:\n", paste(names(outsamples_missing_loci),"\t",round(outsamples_missing_loci,3),"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(outsamples_missing_target)," samples are below the threshold (",remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered,") for recovered target sequence length\n", paste(names(outsamples_missing_target),"\t",round(outsamples_missing_target,3),"\n"), sep="")
cat(file=summary_file, append = T, "\nIn total ", length(outsamples_missing), " samples were removed:\n", paste(outsamples_missing,"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(failed_loci)," loci failed completely:\n", paste(names(failed_loci)),"\n", sep="")
cat(file=summary_file, append = T, "\n", length(outloci_missing_samples)," loci are below the threshold (", remove_loci_with_less_than_this_propotion_of_samples_recovered,") for proportion of recovered samples:\n", paste(names(outloci_missing_samples),"\t",round(outloci_missing_samples,3),"\n"), sep="")
cat(file=summary_file, append = T, "\n", length(outloci_missing_target)," loci are below the threshold (",remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered,") for proportion of recovered target sequence length:\n", paste(names(outloci_missing_target),"\t",round(outloci_missing_target,3),"\n"), sep="")
cat(file=summary_file, append = T, "\nIn total ", length(outloci_missing), " loci were removed:\n", paste(outloci_missing,"\n"), sep="")
############################################################################################
### Dataset optimization step 2, removing paralogs for a) all samples and b) each sample ###
############################################################################################
### 2a) Paralogs across multiple samples (removing loci with unusually high proportions of SNPs across all samples)
###################################################################################################################
loci_cl1_colmeans <- colMeans(as.matrix(loci_cl1), na.rm = T)
nloci_cl1 <- length(colnames(loci_cl1))
nsamples_cl1 <- length(colnames(tab_snps_cl1))
loci_cl1_colmeans_mean <- round(mean(loci_cl1_colmeans),4)
loci_cl1_colmeans_median <- round(median(loci_cl1_colmeans),4)
# applying chosen threshold
if (length(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs) != 1 || remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "none"){
outloci_para_all <- vector()
threshold_value <- 1
} else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "outliers"){
threshold_value <- 1.5*IQR(loci_cl1_colmeans, na.rm = TRUE )+quantile(loci_cl1_colmeans, na.rm = TRUE )[4]
outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)]
outloci_para_all <- names(outloci_para_all_values)
} else if (remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs == "file"){
if(file.exists(file_with_putative_paralogs_to_remove_for_all_samples) == FALSE){
print("File with list of paralogs to remove for all samples does not exist.")
} else {
outloci_para_all <- readLines(file_with_putative_paralogs_to_remove_for_all_samples)
outloci_para_all_values <-loci_cl1_colmeans[which(names(loci_cl1_colmeans) %in% outloci_para_all)]
}
} else {
threshold_value <- remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs
outloci_para_all_values <- loci_cl1_colmeans[which(loci_cl1_colmeans > threshold_value)]
outloci_para_all <- names(outloci_para_all_values)
}
# color outliers red
colour_outparaall <- rep("black",nloci_cl1)
colour_outparaall[which(colnames(loci_cl1[,order(loci_cl1_colmeans)]) %in% outloci_para_all)] <- "red"
loci_cl1_order_means <- loci_cl1[,order(loci_cl1_colmeans)]
# generate bar graph
for(i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"2a_Paralogs_for_all_samples.pdf"), width = 11, height=7)
} else {
png(file=file.path(output_assess,"2a_Paralogs_for_all_samples.png"), width = 1400, height=1000)
par(cex.axis=2, cex.lab=2, cex.main=2)
}
layout(matrix(c(1,2),2,2, byrow=TRUE), widths=c(5,1))
barplot(sort(loci_cl1_colmeans), col=colour_outparaall, border = NA, las=2,
main=paste("Mean % SNPs across samples (n=",nsamples_cl1,") for each locus (n=", nloci_cl1,")", sep=""))
if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)}
boxplot(loci_cl1_colmeans, las=2)
if(length(threshold_value)>0 && remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs != "file"){abline(h=threshold_value, col="red", lty=2)}
dev.off()
}
# removing marked loci from table
if(length(outloci_para_all)==0) {tab_snps_cl2a <- tab_snps_cl1
} else { tab_snps_cl2a <- tab_snps_cl1[-which(rownames(tab_snps_cl1) %in% outloci_para_all),]}
### 2b) Paralogs for each sample (removing outlier loci for each sample)
##########################################################################
tab_snps_cl2b <- tab_snps_cl2a
if(!exists("remove_outlier_loci_for_each_sample")) {remove_outlier_loci_for_each_sample <- "no"}
# generate tables without zeros to count only loci with SNPs
tab_snps_cl2a_nozero <- tab_snps_cl2a
tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero==0)] <- NA
tab_snps_cl2b_nozero <- tab_snps_cl2a_nozero
if(remove_outlier_loci_for_each_sample == "yes" ){
outloci_para_each <- list()
outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL)
threshold_para_each <- outloci_para_each
for(i in 1:length(colnames(tab_snps_cl2a))){
threshold_i <- 1.5*IQR(tab_snps_cl2a_nozero[,i], na.rm = TRUE ) + quantile(tab_snps_cl2a_nozero[,i] , na.rm = TRUE)[[4]]
outlier_loci_i <- tab_snps_cl2a_nozero[which(tab_snps_cl2a_nozero[,i] > threshold_i),i]
outloci_para_each[i] <- list(outlier_loci_i)
threshold_para_each[i] <- threshold_i
tab_snps_cl2b[which(rownames(tab_snps_cl2a) %in% outlier_loci_i),i] <- NA
tab_snps_cl2b_nozero[which(rownames(tab_snps_cl2b_nozero) %in% names(outlier_loci_i)),i] <- NA
outliers_color <- "red"
}
} else {
outloci_para_each <- list()
outloci_para_each <- sapply(colnames(tab_snps_cl2a),function(x) NULL)
tab_snps_cl2b <- tab_snps_cl2a
outliers_color <- "black"
}
tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))]
### output
# generate graphic
for (i in 1:2){
if(i==1){
pdf(file=file.path(output_assess,"2b_Paralogs_for_each_sample.pdf"), width = 10, h=14)
} else {
png(file=file.path(output_assess,"2b_Paralogs_for_each_sample.png"), width = 1000, h=1400)
}
par(mfrow=c(1,1))
boxplot(as.data.frame(tab_snps_cl2a_nozero[,order(colMeans(as.matrix(tab_snps_cl2a_nozero), na.rm = T))]),
horizontal=T, las=1, yaxt='n',
main="Proportions of SNPs for all loci per sample\n(only loci with any SNPs)",
ylab="Samples",
xlab="Proportion of SNPs",
col="grey", pars=list(outcol=outliers_color),
outpch=20
)
dev.off()
}
# write summary text file
cl2b_file <- file.path(output_assess,"2_Summary_Paralogs.txt")
cat(file=cl2b_file,"Removal of putative paralog loci.")
cat(file=cl2b_file,"Paralogs removed for all samples:\n", append = T)
cat(file=cl2b_file, paste("Variable 'remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs' set to: ", remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs,"\n", sep=""), append=T)
if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="file"){
cat(file=cl2b_file, paste("Loci listed in this file were removed: '", file_with_putative_paralogs_to_remove_for_all_samples,"'\n"), append=T)
} else if(remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs=="none"){
cat(file=cl2b_file,"None!\n", append = T)
} else {
cat(file=cl2b_file, paste("Resulting threshold value (mean proportion of SNPs):", round(threshold_value,5),"\n"), append=T)
}
if(length(outloci_para_all)>0){
cat(file=cl2b_file, paste(length(outloci_para_all)," loci were removed:\n",sep=""), append=T)
cat(file=cl2b_file, "locus\tmean_prop_SNPs\n", append=T)
cat(file=cl2b_file, paste(paste(outloci_para_all, round(outloci_para_all_values,4), sep="\t"), collapse="\n"), append=T)
cat(file=cl2b_file, "\n\n", append = T)
}
cat(file= file.path(output_assess,"2a_List_of_paralogs_removed_for_all_samples.txt"), paste(outloci_para_all,collapse = "\n"))
if(remove_outlier_loci_for_each_sample=="yes"){
cat(file=cl2b_file, "Paralogs removed for each sample:\n", append=T)
cat(file=cl2b_file, "Sample\tthreshold\t#removed\tnames\n", append=T)
for(i in 1:length(names(outloci_para_each))){
cat(file=cl2b_file, names(outloci_para_each)[i],"\t", append=T)
cat(file=cl2b_file, round(threshold_para_each[[i]],5), length(outloci_para_each[[i]]), paste(names(outloci_para_each[[i]]),collapse=", "), sep="\t", append=T)
cat(file=cl2b_file, "\n", append=T)
}
} else{
cat(file=cl2b_file, "The step for removing paralogs for each samples was skipped.\n", append=T)
}
# tables
write.csv(tab_snps_cl2b, file = file.path(output_assess,"0_Table_SNPs.csv"))
write.csv(tab_length_cl2b, file = file.path(output_assess,"0_Table_consensus_length.csv"))
# txt file with included samples
write(samples[which(!(colnames(tab_snps) %in% outsamples_missing))], file=file.path(output_assess,"0_namelist_included_samples.txt"))
# write(outsamples_missing, file=file.path(output_assess, "0_namelist_excluded_samples.txt"))
#write(paste(rownames(tab_snps_cl2a),colMeans(t(tab_snps_cl2a), na.rm = T)), file=file.path(output_assess,"Mean_SNPs_loci.txt"))
#outsamples_missing <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0
}
### save Data as R objects
saveRDS(failed_loci,file=file.path(output_Robjects,"failed_loci.Rds"))
saveRDS(failed_samples,file=file.path(output_Robjects,"failed_samples.Rds"))
saveRDS(tab_snps_cl2b,file=file.path(output_Robjects,"Table_SNPs_cleaned.Rds"))
saveRDS(tab_length_cl2b,file=file.path(output_Robjects,"Table_consensus_length_cleaned.Rds"))
saveRDS(outloci_missing,file=file.path(output_Robjects,"outloci_missing.Rds"))
saveRDS(outsamples_missing,file=file.path(output_Robjects,"outsamples_missing.Rds"))
saveRDS(outloci_para_all,file=file.path(output_Robjects,"outloci_para_all.Rds"))
saveRDS(outloci_para_each,file=file.path(output_Robjects,"outloci_para_each.Rds"))
#############################################################################################################
### Generating summary table and graphs for assessment of Locus heterozygosity and allele divergence of samples ###
#############################################################################################################
tab_length <- as.matrix(tab_length)
tab_length_cl2b <- tab_length[which(rownames(tab_length) %in% rownames(tab_snps_cl2b)),which(colnames(tab_length) %in% colnames(tab_snps_cl2b))]
targets_length_cl2b <- sum(max_target_length[which(gsub(".*-","",names(max_target_length)) %in% rownames(tab_snps_cl2b))])
########## generating summary table
nloci_cl2 <- length(tab_snps_cl2b[,1])
tab_het_ad <- data.frame("sample"=colnames(tab_snps_cl2b))
for(i in 1:length(colnames(tab_snps_cl2b))){
tab_het_ad$bp[i] <- sum(tab_length_cl2b[,i], na.rm = T)
tab_het_ad$bpoftarget[i] <- round(sum(tab_length_cl2b[,i], na.rm = T)/targets_length_cl2b,3)*100
tab_het_ad$paralogs_all[i] <- length(outloci_para_all)
tab_het_ad$paralogs_each[i] <- length(outloci_para_each[[i]])
tab_het_ad$nloci[i] <- nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))
tab_het_ad$allele_divergence[i] <- 100*round(sum(tab_length_cl2b[,i] * tab_snps_cl2b[,i], na.rm = T) / sum(tab_length_cl2b[,i], na.rm = T),5)
tab_het_ad$locus_heterozygosity[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]==0))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >0.5% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.005))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >1% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.01))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
tab_het_ad$'loci with >2% SNPs'[i] <- 100*round(1 - length(which(tab_snps_cl2b[,i]<0.02))/ (nloci_cl2-length(which(is.na(tab_snps_cl2b[,i])))),4)
}
# output as csv file
write.csv(tab_het_ad, file = file.path(output_assess, "4_Summary_table.csv"))
#output as R-object
saveRDS(tab_het_ad, file = file.path(output_Robjects, "Summary_table.Rds"))
### Generating graphs
text_size_mod <- 1
nrows <- length(tab_het_ad[,1])
text_size <- (15+200/nrows)*text_size_mod
for(i in 1:2){
if(i==1){
pdf(file.path(output_assess,"3_LH_vs_AD.pdf"), h=10,w=10)
} else {
png(file.path(output_assess,"3_LH_vs_AD.png"), h=1000,w=1000)
}
plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity,
xlab="Allele divergence [%]", ylab="Locus heterozygosity [%]", main="Locus heterozygosity vs allele divergence", las=1)
dev.off()
}
for(i in 1:2){
if(i==1){
pdf(file.path(output_assess,"3_varLH_vs_AD.pdf"), h=10,w=10)
} else {
png(file.path(output_assess,"3_varLH_vs_AD.png"), h=1000,w=1000)
}
par(mfrow=c(2,2))
plot(tab_het_ad$allele_divergence,tab_het_ad$locus_heterozygosity,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (0% SNPs) [%]", main="Locus heterozygosity (0% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >0.5% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>0.5% SNPs) [%]", main="Locus heterozygosity (>0.5% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >1% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>1% SNPs) [%]", main="Locus heterozygosity (>1% SNPs) vs allele divergence",las=1
)
plot(tab_het_ad$allele_divergence,tab_het_ad$`loci with >2% SNPs`,
xlab="Allele divergence [%]", ylab="Locus heterozygosity (>2% SNPs) [%]", main="Locus heterozygosity (>2% SNPs) vs allele divergence",las=1
)
par(mfrow=c(1,1))
dev.off()
}
```
:::
### Changes to make config txt
Details below for parameters set in config.txt.
:::spoiler Config changes
- `name_for_dataset_optimization_subset = ""`
These values are the values _Nauheimer et. al._ uses in their paper
- `remove_samples_with_less_than_this_propotion_of_loci_recovered = 0.2`
- `remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.45`
- `remove_loci_with_less_than_this_propotion_of_samples_recovered = 0.2`
- `remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered`
For the paralogs:
- `remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers"`
- `file_with_putative_paralogs_to_remove_for_all_samples = ""`
- `remove_outlier_loci_for_each_sample = "no"`
These values should be carefully evaluated per dataset.
:::
######
### Run
To run the script type
```bash= !
Rscript 1b_assess_dataset.R
```
:::danger
**1b_assess_dataset_r**: Line 370 `write(samples[which(!(samples %in% outsamples_missing))], file=file.path(output_asses, "0_namelist_included_samples.txt"))` gave error- commented out the line. Unsure how to fix.
Nora's fix:
The problem is that the sample names are not stored in an object called "samples". I modified the code at line 370 removing the object samples.
```r=
write(colnames(tab_snps)[which(!(colnames(tab_snps) %in%
outsamples_missing))],
file=file.path(output_assess,
"0_namelist_included_samples.txt"))
```
In 1c_generate_sequence_lists.R script, there is an issue that there are no objects "failed_loci" and "failed_samples". Nora's fix is to generate Robjects for these in this script. At "Save Data as R Objects", line 375, I added:
```=R
saveRDS(failed_loci,file=file.path(output_Robjects,"failed_loci.Rds"))
saveRDS(failed_samples,file=file.path(output_Robjects,"failed_samples.Rds"))
```
:::
## Step 6: ++1c_generate_sequence_lists.R++
This R script generates a list of sequences for every locus and sample. These files are used for downstream analyses.
The script itself had quite a few occurences of the same error. The brighamia data did not have any loci that completely failed (I think I removed them earlier on due to poor coverage & issues with some of the scripts). This R script expects the existence of an R object called failed loci, unsure of where/when in the script this object is created but it didn't exist for Brighamia data. An example of how I fixed this can be found in troubleshooting.
Running the actual script is again as easy as
```bash=
Rscript 1c_generate_sequence_lists.R
```
:::danger
**1c_generate_sequence_lists.R**: The latter half of this script expects R objects to exist (like failed_samples, failed_loci), but the objects didn't exist for me and I could not find where they are supposedly created.
```r= !
loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) ###ERROR: object (failed_loci) not found
```
This line returns the error `object failed_loci not found`. I'm not sure if my fix still leads to the desired outcome of the script but I added a step to check for the existence of the object. If it didn't exist I would create an empty vector for it
```r= !
if(exists("failed_loci")) {
loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all)
} else {
loci_to_remove <- c()
failed_loci <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0
}
```
This solution was applied everywhere the Rscript expected the existence of an object that did not exist for me (specifically objects `failed_samples` and `failed_loci`)
Nora's Troubleshooting for this error:
Impatiens had several failed loci, so using an empty vector would not be advised. I fixed this by editing the 1b_assess_dataset.R script to write objects for failed_loci and failed_samples (see above).
The 1c_generate_sequence_lists.R will also have to be edited to read these Robjects. At line 40, I added:
```r= !
failed_loci<-readRDS(file=file.path(output_Robjects,"failed_loci.Rds"))
failed_samples<-readRDS(file=file.path(output_Robjects,"failed_samples.Rds"))
```
Script ran without issues after that.
:::
## Step 7: Aligning and trimming consensus sequences
```bash=
#####Running MAFFT on an individual sample:
mafft --auto 2000468_BI005_consensus.fasta > 2000468_BI005_consensus_aligned.fasta
#####Creating a loop to run MAFFT on all files in the folder
for file in *
do
file2=${file//.fasta/_aligned.fasta}
#I created a folder called loci_consensus_aligned to house all the aligned files
echo "mafft --auto $file > $file2" >> mafft_loop.sh
echo "mv $file2 loci_consensus_aligned" >> mafft_loop.sh
done
#####Running trimAl on an individual sample:
trimal -in 2001114_BI146_consensus_aligned.fasta -out 2001114_BI146_trimmed.fasta -gt 0.85 -st 0.001 -cons 35
#####Loop trimAl through folder of files
for file in *
do
file2=${file//_consensus_aligned.fasta/_consensus_trimmed.fasta}
echo "trimal -in $file -out $file2 -gt 0.85 -st 0.001 -cons 10" >> trimAl_loop.sh
echo "mv $file2 loci_consensus_trimmed" >> trimAl_loop.sh
done
```
# HybPhaser 2
## Extract mapped reads
-- don't remember what we did for this
## BBsplit analysis and collation
For the Rscripts `2a_prepare_bb_split` and `2b_collate_bbsplit_results.R` make sure that the config file is properly points to the correct folders.
- path_to_clade_association_folder = "~/HybPhaser2/04_clade_association"
- csv_file_with_clade_reference_names = "~/HybPhaser2/03_sequence_lists/trimmed_loci_consensus_unique_alleles/4471_ref.csv" (this is Nora's folder)
- path_to_reference_sequences = "~/HybPhaser2/03_sequence_lists/samples_consensus"
- path_to_read_files_cladeassociation = "~/HybPhaser2/mapped_reads"
- run_clade_association_mapping_in_R = "yes"
- If this is set to no, you must run the bash script created in 04_clade_association that is created as a result of `2a_prepare_bbsplit`
Then simply run `Rscript 2a_prepare_bb_split.R` and `Rscript 2b_collate_bbsplit_results.R`
### Restructuring HybPhaser 2
We restructure hybphaser 2 to act on the individaul/gene level opposed to species/genome level. Use python script `allele_association.py` to do this.
To hopefully get cleaner HybPhaser2 results, we reduce the number of alleles per gene to 2 (previously ranged from 0-7), selecting the most divergent alleles. Use python script `max_divergence.py`
---
NOTES
Step 2 Clade association
At FIgure 1C step in HybPhaser paper.
Look at consensus sequences to identify individuals with no abmiguity.