# Dylan's HybSeq Hacks!
###### tags:`HybSeq`
author: Dylan Cohen
----
Just some things that I found useful for hybseq/hybpiper data collection, analyses, and all - need to better organize this code
---
# Rsync
-Curie to -> Funk
```
rsync -avz /home/dcl9541/Oaks_mapping_vcf/split/*.txt dcl9541@10.2.0.51:/home/dcl9541/vcf/split
```
-JBOD/Curie to -> Quest
```
rsync -avz /data/labs/Fant/Cohen/Magnolia/cut_adapt/transfer/ dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/cut_adapt/GENES/test/split_samples/xray/
```
-JBOD/Curie to Desktop
```
rsync -avz dcl9541@10.2.0.53:/home/dcl9541/Magnolia_working/zalign/Mag2.align.fasta /Users/dcohen/Desktop/QUEST
```
-Quest to JBOD/Curie
```
rsync -aivP dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/Liri/outs/ /home/dcl9541/Magnolia_working
```
-Quest to Desktop
```
rsync -avz dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/cut_adapt/GENES/test/split_samples2/summary.txt /Users/dcohen/Desktop/QUEST
```
-Desktop to Quest
```
rsync -avz /Users/dcohen/Desktop/Beast_Files_Amo/whatup2.nex dcl9541@quest.northwestern.edu:/home/dcl9541/p31913/revbayes_biogeo_biogeo_simple/
```
---
My path to treePL on Quest
```
/home/dcl9541/.conda/envs/treePL/treePL
```
Curie
```
/home/dcl9541/.conda/envs/treepl/treePL
```
---
Rename script for incorrect tip names on a phylo
```
#!/bin/bash
# Input files
tree_file="tree_file.tre"
swap_list="swap_list.txt"
output_file="swapped_tree_file.tre"
temp_file="temp_tree_file.tre"
# Make a copy of the tree file to apply the changes
cp "$tree_file" "$temp_file"
# Generate a unique prefix for placeholders
placeholder_prefix="PLACEHOLDER_"
# Phase 1: Replace old names with temporary placeholders
count=1
declare -A placeholders
# Read the swap list and replace old names with placeholders
while read -r old_name new_name; do
placeholder="${placeholder_prefix}${count}"
placeholders["$old_name"]="$placeholder"
sed -i "s/\b$old_name\b/$placeholder/g" "$temp_file"
((count++))
done < "$swap_list"
# Phase 2: Replace placeholders with new names
count=1
while read -r old_name new_name; do
placeholder="${placeholder_prefix}${count}"
sed -i "s/\b$placeholder\b/$new_name/g" "$temp_file"
((count++))
done < "$swap_list"
# Move the final output to the desired output file
mv "$temp_file" "$output_file"
echo "Swapping complete. Updated tree saved in $output_file."
```
----
## Using AMAS.py to compute summary stats
Install in a conda env with hybpiper
```
AMAS.py summary -f fasta -d dna -i *.fasta
```
outputs a summary.txt
Following Pokorny et al. 2024 filter genes that are less than 1/3 the median of the alignment length or proportion of parsimony informative characters
### Use AMAS.py to concatenate gene files
```
AMAS.py concat -f fasta -d dna -i *fasta -y raxml -u phylip
```
This will generate a partiion file raxml style that can be used with iqtree and also an alignment that is a .phy
----
## Removing GGGGGGGs at the end of reads (trimmomatic does not do this)
To check files if they contain unwanted GGGGGs
```
zgrep GGGGGGG file.gz
```
To remove extra GGGGGGs found at the end of reads run cutadapt after using trimmomatic.
```
cutadapt --nextseq-trim=20 -o out.fastq input.fastq
```
## Convert a fasta to phylip
```
require(ape)
mydata=read.dna("way2.fasta", format="fasta")
write.dna(mydata, "way2.phy", format="interleaved", nbcol=-1,colsep="")
```
---
## Renaming fasta headers to first word (sample name)
```
cut -d ' ' -f1 your_file.fa > new_file.fa
```
### Remove samples from a fasta file in R
```
library(Biostrings)
fasta <- readDNAStringSet("path/to/your.fa")
fasta_new <- fasta[!names(fasta) %in% c("VRE32514")]
#/ save to disk:
writeXStringSet(fasta_new, "path/to/outputfile.fa")
```
## Code to Split multiple genes from one fasta
1) Need an outgroup - Check TOL at KEW or NCBI/GENBANK
2) Download outgroup, load to cluster, make sure to save as a .fasta or .fa, run this code and it should rename each gene region as ####.out.fasta as its own file
```
cat yourfasta.fasta | awk '{
if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".out.fasta")}
print $0 >> filename
close(filename)
}'
```
Taken from here:
https://gist.github.com/astatham/621901
---
## To rename the first line of fasta files
```
var=">Craspidospermum_verticillatum"
sed -i "1s/.*./$var/" *out.fasta
```
This code will rename the first line in all fasta files in a directory to: >Craspidospermum_verticillatum
---
## Add fasta file to another fasta file based on unique identifier
`for f in *.out.fasta ; do
t=${f/.out./_supercontig.}
out=${f/.out./_ALL.}
cat "$f" "$t" > "$out"
done`
This will work if you have one file name out.fasta and another supercontig.fasta or any variation of that. Atleast two different unique identifiers
---
## Concatenated "supergene" - one continous gene per sample (with gaps) This would be like a plastid.
The first thing you will need to do before either is align all of your genes using mafft.
I found this on John's old HybSeq Protocol page (https://hackmd.io/emTOFHJaQNmkjsIwJWxZ6Q):
```
for i in *.fasta; do
mafft --auto ${i} > ${i%.*}_mafft.fasta;
done
```
Note you will have load the module on Quest, and I am running it in a slurm script on an entire directory.
```
#!/bin/sh
#SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=4:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=20 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=5GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=Align_job ## When you run squeue -u
#SBATCH --mail-type=END ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=yourEMAIL.org
#SBATCH --output=align.out
module load mafft
for i in *.fasta; do
mafft --auto ${i} > ${i%.*}_mafft.fasta;
done
```
## How to concatenate multiple genes (from same sample) found in multiple fasta files to one fasta with all samples
Note each sample will have all of its genes
I found a working solution here:
https://github.com/ballesterus/Utensils
These scripts were written in python 2 so make sure to module load python 2
Next you will want to copy/paste three python scripts:
Get_fasta_from_Ref.py
(note this one is found in this repo: https://github.com/ballesterus/UPhO)
partBreaker.py
geneStitcher.py
```
geneStitcher.py -in *_mafft.fasta
```
---
# SUPER MATRIX APPROACH
Lets Concatenate all the Genes and work on the supergene tree first. This will give us a good indication of the species tree.
I found this script fasta_merge.py to concatenate all genes into a supergene matrix. This was taken from here: https://github.com/mossmatters/HybPiper/blob/master/hybpiper/fasta_merge.py
*Note you will need to activate hybpiper environment to use this script*
```
#!/usr/bin/env python
helptext ='''This script will take a list of FASTA files and concatenate them for use in
phylogenetic inference. The sequence headers (up until the first space) must be identical
in each individual FASTA file.
Individual gene sequences should be aligned prior to running this script!
This script requires BioPython to read/write FASTA sequences.'''
import os,sys,argparse
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def read_sequences(fastafiles):
'''Given a list of FASTA file names, read in each sequence to a dictionary of dictionaries, one per file'''
return {filename:SeqIO.to_dict(SeqIO.parse(filename,'fasta')) for filename in fastafiles}
def get_unique_names(gene_dict):
'''Given the dictionary of SeqRecord dictionaries, return a list of the unique sequence headers'''
all_names = []
for gene in gene_dict:
all_names += list(gene_dict[gene].keys())
return set(all_names)
def insert_sequences(gene_dict,unique_names):
'''Given the dictionary of dictionaries, insert blank sequences if any are missing for a gene'''
inserted_sequences = 0
for gene in gene_dict:
for name in unique_names:
if name not in gene_dict[gene]:
gene_length = len(next(iter(gene_dict[gene].values())))
gene_dict[gene][name] = SeqRecord(Seq("-"*gene_length),id=name)
inserted_sequences += 1
sys.stderr.write("{} Empty sequences inserted across all genes.\n".format(inserted_sequences))
return gene_dict
def concatenate_sequences(gene_dict,fastafiles,unique_names):
'''Given a dictionary of dictionaries with complete sampling in each gene, write out concatenated sequences to stdout. Returns a list of partition lengths.'''
new_seq_dict = {}
partition_lengths = []
for gene in fastafiles:
for name in unique_names:
try:
new_seq_dict[name] += gene_dict[gene][name]
except KeyError:
new_seq_dict[name] = gene_dict[gene][name]
partition_lengths.append(len(next(iter(gene_dict[gene].values()))))
for final_seq in new_seq_dict:
SeqIO.write(new_seq_dict[final_seq],sys.stdout,'fasta')
final_seq_length = len(new_seq_dict[final_seq])
sys.stderr.write("Final conatenated sequence length: {}\n".format(final_seq_length))
return partition_lengths
def raxml_partition(fastafiles,partition_lengths,partition_type):
'''Generate a raxml partition file for the given fastafiles. User specifies the partition type'''
gene_start = 1
partition_file = open("partition.raxml",'w')
if partition_type == 'CODON':
for g in range(len(fastafiles)):
codon3_start = gene_start + 2
codon3_end = gene_start + partition_lengths[g] - 1
codon1_end = codon3_end - 2
codon2_start = gene_start + 1
codon2_end = codon3_end - 1
partition_file.write("{},{}{}={}-{}\\3,{}-{}\\3\n".format("DNA",fastafiles[g],"12",gene_start,codon1_end,codon2_start,codon2_end))
partition_file.write("{},{}{}={}-{}\\3\n".format("DNA",fastafiles[g],"3",codon3_start,codon3_end))
gene_start = codon3_end + 1
else:
for g in range(len(fastafiles)):
gene_end = gene_start + partition_lengths[g] - 1
partition_file.write("{},{}={}-{}\n".format(partition_type,fastafiles[g],gene_start,gene_end))
gene_start = gene_end + 1
partition_file.close()
def main():
parser = argparse.ArgumentParser(description=helptext,formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--fastafiles",nargs='+',help="List of Fasta Files. Can use wildcard on Linux/Mac systems")
parser.add_argument("--filelist",help="File containing list of Fasta files. Alternative to --fastalist")
parser.add_argument("--raxml",help="Create a partition file 'partitions.raxml' intended for raxml in the current directory. For amino acid sequences, select the substitution model. To specify a separate model for 1st/2nd vs. 3rd codon positions, select CODON.",
choices = ['DNA','WAG','JTT','CODON'
],default=None)
if len(sys.argv) < 2:
parser.print_help()
sys.exit(1)
args = parser.parse_args()
if args.fastafiles:
#print args.fastafiles
if args.filelist:
sys.stderr.write("Specify either a list of FASTA files or a file containing names, not both!\n")
sys.exit(1)
else:
fastafiles = args.fastafiles
elif args.filelist:
#print args.filelist
if os.path.isfile(args.filelist):
fastafiles = [x.rstrip() for x in open(args.filelist)]
else:
sys.stderr.write("File containing list of FASTA files not found!")
sys.exit(1)
else:
sys.stderr.write("You must specify the FASTA files as a list or in a file.\n")
sys.exit(1)
sys.stderr.write("{} FASTA files found.\n".format(len(fastafiles)))
gene_dict = read_sequences(fastafiles)
sys.stderr.write("All sequences read successfully.\n")
unique_names = get_unique_names(gene_dict)
sys.stderr.write("{} Unique names found. If you were expecting fewer sequences, check your IDs!\n".format(len(unique_names)))
gaps_inserted = insert_sequences(gene_dict,unique_names)
partition_lengths = concatenate_sequences(gaps_inserted,fastafiles,unique_names)
if args.raxml:
raxml_partition(fastafiles,partition_lengths,args.raxml)
if __name__ == "__main__":main()
```
To run this script:
```
python fasta_merge.py --fastafiles *.fasta > super.fasta
```
There are some other easter eggs/function with that script. I only used this code.
### Notes for making target fasta from genbank - - - you will want to use multiple baits or else it will have low complexity lol
1- Download data with gene features
headers should look something like this:
>lcl|KY045484.1_gene_3 [gene=psbA] [gbkey=Gene]
2 - use a text editer to replace (lcl|KY045484.1_) witha four letter name+-
3 - header should like this : >LOBETA-gene_1 [gene=psbA] [gbkey=Gene]]
4 - use this one liner to delete the space after gene number
```sed '/^>/ s/ .*//' SIPHKR.fasta > SIKR3.fasta ```
# to make a hybpiper assemble file:
borrowed from John Z
```
for i in *R1_paired.fastq.gz
do f2=${i//R1/R2}
f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz}
f4=${i//_R1_paired.fastq.gz/}
echo hybpiper assemble -t_dna /projects/p31913/lobeliad_rawreads/fasta/HAW.plastid_fixed.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh
done
```
---
# Amsonia HybSeq notes
1. Tranfer files from Globus to Quest
2. Remove _001. from file names
```
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done
```
3. Run loop to generate trimomatic slurm file
```
for _R1_ in *_R1*; do
R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz}
R1p=${_R1_//.fastq.gz/_paired.fastq.gz}
R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz}
R2p=${R2//.fastq.gz/_paired.fastq.gz}
R2u=${R2//.fastq.gz/_unpaired.fastq.gz}
echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh
done
```
Added slurm heading to the .sh script
```
#!/bin/bash
#SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=3:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=5 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=15GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=Trimm_job ## When you run squeue -u
#SBATCH --output=trim.out
module purge
module load trimmomatic/0.39
java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 A_elliptica_1_S3_R1.fastq.gz A_elliptica_1_S3_R2.fastq.gz A_elliptica_1_S3_R1_paired.fastq.gz A_elliptica_1_S3_R1_unpaired.fastq.gz A_elliptica_1_S3_R2_paired.f$
```
Concatenate the unpaired read files
```
#To concatenate the unpaired files for each sample:
for file in *R1_unpaired*
do
file2=${file//_R1_/_R2_}
file3=${file//R1_unpaired/unpaired}
cat $file $file2 > $file3
done
```
Run loop to create commands for files
```
for i in *R1_paired.fastq.gz
do f2=${i//_R1_/_R2_}
f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz}
f4=${i//_R1_paired.fastq.gz/}
echo hybpiper assemble -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh
done
```
example slurm script
```
#!/bin/bash
#SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=24:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=10 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=15GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=HYBPIP_job ## When you run squeue -u
#SBATCH --output=hyb.out
module purge
#module load java
#module load trimmomatic
module load hybpiper/2.1.2-python-3.8.4
hybpiper assemble -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta -r A_elliptica_1_S3_R1_paired.fastq.gz A_elliptica_1_S3_R2_paired.fastq.gz --unpaired A_elliptica_1_S3_unpaired.fastq.gz --prefix A_elliptica_1_S3 --bwa --run_intronerate
```
To recover sample stats:
```
hybpiper stats -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta gene namefile2.txt
```
To recover a heatmap:
```
hybpiper recovery_heatmap seq_lengths.tsv --sample_text_size 25 --figure_height 45 --heatmap_filename map3
```
Get ya sequences!
```
hybpiper retrieve_sequences dna -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta --sample_names namefile2.txt
```
SUPERCONTIG!
```
hybpiper retrieve_sequences supercontig -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta --sample_names namefile2.txt --fasta_dir /projects/p31913/Amsonia-353/Results_353/SUPER
```
1 - before running astral collapse branchs with low support as suggest by astral guide - need to git clone newick utilies from github (probably other programs that can do this too)
2 - i also installed the latest version of Astral = ASTER from a git pull (https://github.com/chaoszhang/ASTER?tab=readme-ov-file)
Also note to install this you need to load a new version of gcc i did this
```module load gcc/8.4.0 ```
Aligning fasta files (from John Z)
```
#!/bin/bash
#SBATCH --account=p31913
#SBATCH --partition=normal
#SBATCH --time=14:00:00
#SBATCH --mem=48G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12
#SBATCH --job-name=coal_align
#SBATCH --error=align.log
###Load module
module load mafft
###For loop to run MAFFT on unaligned loci fasta
for i in *.fasta; do
mafft --auto --thread 12 ${i} > ${i%.*}_aligned.fasta;
done
```
Note to Install trimal in conda env
Other trimming/alignmenet to checkout : ClipKIT
https://jlsteenwyk.com/ClipKIT/
```
mamba install bioconda::trimal
```
Trim to 50% gaps (from John Z)
```
#!/bin/bash
#SBATCH --account=p31913
#SBATCH --partition=short
#SBATCH --time=01:30:00
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=lob_trim
#SBATCH --error=trim.log
eval "$(conda shell.bash hook)"
conda activate iqtree
for i in *.fasta; do
trimal -in $i -out ${i//aligned/trim} -gt 0.5
done
```
Alternative to using trimal - you can use Clipkit
(https://jlsteenwyk.com/ClipKIT/advanced/index.html)
```
#!/bin/bash
#SBATCH --account=p32416
#SBATCH --partition=short
#SBATCH --time=01:30:00
#SBATCH --mem=20G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=clipkit
#SBATCH --error=clip.log
eval "$(conda shell.bash hook)"
conda activate jvarkit
for i in *.fasta; do
# Run clipkit
clipkit "$i"
# Extract the base name (without the _supercontig part) and append _mus.clipkit.fasta
base_name=$(echo "$i" | sed 's/_supercontig_mus.fasta/_mus.clipkit.fasta/')
# Rename the file produced by clipkit to the desired format
mv "${i}.clipkit" "$base_name"
done
```
Setting up files for gene trees (from John Z)
```
for i in *.fasta; do
echo iqtree2 -s $i -nt AUTO -ntmax 4 -m TEST -B 1000 -wbtl --polytomy --alrt 1000
done > full_iqtree.sh
for i in *.fasta; do
echo iqtree2 -s $i -nt AUTO -ntmax 4 -m TEST -B 1000
done > full_iqtree.sh
```
Split the files for the array job
```
split -a 1 -d -l 37 full_iqtree.sh iqtree --additional-suffix=.sh
```
Array job
```
#!/bin/bash
#SBATCH --account=p31913
#SBATCH --partition=normal
#SBATCH --time=10:00:00
#SBATCH --mem=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=iqtree_array
#SBATCH --error=iqtree.log
module purge
eval "$(conda shell.bash hook)"
conda activate iqtree
#IFS=$'\n' read -d '' -r -a input_args < input_args.txt
#bash ${input_args[$SLURM_ARRAY_TASK_ID]}
---------
#!/bin/bash
#SBATCH --account=b1042
#SBATCH --partition=genomics
#SBATCH --array=0-8
#SBATCH --time=20:00:00
#SBATCH --mem=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=iqtree_array
#SBATCH --error=iqtree.log
module purge
eval "$(conda shell.bash hook)"
conda activate iqtree2
IFS=$'\n' read -d '' -r -a input_args < input_args.txt
bash ${input_args[$SLURM_ARRAY_TASK_ID]}
```
Collapse branches with low support < 10%
Note deleted the folder - need conda activate iqtree2
```
for i in *.treefile; do
nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile}
done
```
From John Z
```
#!/bin/bash
#SBATCH --account=p31913
#SBATCH --partition=short
#SBATCH --time=01:00:00
#SBATCH --mem=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=astral_lob
#SBATCH --error=astral.log
module load gcc/8.4.0
/projects/p31913/ASTER/bin/astral -i amsonia_coal25_genetrees.tre -t 2 -o amsonia_coal25_sptre.tre 2>astral.log
```
------
Getting Plastid Sequences using getorganelle
github - https://github.com/Kinggerm/GetOrganelle
Including closely related species or genera may help with plastid capture -
To do
Aligning plastids
```
for i in *R1_paired.fastq.gz
do f2=${i//_R1_/_R2_}
f4=${i//_R1_paired.fastq.gz/}
echo get_organelle_from_reads.py -s amo_cp.fasta -1 $i -2 $f2 -o $f4 -t 4 -F embplant_pt -R 10 >> organ.sh
done
```
Include the unpaired reads
```
for i in *R1_paired.fastq.gz
do f2=${i//_R1_/_R2_}
f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz}
f4=${i//_R1_paired.fastq.gz/}
echo python /projects/p31913/GetOrganelle-1.7.4.1/get_organelle_from_reads.py -s amo_cp.fasta -1 $i -2 $f2 -u $f3 -o $f4 --prefix $f4 -t 4 -F embplant_pt -R 15 --reduce-reads-for-coverage inf --max-reads inf >> reorgan.sh
done
```
### Codes to help get output from Getorganelle ready for cactus
First create a directory to store output fastas sequences2
```
find /projects/p31913/Trim_outs/trim_cut/ -type f -name "*_sequence.fasta" -exec cp {} /projects/p31913/Trim_outs/sequences2/ \;
```
---
### Rename files name_1.fasta
Make sure you only have complete genomes ending complete.graph1.1.path_sequence.fasta
```
for file in *embplant_pt*; do
# Extract the part of the filename before 'embplant_pt'
new_name=$(echo "$file" | sed 's/\(.*\)embplant_pt.*/\1_1.fasta/')
# Rename the file to the new name
mv "$file" "$new_name"
done
```
---
## Rename file headers
```
for file in *_1.fasta; do
#Extract the part of the filename before '_1.fasta'
new_header=$(basename "$file" "_1.fasta")
# Replace the line starting with '>' with the new header
sed -i "s/^>.*/>$new_header/" "$file"
done
```
---
Sequences need to be soft masked to run cactus
####################################
### Misc helpful code
Convert .gb to fasta
```
import os
from Bio import SeqIO
directory_path = './'
def process_genbank_files_in_directory(directory_path):
for filename in os.listdir(directory_path):
if filename.endswith(".gb"):
file_path = os.path.join(directory_path, filename)
genome = next(SeqIO.parse(file_path, "genbank"))
cds = [feature for feature in genome.features if feature.type == 'CDS']
seqs = []
for feature in cds:
seq = genome[int(feature.location.start):int(feature.location.end)]
if feature.location.strand == -1:
seq.seq = seq.seq.reverse_complement()
seq.description = feature.qualifiers['gene'][0]
seqs.append(seq)
output_file = f"{filename}.CDS.fasta"
with open(output_file, 'w') as output_handle:
SeqIO.write(seqs, output_handle, 'fasta')
process_genbank_files_in_directory(directory_path)
```
## Splitting genes into there own files
```
cat yourfasta.fasta | awk '{
if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".out.fasta")}
print $0 >> filename
close(filename)
}
```
```for i in *.fasta; do sed 's/\(.*\)_/\1 /' $i > $i.name.fasta; done```
renaming things
```
for file in *out_aligned.fasta.name.fasta
do
mv "$file" "${file/out_aligned.fasta./}"
done
```
---------------
# Using Phyparts to detect genetree incongruences
I ran this code on quest. First you will need to install phyparts. Here is the link to the bitbucket (https://bitbucket.org/blackrim/phyparts/src/master/). To run PhyParts you will need to load this module : ```module load apache-maven-mvn3/3.9.7```
Notes - I generated my gene trees using IQtree, but the program suggest using RaxML not sure if there is a big difference. Before running PhyParts you will need to root all your genetrees and species tree on the outgroup. There is a python script to reroot found here: https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts (its at the bottom of the page), you can write a little loop to do all the gene trees, you will also need to conda install ete3 (http://etetoolkit.org/download/). Note you do not need the etetoolkit or ete_toolchain, just ete3. This program will also be used to parse the output for phypartspiecharts.
Once your trees are rerooted and phyparts is installed you can run phyparts - do so in a slurm script. Mine too awhile to run 320 genes, so i let it run for 10 hours.
```
#!/bin/bash
#SBATCH --account=p31913
#SBATCH --partition=normal
#SBATCH --time=10:30:00
#SBATCH --mem=80G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --job-name=phy_part
#SBATCH --error=phy.log
module load apache-maven-mvn3/3.9.7
java -jar target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d /projects/p31913/Amsonia-353/Results_353/Super3/align/cutcut/root_test -m /projects/p31913/Amsonia-353/Results_353/SUPER2/phyl/phyparts/target/rerooted.tre -o out
```
To viscualize the phypartspiecharts - first find the python script here (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts).
To run this script you will need the species tree (rooted) and the "base" output from phyparts -check the phypartspie charts for this. You will also need to install FastX on your computer and log into quest thru it (instructions here: https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1511)
Phypartspiecharts will generate a .svg file that will be able to view online internet browser and can also be converted to PDF.
-------------
# Convert a fasta to a nexus (for beast)
Install seqmagick
```
pip install seqmagick
```
Run this one liner
```
seqmagick convert --alphabet dna 6119_supercontig_aligned.fasta 6119_supercontig_aligned.nex
```