Dylan's HybSeq Hacks!

# Dylan's HybSeq Hacks! ###### tags:`HybSeq` author: Dylan Cohen ---- Just some things that I found useful for hybseq/hybpiper data collection, analyses, and all - need to better organize this code --- # Rsync -Curie to -> Funk ``` rsync -avz /home/dcl9541/Oaks_mapping_vcf/split/*.txt dcl9541@10.2.0.51:/home/dcl9541/vcf/split ``` -JBOD/Curie to -> Quest ``` rsync -avz /data/labs/Fant/Cohen/Magnolia/cut_adapt/transfer/ dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/cut_adapt/GENES/test/split_samples/xray/ ``` -JBOD/Curie to Desktop ``` rsync -avz dcl9541@10.2.0.53:/home/dcl9541/Magnolia_working/zalign/Mag2.align.fasta /Users/dcohen/Desktop/QUEST ``` -Quest to JBOD/Curie ``` rsync -aivP dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/Liri/outs/ /home/dcl9541/Magnolia_working ``` -Quest to Desktop ``` rsync -avz dcl9541@quest.northwestern.edu:/projects/p32416/Magnolia/cut_adapt/GENES/test/split_samples2/summary.txt /Users/dcohen/Desktop/QUEST ``` -Desktop to Quest ``` rsync -avz /Users/dcohen/Desktop/Beast_Files_Amo/whatup2.nex dcl9541@quest.northwestern.edu:/home/dcl9541/p31913/revbayes_biogeo_biogeo_simple/ ``` --- My path to treePL on Quest ``` /home/dcl9541/.conda/envs/treePL/treePL ``` Curie ``` /home/dcl9541/.conda/envs/treepl/treePL ``` --- Rename script for incorrect tip names on a phylo ``` #!/bin/bash # Input files tree_file="tree_file.tre" swap_list="swap_list.txt" output_file="swapped_tree_file.tre" temp_file="temp_tree_file.tre" # Make a copy of the tree file to apply the changes cp "$tree_file" "$temp_file" # Generate a unique prefix for placeholders placeholder_prefix="PLACEHOLDER_" # Phase 1: Replace old names with temporary placeholders count=1 declare -A placeholders # Read the swap list and replace old names with placeholders while read -r old_name new_name; do placeholder="${placeholder_prefix}${count}" placeholders["$old_name"]="$placeholder" sed -i "s/\b$old_name\b/$placeholder/g" "$temp_file" ((count++)) done < "$swap_list" # Phase 2: Replace placeholders with new names count=1 while read -r old_name new_name; do placeholder="${placeholder_prefix}${count}" sed -i "s/\b$placeholder\b/$new_name/g" "$temp_file" ((count++)) done < "$swap_list" # Move the final output to the desired output file mv "$temp_file" "$output_file" echo "Swapping complete. Updated tree saved in $output_file." ``` ---- ## Using AMAS.py to compute summary stats Install in a conda env with hybpiper ``` AMAS.py summary -f fasta -d dna -i *.fasta ``` outputs a summary.txt Following Pokorny et al. 2024 filter genes that are less than 1/3 the median of the alignment length or proportion of parsimony informative characters ### Use AMAS.py to concatenate gene files ``` AMAS.py concat -f fasta -d dna -i *fasta -y raxml -u phylip ``` This will generate a partiion file raxml style that can be used with iqtree and also an alignment that is a .phy ---- ## Removing GGGGGGGs at the end of reads (trimmomatic does not do this) To check files if they contain unwanted GGGGGs ``` zgrep GGGGGGG file.gz ``` To remove extra GGGGGGs found at the end of reads run cutadapt after using trimmomatic. ``` cutadapt --nextseq-trim=20 -o out.fastq input.fastq ``` ## Convert a fasta to phylip ``` require(ape) mydata=read.dna("way2.fasta", format="fasta") write.dna(mydata, "way2.phy", format="interleaved", nbcol=-1,colsep="") ``` --- ## Renaming fasta headers to first word (sample name) ``` cut -d ' ' -f1 your_file.fa > new_file.fa ``` ### Remove samples from a fasta file in R ``` library(Biostrings) fasta <- readDNAStringSet("path/to/your.fa") fasta_new <- fasta[!names(fasta) %in% c("VRE32514")] #/ save to disk: writeXStringSet(fasta_new, "path/to/outputfile.fa") ``` ## Code to Split multiple genes from one fasta 1) Need an outgroup - Check TOL at KEW or NCBI/GENBANK 2) Download outgroup, load to cluster, make sure to save as a .fasta or .fa, run this code and it should rename each gene region as ####.out.fasta as its own file ``` cat yourfasta.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".out.fasta")} print $0 >> filename close(filename) }' ``` Taken from here: https://gist.github.com/astatham/621901 --- ## To rename the first line of fasta files ``` var=">Craspidospermum_verticillatum" sed -i "1s/.*./$var/" *out.fasta ``` This code will rename the first line in all fasta files in a directory to: >Craspidospermum_verticillatum --- ## Add fasta file to another fasta file based on unique identifier `for f in *.out.fasta ; do t=${f/.out./_supercontig.} out=${f/.out./_ALL.} cat "$f" "$t" > "$out" done` This will work if you have one file name out.fasta and another supercontig.fasta or any variation of that. Atleast two different unique identifiers --- ## Concatenated "supergene" - one continous gene per sample (with gaps) This would be like a plastid. The first thing you will need to do before either is align all of your genes using mafft. I found this on John's old HybSeq Protocol page (https://hackmd.io/emTOFHJaQNmkjsIwJWxZ6Q): ``` for i in *.fasta; do mafft --auto ${i} > ${i%.*}_mafft.fasta; done ``` Note you will have load the module on Quest, and I am running it in a slurm script on an entire directory. ``` #!/bin/sh #SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=4:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=20 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=5GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=Align_job ## When you run squeue -u #SBATCH --mail-type=END ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=yourEMAIL.org #SBATCH --output=align.out module load mafft for i in *.fasta; do mafft --auto ${i} > ${i%.*}_mafft.fasta; done ``` ## How to concatenate multiple genes (from same sample) found in multiple fasta files to one fasta with all samples Note each sample will have all of its genes I found a working solution here: https://github.com/ballesterus/Utensils These scripts were written in python 2 so make sure to module load python 2 Next you will want to copy/paste three python scripts: Get_fasta_from_Ref.py (note this one is found in this repo: https://github.com/ballesterus/UPhO) partBreaker.py geneStitcher.py ``` geneStitcher.py -in *_mafft.fasta ``` --- # SUPER MATRIX APPROACH Lets Concatenate all the Genes and work on the supergene tree first. This will give us a good indication of the species tree. I found this script fasta_merge.py to concatenate all genes into a supergene matrix. This was taken from here: https://github.com/mossmatters/HybPiper/blob/master/hybpiper/fasta_merge.py *Note you will need to activate hybpiper environment to use this script* ``` #!/usr/bin/env python helptext ='''This script will take a list of FASTA files and concatenate them for use in phylogenetic inference. The sequence headers (up until the first space) must be identical in each individual FASTA file. Individual gene sequences should be aligned prior to running this script! This script requires BioPython to read/write FASTA sequences.''' import os,sys,argparse from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def read_sequences(fastafiles): '''Given a list of FASTA file names, read in each sequence to a dictionary of dictionaries, one per file''' return {filename:SeqIO.to_dict(SeqIO.parse(filename,'fasta')) for filename in fastafiles} def get_unique_names(gene_dict): '''Given the dictionary of SeqRecord dictionaries, return a list of the unique sequence headers''' all_names = [] for gene in gene_dict: all_names += list(gene_dict[gene].keys()) return set(all_names) def insert_sequences(gene_dict,unique_names): '''Given the dictionary of dictionaries, insert blank sequences if any are missing for a gene''' inserted_sequences = 0 for gene in gene_dict: for name in unique_names: if name not in gene_dict[gene]: gene_length = len(next(iter(gene_dict[gene].values()))) gene_dict[gene][name] = SeqRecord(Seq("-"*gene_length),id=name) inserted_sequences += 1 sys.stderr.write("{} Empty sequences inserted across all genes.\n".format(inserted_sequences)) return gene_dict def concatenate_sequences(gene_dict,fastafiles,unique_names): '''Given a dictionary of dictionaries with complete sampling in each gene, write out concatenated sequences to stdout. Returns a list of partition lengths.''' new_seq_dict = {} partition_lengths = [] for gene in fastafiles: for name in unique_names: try: new_seq_dict[name] += gene_dict[gene][name] except KeyError: new_seq_dict[name] = gene_dict[gene][name] partition_lengths.append(len(next(iter(gene_dict[gene].values())))) for final_seq in new_seq_dict: SeqIO.write(new_seq_dict[final_seq],sys.stdout,'fasta') final_seq_length = len(new_seq_dict[final_seq]) sys.stderr.write("Final conatenated sequence length: {}\n".format(final_seq_length)) return partition_lengths def raxml_partition(fastafiles,partition_lengths,partition_type): '''Generate a raxml partition file for the given fastafiles. User specifies the partition type''' gene_start = 1 partition_file = open("partition.raxml",'w') if partition_type == 'CODON': for g in range(len(fastafiles)): codon3_start = gene_start + 2 codon3_end = gene_start + partition_lengths[g] - 1 codon1_end = codon3_end - 2 codon2_start = gene_start + 1 codon2_end = codon3_end - 1 partition_file.write("{},{}{}={}-{}\\3,{}-{}\\3\n".format("DNA",fastafiles[g],"12",gene_start,codon1_end,codon2_start,codon2_end)) partition_file.write("{},{}{}={}-{}\\3\n".format("DNA",fastafiles[g],"3",codon3_start,codon3_end)) gene_start = codon3_end + 1 else: for g in range(len(fastafiles)): gene_end = gene_start + partition_lengths[g] - 1 partition_file.write("{},{}={}-{}\n".format(partition_type,fastafiles[g],gene_start,gene_end)) gene_start = gene_end + 1 partition_file.close() def main(): parser = argparse.ArgumentParser(description=helptext,formatter_class=argparse.RawTextHelpFormatter) parser.add_argument("--fastafiles",nargs='+',help="List of Fasta Files. Can use wildcard on Linux/Mac systems") parser.add_argument("--filelist",help="File containing list of Fasta files. Alternative to --fastalist") parser.add_argument("--raxml",help="Create a partition file 'partitions.raxml' intended for raxml in the current directory. For amino acid sequences, select the substitution model. To specify a separate model for 1st/2nd vs. 3rd codon positions, select CODON.", choices = ['DNA','WAG','JTT','CODON' ],default=None) if len(sys.argv) < 2: parser.print_help() sys.exit(1) args = parser.parse_args() if args.fastafiles: #print args.fastafiles if args.filelist: sys.stderr.write("Specify either a list of FASTA files or a file containing names, not both!\n") sys.exit(1) else: fastafiles = args.fastafiles elif args.filelist: #print args.filelist if os.path.isfile(args.filelist): fastafiles = [x.rstrip() for x in open(args.filelist)] else: sys.stderr.write("File containing list of FASTA files not found!") sys.exit(1) else: sys.stderr.write("You must specify the FASTA files as a list or in a file.\n") sys.exit(1) sys.stderr.write("{} FASTA files found.\n".format(len(fastafiles))) gene_dict = read_sequences(fastafiles) sys.stderr.write("All sequences read successfully.\n") unique_names = get_unique_names(gene_dict) sys.stderr.write("{} Unique names found. If you were expecting fewer sequences, check your IDs!\n".format(len(unique_names))) gaps_inserted = insert_sequences(gene_dict,unique_names) partition_lengths = concatenate_sequences(gaps_inserted,fastafiles,unique_names) if args.raxml: raxml_partition(fastafiles,partition_lengths,args.raxml) if __name__ == "__main__":main() ``` To run this script: ``` python fasta_merge.py --fastafiles *.fasta > super.fasta ``` There are some other easter eggs/function with that script. I only used this code. ### Notes for making target fasta from genbank - - - you will want to use multiple baits or else it will have low complexity lol 1- Download data with gene features headers should look something like this: >lcl|KY045484.1_gene_3 [gene=psbA] [gbkey=Gene] 2 - use a text editer to replace (lcl|KY045484.1_) witha four letter name+- 3 - header should like this : >LOBETA-gene_1 [gene=psbA] [gbkey=Gene]] 4 - use this one liner to delete the space after gene number ```sed '/^>/ s/ .*//' SIPHKR.fasta > SIKR3.fasta ``` # to make a hybpiper assemble file: borrowed from John Z ``` for i in *R1_paired.fastq.gz do f2=${i//R1/R2} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo hybpiper assemble -t_dna /projects/p31913/lobeliad_rawreads/fasta/HAW.plastid_fixed.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh done ``` --- # Amsonia HybSeq notes 1. Tranfer files from Globus to Quest 2. Remove _001. from file names ``` for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done ``` 3. Run loop to generate trimomatic slurm file ``` for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` Added slurm heading to the .sh script ``` #!/bin/bash #SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=3:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=5 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=15GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=Trimm_job ## When you run squeue -u #SBATCH --output=trim.out module purge module load trimmomatic/0.39 java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 A_elliptica_1_S3_R1.fastq.gz A_elliptica_1_S3_R2.fastq.gz A_elliptica_1_S3_R1_paired.fastq.gz A_elliptica_1_S3_R1_unpaired.fastq.gz A_elliptica_1_S3_R2_paired.f$ ``` Concatenate the unpaired read files ``` #To concatenate the unpaired files for each sample: for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` Run loop to create commands for files ``` for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo hybpiper assemble -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh done ``` example slurm script ``` #!/bin/bash #SBATCH --account=p31913 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=24:10:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=10 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=15GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=HYBPIP_job ## When you run squeue -u #SBATCH --output=hyb.out module purge #module load java #module load trimmomatic module load hybpiper/2.1.2-python-3.8.4 hybpiper assemble -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta -r A_elliptica_1_S3_R1_paired.fastq.gz A_elliptica_1_S3_R2_paired.fastq.gz --unpaired A_elliptica_1_S3_unpaired.fastq.gz --prefix A_elliptica_1_S3 --bwa --run_intronerate ``` To recover sample stats: ``` hybpiper stats -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta gene namefile2.txt ``` To recover a heatmap: ``` hybpiper recovery_heatmap seq_lengths.tsv --sample_text_size 25 --figure_height 45 --heatmap_filename map3 ``` Get ya sequences! ``` hybpiper retrieve_sequences dna -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta --sample_names namefile2.txt ``` SUPERCONTIG! ``` hybpiper retrieve_sequences supercontig -t_dna /projects/p31913/Amsonia-353/hybpiper-nf/Angiosperms353_targetSequences.fasta --sample_names namefile2.txt --fasta_dir /projects/p31913/Amsonia-353/Results_353/SUPER ``` 1 - before running astral collapse branchs with low support as suggest by astral guide - need to git clone newick utilies from github (probably other programs that can do this too) 2 - i also installed the latest version of Astral = ASTER from a git pull (https://github.com/chaoszhang/ASTER?tab=readme-ov-file) Also note to install this you need to load a new version of gcc i did this ```module load gcc/8.4.0 ``` Aligning fasta files (from John Z) ``` #!/bin/bash #SBATCH --account=p31913 #SBATCH --partition=normal #SBATCH --time=14:00:00 #SBATCH --mem=48G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --job-name=coal_align #SBATCH --error=align.log ###Load module module load mafft ###For loop to run MAFFT on unaligned loci fasta for i in *.fasta; do mafft --auto --thread 12 ${i} > ${i%.*}_aligned.fasta; done ``` Note to Install trimal in conda env Other trimming/alignmenet to checkout : ClipKIT https://jlsteenwyk.com/ClipKIT/ ``` mamba install bioconda::trimal ``` Trim to 50% gaps (from John Z) ``` #!/bin/bash #SBATCH --account=p31913 #SBATCH --partition=short #SBATCH --time=01:30:00 #SBATCH --mem=20G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_trim #SBATCH --error=trim.log eval "$(conda shell.bash hook)" conda activate iqtree for i in *.fasta; do trimal -in $i -out ${i//aligned/trim} -gt 0.5 done ``` Alternative to using trimal - you can use Clipkit (https://jlsteenwyk.com/ClipKIT/advanced/index.html) ``` #!/bin/bash #SBATCH --account=p32416 #SBATCH --partition=short #SBATCH --time=01:30:00 #SBATCH --mem=20G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=clipkit #SBATCH --error=clip.log eval "$(conda shell.bash hook)" conda activate jvarkit for i in *.fasta; do # Run clipkit clipkit "$i" # Extract the base name (without the _supercontig part) and append _mus.clipkit.fasta base_name=$(echo "$i" | sed 's/_supercontig_mus.fasta/_mus.clipkit.fasta/') # Rename the file produced by clipkit to the desired format mv "${i}.clipkit" "$base_name" done ``` Setting up files for gene trees (from John Z) ``` for i in *.fasta; do echo iqtree2 -s $i -nt AUTO -ntmax 4 -m TEST -B 1000 -wbtl --polytomy --alrt 1000 done > full_iqtree.sh for i in *.fasta; do echo iqtree2 -s $i -nt AUTO -ntmax 4 -m TEST -B 1000 done > full_iqtree.sh ``` Split the files for the array job ``` split -a 1 -d -l 37 full_iqtree.sh iqtree --additional-suffix=.sh ``` Array job ``` #!/bin/bash #SBATCH --account=p31913 #SBATCH --partition=normal #SBATCH --time=10:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_array #SBATCH --error=iqtree.log module purge eval "$(conda shell.bash hook)" conda activate iqtree #IFS=$'\n' read -d '' -r -a input_args < input_args.txt #bash ${input_args[$SLURM_ARRAY_TASK_ID]} --------- #!/bin/bash #SBATCH --account=b1042 #SBATCH --partition=genomics #SBATCH --array=0-8 #SBATCH --time=20:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_array #SBATCH --error=iqtree.log module purge eval "$(conda shell.bash hook)" conda activate iqtree2 IFS=$'\n' read -d '' -r -a input_args < input_args.txt bash ${input_args[$SLURM_ARRAY_TASK_ID]} ``` Collapse branches with low support < 10% Note deleted the folder - need conda activate iqtree2 ``` for i in *.treefile; do nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile} done ``` From John Z ``` #!/bin/bash #SBATCH --account=p31913 #SBATCH --partition=short #SBATCH --time=01:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=astral_lob #SBATCH --error=astral.log module load gcc/8.4.0 /projects/p31913/ASTER/bin/astral -i amsonia_coal25_genetrees.tre -t 2 -o amsonia_coal25_sptre.tre 2>astral.log ``` ------ Getting Plastid Sequences using getorganelle github - https://github.com/Kinggerm/GetOrganelle Including closely related species or genera may help with plastid capture - To do Aligning plastids ``` for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f4=${i//_R1_paired.fastq.gz/} echo get_organelle_from_reads.py -s amo_cp.fasta -1 $i -2 $f2 -o $f4 -t 4 -F embplant_pt -R 10 >> organ.sh done ``` Include the unpaired reads ``` for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo python /projects/p31913/GetOrganelle-1.7.4.1/get_organelle_from_reads.py -s amo_cp.fasta -1 $i -2 $f2 -u $f3 -o $f4 --prefix $f4 -t 4 -F embplant_pt -R 15 --reduce-reads-for-coverage inf --max-reads inf >> reorgan.sh done ``` ### Codes to help get output from Getorganelle ready for cactus First create a directory to store output fastas sequences2 ``` find /projects/p31913/Trim_outs/trim_cut/ -type f -name "*_sequence.fasta" -exec cp {} /projects/p31913/Trim_outs/sequences2/ \; ``` --- ### Rename files name_1.fasta Make sure you only have complete genomes ending complete.graph1.1.path_sequence.fasta ``` for file in *embplant_pt*; do # Extract the part of the filename before 'embplant_pt' new_name=$(echo "$file" | sed 's/$.*$embplant_pt.*/\1_1.fasta/') # Rename the file to the new name mv "$file" "$new_name" done ``` --- ## Rename file headers ``` for file in *_1.fasta; do #Extract the part of the filename before '_1.fasta' new_header=$(basename "$file" "_1.fasta") # Replace the line starting with '>' with the new header sed -i "s/^>.*/>$new_header/" "$file" done ``` --- Sequences need to be soft masked to run cactus #################################### ### Misc helpful code Convert .gb to fasta ``` import os from Bio import SeqIO directory_path = './' def process_genbank_files_in_directory(directory_path): for filename in os.listdir(directory_path): if filename.endswith(".gb"): file_path = os.path.join(directory_path, filename) genome = next(SeqIO.parse(file_path, "genbank")) cds = [feature for feature in genome.features if feature.type == 'CDS'] seqs = [] for feature in cds: seq = genome[int(feature.location.start):int(feature.location.end)] if feature.location.strand == -1: seq.seq = seq.seq.reverse_complement() seq.description = feature.qualifiers['gene'][0] seqs.append(seq) output_file = f"{filename}.CDS.fasta" with open(output_file, 'w') as output_handle: SeqIO.write(seqs, output_handle, 'fasta') process_genbank_files_in_directory(directory_path) ``` ## Splitting genes into there own files ``` cat yourfasta.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($1,2) ".out.fasta")} print $0 >> filename close(filename) } ``` ```for i in *.fasta; do sed 's/$.*$_/\1 /' $i > $i.name.fasta; done``` renaming things ``` for file in *out_aligned.fasta.name.fasta do mv "$file" "${file/out_aligned.fasta./}" done ``` --------------- # Using Phyparts to detect genetree incongruences I ran this code on quest. First you will need to install phyparts. Here is the link to the bitbucket (https://bitbucket.org/blackrim/phyparts/src/master/). To run PhyParts you will need to load this module : ```module load apache-maven-mvn3/3.9.7``` Notes - I generated my gene trees using IQtree, but the program suggest using RaxML not sure if there is a big difference. Before running PhyParts you will need to root all your genetrees and species tree on the outgroup. There is a python script to reroot found here: https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts (its at the bottom of the page), you can write a little loop to do all the gene trees, you will also need to conda install ete3 (http://etetoolkit.org/download/). Note you do not need the etetoolkit or ete_toolchain, just ete3. This program will also be used to parse the output for phypartspiecharts. Once your trees are rerooted and phyparts is installed you can run phyparts - do so in a slurm script. Mine too awhile to run 320 genes, so i let it run for 10 hours. ``` #!/bin/bash #SBATCH --account=p31913 #SBATCH --partition=normal #SBATCH --time=10:30:00 #SBATCH --mem=80G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --job-name=phy_part #SBATCH --error=phy.log module load apache-maven-mvn3/3.9.7 java -jar target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d /projects/p31913/Amsonia-353/Results_353/Super3/align/cutcut/root_test -m /projects/p31913/Amsonia-353/Results_353/SUPER2/phyl/phyparts/target/rerooted.tre -o out ``` To viscualize the phypartspiecharts - first find the python script here (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts). To run this script you will need the species tree (rooted) and the "base" output from phyparts -check the phypartspie charts for this. You will also need to install FastX on your computer and log into quest thru it (instructions here: https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1511) Phypartspiecharts will generate a .svg file that will be able to view online internet browser and can also be converted to PDF. ------------- # Convert a fasta to a nexus (for beast) Install seqmagick ``` pip install seqmagick ``` Run this one liner ``` seqmagick convert --alphabet dna 6119_supercontig_aligned.fasta 6119_supercontig_aligned.nex ```