Nora Impatiens Notes

--- title: Nora Impatiens Notes breaks: false tags: HybSeq, Nora --- ## Phylogenetics notes Align using MAFFT Trim using Trimal, but look at the parameters you are using carefully. Specifically I would not use -cons because this conserves positions in the alignment even if they are mostly gap. Gappy alignments will make making trees difficult, time consuming, less robust, etc. ``` #!/bin/bash #SBATCH --account=p31984 #SBATCH --partition=short #SBATCH --time=1:00:00 #SBATCH --mem=5G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=5 #SBATCH --job-name=trimal #SBATCH --output=trimal.log #SBATCH --error=trimal.log #SBATCH --mail-type=ALL #SBATCH --mail-user=nora.gavinsmyth@gmail.com for i in *_mafft.fasta; do trimal -in ${i} -out ${i%.*}_trim.fasta -gt 0.5 -st 0.001; done ``` Making trees with iqtree. -m TEST asks iqtree to do something like a JModelTest for substitution model. Useful info! ``` #!/bin/bash #SBATCH --account=p31944 #SBATCH --partition=long #SBATCH --time=168:00:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --job-name=iqtree_super= #SBATCH --output=iqtree_super.log #SBATCH --error=iqtree_supererror.log #SBATCH --mail-type=ALL #SBATCH --mail-user=nora.gavinsmyth@gmail.com for i in 44*trim.fasta; do iqtree -s $i -nt AUTO -ntmax $SLURM_NPROCS -m TEST -bb 1000 -wbtl; done ``` A silly way to get the average number of loci recovered per sample after having deleted low recovery individuals, removed putatitive paralogs, etc. ``` #this makes a text file of all the headers in all the fastas and a 1 next to each grep ">" *.fasta | cut -d "|" -f 2 | sort | uniq -c | awk '{sum += $1; print $2, $1}' > loci_per_sample.txt #this gets rid of everything except for >sample_name sed 's/^[0-9]\+_supercontig\.filt_mafft_trim\.fasta://' loci_per_sample.txt > loci_per_sample2.txt #this again to add up all the 1's for each >sample_name grep ">" loci_per_sample2.txt | cut -d "|" -f 2 | sort | uniq -c | awk '{sum += $1; print $2, $1}' > loci_per_sample3.txt #and then this to get the average per >sample_name awk '{sum += $2} END {print "Average:", sum/NR}' loci_per_sample3.txt #the median: sort -n loci_per_sample3.txt | awk ' { a[i++]=$2; } END { print a[int(i/2)]; }' ``` # TreeShrink Making one directory per locus with a treefile its alignment file: ``` for i in *.treefile; do NM=${i//_supercontig.filt_mafft_trim.fasta.treefile/} mkdir ../inputfiles/$NM done for i in *.treefile; do NM=${i//_supercontig.filt_mafft_trim.fasta.treefile/} cp $i ../inputfiles/$NM && mv ../inputfiles/$NM/$i ../inputfiles/$NM/input.tree done for i in *.treefile; do NM=${i//_supercontig.filt_mafft_trim.fasta.treefile/} fasta=${i//fasta.treefile/fasta} cp /projects/p31984/all_seq_retrieved_October2023/aligned_fastas/balsaminaceae_only_non_redundant/trimmed/$fasta ../inputfiles/$NM && mv ../inputfiles/$NM/$fasta ../inputfiles/$NM/input.fasta done ``` Treeshrink needs R/3.6.0 to run BMA ``` module load R/3.6.0 run_treeshrink.py -i ../inputfiles/ -b 20 -c -t input.tree -a input.fasta ``` Grab those alignments and giving them back their loci names: ``` for i in *.treefile; do NM=${i//_supercontig.filt_mafft_trim.fasta.treefile/} cp ../inputfiles/$NM/output.fasta ../shrunk_alignments/$NM.FNA done ``` ## SortaDate This is a tricky one to get set up on Quest, so Suzy has set it up on funk in /home/general/Programs/SortaDate/src and phyx is located in /home/srs57/phyx-1.3.1/src/ First you need to root all your trees convert them to newick using pxrr from phyx. While -s will suppress any error that comes with the outgroup not being present in the treefile, this will introduce NA's which will be a problem later on. List multuple outgroups instead: e.g. Hydrocera_triflora,omeiana,pritzelii and use the -r flag to rank the outgroups, the first listed being the highest rank. ``` for i in *treefile; do /home/srs57/phyx-1.3.1/src/pxrr -t $i -g Pelliciera_rhizophorae,Pentamerista_neotropica,Tetramerista_sp,Marcgravia_polyantha,Norantea_guianensis,Ruyschia_phylladenia,Marcgraviastrum_subsessile,Sarcoperra_tepuiensis,Schwarzia_braziliensis,Hydrocera_triflora -r -o $i.rooted; done ``` Now you can run the first python script: ``` python /home/general/Programs/SortaDate/src/get_var_length.py ./ --flend .treefile.rooted --outf ./var_length --loc /home/srs57/phyx-1.3.1/src/ ``` For the next python script you need the treefiles, and you will also need the species tree. The species tree should be in newick format, so run in astral with the -t 0 option for no annotation/newick ``` python /home/general/Programs/SortaDate/src/get_bp_genetrees.py ./ ./fixed.tre --flend .treefile.rooted --loc /home/srs57/phyx-1.3.1/src/ --outf bp ``` ``` python /home/general/Programs/SortaDate/src/combine_results.py ./var_length ./bp --outf comb ``` Finally get an output of your good genes. --order gives it the order of importance 3=bipartition support 2=tree length 1=root to tip variance, --max gives it the max number of loci to filter ``` python /home/general/Programs/SortaDate/src/get_good_genes.py comb --max 30 --order 3,1,2 --outf gg ``` Moving the good genes to their own directory ``` awk -F_ '{print $1}' ./gg > goodgenes.txt while IFS= read -r line; do cp "$line"*_trim.fasta ./goodgenes/ done < goodgenes.txt ``` Making a list of all the headers in a fasta, like for an ASTRAL species map: ``` grep '^>' super50.fasta | sed 's/>$.*$ .*/>\1/' | awk '{print $1}' | awk 'sub(/^>/, "")' > tips paste -d ":" tips tips > speciesMap.txt ``` then edit the species map by hand in bbedit to collapse tips to one per (good to do to reduce computation time for BEAST). At first I removed anything redundant, such as multiple tips of walleriana and sodenii, but then I realized that I do want to keep these because they are in different EAM blocs and we want to time their divergence. Only kept redundant taxa if they are in different blocs. To run loci as separate partitions, but with linked trees/same fixed tree, all the partitions will need to have the same tips. To get these, I used first made a concatenated supermatrix of the files to get the tips that are full of gaps ``` #!/usr/bin/env python helptext ='''This script will take a list of FASTA files and concatenate them for use in phylogenetic inference. The sequence headers (up until the first space) must be identical in each individual FASTA file. Individual gene sequences should be aligned prior to running this script! This script requires BioPython to read/write FASTA sequences.''' import os,sys,argparse from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def read_sequences(fastafiles): '''Given a list of FASTA file names, read in each sequence to a dictionary of dictionaries, one per file''' return {filename:SeqIO.to_dict(SeqIO.parse(filename,'fasta')) for filename in fastafiles} def get_unique_names(gene_dict): '''Given the dictionary of SeqRecord dictionaries, return a list of the unique sequence headers''' all_names = [] for gene in gene_dict: all_names += list(gene_dict[gene].keys()) return set(all_names) def insert_sequences(gene_dict,unique_names): '''Given the dictionary of dictionaries, insert blank sequences if any are missing for a gene''' inserted_sequences = 0 for gene in gene_dict: for name in unique_names: if name not in gene_dict[gene]: gene_length = len(next(iter(gene_dict[gene].values()))) gene_dict[gene][name] = SeqRecord(Seq("-"*gene_length),id=name) inserted_sequences += 1 sys.stderr.write("{} Empty sequences inserted across all genes.\n".format(inserted_sequences)) return gene_dict def concatenate_sequences(gene_dict,fastafiles,unique_names): '''Given a dictionary of dictionaries with complete sampling in each gene, write out concatenated sequences to stdout. Returns a list of partition lengths.''' new_seq_dict = {} partition_lengths = [] for gene in fastafiles: for name in unique_names: try: new_seq_dict[name] += gene_dict[gene][name] except KeyError: new_seq_dict[name] = gene_dict[gene][name] partition_lengths.append(len(next(iter(gene_dict[gene].values())))) for final_seq in new_seq_dict: SeqIO.write(new_seq_dict[final_seq],sys.stdout,'fasta') final_seq_length = len(new_seq_dict[final_seq]) sys.stderr.write("Final conatenated sequence length: {}\n".format(final_seq_length)) return partition_lengths def raxml_partition(fastafiles,partition_lengths,partition_type): '''Generate a raxml partition file for the given fastafiles. User specifies the partition type''' gene_start = 1 partition_file = open("partition.raxml",'w') if partition_type == 'CODON': for g in range(len(fastafiles)): codon3_start = gene_start + 2 codon3_end = gene_start + partition_lengths[g] - 1 codon1_end = codon3_end - 2 codon2_start = gene_start + 1 codon2_end = codon3_end - 1 partition_file.write("{},{}{}={}-{}\\3,{}-{}\\3\n".format("DNA",fastafiles[g],"12",gene_start,codon1_end,codon2_start,codon2_end)) partition_file.write("{},{}{}={}-{}\\3\n".format("DNA",fastafiles[g],"3",codon3_start,codon3_end)) gene_start = codon3_end + 1 else: for g in range(len(fastafiles)): gene_end = gene_start + partition_lengths[g] - 1 partition_file.write("{},{}={}-{}\n".format(partition_type,fastafiles[g],gene_start,gene_end)) gene_start = gene_end + 1 partition_file.close() def main(): parser = argparse.ArgumentParser(description=helptext,formatter_class=argparse.RawTextHelpFormatter) parser.add_argument("--fastafiles",nargs='+',help="List of Fasta Files. Can use wildcard on Linux/Mac systems") parser.add_argument("--filelist",help="File containing list of Fasta files. Alternative to --fastalist") parser.add_argument("--raxml",help="Create a partition file 'partitions.raxml' intended for raxml in the current directory. For amino acid sequences, select the substitution model. To specify a separate model for 1st/2nd vs. 3rd codon positions, select CODON.", choices = ['DNA','WAG','JTT','CODON' ],default=None) if len(sys.argv) < 2: parser.print_help() sys.exit(1) args = parser.parse_args() if args.fastafiles: #print args.fastafiles if args.filelist: sys.stderr.write("Specify either a list of FASTA files or a file containing names, not both!\n") sys.exit(1) else: fastafiles = args.fastafiles elif args.filelist: #print args.filelist if os.path.isfile(args.filelist): fastafiles = [x.rstrip() for x in open(args.filelist)] else: sys.stderr.write("File containing list of FASTA files not found!") sys.exit(1) else: sys.stderr.write("You must specify the FASTA files as a list or in a file.\n") sys.exit(1) sys.stderr.write("{} FASTA files found.\n".format(len(fastafiles))) gene_dict = read_sequences(fastafiles) sys.stderr.write("All sequences read successfully.\n") unique_names = get_unique_names(gene_dict) sys.stderr.write("{} Unique names found. If you were expecting fewer sequences, check your IDs!\n".format(len(unique_names))) gaps_inserted = insert_sequences(gene_dict,unique_names) partition_lengths = concatenate_sequences(gaps_inserted,fastafiles,unique_names) if args.raxml: raxml_partition(fastafiles,partition_lengths,args.raxml) if __name__ == "__main__":main() ``` Usage: ``` python fasta_merge.py --fastafiles *fasta --raxml DNA > super.fa ``` then I un-concatenated them using a python script I found here: https://github.com/wrf/supermatrix/blob/master/split_supermatrix_to_genes.py The partition file will need to be into this format, and it won't keep the gene names, "01:150,151:338,339:700" ``` #edit the partition file: sed 's/DNA,[a-z0-9_]*.fasta=//g' partition.raxml | sed 's/-/:/g' | sed ':a;N;$!ba;s/\n/,/g' | sed 's/,$//' > partition.txt python split_script.py -a super.fa -p partition.txt -d ./ -f fasta ``` Then use seqmagick to convert to nexus: ``` #purge hybpiper to run seqmagick properly: module purge hybpiper for i in *aln; do NM=${i//.aln/.nex}; seqmagick convert $i $NM --input-format fasta --output-format nexus --alphabet dna; done ``` Notes about issues with BEAST: Trying to set uniform or lognormal prior will not work with fixed tree/starting tree, because the starting tree did not have a root within that range it returns a zero probability. The workaround is to run a chain with 50,000 and normal priors only with a standard deviation of .1 Set the MCMC to write to log files every 50000. When the search is done, take the last tree STATE_50000 and use that as the starterTree for the real MCMC. Using the fixed topology module in BEAST2. Make sure to load the fixed tree first before the alignments, the other way around makes it freeze. ``` grep -v STATE_0 babyRun.trees > starterTreeForRealMCMC.nex ``` Make the tree readable with TreeAnnotator and/or FigTree Unfortunately I still have issues getting BEAST to run with a uniform prior on the TMRCA of Impatiens, but I can run it with a normal prior of 69.9 +/- 10. For BEASTv1, to run with fixed topology, add this manually to the .xml: <stateNode spec='beast.util.TreeParser' id='Tree.t:tree_1' IsLabelledNewick='true' adjustTipHeights='true' taxa='@tree_1' newick="(tree (in (newick format))) Was also having issues with the partitions and assigning GPU resources, but this is the script that finally works: ``` #!/bin/bash #SBATCH -A b1042 #SBATCH -p genomics-gpu #SBATCH --gres=gpu:a100:2 #SBATCH -N 1 #SBATCH -n 1 #SBATCH -t 48:00:00 #SBATCH --mem=80G #SBATCH --mail-user=nora.gavinsmyth@gmail.com #SBATCH --mail-type=ALL #SBATCH --output=beast.output module purge all module use /hpc/software/spack_v20d1/spack/share/spack/modules/linux-rhel7-x86_64/ module load beast1/1.10.4-cuda-gcc-10.4.0 beast -beagle_GPU -beagle_cuda -beagle_order 1 parts10_Rose.xml ``` Downloaded a bunch of fastas from PAFTOL, but they are multifasta by individual, headers by locus. Wanted to split them into fastas by locus with headers by sample name: ``` #Clean up the file names for file in *a353.fasta; do new_name=$(echo "$file" | cut -d "." -f3); mv $file $new_name.fasta; done #Split by the first item header name, the locus name for i in [A-z]*.fasta; do awk '/^>/ { gsub(">", "", $1); split($1, nameParts, " "); out=nameParts[1]".fasta"; print ">"$0 >> out; next } { print >> out }' $i ; done #now let's clean up the header names so we just have the sample name sed -i 's/[a-zA-Z0-9_]* Species://g' [0-9]*.fasta sed -i 's/[0-9_]* Gene_Name://g' [0-9]*.fasta #using cut -d f1 to write new fastas takes a long time so instead: sed -i 's/Repository:[A-Za-z0-9]* Sequence_ID:[A-Za-z0-9]*//g' [0-9]*.fasta ``` A tsv summary of the gappiness of alignments: ``` import os from Bio import SeqIO def calculate_gap_percent(sequence): total_length = len(sequence) gap_count = sequence.count('-') return (gap_count / total_length) * 100 def process_fasta_files_in_directory(directory): data = {} for filename in os.listdir(directory): if filename.endswith('.fasta') or filename.endswith('.fa'): fasta_file = os.path.join(directory, filename) for record in SeqIO.parse(fasta_file, 'fasta'): sample_id = record.id gap_percent = calculate_gap_percent(str(record.seq)) if sample_id not in data: data[sample_id] = {} data[sample_id][fasta_file] = gap_percent return data def write_data_to_tsv(data, output_file): with open(output_file, 'w') as out: # Write header out.write("Sample ID\t") out.write("\t".join(sorted(data[next(iter(data))].keys()))) out.write("\n") # Write data for sample_id, fasta_data in data.items(): out.write(sample_id + "\t") for fasta_file, gap_percent in sorted(fasta_data.items()): out.write(f"{gap_percent:.2f}\t") out.write("\n") directory = './' output_file = 'gap_percent_summary.tsv' data = process_fasta_files_in_directory(directory) write_data_to_tsv(data, output_file) ``` ## treePL treePL needs an ultrametric tree with ML branch lengths. Get ASTRAL tree with no annotations (-t 0) and then put it into figtree, export it as a newick format with the outgroups where you want them. Then run iqtree with a fixed tree topology with a concatenated supermatrix of the loci, consider doing concatenating just the best loci as per SortaDate-- 4/16 when doing this with all 340-some loci, the -t option still allowed the tree topology to change as it is a starting tree. ``` for i in super.fasta do iqtree -s $i -nt AUTO -m TEST -B 1000 -nm 5000 -t fixed.tre -wbtl; done ``` ## getOrganelle ``` get_organelle_from_reads.py -1 zombensis_1.fastq -2 zombensis_2.fastq -t 1 -o zombensis_plastome -F embplant_pt -R 10 ``` ... Once we have the assemblies, all of the fastas will have the same name, but we need them to have the sample names. The files are named differently depending on if it is a complete/nearly complete/incomplete assembly, and so do this for each of those: ``` mkdir ./nearlyCompletePlastomes for i in *_plastome; do cp ./$i/embplant_pt.K115.nearly-complete.graph1.1.path_sequence.fasta ./nearlyCompletePlastomes/$i.fasta; done mkdir ./completePlastomes # etc. ``` Next we want to annotate the assemblies. I used PGA to do this. https://github.com/quxiaojian/PGA First, have to make sure that blast is loaded In my conda environment, perl is messed up so it is best to conda deactivate the environment and then module load blast ``` conda install blast git clone https://github.com/quxiaojian/PGA.git chmod a+rwx ./PGA/PGA.pl cd ./PGA perl PGA.pl -r test/angiosperms/reference -t ../2023_data/plastomes/getorganelle_assemblies/nearlyCompletePlastomes/ ``` PGA will overwrite anything in ./gb so once it is done, move everything out of that directory before you do a different batch. Sometimes PGA will give me an error that it can't access the reference files. I don't know why it does this, but if I remove everything from the ./gb directory it will work the second time. use the python script from Kinggerm to turn .gb files into .fasta : get_annotated_regions_from_gb.py Then clean up header names and put the gene names in the headers: ``` for i in *gb_fastas/gene/*fasta; do name="${i%.fasta}"; nm="$(basename "$name")"; sed -i "s/\--* chloroplast/ ${nm}/g" "$i"; done ``` then concatenate them all into one fasta per individual: ``` for i in *gb_fastas; do name="${i%.gb_fastas}"; cat $i/gene/*.fasta > "${name}.fasta"; done ``` Then use this python script to extract all sequences for each gene: ``` import os def process_fasta(input_filename, output_dir): # Create output directory if it doesn't exist if not os.path.exists(output_dir): os.makedirs(output_dir) # Read the input Fasta file with open(input_filename, 'r') as infile: lines = infile.readlines() # Process each line of the Fasta file for line in lines: if line.startswith('>'): # Extract sample name and gene name from the header header_parts = line.strip().split() sample_name = header_parts[0][1:] gene_name = header_parts[1] # Create a new Fasta file for each gene output_filename = os.path.join(output_dir, f"{gene_name}.fasta") with open(output_filename, 'a') as outfile: # Write the new header and sequence outfile.write(f">{sample_name}\n") else: # Append sequence to the corresponding gene file with open(output_filename, 'a') as outfile: outfile.write(line) # Specify the input directory containing Fasta files input_directory = '/projects/p31984/2023_data/plastomes/complete_plastomes/annotatedComplete/fastas_by_individual/no_underscores' # Specify the output directory for the new Fasta files output_directory = '/projects/p31984/2023_data/plastomes/complete_plastomes/annotatedComplete/fastas_by_individual/no_underscores/output/' # Process each Fasta file in the input directory for filename in os.listdir(input_directory): if filename.endswith('.fasta'): input_filepath = os.path.join(input_directory, filename) process_fasta(input_filepath, output_directory) ``` Now align each gene using MAFFT Trim using trimAl to remove positions with less than 50% Use this python script remove_gappy_sequences.py to remove all sequences with less than 50% of positions: ``` import os from Bio import AlignIO from Bio.Align import MultipleSeqAlignment def filter_gap_sequences(msa_file, output_file, gap_threshold=0.5): # Read the multiple sequence alignment file alignment = AlignIO.read(msa_file, "fasta") # Calculate the maximum number of allowed gaps for each sequence max_gaps = len(alignment[0]) * gap_threshold # Filter out sequences with more than the maximum allowed gaps filtered_alignment = MultipleSeqAlignment( seq for seq in alignment if seq.seq.count("-") <= max_gaps ) # Write the filtered alignment to a new file with open(output_file, "w") as out_handle: AlignIO.write(filtered_alignment, out_handle, "fasta") # Directory containing the fasta alignment files input_dir = "./" # Output directory for filtered alignments output_dir = "./fastas" # Ensure the output directory exists os.makedirs(output_dir, exist_ok=True) # Iterate over each fasta alignment file in the input directory for filename in os.listdir(input_dir): if filename.endswith(".fasta"): input_file = os.path.join(input_dir, filename) output_file = os.path.join(output_dir, filename) # Apply the filtering function filter_gap_sequences(input_file, output_file, gap_threshold=0.5) print(f"Filtered {filename} and saved as {output_file}") ``` Let's see how much coverage we have per gene: ``` for i in *trim.fasta; do nm=${i%_mafft_trim.fasta}; number_headers=$(grep -c "^>" ${i}); echo -e "${nm}\t${number_headers}" >> summary_table; done sort -k2 summary_table ``` Concatenate the trimmed aligned loci using fasta_merge.py Now remove any sequences that have fewer than 50% of the positions in the concatenated alignment using again the python script remove_gappy_sequences.py # snaq! First make alignments just with the samples youre going to use in your network search. Some useful code to extract the desired fastas (names.txt has the sample names you want to extract from a multifasta) ``` for fasta_file in *trim.fasta; do output_file="${fasta_file%_mafft_trim.fasta}.FNA" awk 'NR==FNR{headers[$1]; next} /^>/{header=substr($1,2)} header in headers' "names.txt" "$fasta_file" > "$output_file" done ``` Then make raxml trees for those trimmed alignments. Make a tree file of all the trees, then make an ASTRAL tree. julia code to visualize the networks... In the command line, grab the best network from net1 and cat into a new file "net1" and same with net2. In julia: ``` using PhyloNetworks using PhyloPlots using RCall net1 = readTopology("net1") net2 = readTopology("net2") imagefilename = "waloutNet2.pdf" R"pdf"(imagefilename, width=8, height=8); R"par"(mar=[5,4,4,8]); plot(net1, showgamma=true, showedgenumber=true, xlim = (20), ylim = (20), tipcex = .33, shownodenumber = true); R"dev.off()"; rootatnode!(net1,"Bianchi118") #cannot reroot at this outgroup nodebecause it is downstream of a hybrid edge! #instead root on the edge leading up to the hybrid edge: rerooted_net1 = rootonedge!(net1, 35); imagefilename = "waloutRerootedNet2.pdf" R"pdf"(imagefilename, width=8, height=8); R"par"(mar=[5,4,4,8]); plot(rerooted_net1, showgamma=true, showedgenumber=true, xlim = (20), ylim = (20), tipcex = .33, shownodenumber = true); R"dev.off()"; ``` # ddRAD Phylogeography Make a .phylip tree file using populations in STACKS. Make a new popmap where each individual is its own population e.g. 210_13 210_13 210_05 210_05 Ibata7 Ibata7 ``` populations -P ./keiliiStacks -M ./AllpopmapKeilii --phylip --write-random-snp -t 8 ``` There is a junk line at the end of the file which will prevent RaxML from reading the file. Simply delete the last line: ``` head -n -1 populations.fixed.phylip > temp && mv temp fixed.phylip ``` Now make a Maximum Likelihood tree in RaxML ``` raxmlHPC-SSE3 -f a -V -T 12 -m ASC_GTRCAT --asc-corr lewis -p 12345 -x 12345 -# 100 -s fixed.phylip -n keiliiTree ``` # HyDe Requires sequential phylip format, with tab separating sequence namess and sequences. First I removed all sequencess of non African taxa, i.e. not of interest for looking at Hybridization. ``` #Get the header names which are to be removed: awk 'NR==FNR{pruned_headers[$0]; next} !($0 in pruned_headers)' pruned_headers headers ``` Then had to realign and retrim the alignments. Then I concatenated them into a supermatrix. ``` python fasta_merge.py --fastafiles *.fasta > super.fasta ``` The phylip conversion in AlignIO can either take sequential or relaxed, can't figure out how to make it do both. Luckily there are only two sequences that would be redundant if cut off after 10 characters, Grimshaw94614 and Grimshaw94613. So I sedited Grimshaw94614 to Grims94614. ``` sed -i 's/Grimshaw94614/Grims94614/' super.fasta ``` in python: ``` from Bio import AlignIO AlignIO.convert("super.fasta","fasta","output.phy","phylip-sequential") ``` It doesn't insert a tab between the name and the sequence, so ``` sed 's/^$.\{10\}$/\1\t/' output.phy > output_file.txt ``` and then drop the first line with n headers and n positions, which is standard in .phylip files HyDe does not accept. ``` sed -i 1d output_file.txt ``` Then making the map.txt file ``` cut -c-10 output_file.txt > sp paste sp sp -d "\t"> map.txt ##and drop the first line sed -i 1d map.txt ``` Note that fasta_merge.py has to be run with hybpiper module loaded, however run_hyde.py won't run with this module loaded, so purge hybpiper module. # Notes on Impatiens popgen HybPhaser When getting GLIBC error, export new GLIBC: ``` export LD_LIBRARY_PATH=/opt/glibc-2.14/lib ``` Use emboss to create a single consensus sequence of the aligned multifasta of all samples consensus. The point of this is to generate a "dummy" reference that has only the most common SNPs. We will use the dummy reference in generating a VCF. ## A random dumping of code that helps us move along. Will have to organize To convert fasta to nexus files (for instance, for MrBayes) ``` for i in *trim.fasta; do NM=${i//_trim.fasta/.nex}; seqmagick convert $i $NM --output-format nexus --alphabet dna; done ``` Use EMBOSS to generate the consensus reference on all of the multifastas in a directory.I changed the names so that they just say the GENE#.fasta ``` for filename in ./*.fasta; do mv "./$filename" "./$(echo "$filename" | sed -e 's/_intronerated_consensus_trimmed//g')"; done ``` ``` for i in *trimmed.fasta; do cons -sequence ${i} -outseq dummy-${i%.fasta}embossed.fasta -ofdirectory2 ../embossed_fastas -name ${i%.fasta}; done ``` All the names were a mess so cleaned them up using: Now concatenate all the "emboss" files to one reference ``` cat *.embossedfasta > dummy_ref.fasta ``` Next we can generate vcf for each individual mapping against the dummy reference using the variantcall.sh script from https://github.com/lindsawi/HybSeq-SNP-Extraction/blob/master/variantcall.sh I have added modifications, including adding mkdir $prefix before line 15, adding gatkpath=pwd to gatk, and using absolute paths for defining read1fn and read2fn and the reference. See Nora's var_workflow2.sh for reference To bash the shell script on all sample names in a text file: ``` cat names_list.txt | while read i ; do bash var_workflow2.sh dummy_ref.fasta ${i}; done ``` Next we want to get vcf summary for all the samples, and summary statistics. loop for changing names from SRR downloads ``` for filename in ./SRR19402956_*.fastq; do mv "./$filename" "./$(echo "$filename" | sed 's/SRR19402956/chinensis/')"; done ``` other loops for cleaning up filenames using SED and regular expression: ``` for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done ^^ do note that if you are trying to clean up something like .1. in a file name, in sed, "."" means the first character in the file name and that will be what is edited. Use a backslash before any character like that. ``` To make a namelist.txt of only the subdirectories in a directory: ``` ls -d */ | cut -f1 -d'/' | cat > name_list.txt ``` To make a namelist of only the basename of a directory of full of files: ``` for i in *.fq.gz; do NM=${i//.fq.gz/}; echo $NM; done > namelist.file ``` add a column, like for a popmap ``` awk -F '\t' -v OFS='\t' '{ $(NF+1) = "engleri"; print}' r1r2engleri > r1R2engleri ``` Take just one column from a text file: ``` awk '{print $1}' popmap > file #or cut -f 1 popmap > file ``` Move all small file sizes to another folder, lowreads/ ``` find . -maxdepth 1 -type f -size -100k -exec mv {} lowreads/ \; ``` Combine retrieved sequence .FNA files from separate runs. Gene names need to be identical. ``` for i in *.FNA; do cat ${i} ../secondDirectoryWithFNAfiles/${i} > ../newDirectoryForFNAfiles/${i}; done ``` To remove duplicated header names in a fasta: ``` for i in *_mafft.fasta; do seqkit rmdup < $i > $i.d.fasta; done ``` To concatenate gene alignments into a supermatrix, use the hybpiper script fasta_merge.py Looking for barcodes in raw reads. Second grep is to ignore the headers. ``` zless Fant28_S0_L001_R1_001.fastq.gz |grep "CAGCCTCG" |grep -v "^@" |less ``` ### To see what command is running in a detached screen: ``` ps u -p $(ps -el | grep $(ps -el | grep SCREEN_SESSION_PID | grep bash | awk '{print $4}') | grep -v bash | awk '{print $4}') ``` ### Log ```csvpreview {header="true"} Date, Entry Jan 9 2023, At this point we have a vcf containing all SNP's from all loci in all individuals. For GenotypestoPCA.sh we removed line 13 which filtered the VCF using GATK because the filters were too strict/some scores appeared to all be 0. Instead we use vcf tools and using vcf tools and two flags (1) missinginess of "90%?" and (2) minor allele count of "3". Dec 6 2023, I had to remove rubromaculata vel_sp_ioides and 162-1. In order to use awk quickly to remove rubromaculata, I first changed the name of rubromaculata_subsp_rubromaculata Jan 9 2024, I removed the following taxa from the alignments for TreeShrink: "deWilde359" "Lebrun9424" "NGS037_2" "Abeid2627" "Manning2019" "Vanheist701" "Mwasumbi11834" Tip names are now stored in oldTipsShrunk.csv and newTips_Shrunk.csv Things to work on next, How to handle redundant tips. Do I use the species mode for Astral? Then color code the tip names by geographic region. fHow to handle all the non-monophyly?? I. messumbaiensis I. gordonii. Friday EOD, removed all non Balsaminoid outgroups as well as huangyuanensis_dld so now realigning then have to retrim and then redo trees. Also redid trees with the properly trimmed alignments but they still include the outgroups ``` to change a header name: ``` for i in *supercontig.FNA; do sed 's/rubromaculata_subsp_rubromaculata/I_nana_vel.sp.aff_NGS157/' $i > $i.fa; done ``` then to remove sequences from the fasta: ``` for i in *.trim.fasta do awk '!/rubromaculata|vel_sp_ioides|162-1/' RS=">" ORS=">" ${i} > ${i%_mafft.trim.fasta}.FNA ``` This is a python script I have used to make gb into .fasta, however it doesn't work very well for incomplete plastomes. I don't want to delete it right now, but don't recommend using it, instead use get_annotated_regions_from_gb.py ``` import os from Bio import SeqIO directory_path = './' def process_genbank_files_in_directory(directory_path): for filename in os.listdir(directory_path): if filename.endswith(".gb"): file_path = os.path.join(directory_path, filename) genome = next(SeqIO.parse(file_path, "genbank")) cds = [feature for feature in genome.features if feature.type == 'CDS'] seqs = [] for feature in cds: seq = genome[int(feature.location.start):int(feature.location.end)] if feature.location.strand == -1: seq.seq = seq.seq.reverse_complement() seq.description = feature.qualifiers['gene'][0] seqs.append(seq) output_file = f"{filename}.CDS.fasta" with open(output_file, 'w') as output_handle: SeqIO.write(seqs, output_handle, 'fasta') process_genbank_files_in_directory(directory_path) ```