Jack Koskinen Lab Notes

# Jack Koskinen Lab Notes ## August 31 + "ssh jjkosk21@bi278" allows me to log into my colby home from any device using the terminal (normal password, exit logs out) + "pwd" gives you the full directory you're in, "ls" lists everything in your current directory (or if you supply it another directory, it will give detailed descriptions of what is in that one) + "Command k" opens the file server menu once you, supply it "smb://filer.colby.edu" to get the colby fileserver + "ls /courses/bi278/" will also get to the course fileserver space + "cd . ." (no space between periods) will move you out to the parent directory + "cp /courses/bi278/Course_Materials/lab_01a/* . " copied the contents of lab_01a to the current directory (the home directory "." in this case) + "mkdir directory_name" will create a directory of that name as a child of the current directory + "mv file.txt directory_name" moves the specified file to the directory, and "mv *.txt" would move all text files from that directory to the new one (the asterisk is a wildcard) + It sounds like the fastq-dump command is gonna require fine tuning depending on the dataset, so just check the usage each time (if it writes a file and doesn't show anything on the screen "cat" will throw it all in the terminal) (as an example, today we used "fastq-dump -X 100 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --outdir directory_name/ SRR7641793" to pull a file, IDK what all that meant but it worked) + "grep pattern file" will show every line in the specified file with a sequence of characters matching the pattern (> denotes a contig in .fa files, but writes to a file in linux, so be sure to use "<" when trying to find contigs) (-v shows all lines WITHOUT a matching pattern) + "wc file" gives lines, number of words, number of characters and the file name it just counted (-l does just lines, -c does just characters, -w does just words) + "grep -v ">" test.fa | tr -d -c GCgc | wc -c" will count Gs and Cs (you could give it AaCcTtGg to get all the bases) + My shell script processor.sh takes in a .fa or .fna file and spits out every line with a contig name, the total number of Gs and Cs, and then the total number of bases (below are the results from the organisms we were asked to tabulate results for)![](https://i.imgur.com/w6C3MpE.png) # Lab Practical ### Commands entered (starting inside my Genomics directory): ``` mkdir Practical mv processor.sh Practical fastq-dump -X 3000000 --fasta --split-files --outdir ./ SRR11020398 rm SRR11020398_2.fasta sh processor.sh SRR11020398_1.fasta mv SRR11020398_1.fasta PfungorumSRA.fasta fastq-dump -X 3000000 --fasta --split-files --outdir ./ SRR4188563 fastq-dump -X 3000000 --fasta --split-files --outdir ./ SRR2889773 rm *_2.fasta mv SRR2889773_1.fasta PxenSRA.fasta mv SRR4188563_1.fasta PmegSRA.fasta sh processor.sh PxenSRA.fasta sh processor.sh PmegSRA.fasta ##for the second part (AAF generation): mkdir InputAlignment cp /courses/bi278/Course_Materials/lab_02/* InputAlignment mkdir OutputAlignment cd OutputAlignment ``` ```csvpreview {header="true"} Orgnaism, SRA Number, Genome Size P. fungorum, SRR11020398, 452999591 P. megapolitana, SRR4188563, 139325285 P. xenovorans, SRR2889773, 45476390 ``` #### Point of note: The downloads on some of the Fasta files took a really long times and multiple times threw errors despite using the same command as the first one which seemed to go fine, with only the SRA changed, so not sure if something might have gone wrong there. #### Second point of note: I see that the genome sizes that my shell file from last week is calculating are like orders of magnitude larger than either the estimations in R or from last week, but I'm not sure what to do with that information, because the processor.sh file seemed to work fine last lab and I haven't edited it or even opened it since. Here is the pwd followed by ls to show my file organization: ![](https://i.imgur.com/90OgYJ2.png) And here is my home directory showing the two folders I used for storing everything associated with tree generation: ![](https://i.imgur.com/PDzsaKA.png) # Rest of Lab 2 (Sept 14) Example of jellyfish commands to create count and histogram files: ``` jellyfish count -t 2 -C -s 2G -m 25 -o Pfung.25.count Pfungorum.fasta jellyfish histo -o Pfung.25.histo Pfung.25.count ``` Example of how to load the histogram file into R and use it to estimate the coverage and genome size from the SRA data: ``` setwd("/home/jjkosk21/colbyhome/Genomics/Lab_02") Pfung <- read.table("Pfung.25.histo", h=F, sep=" ") plot(Pfung[2:200,], type="l", xlim = c(0, 200), ylim = c(0, 400000), xlab = "Occurrence", ylab = "Density") name[2:50,] #Look for where the peak is, in this case it's at 36 #Rows 6-12 below are just a walkthrough of how finding the area under a curve nrow(Pfung) head(Pfung) head(Pfung[,1]) #[rowrange,columnrange] head(Pfung[,2]) head(Pfung[5:nrow(Pfung),]) head(Pfung[5:nrow(Pfung),1]*Pfung[5:nrow(Pfung),2]) sum(Pfung[5:nrow(Pfung),1]*Pfung[5:nrow(Pfung),2]) sum(Pfung[5:nrow(Pfung),1]*Pfung[5:nrow(Pfung),2])/36 #this line takes the area and #divides it by coverage, estimating genome size ``` Genome sizes estimated in R: ```csvpreview {header="true"} Organism, Size P. fungorum, 10342202 P. megapolitana, 11159258 P. xenovorans, NA ``` (The data download for P. xenovorans was one that was throwing errors, my plot doesn't show a peak other than the erroneous reads, so I wasn't able to carry out any of the estimations in R for P. xen) Creating the tree: ``` python /usr/local/src/AAF-20160831/BetaVersion/aaf_phylokmer.py -k 27 -n 2 -t 4 -o k27 -d ../InputAlignment/ & python /usr/local/src/AAF-20160831/BetaVersion/aaf_distance.py -i k27.dat.gz -t 4 -o k27 -f k27.wc ``` Viewing tree in R: ``` setwd("/home/jjkosk21/OutputAlignment") library(ape) tree <- read.tree("k27.tre") plot(tree) ## (if renaming branches is desired, tree$tip.label accesses the names) ``` ![](https://i.imgur.com/MGSMM8X.png) # Lab #3 (Sept 28) ## Exercise 1, Prokka This was the command entered to run the prokka annotation on our BaQS70 bacterial genome, using the closely related Burkholderia pseudomallei as a reference genome (there aren't actually new line characters entered, it just has to wrap around, this can create problems if trying to copy and past the whole thing at once): ``` prokka --outdir baqs70 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_ K96243_132.gbk --locustag PAG70 --genus Paraburkholderia --species agricolaris - -strain BaQS70 /courses/bi278/Course_Materials/lab_03/b70_dnaa_irp_repolished.fasta ``` Q1. Here's what it output into the baqs70 folder that we told it to create: ![](https://i.imgur.com/FWGuiNU.png) Q2. There are 7,860 CDS (coding sequences), 2,703 of which are "hypothetical proteins", 57 of which are ribosomal proteins, and 75 of which are tRNAs Example command to count tRNAs: ``` grep "tRNA" PROKKA_09282020.faa | wc -l ``` Q3. The FAA file seemed to have all the information I needed, I don't know what was different about all the other ones since I can't actually seem to open any to look at them, I was only able to use grep commands to find things I was told to look for inside the files. ## Exercise 2, command line BLAST A useful table I found on wikipedia: ![](https://i.imgur.com/xaYfP5f.png) The following command finds all entries containing the phrase "hypothetical protein" and puts them into the specified file. This was for the amino acid sequences, the same thing was done with the .ffn file for the nucleotide sequences. ``` awk 'BEGIN {RS=">"} /hypothetical protein/{print ">"$0}' ORS="" PROKKA_09282020.faa > hypProAmino.txt ``` A file containing only a few hypothetical proteins was made from this larger file using (similar idea for subsetting into a smaller faa file): ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypProNuc.ffn > nucleotides.01.ffn ``` A BLAST search was then run on this file containing 6 hypothetical protein nucleotide sequences (Essentially the same command was used for the FAA files, just modified for that file type): ``` /usr/local/src/ncbi-blast-2.10.1+/bin/blastn -task megablast -query nucleotides.01.ffn -db nt -outfmt "6 std stitle sscinames" -max_target_seqs 5 -evalue 1e-6 > ffn_megablast_result.txt & ``` And the output file was parsed into something easier to look at using: ``` awk -F "\t" '{print $1,$14,$13,$3,$4,$11}' OFS="\t" ffn_megablast_result.txt ``` Q4. I'm pretty sure that my results from the nucleotide search and the amino acid search are not the same sequences (since I did nothing to ensure this I have no idea why they would be, given that they're from completely different output files from the prokka analyses), but regardless they show a lot of overlap in the species that suspected orthologs were found in, which is encouraging. The nucleotide search was only able to show us sequences from other organisms genomes that had a lot of sequence homology, but it didn't know anything about what the proteins actually might be. The swissprot search provided very specific proteins that had the most homology, but only to organisms in the database, which were mostly mammals like mice and humans. The refseq-protein search was able to find potential protein identities from more closely related species to the one we are working with (e.g. it found a diozygenase protein from a paraburkholderia that had 94% homology with one of our hypothetical proteins), but for some of our potential proteins it was only able to find matches with other as-of-yet un-studied proteins in related species. Q5. The only real modification I needed to make was to limit the number of matches that were displayed for the protein searches, at first I was seeing far too many. Other than that I just put in what was supplied in the directions. Q6. By default a BLAST nucleotide search will use the megablast option (which is what we used, so that seems fine), will create an output file named "-" which would not be helpful to find later, uses an E value of 10 as the threshold for saving matches (given that we used 1e-6, 10 seems rather high, probably gonna return a lot of results that aren't meaningful), uses the MegaBLAST database index by default (we didn't change this, so seems fine), and returns one-line descriptions for the first 500 matches, and shows sequence alignments for 250 results. A protein BLAST search uses the blastp search option by default (we used blastp-fast, and fast is good, but there may be tradeoffs I'm not aware of), also specifies an output file of "-" (not particularly helpful), also uses an E value of 10 (still seems high), uses a default value of '2' for the -comp_based_stats parameter (seems to change how the composition-based stats are computed, we didn't change this, so its probably fine to us the default), and shows 500 one line descriptions and 250 sequence alignments. I think the most important ones to change when starting a search where you aren't sure what parameters will work exctly best are to lower the E value to an actual meaningful cutoff, specify an output file that you can find later, and limit how many matches it shows you (this was the max_target_seqs parameter that we set to 5) because 500 matches are far too many to be able to look at in a useful way (they could be useful for analysis, but just to look at without processing at all they're overwhelming). # Lab #4 (Oct 19) ### Section 1.2 First I made individual databases for each organism using these commands (the first for nucleotide sequences, the second for amino acid sequences) ``` /usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in /courses/bi278/Course_Materials/lab_04/B.pseudomallei.ncbi.cds_from_genomic.fna -dbtype nucl -title B.pseudomallei -out /home/jjkosk21/MYBLAST/B.pseudomallei -parse_seqids ``` This takes the .fna file and uses it to create a database called "B.pseudomallei" in my MYBLAST folder. ``` /usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in /courses/bi278/Course_Materials/lab_04/P.terrae.ncbi.protein.faa -dbtype prot -title P.terrae -out /home/jjkosk21/MYBLAST/P.terrae -parse_seqids ``` This command is the same thing but for a protein database. I then created nucleotide and protein databases in the MYBLAST folder using the same commands but just changing the path name and file name to my subsections of the Baqs70 genes from last lab. Then I set my MYBLAST folder as the database I want my blast searches to use with the following command: ``` export BLASTDB="/home/jjkosk21/MYBLAST:" ``` ### Section 1.3 I performed the first BLAST search looking for a few of the Baqs70 genes in the P.terrae database using this command and piped the result into a command to extract the sequence ID of the hits: ``` /usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast -num_threads 4 -query aminos.01.faa -db P.terrae -max_target_seqs 1 -outfmt "6 std stitle" | cut -f 2 > Pterrae_hits.txt ``` and it found two hits for two of the genes I supplied it. I then searched the P.terrae database to extract the sequences of those proteins with this command: ``` /usr/local/src/ncbi-blast-2.10.1+/bin/blastdbcmd -db P.terrae -entry_batch Pterrae_hits.txt > Pterrae_seqs.faa ``` And then finally checked that they were the reciprocal best hits of the Baqs70 genes in the original BLAST by performing a search for these sequences in the Baqs70 database with this command (they did in fact find the original genes, indicating that they were reciprocal best hits): ``` /usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast -num_threads 4 -query Pterrae_seqs.faa -db Baq70_prot -max_target_seqs 1 -outfmt "6 std stitle" ``` Q1. Some of my pairs of homologs were reciprocal best hits, some were not. Q2. None of the B.pseudomallei proteins were reciprocal best hits with any of the Baqs70 proteins, which given that B.pseudomallei is the most distantly related of the ones we are looking at makes sense. After putting all the sequences into a new file, I aligned them all together with Muscle: ``` muscle -in dioxygenase.txt -out dioxygenase_muscle_aln.faa ``` and then created a tree with: ``` FastTree dioxygenase_muscle_aln.faa > dioxygenase_faa.tree ``` # Lab #5 (Nov 2) ### 1. Quality control of reads The following command checks the quality associated with the reads and outputs a file about them: ``` fastx_quality_stats -Q 33 -i /courses/bi278/Course_Materials/lab_05/QS23_s_8_1_sanger_subset.fastq -o QS23_r1_qstats.txt ``` This command was supposed to create a boxplot of some sort from the quality stats file, but it didn't really work and Sugene said it was fine to just look at her example: ``` fastq_quality_boxplot_graph.sh -i QS23_r1_qstats.txt –o QS23_r1.quality.png -t "QS23_r1" ``` This command actually did work, and generated the distribution of nucleotides shown below: ``` fastx_nucleotide_distribution_graph.sh -i QS23_r1_qstats.txt -o QS23_r1.nucdist.png -t "QS23_r1" ``` ![](https://i.imgur.com/V0mgJAp.png) Q1. Since the box plot didn't generate and I can't tell what I'm looking at in the raw txt file I'm not sure what the quality of these reads is like, but if they have scores consistently in the very high 20s or 30s that don't dip down towards the end of the read I would be comfortable callin them high quality. Q2. I'm not really sure what I'm supposed to be seeing in thi distribution, but I'm guessing the fact that the relative frequencies is pretty consistent is a good sign? If they were changing suddenly at certain read positions that would probably be indicative of an error. The following command first removes poor quality reads, and then filters to remove reads where less than 1/3 of the positions have a quality score as good as Q=30: ``` cat /courses/bi278/Course_Materials/lab_05/QS23_s_8_1_sanger_subset.fastq | fastx_clipper -Q 33 -v -l 20 | fastq_quality_filter -Q 33 -v -q 30 - p 33 > QS23_s_8_1_sanger_subset.clipped.fastq ``` Q3. A quality score of Q=30 means that the error rate is 0.001, and anything above 30 is a lower error rate than 0.001. Since we may have removed poor quality reads from one file but not the same position in the other paired end file when we ran the command above, we now need to make sure that they match: ``` python /courses/bi278/Course_Materials/lab_05/fastqCombinePairedEnd.py QS23_s_8_1_sanger_subset.clipped.fastq QS23_s_8_2_sanger_subset.clipped.fastq / ``` This command cleans up the name of the files: ``` mv QS23_s_8_1_sanger_subset.clipped.fastq_pairs_R1.fastq QS23_s_8_1_subset.clipped.pairs.fastq ``` These remove the unncessary files: ``` rm *subset.clipped.fastq rm *singles.fastq ``` And this compresses the file to save space: ``` gzip QS23_s_8_1_subset.clipped.pairs.fastq ``` ### 2. Aligning reads to a reference genome Sugene was nice enough to index a reference genome for us, so we just had to access it from the course fileserver space: ``` #bwa index /courses/bi278/Course_Materials/lab_05/ dicty_chromosomal_nodup.fasta ``` Then this command aligns the cleaned paired end reads to the reference: ``` bwa mem -M -t 4 -R '@RG\tID:qs23\tSM:qs23\tLB:qs23' /courses/bi278/Course_Materials/lab_05/dicty_chromosomal_nodup.fasta QS23_s_8_1_subset.clipped.pairs.fastq.gz QS23_s_8_2_subset.clipped.pairs.fastq.gz > QS23.bwa.sam ``` Then this command does some cleanup for the way bwa handles alignments: ``` samtools fixmate -O bam QS23.bwa.sam QS23_fixmate.bwa.bam ``` Then this command sorts the reads into a compressed alignment format (bam): ``` samtools sort -O bam -o QS23_sorted.bwa.bam -T . QS23_fixmate.bwa.bam ``` This command collects summary stats about how well alignment went: ``` picard CollectAlignmentSummaryMetrics R=/courses/bi278/Course_Materials/lab_05/dicty_chromosomal_nodup.fasta I=QS23_sorted.bwa.bam O=QS23_sorted.bwa.bam.stats.txt ``` To index the mapped file: ``` samtools index QS23_sorted.bwa.bam ``` And then to view a graphical representation of the alignment: ``` samtools tview QS23_sorted.bwa.bam ``` ### 3. Understanding Alignment and Alignment Files I used the following command to look at a few of the reads from the alignment file: ``` samtools view -s 0.0000001 QS23_sorted.bwa.bam ``` Q4. A summary of what fields 12+ showed for a couple of the reads from the alignment: + most had one or two substitutions, but a couple didn't have any (NM field) + One read had a T substituted 37 bases into the read, an A substituted at position 40, and then 30 more unchanged bases after those (MD:Z:36T2A30) + That read had an alignment score of 60 (AS:i:60) + The next best alignment for that read had a score of 45 (XS:i:45) + That read's mate had an alignment score of 39 (MQ:i:39) + It had a CIGAR string for mate/next segment (not sure what this means) 75M (MC:Z:75M) + And then finally it was in the qs23 read group like it should be since that is what we input (RG:Z:qs23) ### 4. Refine alignment and Call Variants The following command was used to mark duplicates in the alignment: ``` picard MarkDuplicates I=QS23_sorted.bwa.bam O=QS23_marked.bwa.bam METRICS_FILE= QS23_marked.bwa.bam.metrics ASSUME_SORT_ORDER=coordinate TAGGING_POLICY=All VALIDATION_STRINGENCY=LENIENT ``` Sugene did the following steps for us to save time, but we would have used the next two commands to call variants and remove variants with too low of quality to use: ``` #bcftools mpileup -Ou -EI -C 50 -Q 30 -f /courses/bi278/Course_Materials/lab_05/dicty_chromosomal_nodup.fasta /courses/bi278/Course_Materials/lab_05/QS11_marked.bwa.bam QS23_marked.bwa.bam | bcftools call -vc -O z --ploidy 1 -o dicty.bwa.mpileup.vcf.gz ``` ``` #bcftools filter -O v -o dicty.bwa.mpileup.filtered.vcf -s LOWQUAL - i'%QUAL>10' dicty.bwa.mpileup.vcf.gz ``` ### 5. Understanding Variant Files Q5. A random SNP description from the CVF file: On chromosome DDB0215018, position 65256, the reference genome and qs11 have a G in that position while the qs23 has a T in that position: ![](https://i.imgur.com/4xoD1Tf.png)