BI345 Final Project

# BI345 Final Project ## Sourcing Data Data and motivation for this project has been sourced from two papers: [A transcriptome software comparison for the analyses of treatments expected to give subtle gene expression responses](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08673-8#Sec10) Data Source: [SRA](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SAMN23849194&o=acc_s%3Aa) [Evaluation of Wi-Fi Radiation Effects on Antibiotic Susceptibility, Metabolic Activity and Biofilm Formation by Escherichia Coli 0157H7, Staphylococcus Aureus and Staphylococcus Epidermis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6820025/#:~:text=In%20addition%20to%20antibiotic%20resistance,studies%20that%20showed%20promoting%20E.) Data Source: [SRA](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP185923&o=acc_s%3Aa) GTF and FNA files for alignment: https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/ ### Data acquisition Data acquisition was done directly from the SRA bioproject page using the fasterq-dump method. Below is the script file for acquiring the data from the low-dose radiation paper. The RNA-seq paired-end reads data have been placed into the "rad" and "wifi" directories. ``` # For all files: #!/bin/sh for((i = 83; i <= 98;i++)) do fasterq-dump --split-files SRR171944$i done ``` ``` # For single file: fastq-dump --gzip --split-3 -X 10000000 SRR17194483 ``` The files have been renamed by library name in order to make identification easier. The naming schema is "rad" or "wifi" depending on which paper the data is from followed by the library name (i.e. wifiExp1 would correspond to the first wifi-exposed sample) | shortName | SRR ID | Description | | --------- | ----------- | -------------------------- | | wifiExp1 | SRR8580017 | WiFi exposed replicate 1 | | wifiExp2 | SRR8580018 | WiFi exposed replicate 2 | | wifiExp3 | SRR8580019 | WiFi Exposed Replicate 3 | | radP2 | SRR17194483 | Rad. Pozzolana Replicate 2 | | radP1 | SRR17194484 | Rad. Pozzolana Replicate 1 | | radM4 | SRR17194485 | Rad. Minus(control) Rep. 4 | | radM3 | SRR17194486 | Rad. Minus(control) Rep. 3 | | radM2 | SRR17194487 | Rad. Minus(control) Rep. 2 | | radM1 | SRR17194488 | Rad. Minus(control) Rep. 1 | | radK4 | SRR17194489 | Rad. KCl Replicate 4 | | radK3 | SRR17194490 | Rad. KCl Replicate 3 | | radT4 | SRR17194491 | Rad. Tufo Replicate 4 | | radT3 | SRR17194492 | Rad. Tufo Replicate 3 | | radT2 | SRR17194493 | Rad. Tufo Replicate 2 | ## Clipping raw read data We quality control the RNA-seq data in order to remove sequencing adapters and barcodes and remove any reads that do not pass through the quality filter. The filter we will apply is a minimum Phred score of 10, remove reads that are less than 20 bp long, and have 1/3 of the reads meet this quality score criteria **This step was omitted in the final version in order to save time.** ``` fastp --in1 radP1_1.fastq --out1 radP1_1.fastp.fastq --in2 radP1_2.fastq --out2 radP1_2.fastp.fastq -q 10 -u 66 -l 20 ``` ## Indexing reference genome The gtf and fna files that were used to create the reference the genome were sourced [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/). The command used to reference this genome is included below: ``` sed 's/CDS/exon/g' GCF_000005845.2_ASM584v2_genomic.gtf > GCF_000005845.2_ASM584v2_genomic.exon.gtf STAR --runMode genomeGenerate --genomeSAindexNbases 10 --genomeDir STAR.index --genomeFastaFiles GCF_000005845.2_ASM584v2_genomic.fna --sjdbGTFfile GCF_000005845.2_ASM584v2_genomic.exon.gtf ``` The sed command will substitute every instance that the string "CDS" appears on a line with "exon". This is necessary because the GTF 2.2 file-structure uses protein coding sequences (CDS) instead of exons which the htseq-count command takes by default. The difference between an exon and a CDS both will remain present in mature RNA however, CDS is much more stringent since it will only keep the protein coding sequences. The added flag, genomeSAindexNbases will change the length, N, of the suffix arrays used to generate the genome. This is because the length of the e. coli genome is too short for the default value of 14. ## Aligning reads Each clipped fastq file is aligned to the reference genome using the following command. The alignIntronMin and max arguments are necessary in order to turn off splice alignments: ``` STAR --runMode alignReads --genomeDir /courses/bi345/kyamad23/final_project/raw/rad/STAR.index --readFilesIn SRR17194483_1.fastq.gz SRR17194483_2.fastq.gz --alignIntronMin 1 --alignIntronMax 1 --outFileNamePrefix SRR17194483. --outSAMtype BAM Unsorted --outSAMattributes All --readFilesCommand zcat ``` ## Sorting and fixing read groups We output the sorted file to a bam in order to save space. We also need to use the -h flag in the samtools view command in order to include the header in the output. ``` samtools view -f 0x2 -h SRR17194483.Aligned.out.bam | samtools sort -n > SRR17194483.sorted.bam ``` ## Add and replace readgroups To add read group information, we use the table above to choose the appropriate lane ID in order to add it to the output ``` samtools addreplacerg -r "@RG\tID:kyamada\tSM:kyamada\tPL:illumina" SRR17194483.sorted.bam > SRR17194483.readGroups.bam ``` ## Count reads from readfiles The count command will only include reads that had a Phred quality score above 20. ``` samtools view SRR17194483.readGroups.bam | htseq-count -s reverse -a 20 - /courses/bi345/kyamad23/final_project/raw/rad/reference/GCF_000005845.2_ASM584v2_genomic.exon.gtf > SRR17194483.rev.count ``` ### format topGO counts: The topGO GO annotation file was downloaded from [Gene Ontology Consortium website](http://snapshot.geneontology.org/products/pages/downloads.html) We need to format this file because all we need is the GO term ID and the Brattner number corresponding to the gene. The downloaded gaf file has extraneous columns, a header, and synonym gene IDs, none of which we will use. The command below pulls any line that has a GO Ontology term ID and pipes this output into an awk command. The awk command will take the fifth column, which has the GO term, and the last five characters of the the gene IDs. This is because the brattner number that our count files label the genes by are always the in the format b#### and are the last ID in the column. The second set of grep commands searches any row without a Bratner ID (i.e. b####) and excludes them from the study. The alternative would be to translate whatever geneID existed in that slot into a Bratner ID however, they were removed in the interest of time. ``` grep GO: ecocyc.gaf | awk -F "\t" '$4!~/NOT/ && $7!="ND" {print $4,$5,$7,$11}' OFS="\t" > ecocyc_filtered.gaf cat ecocyc_filtered.gaf | grep -o '.....$' > ecocyc_filtered.2 cut -f 2 ecocyc_filtered.gaf > ecocyc_filtered.1 paste ecocyc_filtered.2 ecocyc_filtered.1 > ecocyc_filtered.topgo.txt grep ^b ecocyc_filtered.topgo.txt > temp mv temp ecocyc_filtered.topgo.txt ```