Linux tutorial answers

# Linux tutorial answers The tutorial is available [here](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ). ## Practice Problems Each of the questions applies to the file `/export/groups/drangeli/rnaseq.sample.fq` Start by copying it to your home directory. ### Problem B1 How frequent is the [*Eco*RI](https://www.neb.com/products/r0101-ecori) recognition sequence, GAATTC? #### Solution Here the best tool to use is [`grep`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#Searching-with-grep) with the count feature (`-c`). Your answer might look like this. ```bash grep 'GAATTC' -c rnaseq.sample.fq ``` It should return 23880. There is a danger here! This answer is based on a search of the entire file, which includes the ID lines and base quality values, in addition to the DNA sequence itself. It's possible that our search string "GAATTC" might appear in some of those lines. If so it would give us an inflated number. A solution to this is to filter the fastq file before using `grep`. We could do this by first using `grep` to find only the DNA lines, then apply the search itself. This can be done by asking grep to look for just the characters A,C,G or T at the beginning of the line. ```bash grep '^[ACGT][ACGT][ACGT]' rnaseq.sample.fq | grep 'GAATTC' -c ``` The answer is still 23880, but now we can be confident it isn't over-inflated! ### Problem B2 Create a file containing only the ID lines. Replace colons and space characters with tabs. #### Solution There are two steps to this solution. First, we can use `grep` to filter to the ID lines, then [`sed`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#sed) can be used to search for and replace the colon and space characters. ```bash grep '@' rnaseq.sample.fq > id.lines.txt sed 's/[: ]/\t/g' id.lines.txt > rnaseq.sample.ids.tsv ``` Alternatively, this could be done in one command, using piping. ```bash grep '@' rnaseq.sample.fq | sed 's/[: ]/\t/g' > rnaseq.sample.ids.tsv ``` ### Problem B3 In each of the [FASTQ ID lines](https://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers) the 5th number is the "tile number". How many different tile numbers are represented in this file? #### Solution By itself, this might be a challenging problem. But we've just seen how to isolate the ID lines from the larger fastq format. Once the colons and spaces are converted into a delimiting character, like a tab or comma, we can treat the file like a table. [`cut`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#cut) is a useful tool for isolating a column from any table-like file that has a delimiter. Conveniently `cut` considers a tab its default delimiter. To see if your on the right track, you could pipe the output of `cut` into [`head`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#Examining-file-contents). ```bash cut -f 5 rnaseq.sample.ids.tsv | head ``` Next, you'll need to find the unique entries in this column. [`sort -u`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#sort) will accomplish this. That's still a long list of tile numbers. To find the number of unique tile numbers, we can just pipe into the line count function `wc -l`. ```bash cut -f 5 rnaseq.sample.ids.tsv | sort -u | wc -l ``` ## Challenge Problems ### Problem A1 Using bash commands and/or [R](https://hackmd.io/@aphanotus/Rtutorial) determine the median distance between genes in *Onocpeltus fasciatus*. Use the Offical Gene Set vesion 1.2, which is available at `/research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff` #### Solution We'll need to work through several steps here. The first is just examining the [format of the `gff` file](https://en.wikipedia.org/wiki/General_feature_format). ```bash [drangeli@nscc-n28 ~/group]$ head /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff ##gff-version 3 Scaffold1 OGSv1.2 gene 30267 30861 . + . ID=OFAS000001;Name=OFAS000001;Dbxref=I5KNAL:OFAS000001;method=Maker Scaffold1 OGSv1.2 mRNA 30267 30861 . + . ID=OFAS000001-RA;Name=OFAS000001-RA;Parent=OFAS000001;method=Maker Scaffold1 OGSv1.2 exon 30267 30428 . + . ID=OFAS000001-RA-EXON01;Parent=OFAS000001-RA;method=Maker Scaffold1 OGSv1.2 exon 30763 30861 . + . ID=OFAS000001-RA-EXON02;Parent=OFAS000001-RA;method=Maker Scaffold1 OGSv1.2 polypeptide 30267 30861 . + . ID=OFAS000001-PA;Parent=OFAS000001-RA;method=Maker Scaffold1 OGSv1.2 CDS 30267 30428 . + 0 ID=OFAS000001-RA-CDS;Parent=OFAS000001-RA;method=Maker Scaffold1 OGSv1.2 CDS 30763 30861 . + 0 ID=OFAS000001-RA-CDS;Parent=OFAS000001-RA;method=Maker ### Scaffold1 OGSv1.2 gene 401906 407738 . + . ID=OFAS000002;Name=OFAS000002;Dbxref=I5KNAL:OFAS000002;method=Maker ``` The GFF file has information annotating positions in the genome with genes and other features. There are columns listing 1. the scaffold number, 2. the Official Gene Set (OGS) version, 3. the type of annotation (gene, mRNA, exon, etc.), 4. the start and 5. stop positions of the annotation, 6. a confidence value in the annotation (a dot is the null value or place-holder here), 7. the strand or direction of the annotation, 8. and the name or "ID" of the annotation. We can start by using the tools discussed above to filter just the "gene" lines and isolate the start and stop numbers. If we're interested in gene-to-gene distances, then we'll also need to know which genes are physically connected. That's the scaffold number. We can start by using `grep` to isolate lines with "gene". ```bash grep 'gene' /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff | head ``` That gives us an mRNA line where the description just happens to include the word "gene". Instead, we might use a more specific search in `grep` ```bash grep 'OGSv1.2\sgene' /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff | head ``` That looks like what we want. Next, we can use `cut` to isolate columns 1, 4 and 5, which contain the scaffold number and start and stop positions. (For this problem, we don't care about the direction of the gene.) We can even use `sed` to remove the word "Scaffold". These steps could be done one-at-a-time, or we can use piping to put it all together. ```bash grep 'OGSv1.2\sgene' /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff | cut -f 1,4,5 | sed 's/Scaffold//' > tmp.tsv ``` Let's create a proper tab-seperated value (TSV) file with a header. Here I'll use the [append redirect `>>`](https://hackmd.io/EU18COvqTJOGEHEOkWmnHQ#Redirects) and clean up the temporary file I created. ```bash echo 'scaffold\tstart\tstop' > Ofas.gene.positions.tsv cat tmp.tsv >> Ofas.gene.positions.tsv rm tmp.tsv ``` From here it makes sense to take this table of numbers into R. This could be done from **nscc** or by down-loading our `tsv` file to a personal laptop. Breifly, here's how you'd do each of those things. From the command line: ```bash $ R > of.genes <- read.delim('Ofas.gene.positions.tsv', header=TRUE) > dim(of.genes) [1] 19616 3 ``` What follows will be similar in either environment. -- Here's how you can download a file like this one, that you've created on **nscc**. Open a second terminal window, so that you have one displaying your actions on **nscc** and another on your laptop. ```bash # on nscc cp Ofas.gene.positions.tsv ~ # on your laptop scp nscc:~/Ofas.gene.positions.tsv . # then clean up nscc rm ~/Ofas.gene.positions.tsv ``` ##### Once you're in R Once your in R, import the `tsv` file and inspect the in looks like the data you're expecting. Be sure that the file is in the working directory. In R you can find and change your working directory using the functions `getwd()` and `setwd()`. ```R > of.genes <- read.delim('Ofas.gene.positions.tsv', header=TRUE) > dim(of.genes) [1] 19616 3 > head(of.genes) scaffold start stop 1 1 30267 30861 2 1 401906 407738 3 1 674114 674596 4 1 707416 709854 5 1 710078 712192 6 1 771803 783322 ``` From here there are probably many different routes to a solution. However in all of them, you'll need to find gene-to-gene distances within scaffolds. Keep that in mind. But since this is the spoiler, I'll describe my solution. Each row in this data frame represents a gene. Let's create a new column that gives the distance from one next to the next. Now for the last gene on each scaffold that value will be meaningless. (It will actually be positively misleading!) But we can remove those next. To do this, I'll use a `for` loop ```R > # Initialize the new column > of.genes$distance <- NA > for (i in 1:(length(of.genes$distance)-1)) { + of.genes$distance[i] <- of.genes$start[i+1] - of.genes$stop[i] + } > # How's it look? > head(of.genes) scaffold start stop distance 1 1 30267 30861 371045 2 1 401906 407738 266376 3 1 674114 674596 32820 4 1 707416 709854 224 5 1 710078 712192 59611 6 1 771803 783322 242536 > # Does the end of it look okay? > tail(of.genes) scaffold start stop distance 19611 11237 1109 1280 -1222 19612 11570 58 2433 -2062 19613 12130 371 787 183 19614 14556 970 3572 -3568 19615 15370 4 722 -560 19616 16305 162 430 NA ``` This is pretty darn close to our answer! Should we run `median()` now? -- We still have the problem that the last gene on each scaffold will have a meaningless gene-to-gene distance value. First, it's worth thinking about whether that's an issue that could mislead us. Maybe is not so big a deal? There are 19616 genes in this dataset. How many differnt scaffolds are there? ```R > length(unique(of.genes$scaffold)) [1] 3670 > hist(with(of.genes, by(scaffold,scaffold,length))) ``` That's actually a lot! (3670) The histogram shows us that most genes are on a scaffold with very few genes, and only a few scaffolds have a large number of genes. So this *is* a problem we want to account for. One approach might be to add a line to our `for` loop where we check to see if the next row (`i+1`) has the same scaffold value. If not, then we reset the `distance` value to `NA`. ```R > # Initialize the new column > of.genes$distance <- NA > for (i in 1:(length(of.genes$distance)-1)) { + of.genes$distance[i] <- of.genes$start[i+1] - of.genes$stop[i] + if (of.genes$scaffold[i] != of.genes$scaffold[i+1]) { of.genes$distance[i] <- NA } + } > # How's it look? > head(of.genes) scaffold start stop distance 1 1 30267 30861 371045 2 1 401906 407738 266376 3 1 674114 674596 32820 4 1 707416 709854 224 5 1 710078 712192 59611 6 1 771803 783322 242536 > # Does the end of it look okay? > tail(of.genes) scaffold start stop distance 19611 11237 1109 1280 NA 19612 11570 58 2433 NA 19613 12130 371 787 NA 19614 14556 970 3572 NA 19615 15370 4 722 NA 19616 16305 162 430 NA ``` It appears that the GFF was organized with the longest scaffolds first. So `tail` shows the scaffolds that only have one gene. Here all the distance values are `NA`. At this point, we can calculate the median gene-to-gene distance! ```R > median(of.genes$distance, na.rm = TRUE) [1] 7819 ``` So the median distance between genes in *O. fasciatus* looks like it's 7819 bp. (If you made this calculation without accounting for the scaffolds, you'd get 5585 bp.) Before we move on to the next problem though, it may be worth just looking at a summary of the values. ```R > summary(of.genes$distance) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -363776 2579 7819 20350 19645 1087239 3854 ``` Hmm. The large values are certainly possible, and they don't look too common. But we should *not* have negative values in there. Let's look into this. The `which.min()` and `which.max()` functions are really helpful when you need to find the extreme values from a large data frame. Here, I'll define a temporay index variable `x` to have the row number with the minimum value. I'll use subsetting to look 2 rows before and after that one. ```R > x <- which.min(of.genes$distance) > of.genes[(x-2):(x+2),] scaffold start stop distance 1027 36 16862 56159 10429 1028 36 66588 75161 -332 1029 36 74829 512094 -363776 1030 36 148318 159649 57234 1031 36 216883 232034 123368 ``` It looks like some scaffolds have "genes" that appear to overlap. Again, a simple solution is to modify our `for` loop so that negative distances are changed to `NA`. ```R > # Initialize the new column > of.genes$distance <- NA > for (i in 1:(length(of.genes$distance)-1)) { + of.genes$distance[i] <- of.genes$start[i+1] - of.genes$stop[i] + if (of.genes$distance[i] < 1) { of.genes$distance[i] <- NA } + if (of.genes$scaffold[i] != of.genes$scaffold[i+1]) { of.genes$distance[i] <- NA } + } > summary(of.genes$distance) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1 3912 9132 23278 21814 1087239 5238 ``` That's better! We now have no negative results. (In order to prevent an error when R tries to evaluate `of.genes$distance[i] < 1` when `of.genes$distance[i]` might be `NA`, I inserted that test before the test of scaffolds that can create NA's.) The median value actually appears in the `summary` function's output. But we can also calculate it again on its own... ```R > median(of.genes$distance, na.rm = TRUE) [1] 9131.5 ``` This is a pretty big large distance! ### Problem A2 [Create a plot](https://hackmd.io/@aphanotus/Rtutorial#Plots-with-base-R) of the distribution of inter-gene distances. #### Solution Compared to the last problem, this one is quite simple. R makes plotting the distribution of data really easy! ```R > hist(of.genes$distance) ``` Most genes are seperated by about 9000 bp, although the distribution is quite skewed. So looking at a log scale may be helpful. ```R > hist(log10(of.genes$distance)) ``` This means that the distances between most genes are on the order to 1000's of base pairs. But rarely some genes are seperated by much larger distances. ### Problem A3 How does this compare to some [other insects](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/50557/)? #### Solution To answer this question, you'd need to start over with the annotation GFF file from a different species. Let's try *Drosophila melanogaster*. Surprisingly, I don't have the GFF file for that species on **nscc**. It can be found though by search for the species name on [NCBI's genome database](https://www.ncbi.nlm.nih.gov/genome/). The [page for *D. melanogaster*](https://www.ncbi.nlm.nih.gov/genome/?term=Drosophila+melanogaster) has a direct link to the GFF. I copied the URL and used it to download directly to **nscc**. ```bash cd /research/drangeli/DB/genomes/Dmel/ curl -O "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz" gunzip GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz ``` After de-compressing the file, I'll move back to the group folder to work. ```bash cd /export/groups/drangeli ``` Inspecting the GFF file via `head` or `less` you can see that there are many more annotations in this genome. Rather than an OGS version, column 2 has "RefSeq" which we can also use to help filter. ```bash grep 'RefSeq\sgene' /research/drangeli/DB/genomes/Dmel/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff | cut -f 1,4,5 > tmp.tsv echo 'scaffold\tstart\tstop' > Dmel.gene.positions.tsv cat tmp.tsv >> Dmel.gene.positions.tsv rm tmp.tsv ``` Next, in R... ```bash dmel.genes <- read.delim('Dmel.gene.positions.tsv', header=TRUE) dmel.genes$distance <- NA for (i in 1:(length(dmel.genes$distance)-1)) { dmel.genes$distance[i] <- dmel.genes$start[i+1] - dmel.genes$stop[i] if (dmel.genes$distance[i] < 1) { dmel.genes$distance[i] <- NA } if (dmel.genes$scaffold[i] != dmel.genes$scaffold[i+1]) { dmel.genes$distance[i] <- NA } } summary(dmel.genes$distance) hist(of.genes$distance) hist(log10(of.genes$distance)) ``` Fly genes appear closer together, with a median inter-gene distance of 4842 bp, although there's a similar skew to nearer distances. --- *Dave Angelini*, 2019