# Bash class 2022
## while loops
Execute if the condition is true
```
#!/bin/bash
i=0
while [ $i -le 10 ];
do
echo $i
let i=i+1
done
```
```
for i in 1 2 3 4 5 6 7 8 9 10; do echo $i; done
```
Execute if the condition is false
```
i=10
until [ $i -le 0 ];
do
echo $i
let i=i-1
done
```
## case statements
```
#!/bin/bash
day=$(date "+%a") #`date "+%a"`
case $day in
Mon | Tue | Wed | Thur | Fri)
echo "Today is a weekday"
;;
Sat | Sun)
echo "Today is a weekend"
;;
*)
echo "This day is not recognised"
;;
esac
```
Script with if statements
```
!/bin/bash
# This script asks the user to enter their name, then
# writes a nice welcome message with the date.
echo -n "What is your name, my friend? "
read name # writes the input into that variable
# then we want to gather the current time in a specific format:
fulldate=$( date +%H%M%Z%B%d )
hours=${fulldate:0:2}
minutes=${fulldate:2:2}
day=${fulldate:(-2)}
text=${fulldate:4} ; text=${text%%[0-9]*}
# at this stage, 'text' contains the concatenation of the timezone code
# and the month name, which starts with an uppercase letter:
tzcode=${text%[A-Z]*}
tzcode_length=${#tzcode}
month=${text:tzcode_length}
# DEBUG:
# echo "${hours}:${minutes}${tzcode} on ${day} of ${month}"
# Now we determine whether it is evening (from 17:00 to 23:00), night (23:00 to 04:00), morning (04:00 to 10:00), or "day" (10:00 to 17:00)
# We could use an if/elif/else:
# Beware! the 'hours' variable is on two digits, it makes perfect sense here
# to perform tests based on alphabetical sort, all the more because sometimes
# numbers starting with a leading 0 can be interpreted in octal.
# Careful! >= and <= operators are not accepted for alphabetical string comparison!
if [[ "${hours}" < "23" && "${hours}" > "16" ]]
then
period=evening
elif [[ "${hours}" == "23" || "${hours}" < "04" ]] # careful! a logical OR here!
then
period=night
elif [[ "${hours}" > "03" && "${hours}" < "10" ]]
then
period=morning
else
period=day
fi
# And finally, printing the message:
echo "Good ${period}, ${name}! It is now ${hours}:${minutes}${tzcode} on this lovely day of ${month} ${day}."
```
Script with case statements
```
#!/bin/bash
#
# This script asks the user to enter their name, then
# writes a nice welcome message with the date.
echo -n "What is your name, my friend? "
read name # writes the input into that variable
# then we want to gather the current time in a specific format:
fulldate=$( date +%H%M%Z%B%d )
hours=${fulldate:0:2}
minutes=${fulldate:2:2}
day=${fulldate:(-2)}
text=${fulldate:4} ; text=${text%%[0-9]*}
# at this stage, 'text' contains the concatenation of the timezone code
# and the month name, which starts with an uppercase letter:
tzcode=${text%[A-Z]*}
tzcode_length=${#tzcode}
month=${text:tzcode_length}
# DEBUG:
# echo "${hours}:${minutes}${tzcode} on ${day} of ${month}"
# Now we determine whether it is evening (from 17:00 to 23:00), night (23:00 to 04:00), morning (04:00 to 10:00), or "day" (10:00 to 17:00)
# We can use a case statement:
case "${hours}" in
1[7-9] | 2[0-2] )
period=evening
;;
23 | 0[0-3] )
period=night
;;
0[4-9] | 10 )
period=morning
;;
* )
period=day
;;
esac
# And finally, printing the message:
echo "Good ${period}, ${name}! It is now ${hours}:${minutes}${tzcode} on this lovely day of ${month} ${day}."
```
## Access to the file for today's class
```
wget https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/nrf1_seq.fa
# alternatively
curl https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/nrf1_seq.fa > nrf1_seq.fa
```
## grep and sed
```
# using grep and sed to search the patterns
# study the output from the commands below and refer to the regex cheatsheet
# as well as the guide reference books.
grep --color ACAAGTTTG nrf1_seq.fa
grep --color 'ACAAGTTTG' nrf1_seq.fa
grep --color 'ACAAGTTTG[TA]' nrf1_seq.fa
grep --color 'ACAAGTTTG[TAC]' nrf1_seq.fa
grep --color 'ACAAGTTTG[^TAC]' nrf1_seq.fa
grep --color 'ACAAGTTTG[^TA]' nrf1_seq.fa
grep --color 'ACAA.TTTGTT' nrf1_seq.fa
grep --color 'ACAA.TTTG' nrf1_seq.fa
grep --color 'ACAA.TTTGA' nrf1_seq.fa
grep --color 'ACAA*TTTGA' nrf1_seq.fa
grep --color 'ACATTT*GA' nrf1_seq.fa
grep --color 'ACA' nrf1_seq.fa
grep --color 'ACA' nrf1_seq.fa | wc -l
grep --color 'ACATTT*GA' nrf1_seq.fa
grep --color -o 'ACATTT*GA' nrf1_seq.fa
grep --color -o 'ACATTT*GA' nrf1_seq.fa| wc -l
grep --color 'ACATTT*GA' nrf1_seq.fa| wc -l
grep --color 'ACA' nrf1_seq.fa| wc -l
grep --color -o 'ACA' nrf1_seq.fa| wc -l
grep --color -o -n 'ACA' nrf1_seq.fa| less
less -S nrf1_seq.fa
grep --color 'TCTCCTGCCCTTGATATATTGCTTACCCCTCTCTTAAA' nrf1_seq.fa
grep --color 'TCTCCTGCCCTTGATATATTGCTT[^A]CCCCTCTCTTAAA' nrf1_seq.fa
grep '^>' nrf1_seq.fa
grep 'predicted' nrf1_seq.fa
grep -i 'predicted' nrf1_seq.fa
grep -i '[[:upper:]]' nrf1_seq.fa
grep -i '[[:lower:]]' nrf1_seq.fa
grep '[[:lower:]]' nrf1_seq.fa
grep --color '[[:lower:]]' nrf1_seq.fa
grep --color '[[:lower:]]' nrf1_seq.fa
grep --color '[A-Z]' dummy_months
grep --color '[ABCDZ]' dummy_months
grep --color '[:upper:]' dummy_months
grep --color '[[:upper:]]' dummy_months
grep --color '[[:lower:]]' dummy_months
grep --color '[a-z]' dummy_months
grep --color '[[:upper:]][[:lower:]]' dummy_months
grep --color '^[[:upper:]][[:lower:]]' dummy_months
grep --color '[[:alpha:]]' dummy_months
grep --color -o '[[:alpha:]]' dummy_months
grep --color -o '[[:alpha:]]*' dummy_months
grep --color -o '[[:digit:]]*' dummy_months
grep --color -o 'T{5,8}' nrf1_seq.fa
egrep --color -o 'T{5,8}' nrf1_seq.fa # check the man of grep to understand what egrep is
grep -E --color -o 'T{5,8}' nrf1_seq.fa
grep --color -o 'T\{5,8\}' nrf1_seq.fa
grep --color 'T\{5,8\}' nrf1_seq.fa
grep --color 'T\{15,18\}' nrf1_seq.fa
grep --color 'T\{15,18\}' nrf1_seq.fa | less
grep --color 'T\{15,18\}' nrf1_seq.fa | more
grep --color 'T\{15,18\}' nrf1_seq.fa
grep --color 'T\{15,18\}' nrf1_seq.fa > temp
cat temp
grep --color 'T\{15,18\}' nrf1_seq.fa
grep --color 'T\{15,18\}|GGGG' nrf1_seq.fa
grep --color 'T\{15,18\}\|GGGG' nrf1_seq.fa
grep --color -o 'T\{15,18\}\|GGGG' nrf1_seq.faq
grep --o 'T\{15,18\}\|GGGG' nrf1_seq.faq
grep --o 'T\{15,18\}\|GGGG' nrf1_seq.faq
grep --o 'T\{15,18\}\|GGGG' nrf1_seq.fa
grep --o 'T\{15,18\}\|GGGG' nrf1_seq.fa | sort | uniq -c
grep -E --o 'T{16}' nrf1_seq.fa
grep -E -o 'T{16}' nrf1_seq.fa
grep -E --color 'T{16}' nrf1_seq.fa
grep -E --color '[^T]T{16}[^T]' nrf1_seq.fa
grep -E --color -o '[^T]T{16}[^T]' nrf1_seq.fa | wc -l
less -S nrf1_seq.fa
less -S <(grep '^>' nrf1_seq.fa)
grep '^>' nrf1_seq.fa | less -S
grep '^>' nrf1_seq.fa | less -S
grep '^>[^_]*$' nrf1_seq.fa | wc -l
grep '^>[^_]*' nrf1_seq.fa | wc -l
grep --color '^>[]*$' nrf1_seq.fa # pay special attention to the output
grep --color 'A*' dummy_months
grep --color 'A+' dummy_months
grep -E --color 'A+' dummy_months
grep -E --color 'A+' dummy_months
grep -E --color '[[:lower:]]+' dummy_months
grep -E --color '[[:lower:]]+' nrf1_seq.fa | wc -l
grep -E --color '[[:lower:]]*' nrf1_seq.fa | wc -l
grep -E --color '[[:lower:]][[:lower:]]*' nrf1_seq.fa | wc -l
grep -E --color '^>[[:upper:]]+[[:digits:]]+'
grep -E --color '^>[[:upper:]]+[[:digit:]]+'
grep -E --color '^>[[:upper:]]+[[:digit:]]+' nrf1_seq.fa
grep -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa
grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa
grep '[:alpha:]' nrf1_seq.fa
grep -E '[:alpha:]' nrf1_seq.fa
grep -E '[[:alpha:]]' nrf1_seq.fa
grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa
grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]][[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]]{2,} [[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]]{2,}[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]A]+' nrf1_seq.fa | tr -d '>'
# create a file called "dummy_months" just as we did in class
less dummy_months
sed 5 dummy_months
sed '' dummy_months
sed '5p' dummy_months
sed -n '' dummy_months
sed -n '5p' dummy_months
wc -l nrf1_seq.fa
sed -n '5000p' nrf1_seq.fa
sed -n '5000s/CACCC/XXXXX/' nrf1_seq.fa
sed -n '5000s/CACCC/XXXXX/;p' nrf1_seq.fa
sed 's/pril/XXXXXXX/' dummy_months
sed '3s/pril/XXXXXXX/' dummy_months
sed '3s/pril/XXXXXXX/' dummy_months
sed '3s/pril/XXXXXXX/g' dummy_months
sed 's/pril/XXXXXXX/g' dummy_months
sed '3s/pril/XXXXXXX/' dummy_months
sed '1,3 s/pril/XXXXXXX/' dummy_months
sed '/8/ s/pril/XXXXXXX/' dummy_months
sed '3a\' dummy_months
sed '3a\mynew line is this one\' dummy_months
sed '3a\this is my new line' dummy_months
sed 's/pril/XXX/' dummy_months
sed 's/pril/XXX' dummy_months
sed 's#pril#XXX#' dummy_months
sed 's_pril_XXX_' dummy_months
sed 's_pril_XXX' dummy_months
grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>'
sed -n '/^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+/p'
sed -n '/^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+/p' nrf1_seq.fa
sed -n '/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+/p' nrf1_seq.fa
grep '^>' nrf1_seq.fa > identifiers.txt
sed '/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+/p' identifiers.txt
sed 's/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+//' identifiers.txt | less -S
sed 's#^\(>[[:upper:]]\+[[:digit:]]\+\)\.\([[:digit:]]\+\)#\1 version \2#' identifiers.txt | less -S
```
## GNU awk
```
awk 'BEGIN{printf "chromosome\tstart\tend\taccession\n"} {print}' chromosome_one.bed
```
Providing awk commands in a script
Content of the script `print.awk` is
```
{print}
```
The awk command becomes
```
awk -f print.awk chromosome_one.bed
```
select columns and print in desired order
```
awk '{print $3 "\t" $2}' new.bed
```
counting patterns
```
awk '/chr1/{++cnt} END{print "Number of entries = ", cnt}' example.bed
```
## In-class exercise for 8^th^ December 2022
Download the GTF at the address below
```
ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
```
You can find a description of the GTF format [here](https://www.ensembl.org/info/website/upload/gff.html)
Tasks
1. Unzip the file using `gzip`
2. How many lines contain the `#` at the start
3. How many of each feature are in the GTF. **Hint:** Look at column 3
4. Show all the unique crhomosomes of the fungus, in ascending numeric order
5. Extract all unique `gene_id` values found in column 9 and save to a new file called `gene_counts.txt`. You can assume that `gene_id` is always the first field in column 9.
6. Remove double quotes around `gene_ids`
**Use AWK for the rest of the exercise**
6. Are there any non-header lines with more or less than 9 TAB-separated lines? How many non-header lines with exactly 9 columns are there?
8. How many gene features are there in chromosome 2
9. Extract all features that are found on the plus(+) strand
## Exercise on regex expressions
Download the putative binding site positions obtained by a ChIP-seq experiment on Suz12 transcription factor. The file consists of one row per feature.
You can read more about ChIP-seq experiments [here](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-8-56)
```
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41589/suppl/GSE41589_Suz12_BindingSites.txt.gz
```
1. Unzip the file
2. How many binding sites are in the file
3. How many chromosomes are in the file
4. Extract all binding sites for chromosome 1 and redirect them to a new file
5. Using a regular expression, extract sites that correspond to chromosomes 1,3 and 9
6. Using a regular expression, extract sites that correspond to chromosomes 11,13 and 10
7. Using a regular expression, extract sites that correspond to chromosomes 10 to 19