Bash class 2022

# Bash class 2022 ## while loops Execute if the condition is true ``` #!/bin/bash i=0 while [ $i -le 10 ]; do echo $i let i=i+1 done ``` ``` for i in 1 2 3 4 5 6 7 8 9 10; do echo $i; done ``` Execute if the condition is false ``` i=10 until [ $i -le 0 ]; do echo $i let i=i-1 done ``` ## case statements ``` #!/bin/bash day=$(date "+%a") #`date "+%a"` case $day in Mon | Tue | Wed | Thur | Fri) echo "Today is a weekday" ;; Sat | Sun) echo "Today is a weekend" ;; *) echo "This day is not recognised" ;; esac ``` Script with if statements ``` !/bin/bash # This script asks the user to enter their name, then # writes a nice welcome message with the date. echo -n "What is your name, my friend? " read name # writes the input into that variable # then we want to gather the current time in a specific format: fulldate=$( date +%H%M%Z%B%d ) hours=${fulldate:0:2} minutes=${fulldate:2:2} day=${fulldate:(-2)} text=${fulldate:4} ; text=${text%%[0-9]*} # at this stage, 'text' contains the concatenation of the timezone code # and the month name, which starts with an uppercase letter: tzcode=${text%[A-Z]*} tzcode_length=${#tzcode} month=${text:tzcode_length} # DEBUG: # echo "${hours}:${minutes}${tzcode} on ${day} of ${month}" # Now we determine whether it is evening (from 17:00 to 23:00), night (23:00 to 04:00), morning (04:00 to 10:00), or "day" (10:00 to 17:00) # We could use an if/elif/else: # Beware! the 'hours' variable is on two digits, it makes perfect sense here # to perform tests based on alphabetical sort, all the more because sometimes # numbers starting with a leading 0 can be interpreted in octal. # Careful! >= and <= operators are not accepted for alphabetical string comparison! if [[ "${hours}" < "23" && "${hours}" > "16" ]] then period=evening elif [[ "${hours}" == "23" || "${hours}" < "04" ]] # careful! a logical OR here! then period=night elif [[ "${hours}" > "03" && "${hours}" < "10" ]] then period=morning else period=day fi # And finally, printing the message: echo "Good ${period}, ${name}! It is now ${hours}:${minutes}${tzcode} on this lovely day of ${month} ${day}." ``` Script with case statements ``` #!/bin/bash # # This script asks the user to enter their name, then # writes a nice welcome message with the date. echo -n "What is your name, my friend? " read name # writes the input into that variable # then we want to gather the current time in a specific format: fulldate=$( date +%H%M%Z%B%d ) hours=${fulldate:0:2} minutes=${fulldate:2:2} day=${fulldate:(-2)} text=${fulldate:4} ; text=${text%%[0-9]*} # at this stage, 'text' contains the concatenation of the timezone code # and the month name, which starts with an uppercase letter: tzcode=${text%[A-Z]*} tzcode_length=${#tzcode} month=${text:tzcode_length} # DEBUG: # echo "${hours}:${minutes}${tzcode} on ${day} of ${month}" # Now we determine whether it is evening (from 17:00 to 23:00), night (23:00 to 04:00), morning (04:00 to 10:00), or "day" (10:00 to 17:00) # We can use a case statement: case "${hours}" in 1[7-9] | 2[0-2] ) period=evening ;; 23 | 0[0-3] ) period=night ;; 0[4-9] | 10 ) period=morning ;; * ) period=day ;; esac # And finally, printing the message: echo "Good ${period}, ${name}! It is now ${hours}:${minutes}${tzcode} on this lovely day of ${month} ${day}." ``` ## Access to the file for today's class ``` wget https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/nrf1_seq.fa # alternatively curl https://raw.githubusercontent.com/eanbit-rt/IntroductoryLinux-2019/master/Data/nrf1_seq.fa > nrf1_seq.fa ``` ## grep and sed ``` # using grep and sed to search the patterns # study the output from the commands below and refer to the regex cheatsheet # as well as the guide reference books. grep --color ACAAGTTTG nrf1_seq.fa grep --color 'ACAAGTTTG' nrf1_seq.fa grep --color 'ACAAGTTTG[TA]' nrf1_seq.fa grep --color 'ACAAGTTTG[TAC]' nrf1_seq.fa grep --color 'ACAAGTTTG[^TAC]' nrf1_seq.fa grep --color 'ACAAGTTTG[^TA]' nrf1_seq.fa grep --color 'ACAA.TTTGTT' nrf1_seq.fa grep --color 'ACAA.TTTG' nrf1_seq.fa grep --color 'ACAA.TTTGA' nrf1_seq.fa grep --color 'ACAA*TTTGA' nrf1_seq.fa grep --color 'ACATTT*GA' nrf1_seq.fa grep --color 'ACA' nrf1_seq.fa grep --color 'ACA' nrf1_seq.fa | wc -l grep --color 'ACATTT*GA' nrf1_seq.fa grep --color -o 'ACATTT*GA' nrf1_seq.fa grep --color -o 'ACATTT*GA' nrf1_seq.fa| wc -l grep --color 'ACATTT*GA' nrf1_seq.fa| wc -l grep --color 'ACA' nrf1_seq.fa| wc -l grep --color -o 'ACA' nrf1_seq.fa| wc -l grep --color -o -n 'ACA' nrf1_seq.fa| less less -S nrf1_seq.fa grep --color 'TCTCCTGCCCTTGATATATTGCTTACCCCTCTCTTAAA' nrf1_seq.fa grep --color 'TCTCCTGCCCTTGATATATTGCTT[^A]CCCCTCTCTTAAA' nrf1_seq.fa grep '^>' nrf1_seq.fa grep 'predicted' nrf1_seq.fa grep -i 'predicted' nrf1_seq.fa grep -i '[[:upper:]]' nrf1_seq.fa grep -i '[[:lower:]]' nrf1_seq.fa grep '[[:lower:]]' nrf1_seq.fa grep --color '[[:lower:]]' nrf1_seq.fa grep --color '[[:lower:]]' nrf1_seq.fa grep --color '[A-Z]' dummy_months grep --color '[ABCDZ]' dummy_months grep --color '[:upper:]' dummy_months grep --color '[[:upper:]]' dummy_months grep --color '[[:lower:]]' dummy_months grep --color '[a-z]' dummy_months grep --color '[[:upper:]][[:lower:]]' dummy_months grep --color '^[[:upper:]][[:lower:]]' dummy_months grep --color '[[:alpha:]]' dummy_months grep --color -o '[[:alpha:]]' dummy_months grep --color -o '[[:alpha:]]*' dummy_months grep --color -o '[[:digit:]]*' dummy_months grep --color -o 'T{5,8}' nrf1_seq.fa egrep --color -o 'T{5,8}' nrf1_seq.fa # check the man of grep to understand what egrep is grep -E --color -o 'T{5,8}' nrf1_seq.fa grep --color -o 'T\{5,8\}' nrf1_seq.fa grep --color 'T\{5,8\}' nrf1_seq.fa grep --color 'T\{15,18\}' nrf1_seq.fa grep --color 'T\{15,18\}' nrf1_seq.fa | less grep --color 'T\{15,18\}' nrf1_seq.fa | more grep --color 'T\{15,18\}' nrf1_seq.fa grep --color 'T\{15,18\}' nrf1_seq.fa > temp cat temp grep --color 'T\{15,18\}' nrf1_seq.fa grep --color 'T\{15,18\}|GGGG' nrf1_seq.fa grep --color 'T\{15,18\}\|GGGG' nrf1_seq.fa grep --color -o 'T\{15,18\}\|GGGG' nrf1_seq.faq grep --o 'T\{15,18\}\|GGGG' nrf1_seq.faq grep --o 'T\{15,18\}\|GGGG' nrf1_seq.faq grep --o 'T\{15,18\}\|GGGG' nrf1_seq.fa grep --o 'T\{15,18\}\|GGGG' nrf1_seq.fa | sort | uniq -c grep -E --o 'T{16}' nrf1_seq.fa grep -E -o 'T{16}' nrf1_seq.fa grep -E --color 'T{16}' nrf1_seq.fa grep -E --color '[^T]T{16}[^T]' nrf1_seq.fa grep -E --color -o '[^T]T{16}[^T]' nrf1_seq.fa | wc -l less -S nrf1_seq.fa less -S <(grep '^>' nrf1_seq.fa) grep '^>' nrf1_seq.fa | less -S grep '^>' nrf1_seq.fa | less -S grep '^>[^_]*$' nrf1_seq.fa | wc -l grep '^>[^_]*' nrf1_seq.fa | wc -l grep --color '^>[]*$' nrf1_seq.fa # pay special attention to the output grep --color 'A*' dummy_months grep --color 'A+' dummy_months grep -E --color 'A+' dummy_months grep -E --color 'A+' dummy_months grep -E --color '[[:lower:]]+' dummy_months grep -E --color '[[:lower:]]+' nrf1_seq.fa | wc -l grep -E --color '[[:lower:]]*' nrf1_seq.fa | wc -l grep -E --color '[[:lower:]][[:lower:]]*' nrf1_seq.fa | wc -l grep -E --color '^>[[:upper:]]+[[:digits:]]+' grep -E --color '^>[[:upper:]]+[[:digit:]]+' grep -E --color '^>[[:upper:]]+[[:digit:]]+' nrf1_seq.fa grep -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa grep '[:alpha:]' nrf1_seq.fa grep -E '[:alpha:]' nrf1_seq.fa grep -E '[[:alpha:]]' nrf1_seq.fa grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]][[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]]{2,} [[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]]{2,}[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]A]+' nrf1_seq.fa | tr -d '>' # create a file called "dummy_months" just as we did in class less dummy_months sed 5 dummy_months sed '' dummy_months sed '5p' dummy_months sed -n '' dummy_months sed -n '5p' dummy_months wc -l nrf1_seq.fa sed -n '5000p' nrf1_seq.fa sed -n '5000s/CACCC/XXXXX/' nrf1_seq.fa sed -n '5000s/CACCC/XXXXX/;p' nrf1_seq.fa sed 's/pril/XXXXXXX/' dummy_months sed '3s/pril/XXXXXXX/' dummy_months sed '3s/pril/XXXXXXX/' dummy_months sed '3s/pril/XXXXXXX/g' dummy_months sed 's/pril/XXXXXXX/g' dummy_months sed '3s/pril/XXXXXXX/' dummy_months sed '1,3 s/pril/XXXXXXX/' dummy_months sed '/8/ s/pril/XXXXXXX/' dummy_months sed '3a\' dummy_months sed '3a\mynew line is this one\' dummy_months sed '3a\this is my new line' dummy_months sed 's/pril/XXX/' dummy_months sed 's/pril/XXX' dummy_months sed 's#pril#XXX#' dummy_months sed 's_pril_XXX_' dummy_months sed 's_pril_XXX' dummy_months grep -o -E --color '^>[[:upper:]]+[[:digit:]]{5}\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' grep -o -E --color '^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+' nrf1_seq.fa | tr -d '>' sed -n '/^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+/p' sed -n '/^>[[:upper:]]+[[:digit:]]+\.[[:digit:]]+/p' nrf1_seq.fa sed -n '/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+/p' nrf1_seq.fa grep '^>' nrf1_seq.fa > identifiers.txt sed '/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+/p' identifiers.txt sed 's/^>[[:upper:]]\+[[:digit:]]\+\.[[:digit:]]\+//' identifiers.txt | less -S sed 's#^$>[[:upper:]]\+[[:digit:]]\+$\.$[[:digit:]]\+$#\1 version \2#' identifiers.txt | less -S ``` ## GNU awk ``` awk 'BEGIN{printf "chromosome\tstart\tend\taccession\n"} {print}' chromosome_one.bed ``` Providing awk commands in a script Content of the script `print.awk` is ``` {print} ``` The awk command becomes ``` awk -f print.awk chromosome_one.bed ``` select columns and print in desired order ``` awk '{print $3 "\t" $2}' new.bed ``` counting patterns ``` awk '/chr1/{++cnt} END{print "Number of entries = ", cnt}' example.bed ``` ## In-class exercise for 8^th^ December 2022 Download the GTF at the address below ``` ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ``` You can find a description of the GTF format [here](https://www.ensembl.org/info/website/upload/gff.html) Tasks 1. Unzip the file using `gzip` 2. How many lines contain the `#` at the start 3. How many of each feature are in the GTF. **Hint:** Look at column 3 4. Show all the unique crhomosomes of the fungus, in ascending numeric order 5. Extract all unique `gene_id` values found in column 9 and save to a new file called `gene_counts.txt`. You can assume that `gene_id` is always the first field in column 9. 6. Remove double quotes around `gene_ids` **Use AWK for the rest of the exercise** 6. Are there any non-header lines with more or less than 9 TAB-separated lines? How many non-header lines with exactly 9 columns are there? 8. How many gene features are there in chromosome 2 9. Extract all features that are found on the plus(+) strand ## Exercise on regex expressions Download the putative binding site positions obtained by a ChIP-seq experiment on Suz12 transcription factor. The file consists of one row per feature. You can read more about ChIP-seq experiments [here](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-8-56) ``` https://ftp.ncbi.nlm.nih.gov/geo/series/GSE41nnn/GSE41589/suppl/GSE41589_Suz12_BindingSites.txt.gz ``` 1. Unzip the file 2. How many binding sites are in the file 3. How many chromosomes are in the file 4. Extract all binding sites for chromosome 1 and redirect them to a new file 5. Using a regular expression, extract sites that correspond to chromosomes 1,3 and 9 6. Using a regular expression, extract sites that correspond to chromosomes 11,13 and 10 7. Using a regular expression, extract sites that correspond to chromosomes 10 to 19