Dan Leehr
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# GCB Academy: Intro to Scientific Computing for Genomics. November 4-5, 2019 Course website: https://duke-gcb.github.io/SciComp-Nov-2019/ (includes links to materials) Instructors: Hilmar Lapp, Dan Leehr Stuck? Put your pink sticky note on the top of your computer. ## Monday Morning - Shell Launching the shell: **Git Bash** on Windows, **Terminal** on macOS. ### Commands: - `pwd` - print working directory, where am I? - `ls` - list files in current directory - `ls -F` - format directories by trailing `/` - `ls -lh` - show human-readable sizes - `ls -a` - include hidden files/directories - `cd` - change directories - `man ls` - manual page for a command, e.g. `ls` Note: to exit the manual page, press the <kbd>q</kbd> key on your keyboard _Exercise_: What additional information do you learn with `ls -l`? _Answer_: Date modified, file size, owner, permissions, directory or not ### Tab completion: While typing a command or file name, hit the <kbd>tab</kbd> key and the shell will finish the word (or help as much as it can!) - `cd un<tab>` becomes `cd untrimmed_fastq/` Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you. Multiple matches? Start typing and hit <kbd>tab</kbd>. The shell will sound a ding, and list possible completions. Keep typing and hit <kbd>tab</kbd> again once it can pick a single result. ### Navigation shortcuts - `cd` _(by itself, no directory name)_ - changes to your home directory (`cd ~` has the same result) - `cd Desktop/shell_data/untrimmed_fastq` - go directly into `untrimmed_fastq` with one command. Remember <kbd>tab</kbd> completion! - `cd ..` - go one level up in the directory tree (parent directory) - `cd ~/Desktop/shell_data` - from anywhere on the filesystem since we begin with `~` (home directory) - `cd -` - Go back to where I came from (the previous directory you were in) _Exercise_: Find the hidden directory ### Relative and Absolute paths **Relative Path**: Any path that doesn't begin with a slash. (e.g. `Desktop/untrimmed_fastq`) Relative paths are relative to where you are. **Absolute Path**: Any path that begins with a slash (e.g. `/Users/sam/Desktop/untrimmed_fastq`) - `/` is the root directory ### Files and Directories `cd ~/Desktop/untrimmed_fastq` - `ls *.fastq` - list all files ending with `.fastq` - `ls *977.fastq` - list all files ending with `977.fastq` - `ls /bin/*sh` - list all the commands that are shells - `echo $SHELL` - which shell am I running? - `history` - command history with number - `! {number, n}` - execute the nth command - `!!` is replaced with the contents of the previous command - `~` is an absolute path to the home directory ### Working with Files - `cat SRR098026.fastq` - outputs **ALL** contents of the file to the screen (good for small files) - `less SRR098026.fastq` - lets you navigate and scroll through the file - `space` - move forward in the file - `/` + `[search term]` + `[enter]` to search within the document - `b` - move backwards in the file - `g` - go to the beginning of the file - `G` - go to the end of the file - `q` - to exit 'less' #### Head and Tail - `head` - prints the first 10 lines of a file to the screen (default is 10 lines) - `tail` - prints the last 10 lines of a file to the screen (default is 10 lines) - add `-n 100` to instead print the first (`head`) or last (`tail`) 100 lines of the file to the screen example: `head -n 1 SRR098026.fastq` will output the first line of the file: `@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35` ### FASTQ File Format [A FASTQ file normally uses four lines per sequence.](https://en.wikipedia.org/wiki/FASTQ_format) - Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). - Line 2 is the raw sequence letters. - Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. - Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. `@SEQ_ID` `GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT` `+` `!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65` ### Creating, moving, copying, and removing files - `cp` to copy a file - `cp /original_path/file.txt /new_path/file-copy.txt` - `mkdir` to make a new directory (folder) - `mkdir backup` makes a new directory called "backup" - `mv` can both move a file and/or rename that file - `mv SRR098026-copy.fastq backup/` **moves** `SRR098026-copy.fastq` to the `backup` directory - `mv SRR098026-copy.fastq SRR098026-backup.fastq` **renames* `SRR098026-copy.fastq` to `SRR098026-backup.fastq` in the same directory - **NOTE** moving and renaming files are effectively the same thing - to move a file "down" a directory into your current directory (1 level up), you can do: `mv /backup/file.txt .` - the `.` indicates your current directory - adding `-i` to the `mv` command makes sure you don't accidentally overwrite files `mv -i file1.txt file2.txt` ### File Permissions - `r` read - `w` write - `x` execute `ls -l` shows you the permissions of your files: `-rwxr-xr--` - the first `-` character is the file type - the next three `rwx` characters are permissions for the file's owner - the next three `r-x` characters are permissions for the group - the final three `r--` characters are permissions for the current user `chmod` change mode of the file (change permission settings) - `chmod -w file.txt` removes write permissions - if you try to remove this file without having permission, it will not allow you do delete e.g. `rm file.txt` will return a "permission denied" message - `chmod +w file.txt` adds write permissions - `chmod g+w file.txt` adds group write permission - `chmod u+w file.txt` gives owner write permission More examples here: https://www.washington.edu/computing/unix/permissions.html ### Removing a file or directory -`rm` **permanently** deletes the file (i.e. does **not** go into your trash or recycle bin) - `rm -r` remove recursive - to delete a directory and everything in it **NOTE** this command should be used with care, as it can **permanently** delete a lot of data *Exercise:* Make a backup of your FASTQ files and move them to a new `backup/` directory. Remove 'write' permission for the backup files. *Answer:* - `rm -r backup` - `cp SRR098026.fastq SRR098026-backup.fastq` and `cp SRR097977.fastq SRR097977-backup.fastq` - `mkdir backup` and `mv *-backup.fastq backup` - `chmod -w backup/*-backup.fastq` ### Redirection and searching with grep `grep` is a program that allows you to search within a file without having to open it. - `grep NNNNNNNNNN SRR098026.fastq` will return every single line in the SRR098026 file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is. Search for a term within a specific context: - `grep -B 1 -A 2 NNNNNNNNNN SRR098026.fastq` - This will return the line before the search term and the two lines after the search term (in addition to the instance of 10 Ns) - `-B` before - `-A` after *Exercise* 1. Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. - *Answer:* `grep -B1 GNATNACCACTTCC SRR098026.fastq` **returns** `@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35 GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG` 2. Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. - *Answer:* `grep -B1 AAGTT *.fastq` **returns** `SRR097977.fastq-@SRR097977.11 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36 SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG -- SRR097977.fastq-@SRR097977.67 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36 SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT -- SRR097977.fastq-@SRR097977.68 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36 SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA -- SRR097977.fastq-@SRR097977.80 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36 SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT -- SRR097977.fastq-@SRR097977.92 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36 SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT -- SRR097977.fastq-@SRR097977.139 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36 SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA -- SRR097977.fastq-@SRR097977.238 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36 SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT -- SRR098026.fastq-@SRR098026.158 HWUSI-EAS1599_1:2:1:1:1505 length=35 SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG` ### Redirection to capture the output - `>` this is the command to redirect output e.g. `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt` will redirect the results of this grep search to the `bad_reads.txt` file - Note: this command also created the `bad_reads.txt` file - `wc` is short for "word count" - returns information about your file: - first number is the number of lines - second number is number of words - third number is number of characters - adding `-l` returns only the number of lines e.g. `wc -l file.txt` *Exercise:* How many sequences in SRR098026.fastq contain at least 3 consecutive Ns? *Answer:* `wc -l bad_reads.txt` returns `249` sequences in the file have at least 3 consecutive Ns - `>>` **appends** to an existing file e.g. `grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt` - Using the single `>` to redirect output to an existing file will **overwrite** contents of the file - You can also use a wildcard to redirect output of multiple files to the same output files at once e.g. `grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt` **NOTE** on file extensions: - Mac & Windows care (sometimes) about what file extension you use - Unix doesn't care at all what you use, so they're entirely for your (the user's) benefit in understanding the contents of the file The **pipe** command is `|` - `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l` will perform your grep search then "pipe" the output to the `wc` program, so your output will be the number of lines returned from this grep search. ### For Loops Performs a command over a set of parameters or conditions **To define a variable:** variable_name = information ex: `f = Hello` To access a variable: $variable_name ex: `$f` **Loop structure template:** ``` for iterable_name in list_of_things: do stuff done ``` ex1, print the names of fastq files in the present working directory: ``` for f in *fastq do echo $f done ``` ex2, make a backup of fastq files in the present working directory: `for f in $fastq; do echo cp $f $f.backup ; done` The iterable_name is a variable defined in the loop For each loop, the iterable_name will contain the next item in the list_of_things You can write the loop in one line, using a semicolon to deliminate each step (ex2) ## Testing your Cluster Account From your shell prompt, enter: ``` ssh NETID@dcc-slogin.oit.duke.edu ``` Replace NETID with your Duke NetID, and enter your NetID password when prompted. If prompted with ``` The authenticity of host 'dscr-slogin-01.oit.duke.edu (152.3.100.80)' can't be established. RSA key fingerprint is SHA256:7RrLpUJew15D1Iij0i9io6XF3RY2LEFxZi67Ep1SlbY. Are you sure you want to continue connecting (yes/no)? ``` Type `yes`. If you are greeted with a bash prompt, your account is working correctly.: ``` dcl9@dcc-slogin-03 ~ $ ``` If you are unable to login, please add your NetID below: - avd16 - sa372 - bd111 (verified login at duke.edu/online but permission denied for ssh) - sl375 ## Monday Afternoon - Programming with Python File link: https://github.com/Duke-GCB/SciComp-Nov-2017/releases/download/1.0/python-fasta.zip Running Jupyter Notebook Method 1: Use Anaconda Navigator and run jupyter Method 2: Type `jupyter notebook` into the terminal ### Printing and Strings Functions in python are used with the following syntax: `function()` The parentheses contain parameters you want to give the function To print, use the `print()` function. Ex: `print('Hello')` A string is a datatype that indicates a word or sequence. In the example above, hello is in quotes, indicating it is a string Variables are set in python in the following way `variable_name = information` To call a variable, you type the variable_name, unlike in the terminal where you add a '$' beforehand Ex, prints the information stored in the variable called name: ```python name = 'Sam' print(name) ``` You can use numbers (and other datatypes) in print as well ```python weight = 55.5 print(weight) ``` ### Indexing You can access parts of a string (or list) with indexing. General format: ``` string[start:end:countby] ``` Example, print the first letter in Python: ```python name = 'Python' print(name[0]) ``` Indicies in python start at 0, so the first letter or item in a list will be index 0 Indexing starts at the position you indicate and ends at one position ahead. The end value is exclusive, so it is up to that position rather than including that position Example, print 'ytho' from 'Python': ```python name = 'Python' print(name[1:5]) ``` ### Check if a letter is in a string General format: ```python letter in string letter not in string ``` Example, returns true: ```python 'a' in 'apple' ``` Example, returns false: ```python 'b' not in 'bee' ``` ### Working with sequences Given the sequence: `seq = 'ACCTGCATGC'` Ex1, reverse the first 3 letters ```python base1 = seq[0] base2 = seq[1] base3 = seq[2] r3 = base3 + base2 + base1 ``` For strings, you cannot edit them in the following way: `string[3] = 'a'` Strings are immutable datatypes, which cannot be directly changed. You can combine parts of a string together to make a new string (Ex1). #### Reversing a sequence You can reverse a sequence with indexing. Recall that the general format is ``` string[start:end:countby] ``` If you make the last parameter -1, it will count backwards and reverse the string Ex2, reverse a string via indexing ```python name = 'Python' print(name[::-1]) ``` #### Looping through a DNA sequence A for loop can iterate through a string, each iteration will pick the next letter in the string Ex1, print each letter in a sequence separately: ```python for letter in 'CACGTAGCA': print(letter) ``` Loop structure is a bit different than bash in the terminal. General format of a loop in python: ``` for each_item in collection_of_things: do stuff ``` A loop in python requires a colon at the end of the for statement and each line of code within the loop to be indented Ex2, reverse a string via for loop: ```python rev = '' seq = 'ACCTGCATGC' for letter in seq: rev = letter + rev print(rev) ``` #### Dictionaries Creating a dictionary: Empty dictionary: `dictionary_name = {}` Dictionary with keys and values: ```python dictionary_name = {'key1':'value1', 'key2':'value2'} ``` Adding a key:value pair to a dictionary you already made ```python dictionary_name['new_key'] = 'new_value' ``` Ex, a dictionary to give the complement letter ```python complement_letter = {'A':'T', 'T':'A', 'C':'G','G':'C'} ``` Getting the value from a dictionary using a key ```python dictionary_name['key1'] ``` Will return the value for key1, which we defined before as 'value1' For keys, if the key is a string you will have to use the same exact string, including any capitalization. #### Lists List are containers that can contain items of any datatype (even other lists) To make a list: ```python list_name = [] ``` A list can contain different datatypes Ex, a list with a string, integer (whole number), float (number with decimal points), and another list ```python list_name = ['CACGTA', 52, 48.082, ['ACTG', 5]] ``` You can loop (and index) through a list, just like you can loop through a DNA sequence or string #### Conditionals, if, elif, else statements Python has several statements to determine if something is true or false, conditional statements. Some common ones are: - `==` equals - `>` greater than - `<` less than - `in` item in a list, string or iterable sequence - `not in` item not in a list or iterable sequence You can also use `and, or` statements to evaluate multiple statements. Or means one of the statements must be true while and means that both have to be true Ex, all return true: ```python 'a' in 'apple' and 5 < 100 'a' not in 'apple' or 5 > 2 ``` If, elif, and else can be used to perform a task, given a certain condition General format: ``` if something == True: do something elif something_else == True: do something else: do something ``` Python will evaluate the if statement first. If the statement returns false, it will then evaluate the elif (else if) statement. If the elif statement returns false, it will then do the else statement If the first if statement is true, it won't evaluate the elif or else statement. #### Functions in Python There are many built in functions available in python (such as the `print()` function). See the documentation for a list: https://docs.python.org/3/library/functions.html `len` gives you the length of an object ```python print(len(seq)) ``` 10 Some commonly used functions: ```python abs() # finds the absolute value int() # converts the input to the integer type type() # tells you the type of the input (e.g. int, float, etc.) sorted() # sorts the intput ``` `sort()` will sort a list ```python x = [7, 4, 3, 2, 8] y = sorted(x) ``` output: [2, 3, 4, 7, 8] You can also reverse the way `sorted()` sorts the list: ```python r = sorted(x, reverse = True) ``` You can run `help(sorted)` to learn about what other built-in options are available for a particular function. Jupyter notebook also allows you to view this in an alternate way: - `shift` + `tab` while your cursor is in the function call - hit `shift` + `tab` again to get the pop-up to stay visible, and then hit these keys one more time (3 times total) to get the menu to appear separately at the bottom of your page. `split()` will split the contents of your list based on a specified separator (such as '.' or ' ' (space)) into multiple list items. The default is white space between words. e.g. ```python seqs = 'rna dna protein' seqs_list = seqs.split() print(seqs_list, len(seqs_list)) ``` returns: `['rna', 'dna', 'protein'] 3` (a list with 3 items instead of just 1) ```python print(type(seqs)) ``` returns: ```python <class 'str'> ``` *Whereas*: ```python print(type(seqs_list)) ``` returns: ```python <class 'list'> ``` *Exercise:* Functions bases = 'adenine cytosine guanine thymine' Write some code that will: - Make a **list** of these bases from the string - **Uppercases** the names (e.g. ['ADENINE', ...]) - **Reverses** the order (e.g. ['THYMINE', ...]) hint: use `help(str)` and `help(list)` to see what functions are available for strings and lists **Bonus** write a for loop to print the first letter of each (e.g. 'A, C, ...') *One solution:* ```python bases = 'adenine cytosine guanine thymine' # upper() can only be used on strings, so do this before converting to a list using split() upper_list = bases.upper().split() # print(upper_list) # returns: # ['ADENINE', 'CYTOSINE', 'GUANINE', 'THYMINE'] # reverse the order of the list # help(list.reverse) -> "reverse in place" upper_list.reverse() #the list is now reversed and is still stored in upper_list # print(upper_list) # will now return: # ['THYMINE', 'GUANINE', 'CYTOSINE', 'ADENINE'] ``` ### Writing your own function Start by defining the function using `def` and indenting inside the newly defined functions just like we did in for loops and if statements: ```python def double(x): result = 2 * x return result ``` Now we can use this function by passing in a value for the 'x' argument: ```python print(double(10)) ``` returns: `20` This function can also multiply floats and even strings! ```python print(double('Eggs')) ``` will return: `EggsEggs` Now if we define a new variable: ```python x = 100 ``` Note that this 'x' will not be confused by the 'x' *inside* the `double()` function. **Making a reverse function:** ```python= def reverse(seq): # 'seq' is used in a global context here rev = '' for s in seq(): rev = s + rev return rev ``` Now we can use this new function: ```python= print(reverse(seq)) ``` will give you the reverse of`seq` and ```python= print(reverse(reverse(seq))) ``` will reverse `seq` and then reverse it again back to its original state **Making a complement function:** ```python= def complement(seq): comp = '' for s in seq: comp = comp + comp[s] return comp ``` Now we can use this function: ```python print(reverse(complement(seq))) ``` will give you the reverse complement of `seq` **NOTE** the `comp()` function will only work if we have defined the dictionary earlier in the script. It is a much better to idea to include the `comps` dictionary **inside** the definition of the function: ```python def complement(seq): comp = '' comps = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'} for s in seq: comp = comp + comp[s] return comp ``` We can combine these two functions into one function that does both in one step: ```python def rc(seq): return reverse(complement(seq)) ``` ### Working with files in Python In python, you want to open a file and store it in a variable so it can be used in your program ```python f = open('ae.fa') ``` Now the data from `ae.fa` is stored in the variable f. We can loop over the data in f: ```python for line in f: print(line) f.close() # It's a good idea to close your file when you are done working with it ``` but this will just do the same thing as the `cat` function on the command line! Let's make a function to interpret the fasta file in a way that makes the different entries easier to work with. To start, let's convert our code from above into a function: Step 1: ```python def read_fasta(filename): """ Read in a file in FASTA format """ f = open(filename) for line in f: line = line.strip() print(line) f.close() read_fasta('ae.fa') ``` Step 2: add to list instead of printing and return that list ```python def read_fasta(filename): """ Read in a file in FASTA format """ seq = '' f = open(filename) for line in f: line = line.strip() seq = seq + line f.close() return seq read_fasta('ae.fa') ``` Step 3: is the line a sequence line or identifier line? ```python def read_fasta(filename): """ Read in a file in FASTA format """ seq = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: seq = seq + line #only keep the line if it's not an identifier line that begins with '>' f.close() return seq print(read_fasta('ae.fa')) ``` Once you have a function working, you can download it from Jupyter Notebooks as a python script (document with a `.py` file extension) and use it on your computer. Rename the file to something more readable if it was given an automated name: `mv download.html read_fasta.py` and you can run the script with: `python read_fasta.py` which can be piped to `wc -l` `python read_fasta.py | wc -l` or used with any other command line tools. ## Tuesday - Version Control Git Lesson Online Reference: http://swcarpentry.github.io/git-novice/ https://github.com/Duke-GCB/scicomp-python/blob/master/intro-programming-python.ipynb Python script to use for today. Save as a .py file. ```python= def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence print(read_fasta('ae.fa')) ``` General format of git commands ``` git <verb> ``` Git commands - `git init` Initialize a git repository - `git status` Status of your git repository - `git config` Sets configurations for git - `git add` Lets git know that you want to make updates to a file - `git commit` Makes a snapshot of the update - `git diff` Shows what has changed - `git log` Gives a log of the commits - `git checkout` Lets you navigate to a branch - `git reset` Undo changes to local git - `git revert` Undo changes by making a commit reversing the previous commit Within the python-fasta folder, use `git init` to initialize a repository and `git status` to check the status Next, use the following git config commands ``` git config --global user.name "user_name" git config --global user.email "email" git config --global core.editor "nano -w" ``` Nano is a terminal text editor. ``` git add python-fasta.py git commit ``` After git commit, the editor will open. Then type in the comments for the commit. Use `git status` to check that the python-fasta.py file is no longer untracked. After you make changes to the file, running `git status` should place the file under `modified: python-fasta.py` Use `git diff` to show the changes made. Then `git diff --staged` will show the change that has been staged. Use `git log`, `git log --oneline`, `git checkout 'first_7_chars_in_commit_checksum' -- python-fasta.py` About git reset, restore, and revert: https://git-scm.com/docs/git#_reset_restore_and_revert General workflow for making commits: 1. Make a change to the file you are working on, ie. python-fasta.py 2. Use `git add filename` 3. Use `git commit`, then make comments on the commit ### More with Git (Tuesday Afternoon session) **Branching** Making a new branch allows you to maintain different versions of your work silmutaneously. `git branch` shows that you are on the master branch To make a new branch AND move to the new branch in one step: `git checkout -b file-as-param` - `-b` means create a new branch - `git checkout` means to go this branch Now when you run `git status`, it will say: `On brach file-as-param` at the top of the output message. On this new branch, we will modify our python-fasta.py script to take an input file as an argument. Open python-fasta.py with nano and add `import sys` to the top of the file, then add `sys.argv[1]` in place of the file name to indicate that you want the first input from the command line arguments to be used here. ```python= import sys def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence print(read_fasta(sys.argv[1])) ``` Now to run this script, you will have to enter: ``` python python-fasta.py ae.fa ``` And the script will run like it did before (check that it works before updating in git). Save your file and exit nano, then again stage your changes: `git add python-fasta-py` and then commit using a quicker way just using the command line argument `-m` (stands for "message"): `git commit -m 'Parameterizes name of the file'` `git log` will show this new commit, and will also show that it is on the branch `file-as-param`. Next we will make one more change to our script to give a nicer error message if we forget to give a filename as an argument when running the script: ```python= import sys def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence if len(sys.argv) < 2: print("Need to provide filename as argument") exit(1) print(read_fasta(sys.argv[1])) ``` Now commmit your updates: `git add` `git commit -m 'Catches case of missing filename'` Next we want to go back to the "master" branch, which has the version of python-fasta.py that does not take the filename as an argument. Check which branch you are currently on using `git branch`. The branch with the '\*' next it is the branch you are currently on. Note: if you try to run `git checkout -b master` won't work because the `-b` argument means "make a new branch", but this branch already exists. Instead, run: `git checkout master` to get back to the master branch. We can check that this worked by again running `git log --oneline` which shows that HEAD is pointing to 'master' (HEAD -> master) in the output. What is a branch name in git? - the branch name is essentially a label for a node. Each label/branch name points to a particular commit. - HEAD is also just a label. It points to the tip (end) of whichever branch you are on. - if you change to a different branch, then the pointer for the HEAD label changes to point to this different branch. - `git log --online` shows you the commits that came before the HEAD node e.g. ``` 332eea2 (HEAD -> master) Revert "Add python as interpreter" 0cbaa3c Adds python as interpreter f8585b4 Removes cruft from Jupyter Notebook ``` ### Working with branches Make a new branch: `git checkout -b ignore-data` And double check that you successfully switched using `git branch` (you should have a '\*' next to a branch named 'ignore-data') We will make a new hidden file so git will ignore certain data files that you don't want to back up. `nano .gitignore` to open a new file. Type into your new document: ``` *.fa ``` And save and exit. Now when you run `git status` you won't see any files ending with '.fa' in your "Untracked files" section. You can open .gitignore again with nano and add: ``` *.fa *.fasta *.ipynb ``` Run `git add .gitignore` and `git commit -m 'Ignores data and notebook files'` to save your changes with git. ### Merging To merge a branch with another one, you first have to checkout the branch you want to merge *into*. We want to merge our ignore-data branch with the master branch, so first: `git checkout master` then you an merge the branch you want with the branch you are currently on with the command: `git merge ignore-data` This has created a "fast forward" in git, where the pointer for "master" gets moved "forward" to a later commit (ignore-data) node. The HEAD pointer also is fast forwarded to the ignore-data node. Now, `git log --online` will show: ``` 069564e (HEAD -> master, ignore-data) Ignores data and notebook files ``` Fast forwarding happens when you didn't update anything on the master branch so you're just "fast forwarding" it to a later commit node. What if we want to do something more complicated and merge two branches with different changes? `git merge file-as-param` creates a *merge commit* by opening a new file in nano. Save and close this file to commit the merge. Git can usually figure out how to merge two different branches, but sometimes there will be conflicts to be resolved. ### Adding your repository to the cloud Make a new repository on the Github website called **python-fasta**. You can choose to make it either public or private, but since we want to work with collaborators easily, we will make it public today. **NOTE** Do NOT click the option to initialize this repository, because it already exists on our local computer. Only do this if you don't already have a repository created on your computer. Github will take you to a helpful instructions page. We want to push an existing repository from the command line: `git remote add origin https://github.com/username/python-fasta.git` Git is what is known as a **distributed** version control system, meaning the repository exists fully on your local computer, but can also be mirrored elsewhere (like your github profile online, on a server, or on a different computer). - Naming your main remote repository *origin* is a commonly used git convention, but is not required. If you enter the command `git remote` or `git remote -v` (for more verbose options) it will tell you your two command options: To move your changes from a local computer TO online (remote): `push` To move changes from the online version TO your local machine: `fetch` (sometimes `pull`) Enter `git push -u origin master` to add your changes to your online github repository. - Note: the `-u` option here means "upstream". We will discuss this more later. Enter your Github username and password when it prompts you to. If you view this repository on the Github website, we can see it is no longer empty, however it doesn't yet have all of the commits and branches we've made today. `git push --all origin` will add the other branches and their commits to the online repository. ### Collaborating with others You can collaborate with others by looking at their code on Github and modifying it. If you'd like to submit your changes to another person's repository, you can submit a *pull request*. The other person may or may not accept your changes, but this is one avenue through with you can collaborate with others. More on pull requests here: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests To pull from someone else's repsitory, use the command: `git clone <github-url>` where `<github-url>` is just the url of the repository you want to clone. This will "clone" their repository from github (online) onto your local computer. It is important to work on your own branch of someone else's repository so you don't create conflicts. `git checkout -b add-docs` Now we will edit their copy of `python-fasta.py` by adding some documentation. ```python= import sys def read_fasta(filename): # Reads fasta file from disk and returns the sequence sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence if len(sys.argv) < 2: print("Need to provide filename as argument") exit(1) print(read_fasta(sys.argv[1])) ``` `git add python-fasta.py` to stage your changes. `git commit` to open in nano and add the message: ``` Adds docstring for function Documentation of the function will be good to have for our future self in a year from now ``` Using `git branch` still works to see how this collaborator's repository is structured (what commits have been added in the past), but you need to add a `-a` argument to see everything in the remote branch: `git branch -a` We have staged and committed our changes, but we still have to *push* to the remote repository online. `git push -u origin add-docs` - adding the `-u` sets up to track the remote branch, which is very helpful when you want to continue collaborating (i.e. if the other user has accepted your pull request) - adding `-u` also allows you to just enter `git push` in the future instead of having to add "origin" and the branch name every time. Let's say this other user accepted our pull request. Now we need to get up to date with their master branch: `git fetch origin` So now it's on our computer, but we have to merge: `git merge origin/master` - note: this is a fast forward merge We can see our pull request merged with the master branch if we use: `git log --oneline` **NOTE** If we had used `git pull`, it would have done the same thing as the *two* steps we did to update from the online repository (`git fetch` and `git merge`). It can be useful to do this in two stages to make sure it's doing the right thing, but usually `git pull` works fine. ### Running programs on HPC Login to the cluster ssh netid@dcc-slogin-duke.oit.duke.edu Enter password, then you should see a prompt to run additional commands `hostname` gives the name of the host computer `srun hostname` Connects to a remote server in the cluster and gives the hostname. `srun` submits a job, you are assigned an ID and it is put into a que. Interactive job `srun --pty bash` Give the bash shell (bash) and let you interact with it (--pty) Some commands to run in the interactive session - `pwd` present working directory - `free` available memory - `exit` ends the interactive job Exit and go to the login node: ```bash git clone https://github.com/Duke-GCB/scicomp-hpc.git ``` Then cd into the new folder created, scicomp-hpc. Check out the read_fasta.py file with `less fasta_gc.py` To run the file on the cluster: ```bash srun python fasta_gc.py data/E2f1_dna.fasta ``` To run a set of commands, use `sbatch`. Use `man sbatch` to get more information Write a script to use with sbatch with nano, `nano countgc.sh` In the editor type: ```bash #!/bin/batch echo "Starting GC Counter" python fasta_gc.py data/E2f1_dna.fasta echo "Finished GC Counter" sleep 60 echo "Done" ``` Save the file. To submit the file run `sbatch countgc.sh` By default, it will make a file with the job id in the current directory. Ex: slurm-4272224.out Results will be placed in the .out file. That file is updated as the job runs. #### Checking job status in HPC - `squeue` Inidcates the que of jobs - `sacct` When the job is done, placed on the accounting list `squeue -u netid` will give the jobs for just the net id selected sacct options - `sacct --mem` amount of memory the job used - `sacct -e` all options sacct can show By default the cluster requests 2GB of memory for each job. You can edit the script to request a different amount of memory with `#SBATCH --mem=requested_amount` Ex, requesting 100MB of memory: `nano countgc.sh` ```bash #!/bin/batch #SBATCH --mem=100M echo "Starting GC Counter" python fasta_gc.py data/E2f1_dna.fasta echo "Finished GC Counter" sleep 60 echo "Done" ``` Other options to specify in the script: <https://slurm.schedmd.com/sbatch.html> Add the following to have the HPC email you when the job is done ```bash #SBATCH --mail-type=END #SBATCH --mail-user=email_address ``` Placing `srun` in front of the commands in the bash script will indicate if specific parts of the script fail via `sacct` `#SBATCH --array=1-5%2` will have a batch of items run at a time. - 1-5 indicates to run the script 5 times - %2 indicates to run 2 at a time Adding variables to the bash script ```bash #!/bin/bash #SBATCH --mem=400M #SBATCH --array=1-5%2 echo $SLURM_ARRAY_TASK_ID FILENAME=$(ls data/*.fasta | head -n $SLURM_ARRAY_TASK_ID | tail -n 1) echo $FILENAME ``` ### Moving files in/out of the cluster (Afternoon session) **NOTE**: Github should be used primarily for small text/script files, CSV files, etc, NOT for backing up your data, as there is a ~1 GB storage limit. Reminder, you can redirect command line output with `>`: `echo "Data Staging > datafile.txt"` `scp` (stands for "secure copy") allows you to securely copy data files to the server: `scp datafile.txt username@dcc-slogin.oit.duke.edu:datafile.txt` * You should be prompted for your password (for the server, not the one for your local computer) * Note: if you don't specify a path after the `:`, the copied file will end up in your home directory Another way to download data to the server is with the command `wget` if you have a url for the data you want. `srun wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/chrI.fa.gz` Other files are available in an online databases: e.g. http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/ - main page with other organisms: https://hgdownload.soe.ucsc.edu/downloads.html It's a good idea to move data files out of your home directory on the server, as these usually have limited storage space. ### Finding software on the HPC **Modules** Check what software is available with: `module avail` To use software use the command: `module load <X>` e.g. `module load RepeatMasker/4.0.5` Check that it is working with `module list`, which should list the pieces of software you have loaded You can load multiple modules at one time. Load another program: `module load Perl/5.30.0` and run `module list` again to see that now there are 2 items. Now you can run the software you have loaded. e.g. `RepeatMasker -species "Saccharomyces cerevisiae" chrI.fa.gz` *Reminder: use command `srun --pty bash` to start an interactive session.* `module purge` will close any software you had loaded. You can double check that this worked by again entering `module list` to see that it is now empty. HARDAC is another computing cluster at Duke: https://wiki.duke.edu/x/QgSsAw **Note:** saving your scripts (and using version control) is crucial for being able to reproduce your own results and allowing others to repeat your experiments.

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully