Anton Nekrutenko
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: BMMB554-23 --- [![](https://imgs.xkcd.com/comics/python.png)](https://xkcd.com/353/) # Lecture 6: Python 2 - Strings and lists and FASTQ ----- Preclass prep: Chapters [8](https://greenteapress.com/thinkpython2/html/thinkpython2009.html) and [10](https://greenteapress.com/thinkpython2/html/thinkpython2011.html) from "Think Python" :::warning This material uses examples from notebooks developed by [Ben Langmead](https://langmead-lab.org/teaching-materials/). ::: ## Prep 1. Start [JupyterLab](https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/try.jupyter.org?urlpath=lab) 2. Within JupyterLab start a new Python3 notebook 3. Open [this page](http://cs1110.cs.cornell.edu/tutor/#mode=edit) in a new browser tab ## Strings in Python In Python, strings are sequences of characters enclosed in quotation marks (either single or double quotes). They can be assigned to a variable and manipulated using various string methods. For example: string1 = "Hello World!" string2 = 'Hello World!' print(string1) # Output: Hello World! print(string2) # Output: Hello World! You can also use the + operator to concatenate strings, and the * operator to repeat a string a certain number of times: ```python= string3 = "Hello " + "World!" print(string3) # Output: Hello World! string4 = "Hello " * 3 print(string4) # Output: Hello Hello Hello ``` Hello World! Hello Hello Hello There are also many built-in string methods such as `upper()`, `lower()`, `replace()`, `split()`, `find()`, `len()`, etc. These can be used to manipulate and extract information from strings: ```python= string5 = "Hello World!" print(string5.upper()) # Output: HELLO WORLD! print(string5.lower()) # Output: hello world! print(string5.replace("H", "J")) # Output: Jello World! print(string5.split(" ")) # Output: ['Hello', 'World!'] ``` HELLO WORLD! hello world! Jello World! ['Hello', 'World!'] You can also use indexing, slicing and string formatting to access or manipulate substrings or insert dynamic data into a string. ```python= name = "John" age = 30 print("My name is {} and I am {} years old.".format(name, age)) ``` My name is John and I am 30 years old. You can also use f-strings, or formatted string literals, to embed expressions inside string literals. ```python= name = "John" age = 30 print(f"My name is {name} and I am {age} years old.") ``` My name is John and I am 30 years old. ## String indexing In Python, strings are sequences of characters, and each character has a corresponding index, starting from 0. This means that we can access individual characters in a string using their index. This is called "string indexing". Here is an example: ```python= string = "Hello World!" print(string[0]) # Output: H print(string[1]) # Output: e print(string[-1]) # Output: ! ``` H e ! You can also use negative indexing to access characters from the end of the string, with -1 being the last character, -2 the second to last and so on. You can also use slicing to extract a substring from a string. The syntax is `string[start:end:step]`, where start is the starting index, end is the ending index (not included), and step is the number of characters to skip between each index. ```python= string = "Hello World!" print(string[0:5]) # Output: Hello print(string[6:11]) # Output: World print(string[::2]) # Output: HloWrd ``` Hello World HloWrd You can also use string formatting method for getting the string at the specific index. ```python= string = "Hello World!" index = 3 print(f"The character at index {index} is: {string[index]}") ``` The character at index 3 is: l ## String functions In Python, there are many built-in string methods that can be used to manipulate and extract information from strings. Here are some of the most commonly used ones: - `upper()`: Converts the string to uppercase - `lower()`: Converts the string to lowercase - `replace(old, new)`: Replaces all occurrences of the old substring with the new substring - `split(separator)`: Splits the string into a list of substrings using the specified separator - `find(substring)`: Returns the index of the first occurrence of the substring, or -1 if the substring is not found - `index(substring)`: Returns the index of the first occurrence of the substring, or raises a ValueError if the substring is not found - `count(substring)`: Returns the number of occurrences of the substring - `join(iterable)`: Concatenates the elements of an iterable (such as a list or tuple) with the string as the separator - `strip()`: Removes leading and trailing whitespaces from the string - `lstrip()`: Removes leading whitespaces from the string - `rstrip()`: Removes trailing whitespaces from the string - `startswith(substring)`: Returns True if the string starts with the specified substring, False otherwise - `endswith(substring)`: Returns True if the string ends with the specified substring, False otherwise - `isalpha()`: Returns True if the string contains only alphabetic characters, False otherwise - `isdigit()`: Returns True if the string contains only digits, False otherwise - `isalnum()`: Returns True if the string contains only alphanumeric characters, False otherwise - `format()`: Formats the string by replacing placeholders with specified values - `len()`: Returns the length of the string Here is an example of some of these methods: ```python= string = "Hello World!" print(string.upper()) # Output: HELLO WORLD! print(string.lower()) # Output: hello world! print(string.replace("H", "J")) # Output: Jello World! print(string.split(" ")) # Output: ['Hello', 'World!'] print(string.find("World")) # Output: 6 print(string.count("l")) # Output: 3 print(string.strip()) # Output: "Hello World!" print(string.startswith("Hello")) # Output: True print(string.endswith("World!")) # Output: True print(string.isalpha()) # Output: False print(string.isalnum()) # Output: False print(string.format()) # Output: Hello World! print(len(string)) # Output: 12 ``` HELLO WORLD! hello world! Jello World! ['Hello', 'World!'] 6 3 Hello World! True True False False Hello World! 12 ## Playing with strings ```python= st = 'ACGT' ``` ```python= len(st) # getting the length of a string ``` 4 ```python= '' # empty string (epsilon) ``` '' ```python= len('') ``` 0 ```python= import random random.choice('ACGT') # generating a random nucleotide ``` 'G' ```python= # now I'll make a random nucleotide string by concatenating random nucleotides st = ''.join([random.choice('ACGT') for _ in range(40)]) st ``` 'GCTATATCAATGTTATCCGTTTTCTGATGTCGCGAGGACA' ```python= st[1:3] # substring, starting at position 1 and extending up to but not including position 3 # note that the first position is numbered 0 ``` 'CT' ```python= st[0:3] # prefix of length 3 ``` 'GCT' ```python= st[:3] # another way of getting the prefix of length 3 ``` 'GCT' ```python= st[len(st)-3:len(st)] # suffix of length 3 ``` 'ACA' ```python= st[-3:] # another way of getting the suffix of length 3 ``` 'ACA' ```python= st1, st2 = 'CAT', 'ATAC' ``` ```python= st1 ``` 'CAT' ```python= st2 ``` 'ATAC' ```python= st1 + st2 # concatenation of 2 strings ``` 'CATATAC' FASTQ ===== This notebook explores [FASTQ], the most common format for storing sequencing reads. FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing *sequencing reads* (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequencines (without associated quality values). ### Basic format Here's a single sequencing read in FASTQ format: @ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1 AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATGAGAGCTCAGAAGTAACAGTTGCTTTCAGTCCCATAAAAACAGTCCTACAA + BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGFEGFGFFDFFHBGDGFHGEFGHFGHGFGFFFEHGGFGGDGHGFEEHFFHGE It's spread across four lines. The four lines are: 1. "`@`" followed by a read name 2. Nucleotide sequence 3. "`+`", possibly followed by some info, but ignored by virtually all tools 4. Quality sequence (explained below) Here is a very simple Python function for parsing file of FASTQ records: ```python= def parse_fastq(fh): """ Parse reads from a FASTQ filehandle. For each read, we return a name, nucleotide-string, quality-string triple. """ reads = [] while True: first_line = fh.readline() if len(first_line) == 0: break # end of file name = first_line[1:].rstrip() seq = fh.readline().rstrip() fh.readline() # ignore line starting with + qual = fh.readline().rstrip() reads.append((name, seq, qual)) return reads fastq_string = '''@ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1 AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG + BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF @ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1 CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT + @@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF @ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1 GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT + CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH''' from io import StringIO parse_fastq(StringIO(fastq_string)) ``` [('ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1', 'AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG', 'BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF'), ('ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1', 'CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT', '@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF'), ('ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1', 'GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT', 'CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH')] The nucleotide string can sometimes contain the character "`N`". `N` essentially means "no confidence." The sequencer knows there's a nucleotide there but doesn't know whether it's an A, C, G or T. :::info ### A note on `while True` In Python, the while loop is used to repeatedly execute a block of code as long as a certain condition is true. The while True statement is a special case where the loop will run indefinitely, until a break statement is encountered inside the loop. Here is an example of a while True loop: ```python= while True: user_input = input("Enter 'q' to quit: ") if user_input == 'q': break print("You entered:", user_input) print("Exited the loop") ``` ``` Enter 'q' to quit: q Exited the loop ``` In this example, the loop will keep asking for user input until the user enters the 'q' character, which triggers the break statement, and the loop is exited. It is important to be careful when using while True loops, as they will run indefinitely if a break statement is not included. This can cause the program to crash or hang, if not handled properly. Also, It is recommended to use `while True` loop with a `break` statement, in case if you want to execute the loop until some specific condition met, otherwise it's not a good practice to use `while True`. It's a good practice to include a way for the user to exit the loop, such as the break statement in the example above, or a counter variable to keep track of the number of iterations. ::: ### Read name Read names often contain information about: 1. The scientific study for which the read was sequenced. E.g. the string `ERR294379` (an [SRA accession number](http://www.ebi.ac.uk/ena/about/sra_format)) in the read names correspond to [this study](http://www.ncbi.nlm.nih.gov/sra/?term=ERR294379). 2. The sequencing instrument, and the exact *part* of the sequencing instrument, where the DNA was sequenced. See the [FASTQ format](http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers) wikipedia article for specifics on how the Illumina software encodes this information. 3. Whether the read is part of a *paired-end read* and, if so, which end it is. Paired-end reads will be discussed further below. The `/1` you see at the end of the read names above indicate the read is the first end from a paired-end read. ### Quality values Quality values are probabilities. Each nucleotide in each sequencing read has an associated quality value. A nucleotide's quality value encodes the probability that the nucleotide was *incorrectly called* by the sequencing instrument and its software. If the nucleotide is `A`, the corresponding quality value encodes the probability that the nucleotide at that position is actually *not* an `A`. Quality values encoded in two senses: first, the relevant probabilities are rescaled using the Phread scale, which is a negative log scale. In other words if *p* us the probability that the nucleotide was incorrectly called, we encode this as *Q* where *Q* = -10 \* log10(*p*). For example, if *Q* = 30, then *p* = 0.001, a 1-in-1000 chance that the nucleotide is wrong. If *Q* = 20, then *p* = 0.01, a 1-in-100 chance. If *Q* = 10, then *p* = 0.1, a 1-in-10 chance. And so on. Second, scaled quality values are *rounded* to the nearest integer and encoded using [ASCII printable characters](http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters). For example, using the Phred33 encoding (which is by far the most common), a *Q* of 30 is encoded as the ASCII character with code 33 + 30 = 63, which is "`?`". A *Q* of 20 is encoded as the ASCII character with code 33 + 20 = 53, which is "`5`". And so on. Let's define some relevant Python functions: ```python= def phred33_to_q(qual): """ Turn Phred+33 ASCII-encoded quality into Phred-scaled integer """ return ord(qual)-33 def q_to_phred33(Q): """ Turn Phred-scaled integer into Phred+33 ASCII-encoded quality """ return chr(Q + 33) def q_to_p(Q): """ Turn Phred-scaled integer into error probability """ return 10.0 ** (-0.1 * Q) def p_to_q(p): """ Turn error probability into Phred-scaled integer """ import math return int(round(-10.0 * math.log10(p))) ``` ```python= # Here are the examples I discussed above # Convert Qs into ps q_to_p(30), q_to_p(20), q_to_p(10) ``` (0.001, 0.01, 0.1) ```python= p_to_q(0.00011) # note that result is rounded ``` 40 ```python= q_to_phred33(30), q_to_phred33(20) ``` ('?', '5') To convert an entire string Phred33-encoded quality values into the corresponding *Q* or *p* values, I can do the following: ```python= # Take the first read from the small example above name, seq, qual = parse_fastq(StringIO(fastq_string))[0] q_string = list(map(phred33_to_q, qual)) p_string = list(map(q_to_p, q_string)) print(q_string) print(p_string) ``` [33, 35, 35, 36, 36, 37, 30, 37, 38, 37, 37, 37, 39, 38, 37, 37, 39, 39, 38, 39, 38, 38, 39, 34, 39, 31, 38, 39, 39, 39, 38, 37, 32, 39, 36, 38, 37, 36, 39, 38, 36, 37, 38, 39, 34, 34, 38, 38, 38, 37] [0.000501187233627272, 0.00031622776601683794, 0.00031622776601683794, 0.00025118864315095795, 0.00025118864315095795, 0.00019952623149688788, 0.001, 0.00019952623149688788, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.0001258925411794166, 0.00015848931924611126, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0001258925411794166, 0.0007943282347242813, 0.00015848931924611126, 0.0001258925411794166, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.000630957344480193, 0.0001258925411794166, 0.00025118864315095795, 0.00015848931924611126, 0.00019952623149688788, 0.00025118864315095795, 0.0001258925411794166, 0.00015848931924611126, 0.00025118864315095795, 0.00019952623149688788, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0003981071705534969, 0.00015848931924611126, 0.00015848931924611126, 0.00015848931924611126, 0.00019952623149688788] You might wonder how the sequencer and its software can *know* the probability that a nucleotide is incorrected called. It can't; this number is just an estimate. To describe exactly how it's estimated is beyond the scope of this notebook; if you're interested, search for academic papers with "base calling" in the title. Here's a helpful [video by Rafa Irizarry](http://www.youtube.com/watch?v=eXkjlopwIH4). A final note: other ways of encoding quality values were proposed and used in the past. For example, Phred64 uses an ASCII offset of 64 instead of 33, and Solexa64 uses "odds" instead of the probability *p*. But Phred33 is by far the most common today and you will likely never have to worry about this. :::info ### A note in `map()` In Python, the `map()` function is used to apply a given function to all elements of an iterable (such as a list, tuple, or string) and return an iterator (an object that can be iterated, e.g. in a `for`-loop) that yields the results. The `map()` function takes two arguments: A function that is to be applied to each element of the iterable An iterable on which the function is to be applied Here is an example: ```python= # Using a function to square each element of a list numbers = [1, 2, 3, 4, 5] squared_numbers = map(lambda x: x**2, numbers) print(list(squared_numbers)) # Output: [1, 4, 9, 16, 25] ``` [1, 4, 9, 16, 25] In the example above, the `map()` function applies the lambda function `lambda x: x**2` to each element of the numbers list, and returns an iterator of the squared numbers. The `list()` function is used to convert the iterator to a list, so that the result can be printed. Another example is, ```python= # Using the map() function to convert a list of strings to uppercase words = ["hello", "world"] uppercase_words = map(lambda word: word.upper(), words) print(list(uppercase_words)) # Output: ['HELLO', 'WORLD'] ``` ['HELLO', 'WORLD'] It's important to note that the `map()` function returns an iterator, which can be used in a for loop, but is not a list, tuple, or any other iterable. If you want to create a list, tuple, or other iterable from the result of the `map()` function, you can use the `list()`, `tuple()`, or any other built-in function that creates an iterable. In Python 3, the `map()` function returns an iterator, which can be used in a for loop, but it's not an iterable. If you want to create a list, tuple, or other iterable from the result of the `map()` function, you can use the `list()`, `tuple()`, or any other built-in function that creates an iterable. In Python 2, `map()` function returns a list, which can be used in a for loop, and it's an iterable. In python 3.x, there is an alternative way to use map() function is `list(map(...))` or `tuple(map(...))` etc. ::: ### Paired-end reads Sequencing reads can come in *pairs*. Basically instead of reporting a single snippet of nucleotides from the genome, the sequencer might report a *pair* of snippets that appear *close to each other* in the genome. To accomplish this, the sequencer sequences *both ends* of a longer *fragment* of DNA. Here is simple Python code that mimicks how the sequencer obtains one paired-end read: ```python= # Let's just make a random genome of length 1K import random random.seed(637485) genome = ''.join([random.choice('ACGT') for _ in range(1000)]) genome ``` 'AGTACGTCATACCGTTATGATCTAGGTGGGATCGCGGATTGGTCGTGCAGAATACAGCCTTGGAGAGTGGTTAACACGATAAGGCCGATAATATGTCTGGATAAGCTCAGGCTCTGCTCCGAGGCGCTAAGGTACATGTTATTGATTTGGAGCTCAAAAATTGCCATAGCATGCAATACGCCCGTTGATAGACCACTTGCCTTCAGGGGAGCGTCGCATGTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAGGCACTTCGCTAGAAACGGCTTACTGCAGGTGATAAAATCAACAGATGGCACGCTCGCAACAGAAGCATAATATGCTTCCAACCAGGACCGGCATTTAACTCAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATTGTGCAGTGGTCTTTGACAGACAGCAGTTCGCTCCTTACTGACAATACCTGATACTTATAGTATGGCAGCGAGTCGTTGTCTAGGTTAGCCACCTCAGTCTACAGCAGGTAATGAAGCATTCCCACAAAGGCTGGTCCATACACCCGACTGCTACGATTCATGCTTCGCTCGAGAACTGCCCCTGCCTTAGATTCCCCCTCGTCTCCAATGAATACCCATTTTTTTAGATTGCTGAAAACCTTTCGTAAGACGCTTTCCAGTGATTACATGCCCTAACTGGGTACAGTTTGCCCAGGAGCTTTTTGGATGGAGGAGTATTAGTAGCGACCAAAACTCTTCCTCGACTGTTACTGTGTAGAGTCCCAAACGCTAAAGCGGTCCCAGAAAAACGGAACGGCCTACAGATTAAATTGCTCCGTGTTGCAGTTAAGGCGTACAAACCCCTCTGTGTATTAGTTTAAGTCTCTGAGTCTTCTTTGCTATGACGGATTGATGGGTGCCGGTTTGTAGTTCAAGAACCGTGAGTGAACC' ```python= # The sequencer draws a fragment from the genome of length, say, 250 offset = random.randint(0, len(genome) - 250) fragment = genome[offset:offset+250] fragment ``` 'GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAGGCACTTCGCTAGAAACGGCTTACTGCAGGTGATAAAATCAACAGATGGCACGCTCGCAACAGAAGCATAATATGCTTCCAACCAGGACCGGCATTTAACTCAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATT' ```python= # Then it reads sequences from either end of the fragment end1, end2 = fragment[:75], fragment[-75:] end1, end2 ``` ('GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAG', 'CAATATATTAGCTCTCGAGGACAACGCACTACGTTTTCCAATTCAGCGGACTGGCGCCATTACAGTAAGTTGATT') ```python= # And because of how Illumina sequencing works, the # second end is always from the opposite strand from the first # (this is not the case for 454 and SOLiD data) import string # function for reverse-complementing revcomp_trans = str.maketrans("ACGTacgt", "TGCAtgca") def reverse_complement(s): return s[::-1].translate(revcomp_trans) end2 = reverse_complement(end2) end1, end2 ``` ('GTATTGATTGTGTTACATAAACCCTCCCCCCCTACACGTGCTTGTCGACGCGGCACTGGACACTGATACGAGGAG', 'AATCAACTTACTGTAATGGCGCCAGTCCGCTGAATTGGAAAACGTAGTGCGTTGTCCTCGAGAGCTAATATATTG') FASTQ can be used to store paired-end reads. Say we have 1000 paired-end reads. We should store them in a *pair* of FASTQ files. The first FASTQ file (say, `reads_1.fq`) would contain all of the first ends and the second FASTQ file (say, `reads_2.fq`) would contain all of the second ends. In both files, the ends would appear in corresponding order. That is, the first entry in `reads_1.fq` is paired with the first entry in `reads_2.fq` and so on. Here is a Python function that parses a pair of files containing paired-end reads. ```python= def parse_paired_fastq(fh1, fh2): """ Parse paired-end reads from a pair of FASTQ filehandles For each pair, we return a name, the nucleotide string for the first end, the quality string for the first end, the nucleotide string for the second end, and the quality string for the second end. """ reads = [] while True: first_line_1, first_line_2 = fh1.readline(), fh2.readline() if len(first_line_1) == 0: break # end of file name_1, name_2 = first_line_1[1:].rstrip(), first_line_2[1:].rstrip() seq_1, seq_2 = fh1.readline().rstrip(), fh2.readline().rstrip() fh1.readline() # ignore line starting with + fh2.readline() # ignore line starting with + qual_1, qual_2 = fh1.readline().rstrip(), fh2.readline().rstrip() reads.append(((name_1, seq_1, qual_1), (name_2, seq_2, qual_2))) return reads fastq_string1 = '''@509.6.64.20524.149722/1 AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA + HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF @509.4.62.19231.2763/1 GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC + HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH''' fastq_string2 = '''@509.6.64.20524.149722/2 TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT + HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH @509.4.62.19231.2763/2 CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA + HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH''' parse_paired_fastq(StringIO(fastq_string1), StringIO(fastq_string2)) ``` [(('509.6.64.20524.149722/1', 'AGCTCTGGTGACCCATGGGCAGCTGCTAGGGAGCCTTCTCTCCACCCTGA', 'HHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHIIHHIHFHHF'), ('509.6.64.20524.149722/2', 'TAAGTCAGGATACTTTCCCATATCCCAGCCCTGCTCCNTCTTTAAATAAT', 'HHHHHHHHHHHHHHHHHHHH@HHFHHHEFHHHHHHFF#FFFFFFFHHHHH')), (('509.4.62.19231.2763/1', 'GTTGATAAGCAAGCATCTCATTTTGTGCATATACCTGGTCTTTCGTATTC', 'HHHHHHHHHHHHHHEHHHHHHHHHHHHHHHHHHHHHHHDHHHHHHGHGHH'), ('509.4.62.19231.2763/2', 'CTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAANAAGAGCTTACTA', 'HHHHHHHHHHHHHHHHHHEHEHHHFHGHHHHHHHH>@#@=44465HHHHH'))] ::::info ### A note on triple quotes In Python, triple quotes (either single or double) are used to create multiline strings. They can also be used to create docstrings, which are used to document a function, class, or module. For example: ```python= string1 = """This is a multiline string""" string2 = '''This is also a multiline string''' print(string1) # Output: # This is a # multiline string print(string2) # Output: # This is also # a multiline string ``` This is a multiline string This is also a multiline string Triple quotes can also be used to create docstrings, which are used to document a function, class, or module. The first line of a docstring is a brief summary of what the function, class, or module does, and the following lines provide more detailed information. ```python= def my_function(): """ This is a docstring for the my_function. This function does not perform any operation. """ pass print(my_function.__doc__) # Output: # This is a docstring for the my_function. # This function does not perform any operation. ``` This is a docstring for the my_function. This function does not perform any operation. :::: ## How good are my reads? Let's reuse the `parse_fastq` function from above: ```python= reads = parse_fastq(StringIO(fastq_string)) ``` ```python= run_qualities = [] for read in reads: read_qualities = [] for quality in read[2]: read_qualities.append(phred33_to_q(quality)) run_qualities.append(read_qualities) ``` ```python= base_qualities = [] for i in range(len(run_qualities[0])): base = [] for read in run_qualities: base.append(read[i]) base_qualities.append(base) # same thing with numpy # base_qualities = np.array(run_qualities).T ``` ```python= import numpy as np plotting_data = { 'base':[], 'mean':[], 'median':[], 'q25':[], 'q75':[], 'min':[], 'max':[] } for i,base in enumerate(base_qualities): plotting_data['base'].append(i) plotting_data['mean'].append(np.mean(base)) plotting_data['median'].append(np.median(base)) plotting_data['q25'].append(np.quantile(base,.25)) plotting_data['q75'].append(np.quantile(base,.75)) plotting_data['min'].append(np.min(base)) plotting_data['max'].append(np.max(base)) ``` ```python= import pandas as pd plotting_data_df = pd.DataFrame(plotting_data) ``` ```python= import altair as alt base = alt.Chart(plotting_data_df).encode( alt.X('base:Q', title="Position in the read") ).properties( width=800, height=200) median = base.mark_tick(color='red',orient='horizontal').encode( alt.Y('median:Q', title="Phred quality score"), ) q = base.mark_rule(color='green',opacity=.5,strokeWidth=10).encode( alt.Y('q25:Q'), alt.Y2('q75:Q') ) min_max = base.mark_rule(color='black').encode( alt.Y('min:Q'), alt.Y2('max:Q') ) median + q + min_max ``` ```vega { "config": {"view": {"continuousWidth": 400, "continuousHeight": 300}}, "layer": [ { "mark": {"type": "tick", "color": "red", "orient": "horizontal"}, "encoding": { "x": { "field": "base", "title": "Position in the read", "type": "quantitative" }, "y": { "field": "median", "title": "Phred quality score", "type": "quantitative" } }, "height": 200, "width": 800 }, { "mark": { "type": "rule", "color": "green", "opacity": 0.5, "strokeWidth": 10 }, "encoding": { "x": { "field": "base", "title": "Position in the read", "type": "quantitative" }, "y": {"field": "q25", "type": "quantitative"}, "y2": {"field": "q75"} }, "height": 200, "width": 800 }, { "mark": {"type": "rule", "color": "black"}, "encoding": { "x": { "field": "base", "title": "Position in the read", "type": "quantitative" }, "y": {"field": "min", "type": "quantitative"}, "y2": {"field": "max"} }, "height": 200, "width": 800 } ], "data": {"name": "data-ccf1ca196f1d0e4919420b2695143650"}, "$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json", "datasets": { "data-ccf1ca196f1d0e4919420b2695143650": [ { "base": 0, "mean": 32.666666666666664, "median": 33, "q25": 32, "q75": 33.5, "min": 31, "max": 34 }, { "base": 1, "mean": 33.666666666666664, "median": 35, "q25": 33, "q75": 35, "min": 31, "max": 35 }, { "base": 2, "mean": 33, "median": 32, "q25": 32, "q75": 33.5, "min": 32, "max": 35 }, { "base": 3, "mean": 37.333333333333336, "median": 37, "q25": 36.5, "q75": 38, "min": 36, "max": 39 }, { "base": 4, "mean": 33.333333333333336, "median": 36, "q25": 31.5, "q75": 36.5, "min": 27, "max": 37 }, { "base": 5, "mean": 37, "median": 37, "q25": 37, "q75": 37, "min": 37, "max": 37 }, { "base": 6, "mean": 32.333333333333336, "median": 30, "q25": 30, "q75": 33.5, "min": 30, "max": 37 }, { "base": 7, "mean": 36, "median": 36, "q25": 35.5, "q75": 36.5, "min": 35, "max": 37 }, { "base": 8, "mean": 36.666666666666664, "median": 36, "q25": 36, "q75": 37, "min": 36, "max": 38 }, { "base": 9, "mean": 37, "median": 37, "q25": 36, "q75": 38, "min": 35, "max": 39 }, { "base": 10, "mean": 36.333333333333336, "median": 36, "q25": 36, "q75": 36.5, "min": 36, "max": 37 }, { "base": 11, "mean": 35.333333333333336, "median": 37, "q25": 34.5, "q75": 37, "min": 32, "max": 37 }, { "base": 12, "mean": 37, "median": 37, "q25": 36, "q75": 38, "min": 35, "max": 39 }, { "base": 13, "mean": 34.333333333333336, "median": 35, "q25": 32.5, "q75": 36.5, "min": 30, "max": 38 }, { "base": 14, "mean": 37, "median": 37, "q25": 37, "q75": 37, "min": 37, "max": 37 }, { "base": 15, "mean": 36.333333333333336, "median": 36, "q25": 36, "q75": 36.5, "min": 36, "max": 37 }, { "base": 16, "mean": 37, "median": 36, "q25": 36, "q75": 37.5, "min": 36, "max": 39 }, { "base": 17, "mean": 38, "median": 38, "q25": 37.5, "q75": 38.5, "min": 37, "max": 39 }, { "base": 18, "mean": 36.333333333333336, "median": 36, "q25": 35.5, "q75": 37, "min": 35, "max": 38 }, { "base": 19, "mean": 38, "median": 38, "q25": 37.5, "q75": 38.5, "min": 37, "max": 39 }, { "base": 20, "mean": 36, "median": 35, "q25": 35, "q75": 36.5, "min": 35, "max": 38 }, { "base": 21, "mean": 35, "median": 37, "q25": 33.5, "q75": 37.5, "min": 30, "max": 38 }, { "base": 22, "mean": 37, "median": 38, "q25": 36, "q75": 38.5, "min": 34, "max": 39 }, { "base": 23, "mean": 37, "median": 38, "q25": 36, "q75": 38.5, "min": 34, "max": 39 }, { "base": 24, "mean": 37, "median": 37, "q25": 36, "q75": 38, "min": 35, "max": 39 }, { "base": 25, "mean": 31.666666666666668, "median": 31, "q25": 29.5, "q75": 33.5, "min": 28, "max": 36 }, { "base": 26, "mean": 37.666666666666664, "median": 38, "q25": 37.5, "q75": 38, "min": 37, "max": 38 }, { "base": 27, "mean": 38.666666666666664, "median": 39, "q25": 38.5, "q75": 39, "min": 38, "max": 39 }, { "base": 28, "mean": 37.666666666666664, "median": 38, "q25": 37, "q75": 38.5, "min": 36, "max": 39 }, { "base": 29, "mean": 33.333333333333336, "median": 34, "q25": 30.5, "q75": 36.5, "min": 27, "max": 39 }, { "base": 30, "mean": 38, "median": 38, "q25": 38, "q75": 38, "min": 38, "max": 38 }, { "base": 31, "mean": 36.333333333333336, "median": 36, "q25": 36, "q75": 36.5, "min": 36, "max": 37 }, { "base": 32, "mean": 32.333333333333336, "median": 32, "q25": 31, "q75": 33.5, "min": 30, "max": 35 }, { "base": 33, "mean": 34, "median": 39, "q25": 31.5, "q75": 39, "min": 24, "max": 39 }, { "base": 34, "mean": 36.666666666666664, "median": 36, "q25": 35.5, "q75": 37.5, "min": 35, "max": 39 }, { "base": 35, "mean": 35.333333333333336, "median": 38, "q25": 34, "q75": 38, "min": 30, "max": 38 }, { "base": 36, "mean": 38, "median": 37, "q25": 36.5, "q75": 39, "min": 36, "max": 41 }, { "base": 37, "mean": 36.333333333333336, "median": 36, "q25": 36, "q75": 36.5, "min": 36, "max": 37 }, { "base": 38, "mean": 36.666666666666664, "median": 39, "q25": 35.5, "q75": 39, "min": 32, "max": 39 }, { "base": 39, "mean": 35.666666666666664, "median": 36, "q25": 34.5, "q75": 37, "min": 33, "max": 38 }, { "base": 40, "mean": 36.666666666666664, "median": 37, "q25": 36.5, "q75": 37, "min": 36, "max": 37 }, { "base": 41, "mean": 35.333333333333336, "median": 37, "q25": 34.5, "q75": 37, "min": 32, "max": 37 }, { "base": 42, "mean": 38, "median": 38, "q25": 38, "q75": 38, "min": 38, "max": 38 }, { "base": 43, "mean": 33, "median": 36, "q25": 30, "q75": 37.5, "min": 24, "max": 39 }, { "base": 44, "mean": 34, "median": 34, "q25": 32.5, "q75": 35.5, "min": 31, "max": 37 }, { "base": 45, "mean": 34.666666666666664, "median": 34, "q25": 34, "q75": 35, "min": 34, "max": 36 }, { "base": 46, "mean": 36.333333333333336, "median": 36, "q25": 35.5, "q75": 37, "min": 35, "max": 38 }, { "base": 47, "mean": 35.666666666666664, "median": 38, "q25": 34.5, "q75": 38, "min": 31, "max": 38 }, { "base": 48, "mean": 38, "median": 38, "q25": 38, "q75": 38, "min": 38, "max": 38 }, { "base": 49, "mean": 37.666666666666664, "median": 37, "q25": 37, "q75": 38, "min": 37, "max": 39 } ] } } ``` ### Other comments In all the examples above, the reads in the FASTQ file are all the same length. This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments. FASTQ files can contain reads of various lengths. FASTQ files often have extension `.fastq` or `.fq`. ### Other resources * [Wikipedia page for FASTQ format](http://en.wikipedia.org/wiki/Fastq_format) * [BioPython], which has [its own ways of parsing FASTA](http://biopython.org/wiki/SeqIO) * [FASTX] toolkit * [seqtk] * [FastQC] [BioPython]: http://biopython.org/wiki/Main_Page [SeqIO]: http://biopython.org/wiki/SeqIO [SAMtools]: http://samtools.sourceforge.net/ [FASTX]: http://hannonlab.cshl.edu/fastx_toolkit/ [FASTQC]: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [seqtk]: https://github.com/lh3/seqtk

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully