## Shared URLS
- Learn more markdown: [link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
- Human genome: [link](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml)
- SNPedia: [link](https://www.snpedia.com/index.php/SNPedia)
- Project Jupyter: [link](https://jupyter.org/)
- Interesting Jupyter notebooks: [link](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)
- Try Linux terminal: [link](https://cocalc.com/doc/terminal.html)
- Rapid DNA extraction protocol: [link](https://dnabarcoding101.org/lab/protocol-2.h(tml#standard)
- mybinder.org: [link](https://mybinder.org/)
- Notebooks: https://github.com/JasonJWilliamsNY/biocoding-2022-notebooks
- Zoom link: [TBD](TBD)
- JupyterHub: [TBD](TBD)
---
## Learning more after the class
**Notebooks used in this course**
- Biocoding 2020 Notebooks [link](https://github.com/JasonJWilliamsNY/biocoding-2020-notebooks)
- - You can download these materials: [link](https://github.com/JasonJWilliamsNY/biocoding-2020-notebooks/archive/master.zip)
**General Coding**
- CodeCademy: [link](https://www.codecademy.com/)
- Hour of code (also in languages other than English): [link](https://code.org/learn)
**Software installations**
Be sure you have permission to install software
- Try Ubuntu: [link](https://tutorials.ubuntu.com/tutorial/try-ubuntu-before-you-install#0)
- Python: [link](https://www.python.org/downloads/)
- Jupyter: [link](https://jupyter.org/)
- Wing IDE: [link](https://wingware.com/)
- Atom text editor: [link](https://atom.io/)
**Bioinformatics**
- Learn bioinformatics in 100 hours: [link](https://www.biostarhandbook.com/edu/course/1/)
- Rosalind bioinformatics: [link](http://rosalind.info/about/)
- Bioinformatics coursera: [link](https://www.coursera.org/learn/bioinformatics)
- Bioinformatics careers: [link](https://www.iscb.org/bioinformatics-resources-for-high-schools/careers-in-bioinformatics)
**Help**
- General software help: [link](https://stackoverflow.com/)
- Bioinformatics-specific software help: [link](https://www.biostars.org/)
---
## Account names
zhong
navas
lee
labelson
suskin
paval
polevoy
reed
saur
dimaio
kim
mingoia_murphy
shohdy
marinescu
### Jupyter
- [Hub address](http://3.228.2.183:8000/hub/login)
### Notebook setup
git clone https://github.com/JasonJWilliamsNY/biocoding-2021-notebooks.git
### DNA Barcoding
- [silica DNA isolation](https://dnabarcoding101.org/lab/protocol-2.html#alternateb)
---
## Shared notes
**Linux commands for the Command Line/Terminal**
* [linux explainer](https://explainshell.com)
* *PWD* - print working directory (prints the name of the current folder)
* *ls* - list (lists all the files in the current folder)
* *cd **foldername*** - change directory (changes the current folder to "foldername")
* *rm **filename*** - deletes the file "filename"
* *whoami* - prints your username
**Github specific**
* *git clone **github-link*** copies the github repository to your computer
**General Info**
- Logging in to Jupyter:
- username: **lastname**
- password: **lastname.123**
### DAY ONE
**Summary**
We discussed what a **computer** is, what **bioinformatics** is, as well as different programs that are used for programming such as **Github**, **Jupyter**, and the **Command Line**. We logged in to Jupyter for the first time and downloaded the notebooks. We went through biocoding_2021_intro_python_01 and learned about functions and the **print()** function as well as **strings** and **variables**. We **isolated our plant dna** for pcr.
**General**
**--Vocab--**
IP address - the computer's internet address
Github - a place for sharing software/code and data
**=** "assignment operator"
String - 0 or more characters enclosed in quotes
**--Concepts--**
* In Jupyter Notebooks, there are a combination of text and code
* Grey blocks/cells are code, and can be run with the **play button** on top of the screen, on the side of the cell, or by pressing **shift and enter**
* You can create a new cell with the plus at the top of the screen
--
* In python, a function is represented as **functionName()**
* A function (sometimes) takes input (in the parentheses) and then gives output
--
* A variable is something that stores data
* The value on the right of "=" is stored in the variable to the left of it
* ***variableName** = **4*** <- stores 4 in variableNAme
--
* anything with quotation marks around it is a **string**
* "A string is 0 or more characters enclosed in quotes"
**--Code--**
* *print("**text**")* - prints "text"
* Math
* ***a** + **b*** addition
* ***a** - **b*** subtraction
* ***a** / **b*** division
* ***a** * **b*** multiplication
* ***a** ** **b*** exponent
---
### DAY TWO
**Summary**
We went through biocoding_2021_pythonlab_02, and learned about **strings**, the **type()** function, how to name variables and some **python style guidelines**,
**General**
**--Vocab--**
**--Concepts--**
* Naming
* variable_name - "snake case"
* variableName - "camel case"
* both are fine
* [Python Style Guide](https://peps.python.org/pep-0008/#naming-conventions)
* Clusivity - whether a number is included or excluded from a list
* File format - a consistent way of writing/storing data
* FASTA format -
**--Code--**
* *type(**variable**)* - returns the type of variable (string, int, etc)
* "*#*" - comment (not code)
* *len(**variable**)* - returns the length of a variable
* ***string**[**beginIndex**:**endIndex**:**stepSize**]* - get a slice of a string
* the endIndex will go up to the endIndex, but not include it
* ":" is "everything"
* ***variable**.**method**()* is a method call
* ***variable**.count(**letter**)* counts the number of letters
* *help(**something**)* - tells you about "something"
*
----
**Variable names for Average weight of a mouse group?**
* avg_mouse_mass
* avgmofm
* groupnameAM
* groupname_avg_mass
* avg_mouse_g
* avg_weight
* groupname_avg_weight
* avg_weight
* avg_mouse_weight
* avgWeight
* groupname_avgmass
* avg_mouse_mass
**Variable names for Number of mice in a group?**
* numMice
* groupname_num
* groupname_numMice
* group_num
* group_num
* groupname_num_mice
* groupname_Groupnum
* groupnameNum
* mice#
* groupname_mice#
**Challenge In the cell below, print the alpha_id character by character in reverse**
alpha_id[::-1]
-----
print(alpha_id[7]+alpha_id[6]+alpha_id[5]+alpha_id[4]+alpha_id[3]+alpha_id[2]+alpha_id[1]+alpha_id[0])
-----
print(alpha_id[7])
print(alpha_id[6])
print(alpha_id[5])
print(alpha_id[4])
print(alpha_id[3])
print(alpha_id[2])
print(alpha_id[1])
print(alpha_id[0])
----
**Create new variables that contain the initials of the experimenter**
print(alpha_id[0:3])
print(beta_id[0:3])
print(gamma_id[0:3])
----
print(alpha_id[:3])
print(beta_id[:3])
print(gamma_id[:3])
**Create new variables that contain the ID of the experimenter**
print(alpha_id[3:])
print(beta_id[3:])
print(gamma_id[3:])
----
print(alpha_id[3:])
print(beta_id[3:])
print(gamma_id[3:])
----
initial_alpha = alpha_id[0:3]
initial_beta = beta_id[0:3]
initial_gamma = gamma_id[0:3]
print(initial_alpha)
print(initial_beta)
print(initial_gamma)
------
alphaExp = alpha_id[0:3]
betaExp = beta_id[0:3]
gammaExp = gamma_id[0:3]
print("alpha experimenter: " + alphaExp + ' beta experimentor: ' + betaExp + " gamma experimentor: " + gammaExp)
----
##### Creating a Fasta file printer
name = 'Bob'
seq = 'GTACTAATTAGGGCTAGAC'
print(">" + name + '\n' + seq)
-----
ranSeqName = "sequence 1"
ranSeq = "ACGTACGATCGTAGCTACGTATCGTCGGCTACGAT"
print(">"+ranSeqName+"\n"+ranSeq)
-----
```
sequence_name = "sequence 1"
sequence = "TCGTAGCGGTGTACATGACCCCTGGATACGTGCGCCTGCTA"
print(f">{sequence_name}\n{sequence}")
```
-----
seq_name = "sequence_1"
sequence = "ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA"
print(">"+seq_name+"\n"+sequence)
-----
sequence_001_name = "sequence 001"
sequence_001 = "ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA"
print(">"+sequence_001_name+"\n"+sequence_001)
#### Determine and print the length of the HIV genome
print(len(hiv_genome))
##### Create variables for and print the sequences for the following HIV genes
- gag
- pol
- vif
- vpr
- env
-----
gag = hiv_genome[790:2292]
pol = hiv_genome[2085:5096]
vif = hiv_genome[5041:5619]
vpr = hiv_genome[5559:5850]
env = hiv_genome[6225:8795]
print("gag: "+gag +"\n \n pol: "+pol+"\n \n vif: "+vif+"\n \n vpr: "+vpr+"\n \n env:"+env)
-----
gag_seq = hiv_genome[790:2293]
pol_seq = hiv_genome[2085:5097]
vif_seq = hiv_genome[5041:5620]
vpr_seq = hiv_genome[5559:5851]
env_seq = hiv_genome[6225:8796]
-----
gag = hiv_genome[789:2292:]
pol = hiv_genome[2084:5096:]
vif = hiv_genome[5040:5619:]
vpr = hiv_genome[5558:5850:]
env = hiv_genome[6044:8795:]
----
gag = hiv_genome[789:2291]
pol = hiv_genome[2084:5095]
vif = hiv_genome[5040:5618]
vpr = hiv_genome[5558:5849]
env = hiv_genome[6224:8794]
print(gag + '\n' + '\n' + pol + '\n' + '\n' + vif + '\n' + '\n' + vpr + '\n' + '\n' + env
----
gag_seq = hiv_genome[789:2292]
pol_seq = hiv_genome[2084:5096]
vif_seq = hiv_genome[5040:5619]
vpr_seq = hiv_genome[5558:5850]
env_seq = hiv_genome[6224:8795]
----
##### Generate the RNA sequence for each of the genes you have isolated above
gag_rna = gag.replace('t','u')
pol_rna= pol.replace('t','u')
vif_rna = vif.replace('t','u')
vpr_rna = vpr.replace('t','u')
env_rna= env.replace('t','u')
print("gag rna: "+gag_rna +"\n \n pol rna: "+pol_rna+"\n \n vif rna: "+vif_rna+"\n \n vpr rna: "+vpr_rna+"\n \n env rna:"+env_rna)
-----
-----
gag_rna = gag_seq.replace("t", "u")
pol_rna = pol_seq.replace("t", "u")
vif_rna = vif_seq.replace("t", "u")
vpr_rna = vpr_seq.replace("t", "u")
env_rna = env_seq.replace("t", "u")
-----
#### For each gene, generate a sum for each of the nuclotides in that gene (e.g., #of 'A',#of'U',#of'G',#of'C')
gagT = str(gag.count('t'))
gagC = str(gag.count('c'))
gagG = str(gag.count('g'))
gagA = str(gag.count('a'))
polT = str(pol.count('t'))
polC = str(pol.count('c'))
polG = str(pol.count('g'))
polA = str(pol.count('a'))
vifT = str(vif.count('t'))
vifC = str(vif.count('c'))
vifG = str(vif.count('g'))
vifA = str(vif.count('a'))
vprT = str(vpr.count('t'))
vprC = str(vpr.count('c'))
vprG = str(vpr.count('g'))
vprA = str(vpr.count('a'))
envT = str(env.count('t'))
envC = str(env.count('c'))
envG = str(env.count('g'))
envA = str(env.count('a'))
print("gag: A - "+ gagA+ " C - "+gagC+" T - "+gagT+" G - "+gagG + "\n"+"pol: A - "+ polA+ " C - "+polC+" T - "+polT+" G - "+polG + "\n"+"vif: A - "+ vifA+ " C - "+vifC+" T - "+vifT+" G - "+vifG + "\n"+"vpr: A - "+ vprA+ " C - "+vprC+" T - "+vprT+" G - "+vprG + "\n"+"env: A - "+ envA+ " C - "+envC+" T - "+envT+" G - "+envG)
-----
print("gag: A - "+str(gag_rna.count("a"))+", U - "+str(gag_rna.count("u"))+", G - "+str(gag_rna.count("g"))+", C - "+str(gag_rna.count("c")))
print("pol: A - "+str(pol_rna.count("a"))+", U - "+str(pol_rna.count("u"))+", G - "+str(pol_rna.count("g"))+", C - "+str(pol_rna.count("c")))
print("vif: A - "+str(vif_rna.count("a"))+", U - "+str(vif_rna.count("u"))+", G - "+str(vif_rna.count("g"))+", C - "+str(vif_rna.count("c")))
print("vpr: A - "+str(vpr_rna.count("a"))+", U - "+str(vpr_rna.count("u"))+", G - "+str(gag_rna.count("g"))+", C - "+str(vpr_rna.count("c")))
print("env: A - "+str(env_rna.count("a"))+", U - "+str(env_rna.count("u"))+", G - "+str(env_rna.count("g"))+", C - "+str(env_rna.count("c")))
-----
#### For each gene, caculate the GC content (%)
#percent GC = sum of (G) + sum (C) / total number of nuclotides in a given gene
gagGC = ((int(gagC) + int(gagG)) / len(gag))*100
envGC = ((int(envC) + int(envG)) / len(env))*100
vprGC = ((int(vprC) + int(vprG)) / len(vpr))*100
vifGC = ((int(vifC) + int(vifG)) / len(vif))*100
polGC = ((int(polC) + int(polG)) / len(pol))*100
print(" gag GC %: " + str(gagGC) + "% \n env GC %: " + str(envGC) +"% \n vpr GC %: "+str(vprGC)+"% \n vif GC %: "+str(vifGC)+"% \n pol GC %: "+str(polGC)+"%")
-----
-----
print("gag %: "+str(((gag_rna.count("g")+gag_rna.count("c"))/len(gag_rna))*100))
print("pol %: "+str(((pol_rna.count("g")+pol_rna.count("c"))/len(pol_rna))*100))
print("vif %: "+str(((vif_rna.count("g")+vif_rna.count("c"))/len(vif_rna))*100))
print("vpr %: "+str(((vpr_rna.count("g")+vpr_rna.count("c"))/len(vpr_rna))*100))
print("env %: "+str(((env_rna.count("g")+env_rna.count("c"))/len(env_rna))*100))
-----
sequence_data = {
'gag sequence': gag_sequence,
'pol sequence': pol_sequence,
'vif sequence': vif_sequence,
'vpr sequence': vpr_sequence,
'env sequence': env_sequence
}
for name, data in sequence_data.items():
print(name, "percent GC:", (data.count('g') + data.count('c')) / len(data))
-----
gag = hiv_genome[789:2292]
print(gag)
pol = hiv_genome[2084:5096]
print(pol)
vif = hiv_genome[5040:5619]
print(vif)
vpr = hiv_genome[5558:5850]
print(vpr)
env = hiv_genome[6224:8795]
print(env)
RNA_gag = gag.replace('t', 'u')
)
gag = hiv_genome[790:2292]
print(gag)
pol = hiv_genome[2085:5096]
print(pol)
vif = hiv_genome[5041:5619]
print(vif)
vpr = hiv_genome[5559:5850]
print(vpr)
env = hiv_genome[6225:8795]
print(env)
-----
#### Print the list of these HIV genes in order given the list below The correct order is
- gag, pol, vif, vpr, vpu, env, nef
print(hiv_gene_names[1] , hiv_gene_names[3], hiv_gene_names[2], hiv_gene_names[5], hiv_gene_names[0], hiv_gene_names[6])
-----
print(hiv_gene_names[1])
print(hiv_gene_names[3])
print(hiv_gene_names[2])
print(hiv_gene_names[4])
print(hiv_gene_names[5])
print(hiv_gene_names[0])
print(hiv_gene_names[6])
-----
-----
print(hiv_gene_names[1] + ', ' + hiv_gene_names[3] + ', ' + hiv_gene_names[2] + ', ' + hiv_gene_names[4] + ', ' + hiv_gene_names[5] + ', ' + hiv_gene_names[0] + ', ' + hiv_gene_names[6])
-----
print(hiv_gene_names[1])
print(hiv_gene_names[3])
print(hiv_gene_names[2])
print(hiv_gene_names[4])
print(hiv_gene_names[5])
print(hiv_gene_names[0])
print(hiv_gene_names[6])
-----
print(hiv_gene_names[1],hiv_gene_names[3],hiv_gene_names[2],hiv_gene_names[4],hiv_gene_names[5],hiv_gene_names[0],hiv_gene_names[6])
-----
### DAY THREE
#### Use conditionals so that if the float is greater than or equal to 0.5 consider that
- 'Heads' otherwise 'Tails'
from numpy import random
my_random_int = random.randint(1,10)
my_random_float = random.ranf()
print('My random float is %f' % my_random_float)
if my_random_int>=.5:
print('heads')
if my_random_int<.5:
print('tails')
-----
import random
coin_flip = random.uniform(0.0,1.0)
print(coin_flip)
if coin_flip >= 0.5:
print("Heads")
elif coin_flip < 0.5:
print("Tails")
-----
ran_num = random.ranf()
if ran_num >= 0.5:
print("Heads")
else:
print("Tails")
-----
ranFloat = random.ranf()
if ranFloat>=0.5:
print("heads")
elif ranFloat<=0.001 or ranFloat>=0.999:
print("on its side")
else:
print("tails")
-----
coinflip = (int) (random.ranf()*2)+1
if(coinflip==2):
print("It's heads")
else:
print("It's tails")
-----
from numpy import random
coin_flip=random.randint(0,1)
if coin_flip>.5:
print ("heads")
elif coin_flip=<.5:
print ("tails")
print(coin_flip)
-----
from numpy import random
num = random.random()
if num >= 0.5:
print("Heads")
else:
print("Tails")
from numpy import random
my_random_coin = random.ranf()
if my_random_coin <= 0.5:
print("The coin is heads")
else:
print("The coin is tails")
#### 2. Determine how often would HIV mutate in 20 rounds of replication
-----
from numpy import random
replication_states = ['mutation','no_mutation']
hiv_mutation_probibilities = [0.000044,0.999956]
for flip in range(1,21):
mutation_probibility = random.choice(replication_states,p = hiv_mutation_probibilities)
print("%s" %mutation_probibility)
-----
-----
from numpy import random
mut_state = ['Mutation','No Mutation']
overall_prob = 0.44
mut_probabilities = [overall_prob,1-overall_prob]
for x in range(1,21):
opt = random.choice(mut_state, p=mut_probabilities)
print(opt)
-----
HIV_state = ['yes', 'no']
HIV_probability = [0.044, 0.956]
mutation_results = []
mutation_success = []
for mutation in range(1,21):
mutation_rate = random.choice(HIV_state,p = HIV_probability)
mutation_results.append(mutation_rate)
print("Mutation: %s" %mutation_rate)
for result in mutation_results:
if result == 'yes':
mutation_success.append("success")
if len(mutation_success) == 1:
print("The HIV mutated %d time" % len(mutation_success))
else:
print("The HIV mutated %d times" % len(mutation_success))
#### determine which nucleotide in the HIV-1 genome to mutate
mutation_pos = random.randint(0, len(hiv_genome))
old_base = hiv_genome[mutation_pos]
# In one line
base = hiv_genome[random.randint(0, len(hiv_genome))]
-----
if(mutate == 'mutation'):
substitutionState = ['ca', 'ga', 'ta', 'ac', 'gc', 'tc', 'ag', 'tg', 'at', 'ct', 'gt']
substitutionP = [0.04320988, 0.45061693, 0.0617284, 0.00308632, 0.00617284, 0.055556, 0.08950617, 0.01851852, 0.00925926, 0.25, 0.01234568]
whatSubstitute = random.choice(substitutionState, p = substitutionP)
whichSubstitute = random.choice(find_all_indices(hiv_genome, whatSubstitute[0]))
def find_all_indices(text, letter):
indices_of_letter = []
for i, ch in enumerate(text):
if ch == letter:
indices_of_letter.append(i)
return indices_of_letter
-----
index = random.randint(0, len(hiv_genome))
#### flip a coin weighted to the probabilities of mutation given in the 'Class 1: single nt substitution' chart above. In each the number of observed mutations of a nucleotide on the y-axis changing to one on the x-axis is shown.use the replace() function to mutate your HIV-1 genome
-----
```
****import numpy
mutationState = ['mutation', 'no_mutation']
mutationP = [0.000044, 0.999956]
mutate = random.choice(mutationState, p = mutationP)
if(mutate == 'mutation'):
substitutionState = ['ca', 'ga', 'ta', 'ac', 'gc', 'tc', 'ag', 'tg', 'at', 'ct', 'gt']
substitutionP = [0.04320988, 0.45061693, 0.0617284, 0.00308632, 0.00617284, 0.055556, 0.08950617, 0.01851852, 0.00925926, 0.25, 0.01234568]
```
whatSubstitute = random.choice(substitutionState, p = substitutionP)
```
whichSubstitute = random.choice(find_all_indices(hiv_genome, whatSubstitute[0]))
```
temp = list(hiv_genome)
temp[whichSubstitute] = whatSubstitute[1]
mutatedGenome = ''.join(temp)
def find_all_indices(text, letter):
indices_of_letter = []
for i, ch in enumerate(text):
if ch == letter:
indices_of_letter.append(i)
return indices_of_letter**
-----
print(hiv_genome[718])
from numpy import random
mut_nuc_nuc = ["G_A", "G_C", "G_T", "G"]
fair_mut_probabilities_nuc = [0.96052631578, 0.01315789473, 0.02631578947, 0.00000000002]
mut_outcome_nuc = []
for mut in range(1):
fair_mut_nuc = random.choice(mut_nuc_nuc,p = fair_mut_probabilities_nuc)
mut_outcome_nuc.append(fair_mut_nuc)
for result in mut_outcome_nuc:
if result == "G_A":
print("G changed to A")
elif result == "G_C":
print("G changed to C")
elif result == "G_T":
print("G changed to T")
else:
print("G did not mutate")
-----
mutation_pos = random.randint(0, len(hiv_genome))
old_base = hiv_genome[mutation_pos]
print(old_base)
if old_base == 'a':
new_base = random.choice(['c', 'g', 't'], p=[1/33, 29/33, 3/33])
elif old_base == 'c':
new_base = random.choice(['a', 't'], p=[14/95, 81/95])
elif old_base == 'g':
new_base = random.choice(['a', 'c', 't'], p=[146/152, 2/152, 4/152])
else:
new_base = random.choice(['a', 'c', 'g'], p=[20/44, 18/44, 6/44])
hiv_genome = hiv_genome[:mutation_pos] + new_base + hiv_genome[mutation_pos + 1:]
print(hiv_genome[mutation_pos])
-----
(this has the random mutation but because it uses replace the first occurrence of that letter gets changed instead of a random one)
# setting up mutation counts
# setting up mutation rates
mutation_state = ["mutation", "no_mutation"]
mutation_prob = 0.000044
mutation_rate = [mutation_prob, 1-mutation_prob]
# check for mutation
state = random.choice(mutation_state, p=mutation_rate)
# print(state)
# setting up what kind of mutation
mutation_specific_states = ["ac", "ag", "at", "ca", "cg", "ct", "ga", "gc", "gt", "ta", "tc", "tg"]
mutation_specific_rates_temp = [14, 146, 20, 1, 2, 18, 29, 0, 6, 3, 81, 4]
# transforming values into a decimal/percentage
mutation_specific_rates = []
for r in mutation_specific_rates_temp:
mutation_specific_rates.append(r/sum(mutation_specific_rates_temp))
if state == "mutation":
# if mutation, choose the type
change = random.choice(mutation_specific_states, p=mutation_specific_rates)
print(change[1]+" change to "+change[0])
# actually change the genome
mutated_hiv_genome = mutated_hiv_genome.replace(change[1], change[0], 1)
-----
random_int_hiv = random.randint(0,9718)
if (hiv_genome[random_int_hiv]) == (hiv_genome[0]):
random_inta = random.randint(1,33)
if random_inta in range (1,1):
print('adenine mutates into cytosine at' + (random_int_hiv))
if random_inta in range(2,30):
print('adenine mutates into guanine at' + (random_int_hiv))
if random_inta in range(31,33):
print('adenine mutates into thymine at' + (random_int_hiv))
if (hiv_genome[random_int_hiv]) == (hiv_genome[8]):
random_intc = random.randint(1,95)
if random_intc in range(1,14):
print('cytosine mutates into adenine at' + (random_int_hiv))
if random_intc in range(15,95):
print('cytosine mutates into thymine at' + (random_int_hiv))
if (hiv_genome[random_int_hiv]) == (hiv_genome[2]):
random_intg = random.randint(1,152)
if random_intg in range(1,146):
print('guanine mutates into adenine at' + (random_int_hiv))
if random_intg in range(147,148):
print('guanine mutates into cytosine at' + (random_int_hiv))
if random_intg in range(149,152):
print('guanine mutates into thymine at' + (random_int_hiv))
if (hiv_genome[random_int_hiv]) == (hiv_genome[1]):
random_intt = random.randint(1,44)
if random_intt in range(1,20):
print('thymine mutates into adenine at' + (random_int_hiv))
if random_intt in range(21,38):
print('thymine mutates into cytosine at' + (random_int_hiv))
if random_intt in range(39,44):
print('thymine mutates into guanine at' + (random_int_hiv))
if result == 'no_mutation':
print("no_mutation")
-----
mutation_pos = random.randint(0, len(hiv_genome))
nuc = hiv_genome[mutation_pos]
if nuc=='a':
mutation_state=['A-T','A-G','A-C']
mutation_probs = [0.0909090909,0.87878787878,0.0303030303]
elif nuc=='t':
mutation_state=['T-A','T-G','T-C']
mutation_probs=[0.45454545454,0.13636363636,0.40909090909]
elif nuc=='g':
mutation_state=['G-T','G-A','G-C']
mutation_probs=[0.02631578947,0.9605263158,0.01315789473]
elif nuc=='c':
mutation_state=['C-A','C-G','C-T']
mutation_probs=[0.14736842105,0,0.85263157894]
mutation_results = random.choice(mutation_state,p=mutation_probs)
print(mutation_results,'\nin nucleotide position',mutation_pos,'of the hiv genome')
----
#### Write the appropriate code to translate an RNA string to a protein sequence:
- rna = 'AUGCAUGCGAAUGCAGCGGCUAGCAGACUGACUGUUAUGCUGGGAUCGUGCCGCUAG'
counter = 0
while counter < len(rna):
print(amino_acids[rna[counter:counter+3]])
counter = counter + 3
-----
proteinSeq = ''
lengthOfProtein = int(len(rna)/3)
for x in range(0, lengthOfProtein):
proteinSeq += amino_acids[rna[x*3:x*3+3]]
print(proteinSeq)
-----
protein = ""
for x in range(0, len(rna), 3):
codon = rna[x:x+3]
protein += amino_acids[codon]
print(protein)
#### Does your code work on the following RNA sequence?
rna = 'AUGCAAGACAGGGAUCUAUUUACGAUCAGGCAUCGAUCGAUCGAUGCUAGCUAGCGGGAUCGCACGAUACUAGCCCGAUGCUAGCUUUUAUGCUCGUAGCUGCCCGUACGUUAUUUAGCCUGCUGUGCGAAUGCAGCGGCUAGCAGACUGACUGUUAUGCUGGGAUCGUGCCGCUAG'
yes
-----
if it's supposed to stop:
protein = ""
for x in range(0, len(rna), 3):
codon = rna[x:x+3]
if amino_acids[codon] != "_":
protein += amino_acids[codon]
else:
break
print(protein)
otherwise it's the same
or if it's supposed to have multiple proteins:
protein = ""
for x in range(0, len(rna), 3):
codon = rna[x:x+3]
if amino_acids[codon] != "-":
protein += amino_acids[codon]
else:
print(protein)
protein = ""
#### Can you translate this sequence in all 3 reading frames?
for frame in range(3):
protein_seq = ''
for start_pos in range(frame, len(rna), 3):
amino = amino_acids[rna[start_pos:start_pos + 3]]
if amino == '_': break
protein_seq += amino
print(protein_seq)
-----
protien_seq = ''
len_protien = int(len(rna)/3)
for x in range(0,len_protien):
protien_seq+=amino_acids[rna[x*3:x*3+3]]
if(protien_seq[len(protien_seq)-1:len(protien_seq)])=='_':
protien_seq+='\n'
print(protien_seq)
----
for frame in range(3):
protein = ""
for x in range(frame, len(rna), 3):
if x+3 < len(rna):
codon = rna[x:x+3]
if amino_acids[codon] != "-":
protein += amino_acids[codon]
else:
break
print(protein)
or if there's supposed to be multiple proteins
for frame in range(3):
protein = ""
for x in range(frame, len(rna), 3):
if x+3 < len(rna):
codon = rna[x:x+3]
if amino_acids[codon] != "-":
protein += amino_acids[codon]
else:
print(protein)
protein = ""
print(protein)
#### Write a function that calculates the GC content of a DNA string
def calcGCContent(seq):
return seq.count('G') + seq.count('C') + seq.count('g') + seq.count('c')
-----
def GC_count():
GC_count=0
for i in range(len(dna_string)):
if dna_string[i]=='g' | dna_string[i]=='c':
GC_count+=1
return GC_count
-----
def gc_content(dna):
g_count = dna.upper().count("G")
c_count = dna.upper().count("C")
return (g_count+c_count)/len(dna)
-----
def dna_gc_content():
dna_gc_count = (dna.count('c') + dna.count('g')) / len(dna)
return (dna_gc_count*100)
print (dna_gc_content() , '%')
#### Write a function that generates a random string of DNA of random length
import numpy
def dna_maker():
length=random.randint(0,10000)
probs=[0.25,0.25,0.25,0.25]
chances=['a','t','c','g']
dna = ''
for i in range(length):
dna+=random.choice(chances,p=probs)
return dnahttps://hackmd.io/
print(dna)
dna_maker()
------
import numpy
def ranDNA():
sequenceState = ["A", "C", "G", "T"]
sequenceP = [.25, .25, .25, .25]
lengDNA = random.randint(0, 100000)
DNAStrand = ''
for x in range(lengDNA):
DNAStrand += random.choice(sequenceState, p = sequenceP)
return DNAStrand
print(ranDNA())
-----
from numpy import random
def ran_dna():
length = random.randint(10, 500)
print("Length:", length)
dna = ""
for x in range(length):
dna += random.choice(["A", "T", "C", "G"])
return dna
print(ran_dna())
----
#### Challenge: Write a function that generates a random string of DNA of random a random length: use optional arguments to set the length of the strings and the probabilities of the nucleotides.
-----
from numpy import random
def ran_dna(length=random.randint(10, 500), aProb=0.25, tProb=0.25, cProb=0.25, gProb=0.25):
print("Length:", length)
dna = ""
for x in range(length):
dna += random.choice(["A", "T", "C", "G"], p=[aProb, tProb, cProb, gProb])
return dna
print(ran_dna(100, 0.1, 0.1, 0.4, 0.4))
or it could just be a list (it works either way)
from numpy import random
def ran_dna(length=random.randint(10, 500), probs=[0.25, 0.25, 0.25, 0.25]):
print("Length:", length)
dna = ""
for x in range(length):
dna += random.choice(["A", "T", "C", "G"], p=probs)
return dna
print(ran_dna(100, 0.1, 0.1, 0.4, 0.4))
-----
```
import numpy
def ranDNA(probA = 0.25, probC = 0.25, probG = 0.25, probT= 0.25):
sequenceState = ["A", "C", "G", "T"]
sequenceP = [probA, probC, probG, probT]
not25s = []
if(sum(sequenceP)!=1):
for x in range(0,len(sequenceP)):
if(sequenceP[x] == 0.25):
not25s.append(x)
differenceFrom1 = probA+probC+probG+probT -1
changeMade = -differenceFrom1/len(not25s)
for x in range(0, len(not25s)):
sequenceP[not25s[x]] += changeMade
lengDNA = random.randint(0, 1000)
DNAStrand = ''
for x in range(lengDNA):
DNAStrand += random.choice(sequenceState, p = sequenceP)
return DNAStrand
```
#### Write a function that generates a random protein of length L
-----
-----
```
def ranRNA():
sequenceState = ["A", "C", "G", "U"]
sequenceP = [.25, .25, .25, .25]
lengRNA = random.randint(1, 100)
RNAStrand = ''
for x in range(lengRNA):
RNAStrand += random.choice(sequenceState, p = sequenceP)
return RNAStrand
urMom = True
proteinSeq = ''
while urMom == True:
RNAStrand = ranRNA()
if(len(RNAStrand)>=3):
if(amino_acids[RNAStrand[:3]] == 'M'):
lengthOfProtein = int(len(RNAStrand)/3)
for x in range(0, lengthOfProtein):
proteinSeq += amino_acids[RNAStrand[x*3:x*3+3]]
if(amino_acids[RNAStrand[x*3:x*3+3]] == '_'):
urMom = False
break
print(proteinSeq)
```
-----
from numpy import random
amino_acids = {
'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
'UAC':'Y', 'UAU':'Y', 'UAA':'-', 'UAG':'-',
'UGC':'C', 'UGU':'C', 'UGA':'-', 'UGG':'W'
}
# set up the variables
def generate_protein(length = 100):
trials = 0
starts = 0
stops = 0
protein = ""
rna = ""
while len(protein) != length:
protein = ""
rna = ""
# get the start working
codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon
while amino_acids[codon] != "M": #while the codon is not a start, try again (create a new codon)
codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon
trials += 1
starts += 1
# if it worked, add the codon and amino acid to the protein and rna
protein += amino_acids[codon]
rna += codon
# while the codon is not a stop, keep generating and adding codons
while amino_acids[codon] != "-":
codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon
protein += amino_acids[codon]
rna += codon
# once there's a stop, check if it is the right length or not
if len(protein) != length:
trials += 1
stops += 1
print("Total trials:", trials)
print("Total restarts from start:", starts)
print("Total restarts from end:", stops)
print(protein)
print("")
print(rna)
return protein, rna
generate_protein(50)
print("Done")
-----
#### M&M Plotting
|Sample name|Blue|Brown|Green|Orange|Red|Yellow|
|-----------|----|-----|-----|------|---|------|
tube_x = [,,,,,,]
tube_3 = [14,9,14,9,1,1]
tube_4 = [11,12,5,6,5,7]
tube_6 = [7,2,6,10,4,18]
tube_5 = [12,8,4,12,4,8]
tube_7 = [10,8,11,8,8,3]
tube_9 = [11,4,5,12,13,2]
tube_10 = [23,2,3,8,4,9]
#### How would you make a plot for all of the tubes in the class?
-----
data = {
'tube 3': [14,9,14,9,1,1],
'tube 5': [12,8,4,12,4,8],
'tube 6': [7,2,6,10,4,18],
'tube 7': [10,8,11,8,8,3],
'tube 9': [11,4,5,12,13,2],
'tube 10': [23,2,3,8,4,9]
}
for key in data:
plot_1 = plot.bar(index,
data[key],
color=colors,
tick_label=colors,
align='center')
plot.title(key)
plot.show(plot_1)
-----
# function to plot the data
def bar_plot(tube, name=""):
n = len(tube)
index = np.arange(n)
colors = ['blue',
'brown',
'green',
'orange',
'red',
'yellow']
plot_1 = plot.bar(index,
tube,
color=colors,
tick_label=colors,
align='center')
plot.title("Tube "+name)
plot.show(plot_1)
# defining tubes
tube_3 = [14,9,14,9,1,1]
tube_4 = [11,12,5,6,5,7]
tube_5 = [12,8,4,12,4,8]
tube_6 = [7,2,6,10,4,18]
tube_7 = [10,8,11,8,8,3]
tube_9 = [11,4,5,12,13,2]
tube_10 = [23,2,3,8,4,9]
# putting the tubes into a list for looping in the future
tubes = [tube_3, tube_4, tube_5, tube_6, tube_7, tube_9, tube_10]
tube_names = ["3", "4", "5", "6", "7", "9", "10"]
# loop through the tubes and plot them
for x in range(len(tubes)):
bar_plot(tubes[x], tube_names[x])
-----
counter = 0
while counter <= len(tubes):
observations = tubes[counter]
n = len(observations)
index = np.arange(n)
colors = ['blue',
'brown',
'green',
'orange',
'red',
'yellow']
plot_1 = plot.bar(index,
observations,
color=colors,
tick_label=colors,
align='center')
plot.show(plot_1)
counter = counter + 1
-----
```
def graph(x):
#function to graph the graph
thing = observations[x]
n = len(thing)
index = np.arange(n)
colors = ['blue',
'brown',
'green',
'orange',
'red',
'yellow']
plot1 = plot.bar(index,
thing,
color=colors,
tick_label=colors,
align='center')
plot.show(plot1)
#list of the tubes with values
observations = [tube_0, tube_3, tube_4, tube_5, tube_6, tube_7, tube_9, tube_10]
#loop to graph the tubes
for x in range(len(observations)):
graph(x)
```
-----
tube_3 = [14,9,14,9,1,1,3]
tube_4 = [11,12,5,6,5,7,4]
tube_6 = [7,2,6,10,4,18,6]
tube_5 = [12,8,4,12,4,8,5]
tube_7 = [10,8,11,8,8,3,7]
tube_9 = [11,4,5,12,13,2,9]
tube_10 = [23,2,3,8,4,9,10]
def graph(tube):
observations = tube[0:6]
n = len(observations)
index = np.arange(n)
colors = ['blue',
'brown',
'green',
'orange',
'red',
'yellow']
plot_1 = plot.bar(index,
observations,
color=colors,
tick_label=colors,
align='center')
plot.title(tube[6])
plot.show(plot_1)
observations=[tube_3,tube_4,tube_5,tube_6,tube_7,tube_9,tube_10]
for x in range(len(observations)):
graph(observations[x])
------