BioCoding 2023

## Shared URLS - Learn more markdown: [link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) - Human genome: [link](https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) - SNPedia: [link](https://www.snpedia.com/index.php/SNPedia) - Project Jupyter: [link](https://jupyter.org/) - Interesting Jupyter notebooks: [link](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks) - Try Linux terminal: [link](https://cocalc.com/doc/terminal.html) - Rapid DNA extraction protocol: [link](https://dnabarcoding101.org/lab/protocol-2.h(tml#standard) - mybinder.org: [link](https://mybinder.org/) - Notebooks: https://github.com/JasonJWilliamsNY/biocoding-2022-notebooks - Zoom link: [TBD](TBD) - JupyterHub: [TBD](TBD) --- ## Learning more after the class **Notebooks used in this course** - Biocoding 2020 Notebooks [link](https://github.com/JasonJWilliamsNY/biocoding-2020-notebooks) - - You can download these materials: [link](https://github.com/JasonJWilliamsNY/biocoding-2020-notebooks/archive/master.zip) **General Coding** - CodeCademy: [link](https://www.codecademy.com/) - Hour of code (also in languages other than English): [link](https://code.org/learn) **Software installations** Be sure you have permission to install software - Try Ubuntu: [link](https://tutorials.ubuntu.com/tutorial/try-ubuntu-before-you-install#0) - Python: [link](https://www.python.org/downloads/) - Jupyter: [link](https://jupyter.org/) - Wing IDE: [link](https://wingware.com/) - Atom text editor: [link](https://atom.io/) **Bioinformatics** - Learn bioinformatics in 100 hours: [link](https://www.biostarhandbook.com/edu/course/1/) - Rosalind bioinformatics: [link](http://rosalind.info/about/) - Bioinformatics coursera: [link](https://www.coursera.org/learn/bioinformatics) - Bioinformatics careers: [link](https://www.iscb.org/bioinformatics-resources-for-high-schools/careers-in-bioinformatics) **Help** - General software help: [link](https://stackoverflow.com/) - Bioinformatics-specific software help: [link](https://www.biostars.org/) --- ## Account names zhong navas lee labelson suskin paval polevoy reed saur dimaio kim mingoia_murphy shohdy marinescu ### Jupyter - [Hub address](http://3.228.2.183:8000/hub/login) ### Notebook setup git clone https://github.com/JasonJWilliamsNY/biocoding-2021-notebooks.git ### DNA Barcoding - [silica DNA isolation](https://dnabarcoding101.org/lab/protocol-2.html#alternateb) --- ## Shared notes **Linux commands for the Command Line/Terminal** * [linux explainer](https://explainshell.com) * *PWD* - print working directory (prints the name of the current folder) * *ls* - list (lists all the files in the current folder) * *cd **foldername*** - change directory (changes the current folder to "foldername") * *rm **filename*** - deletes the file "filename" * *whoami* - prints your username **Github specific** * *git clone **github-link*** copies the github repository to your computer **General Info** - Logging in to Jupyter: - username: **lastname** - password: **lastname.123** ### DAY ONE **Summary** We discussed what a **computer** is, what **bioinformatics** is, as well as different programs that are used for programming such as **Github**, **Jupyter**, and the **Command Line**. We logged in to Jupyter for the first time and downloaded the notebooks. We went through biocoding_2021_intro_python_01 and learned about functions and the **print()** function as well as **strings** and **variables**. We **isolated our plant dna** for pcr. **General** **--Vocab--** IP address - the computer's internet address Github - a place for sharing software/code and data **=** "assignment operator" String - 0 or more characters enclosed in quotes **--Concepts--** * In Jupyter Notebooks, there are a combination of text and code * Grey blocks/cells are code, and can be run with the **play button** on top of the screen, on the side of the cell, or by pressing **shift and enter** * You can create a new cell with the plus at the top of the screen -- * In python, a function is represented as **functionName()** * A function (sometimes) takes input (in the parentheses) and then gives output -- * A variable is something that stores data * The value on the right of "=" is stored in the variable to the left of it * ***variableName** = **4*** <- stores 4 in variableNAme -- * anything with quotation marks around it is a **string** * "A string is 0 or more characters enclosed in quotes" **--Code--** * *print("**text**")* - prints "text" * Math * ***a** + **b*** addition * ***a** - **b*** subtraction * ***a** / **b*** division * ***a** * **b*** multiplication * ***a** ** **b*** exponent --- ### DAY TWO **Summary** We went through biocoding_2021_pythonlab_02, and learned about **strings**, the **type()** function, how to name variables and some **python style guidelines**, **General** **--Vocab--** **--Concepts--** * Naming * variable_name - "snake case" * variableName - "camel case" * both are fine * [Python Style Guide](https://peps.python.org/pep-0008/#naming-conventions) * Clusivity - whether a number is included or excluded from a list * File format - a consistent way of writing/storing data * FASTA format - **--Code--** * *type(**variable**)* - returns the type of variable (string, int, etc) * "*#*" - comment (not code) * *len(**variable**)* - returns the length of a variable * ***string**[**beginIndex**:**endIndex**:**stepSize**]* - get a slice of a string * the endIndex will go up to the endIndex, but not include it * ":" is "everything" * ***variable**.**method**()* is a method call * ***variable**.count(**letter**)* counts the number of letters * *help(**something**)* - tells you about "something" * ---- **Variable names for Average weight of a mouse group?** * avg_mouse_mass * avgmofm * groupnameAM * groupname_avg_mass * avg_mouse_g * avg_weight * groupname_avg_weight * avg_weight * avg_mouse_weight * avgWeight * groupname_avgmass * avg_mouse_mass **Variable names for Number of mice in a group?** * numMice * groupname_num * groupname_numMice * group_num * group_num * groupname_num_mice * groupname_Groupnum * groupnameNum * mice# * groupname_mice# **Challenge In the cell below, print the alpha_id character by character in reverse** alpha_id[::-1] ----- print(alpha_id[7]+alpha_id[6]+alpha_id[5]+alpha_id[4]+alpha_id[3]+alpha_id[2]+alpha_id[1]+alpha_id[0]) ----- print(alpha_id[7]) print(alpha_id[6]) print(alpha_id[5]) print(alpha_id[4]) print(alpha_id[3]) print(alpha_id[2]) print(alpha_id[1]) print(alpha_id[0]) ---- **Create new variables that contain the initials of the experimenter** print(alpha_id[0:3]) print(beta_id[0:3]) print(gamma_id[0:3]) ---- print(alpha_id[:3]) print(beta_id[:3]) print(gamma_id[:3]) **Create new variables that contain the ID of the experimenter** print(alpha_id[3:]) print(beta_id[3:]) print(gamma_id[3:]) ---- print(alpha_id[3:]) print(beta_id[3:]) print(gamma_id[3:]) ---- initial_alpha = alpha_id[0:3] initial_beta = beta_id[0:3] initial_gamma = gamma_id[0:3] print(initial_alpha) print(initial_beta) print(initial_gamma) ------ alphaExp = alpha_id[0:3] betaExp = beta_id[0:3] gammaExp = gamma_id[0:3] print("alpha experimenter: " + alphaExp + ' beta experimentor: ' + betaExp + " gamma experimentor: " + gammaExp) ---- ##### Creating a Fasta file printer name = 'Bob' seq = 'GTACTAATTAGGGCTAGAC' print(">" + name + '\n' + seq) ----- ranSeqName = "sequence 1" ranSeq = "ACGTACGATCGTAGCTACGTATCGTCGGCTACGAT" print(">"+ranSeqName+"\n"+ranSeq) ----- ``` sequence_name = "sequence 1" sequence = "TCGTAGCGGTGTACATGACCCCTGGATACGTGCGCCTGCTA" print(f">{sequence_name}\n{sequence}") ``` ----- seq_name = "sequence_1" sequence = "ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA" print(">"+seq_name+"\n"+sequence) ----- sequence_001_name = "sequence 001" sequence_001 = "ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA" print(">"+sequence_001_name+"\n"+sequence_001) #### Determine and print the length of the HIV genome print(len(hiv_genome)) ##### Create variables for and print the sequences for the following HIV genes - gag - pol - vif - vpr - env ----- gag = hiv_genome[790:2292] pol = hiv_genome[2085:5096] vif = hiv_genome[5041:5619] vpr = hiv_genome[5559:5850] env = hiv_genome[6225:8795] print("gag: "+gag +"\n \n pol: "+pol+"\n \n vif: "+vif+"\n \n vpr: "+vpr+"\n \n env:"+env) ----- gag_seq = hiv_genome[790:2293] pol_seq = hiv_genome[2085:5097] vif_seq = hiv_genome[5041:5620] vpr_seq = hiv_genome[5559:5851] env_seq = hiv_genome[6225:8796] ----- gag = hiv_genome[789:2292:] pol = hiv_genome[2084:5096:] vif = hiv_genome[5040:5619:] vpr = hiv_genome[5558:5850:] env = hiv_genome[6044:8795:] ---- gag = hiv_genome[789:2291] pol = hiv_genome[2084:5095] vif = hiv_genome[5040:5618] vpr = hiv_genome[5558:5849] env = hiv_genome[6224:8794] print(gag + '\n' + '\n' + pol + '\n' + '\n' + vif + '\n' + '\n' + vpr + '\n' + '\n' + env ---- gag_seq = hiv_genome[789:2292] pol_seq = hiv_genome[2084:5096] vif_seq = hiv_genome[5040:5619] vpr_seq = hiv_genome[5558:5850] env_seq = hiv_genome[6224:8795] ---- ##### Generate the RNA sequence for each of the genes you have isolated above gag_rna = gag.replace('t','u') pol_rna= pol.replace('t','u') vif_rna = vif.replace('t','u') vpr_rna = vpr.replace('t','u') env_rna= env.replace('t','u') print("gag rna: "+gag_rna +"\n \n pol rna: "+pol_rna+"\n \n vif rna: "+vif_rna+"\n \n vpr rna: "+vpr_rna+"\n \n env rna:"+env_rna) ----- ----- gag_rna = gag_seq.replace("t", "u") pol_rna = pol_seq.replace("t", "u") vif_rna = vif_seq.replace("t", "u") vpr_rna = vpr_seq.replace("t", "u") env_rna = env_seq.replace("t", "u") ----- #### For each gene, generate a sum for each of the nuclotides in that gene (e.g., #of 'A',#of'U',#of'G',#of'C') gagT = str(gag.count('t')) gagC = str(gag.count('c')) gagG = str(gag.count('g')) gagA = str(gag.count('a')) polT = str(pol.count('t')) polC = str(pol.count('c')) polG = str(pol.count('g')) polA = str(pol.count('a')) vifT = str(vif.count('t')) vifC = str(vif.count('c')) vifG = str(vif.count('g')) vifA = str(vif.count('a')) vprT = str(vpr.count('t')) vprC = str(vpr.count('c')) vprG = str(vpr.count('g')) vprA = str(vpr.count('a')) envT = str(env.count('t')) envC = str(env.count('c')) envG = str(env.count('g')) envA = str(env.count('a')) print("gag: A - "+ gagA+ " C - "+gagC+" T - "+gagT+" G - "+gagG + "\n"+"pol: A - "+ polA+ " C - "+polC+" T - "+polT+" G - "+polG + "\n"+"vif: A - "+ vifA+ " C - "+vifC+" T - "+vifT+" G - "+vifG + "\n"+"vpr: A - "+ vprA+ " C - "+vprC+" T - "+vprT+" G - "+vprG + "\n"+"env: A - "+ envA+ " C - "+envC+" T - "+envT+" G - "+envG) ----- print("gag: A - "+str(gag_rna.count("a"))+", U - "+str(gag_rna.count("u"))+", G - "+str(gag_rna.count("g"))+", C - "+str(gag_rna.count("c"))) print("pol: A - "+str(pol_rna.count("a"))+", U - "+str(pol_rna.count("u"))+", G - "+str(pol_rna.count("g"))+", C - "+str(pol_rna.count("c"))) print("vif: A - "+str(vif_rna.count("a"))+", U - "+str(vif_rna.count("u"))+", G - "+str(vif_rna.count("g"))+", C - "+str(vif_rna.count("c"))) print("vpr: A - "+str(vpr_rna.count("a"))+", U - "+str(vpr_rna.count("u"))+", G - "+str(gag_rna.count("g"))+", C - "+str(vpr_rna.count("c"))) print("env: A - "+str(env_rna.count("a"))+", U - "+str(env_rna.count("u"))+", G - "+str(env_rna.count("g"))+", C - "+str(env_rna.count("c"))) ----- #### For each gene, caculate the GC content (%) #percent GC = sum of (G) + sum (C) / total number of nuclotides in a given gene gagGC = ((int(gagC) + int(gagG)) / len(gag))*100 envGC = ((int(envC) + int(envG)) / len(env))*100 vprGC = ((int(vprC) + int(vprG)) / len(vpr))*100 vifGC = ((int(vifC) + int(vifG)) / len(vif))*100 polGC = ((int(polC) + int(polG)) / len(pol))*100 print(" gag GC %: " + str(gagGC) + "% \n env GC %: " + str(envGC) +"% \n vpr GC %: "+str(vprGC)+"% \n vif GC %: "+str(vifGC)+"% \n pol GC %: "+str(polGC)+"%") ----- ----- print("gag %: "+str(((gag_rna.count("g")+gag_rna.count("c"))/len(gag_rna))*100)) print("pol %: "+str(((pol_rna.count("g")+pol_rna.count("c"))/len(pol_rna))*100)) print("vif %: "+str(((vif_rna.count("g")+vif_rna.count("c"))/len(vif_rna))*100)) print("vpr %: "+str(((vpr_rna.count("g")+vpr_rna.count("c"))/len(vpr_rna))*100)) print("env %: "+str(((env_rna.count("g")+env_rna.count("c"))/len(env_rna))*100)) ----- sequence_data = { 'gag sequence': gag_sequence, 'pol sequence': pol_sequence, 'vif sequence': vif_sequence, 'vpr sequence': vpr_sequence, 'env sequence': env_sequence } for name, data in sequence_data.items(): print(name, "percent GC:", (data.count('g') + data.count('c')) / len(data)) ----- gag = hiv_genome[789:2292] print(gag) pol = hiv_genome[2084:5096] print(pol) vif = hiv_genome[5040:5619] print(vif) vpr = hiv_genome[5558:5850] print(vpr) env = hiv_genome[6224:8795] print(env) RNA_gag = gag.replace('t', 'u') ) gag = hiv_genome[790:2292] print(gag) pol = hiv_genome[2085:5096] print(pol) vif = hiv_genome[5041:5619] print(vif) vpr = hiv_genome[5559:5850] print(vpr) env = hiv_genome[6225:8795] print(env) ----- #### Print the list of these HIV genes in order given the list below The correct order is - gag, pol, vif, vpr, vpu, env, nef print(hiv_gene_names[1] , hiv_gene_names[3], hiv_gene_names[2], hiv_gene_names[5], hiv_gene_names[0], hiv_gene_names[6]) ----- print(hiv_gene_names[1]) print(hiv_gene_names[3]) print(hiv_gene_names[2]) print(hiv_gene_names[4]) print(hiv_gene_names[5]) print(hiv_gene_names[0]) print(hiv_gene_names[6]) ----- ----- print(hiv_gene_names[1] + ', ' + hiv_gene_names[3] + ', ' + hiv_gene_names[2] + ', ' + hiv_gene_names[4] + ', ' + hiv_gene_names[5] + ', ' + hiv_gene_names[0] + ', ' + hiv_gene_names[6]) ----- print(hiv_gene_names[1]) print(hiv_gene_names[3]) print(hiv_gene_names[2]) print(hiv_gene_names[4]) print(hiv_gene_names[5]) print(hiv_gene_names[0]) print(hiv_gene_names[6]) ----- print(hiv_gene_names[1],hiv_gene_names[3],hiv_gene_names[2],hiv_gene_names[4],hiv_gene_names[5],hiv_gene_names[0],hiv_gene_names[6]) ----- ### DAY THREE #### Use conditionals so that if the float is greater than or equal to 0.5 consider that - 'Heads' otherwise 'Tails' from numpy import random my_random_int = random.randint(1,10) my_random_float = random.ranf() print('My random float is %f' % my_random_float) if my_random_int>=.5: print('heads') if my_random_int<.5: print('tails') ----- import random coin_flip = random.uniform(0.0,1.0) print(coin_flip) if coin_flip >= 0.5: print("Heads") elif coin_flip < 0.5: print("Tails") ----- ran_num = random.ranf() if ran_num >= 0.5: print("Heads") else: print("Tails") ----- ranFloat = random.ranf() if ranFloat>=0.5: print("heads") elif ranFloat<=0.001 or ranFloat>=0.999: print("on its side") else: print("tails") ----- coinflip = (int) (random.ranf()*2)+1 if(coinflip==2): print("It's heads") else: print("It's tails") ----- from numpy import random coin_flip=random.randint(0,1) if coin_flip>.5: print ("heads") elif coin_flip=<.5: print ("tails") print(coin_flip) ----- from numpy import random num = random.random() if num >= 0.5: print("Heads") else: print("Tails") from numpy import random my_random_coin = random.ranf() if my_random_coin <= 0.5: print("The coin is heads") else: print("The coin is tails") #### 2. Determine how often would HIV mutate in 20 rounds of replication ----- from numpy import random replication_states = ['mutation','no_mutation'] hiv_mutation_probibilities = [0.000044,0.999956] for flip in range(1,21): mutation_probibility = random.choice(replication_states,p = hiv_mutation_probibilities) print("%s" %mutation_probibility) ----- ----- from numpy import random mut_state = ['Mutation','No Mutation'] overall_prob = 0.44 mut_probabilities = [overall_prob,1-overall_prob] for x in range(1,21): opt = random.choice(mut_state, p=mut_probabilities) print(opt) ----- HIV_state = ['yes', 'no'] HIV_probability = [0.044, 0.956] mutation_results = [] mutation_success = [] for mutation in range(1,21): mutation_rate = random.choice(HIV_state,p = HIV_probability) mutation_results.append(mutation_rate) print("Mutation: %s" %mutation_rate) for result in mutation_results: if result == 'yes': mutation_success.append("success") if len(mutation_success) == 1: print("The HIV mutated %d time" % len(mutation_success)) else: print("The HIV mutated %d times" % len(mutation_success)) #### determine which nucleotide in the HIV-1 genome to mutate mutation_pos = random.randint(0, len(hiv_genome)) old_base = hiv_genome[mutation_pos] # In one line base = hiv_genome[random.randint(0, len(hiv_genome))] ----- if(mutate == 'mutation'): substitutionState = ['ca', 'ga', 'ta', 'ac', 'gc', 'tc', 'ag', 'tg', 'at', 'ct', 'gt'] substitutionP = [0.04320988, 0.45061693, 0.0617284, 0.00308632, 0.00617284, 0.055556, 0.08950617, 0.01851852, 0.00925926, 0.25, 0.01234568] whatSubstitute = random.choice(substitutionState, p = substitutionP) whichSubstitute = random.choice(find_all_indices(hiv_genome, whatSubstitute[0])) def find_all_indices(text, letter): indices_of_letter = [] for i, ch in enumerate(text): if ch == letter: indices_of_letter.append(i) return indices_of_letter ----- index = random.randint(0, len(hiv_genome)) #### flip a coin weighted to the probabilities of mutation given in the 'Class 1: single nt substitution' chart above. In each the number of observed mutations of a nucleotide on the y-axis changing to one on the x-axis is shown.use the replace() function to mutate your HIV-1 genome ----- ``` ****import numpy mutationState = ['mutation', 'no_mutation'] mutationP = [0.000044, 0.999956] mutate = random.choice(mutationState, p = mutationP) if(mutate == 'mutation'): substitutionState = ['ca', 'ga', 'ta', 'ac', 'gc', 'tc', 'ag', 'tg', 'at', 'ct', 'gt'] substitutionP = [0.04320988, 0.45061693, 0.0617284, 0.00308632, 0.00617284, 0.055556, 0.08950617, 0.01851852, 0.00925926, 0.25, 0.01234568] ``` whatSubstitute = random.choice(substitutionState, p = substitutionP) ``` whichSubstitute = random.choice(find_all_indices(hiv_genome, whatSubstitute[0])) ``` temp = list(hiv_genome) temp[whichSubstitute] = whatSubstitute[1] mutatedGenome = ''.join(temp) def find_all_indices(text, letter): indices_of_letter = [] for i, ch in enumerate(text): if ch == letter: indices_of_letter.append(i) return indices_of_letter** ----- print(hiv_genome[718]) from numpy import random mut_nuc_nuc = ["G_A", "G_C", "G_T", "G"] fair_mut_probabilities_nuc = [0.96052631578, 0.01315789473, 0.02631578947, 0.00000000002] mut_outcome_nuc = [] for mut in range(1): fair_mut_nuc = random.choice(mut_nuc_nuc,p = fair_mut_probabilities_nuc) mut_outcome_nuc.append(fair_mut_nuc) for result in mut_outcome_nuc: if result == "G_A": print("G changed to A") elif result == "G_C": print("G changed to C") elif result == "G_T": print("G changed to T") else: print("G did not mutate") ----- mutation_pos = random.randint(0, len(hiv_genome)) old_base = hiv_genome[mutation_pos] print(old_base) if old_base == 'a': new_base = random.choice(['c', 'g', 't'], p=[1/33, 29/33, 3/33]) elif old_base == 'c': new_base = random.choice(['a', 't'], p=[14/95, 81/95]) elif old_base == 'g': new_base = random.choice(['a', 'c', 't'], p=[146/152, 2/152, 4/152]) else: new_base = random.choice(['a', 'c', 'g'], p=[20/44, 18/44, 6/44]) hiv_genome = hiv_genome[:mutation_pos] + new_base + hiv_genome[mutation_pos + 1:] print(hiv_genome[mutation_pos]) ----- (this has the random mutation but because it uses replace the first occurrence of that letter gets changed instead of a random one) # setting up mutation counts # setting up mutation rates mutation_state = ["mutation", "no_mutation"] mutation_prob = 0.000044 mutation_rate = [mutation_prob, 1-mutation_prob] # check for mutation state = random.choice(mutation_state, p=mutation_rate) # print(state) # setting up what kind of mutation mutation_specific_states = ["ac", "ag", "at", "ca", "cg", "ct", "ga", "gc", "gt", "ta", "tc", "tg"] mutation_specific_rates_temp = [14, 146, 20, 1, 2, 18, 29, 0, 6, 3, 81, 4] # transforming values into a decimal/percentage mutation_specific_rates = [] for r in mutation_specific_rates_temp: mutation_specific_rates.append(r/sum(mutation_specific_rates_temp)) if state == "mutation": # if mutation, choose the type change = random.choice(mutation_specific_states, p=mutation_specific_rates) print(change[1]+" change to "+change[0]) # actually change the genome mutated_hiv_genome = mutated_hiv_genome.replace(change[1], change[0], 1) ----- random_int_hiv = random.randint(0,9718) if (hiv_genome[random_int_hiv]) == (hiv_genome[0]): random_inta = random.randint(1,33) if random_inta in range (1,1): print('adenine mutates into cytosine at' + (random_int_hiv)) if random_inta in range(2,30): print('adenine mutates into guanine at' + (random_int_hiv)) if random_inta in range(31,33): print('adenine mutates into thymine at' + (random_int_hiv)) if (hiv_genome[random_int_hiv]) == (hiv_genome[8]): random_intc = random.randint(1,95) if random_intc in range(1,14): print('cytosine mutates into adenine at' + (random_int_hiv)) if random_intc in range(15,95): print('cytosine mutates into thymine at' + (random_int_hiv)) if (hiv_genome[random_int_hiv]) == (hiv_genome[2]): random_intg = random.randint(1,152) if random_intg in range(1,146): print('guanine mutates into adenine at' + (random_int_hiv)) if random_intg in range(147,148): print('guanine mutates into cytosine at' + (random_int_hiv)) if random_intg in range(149,152): print('guanine mutates into thymine at' + (random_int_hiv)) if (hiv_genome[random_int_hiv]) == (hiv_genome[1]): random_intt = random.randint(1,44) if random_intt in range(1,20): print('thymine mutates into adenine at' + (random_int_hiv)) if random_intt in range(21,38): print('thymine mutates into cytosine at' + (random_int_hiv)) if random_intt in range(39,44): print('thymine mutates into guanine at' + (random_int_hiv)) if result == 'no_mutation': print("no_mutation") ----- mutation_pos = random.randint(0, len(hiv_genome)) nuc = hiv_genome[mutation_pos] if nuc=='a': mutation_state=['A-T','A-G','A-C'] mutation_probs = [0.0909090909,0.87878787878,0.0303030303] elif nuc=='t': mutation_state=['T-A','T-G','T-C'] mutation_probs=[0.45454545454,0.13636363636,0.40909090909] elif nuc=='g': mutation_state=['G-T','G-A','G-C'] mutation_probs=[0.02631578947,0.9605263158,0.01315789473] elif nuc=='c': mutation_state=['C-A','C-G','C-T'] mutation_probs=[0.14736842105,0,0.85263157894] mutation_results = random.choice(mutation_state,p=mutation_probs) print(mutation_results,'\nin nucleotide position',mutation_pos,'of the hiv genome') ---- #### Write the appropriate code to translate an RNA string to a protein sequence: - rna = 'AUGCAUGCGAAUGCAGCGGCUAGCAGACUGACUGUUAUGCUGGGAUCGUGCCGCUAG' counter = 0 while counter < len(rna): print(amino_acids[rna[counter:counter+3]]) counter = counter + 3 ----- proteinSeq = '' lengthOfProtein = int(len(rna)/3) for x in range(0, lengthOfProtein): proteinSeq += amino_acids[rna[x*3:x*3+3]] print(proteinSeq) ----- protein = "" for x in range(0, len(rna), 3): codon = rna[x:x+3] protein += amino_acids[codon] print(protein) #### Does your code work on the following RNA sequence? rna = 'AUGCAAGACAGGGAUCUAUUUACGAUCAGGCAUCGAUCGAUCGAUGCUAGCUAGCGGGAUCGCACGAUACUAGCCCGAUGCUAGCUUUUAUGCUCGUAGCUGCCCGUACGUUAUUUAGCCUGCUGUGCGAAUGCAGCGGCUAGCAGACUGACUGUUAUGCUGGGAUCGUGCCGCUAG' yes ----- if it's supposed to stop: protein = "" for x in range(0, len(rna), 3): codon = rna[x:x+3] if amino_acids[codon] != "_": protein += amino_acids[codon] else: break print(protein) otherwise it's the same or if it's supposed to have multiple proteins: protein = "" for x in range(0, len(rna), 3): codon = rna[x:x+3] if amino_acids[codon] != "-": protein += amino_acids[codon] else: print(protein) protein = "" #### Can you translate this sequence in all 3 reading frames? for frame in range(3): protein_seq = '' for start_pos in range(frame, len(rna), 3): amino = amino_acids[rna[start_pos:start_pos + 3]] if amino == '_': break protein_seq += amino print(protein_seq) ----- protien_seq = '' len_protien = int(len(rna)/3) for x in range(0,len_protien): protien_seq+=amino_acids[rna[x*3:x*3+3]] if(protien_seq[len(protien_seq)-1:len(protien_seq)])=='_': protien_seq+='\n' print(protien_seq) ---- for frame in range(3): protein = "" for x in range(frame, len(rna), 3): if x+3 < len(rna): codon = rna[x:x+3] if amino_acids[codon] != "-": protein += amino_acids[codon] else: break print(protein) or if there's supposed to be multiple proteins for frame in range(3): protein = "" for x in range(frame, len(rna), 3): if x+3 < len(rna): codon = rna[x:x+3] if amino_acids[codon] != "-": protein += amino_acids[codon] else: print(protein) protein = "" print(protein) #### Write a function that calculates the GC content of a DNA string def calcGCContent(seq): return seq.count('G') + seq.count('C') + seq.count('g') + seq.count('c') ----- def GC_count(): GC_count=0 for i in range(len(dna_string)): if dna_string[i]=='g' | dna_string[i]=='c': GC_count+=1 return GC_count ----- def gc_content(dna): g_count = dna.upper().count("G") c_count = dna.upper().count("C") return (g_count+c_count)/len(dna) ----- def dna_gc_content(): dna_gc_count = (dna.count('c') + dna.count('g')) / len(dna) return (dna_gc_count*100) print (dna_gc_content() , '%') #### Write a function that generates a random string of DNA of random length import numpy def dna_maker(): length=random.randint(0,10000) probs=[0.25,0.25,0.25,0.25] chances=['a','t','c','g'] dna = '' for i in range(length): dna+=random.choice(chances,p=probs) return dnahttps://hackmd.io/ print(dna) dna_maker() ------ import numpy def ranDNA(): sequenceState = ["A", "C", "G", "T"] sequenceP = [.25, .25, .25, .25] lengDNA = random.randint(0, 100000) DNAStrand = '' for x in range(lengDNA): DNAStrand += random.choice(sequenceState, p = sequenceP) return DNAStrand print(ranDNA()) ----- from numpy import random def ran_dna(): length = random.randint(10, 500) print("Length:", length) dna = "" for x in range(length): dna += random.choice(["A", "T", "C", "G"]) return dna print(ran_dna()) ---- #### Challenge: Write a function that generates a random string of DNA of random a random length: use optional arguments to set the length of the strings and the probabilities of the nucleotides. ----- from numpy import random def ran_dna(length=random.randint(10, 500), aProb=0.25, tProb=0.25, cProb=0.25, gProb=0.25): print("Length:", length) dna = "" for x in range(length): dna += random.choice(["A", "T", "C", "G"], p=[aProb, tProb, cProb, gProb]) return dna print(ran_dna(100, 0.1, 0.1, 0.4, 0.4)) or it could just be a list (it works either way) from numpy import random def ran_dna(length=random.randint(10, 500), probs=[0.25, 0.25, 0.25, 0.25]): print("Length:", length) dna = "" for x in range(length): dna += random.choice(["A", "T", "C", "G"], p=probs) return dna print(ran_dna(100, 0.1, 0.1, 0.4, 0.4)) ----- ``` import numpy def ranDNA(probA = 0.25, probC = 0.25, probG = 0.25, probT= 0.25): sequenceState = ["A", "C", "G", "T"] sequenceP = [probA, probC, probG, probT] not25s = [] if(sum(sequenceP)!=1): for x in range(0,len(sequenceP)): if(sequenceP[x] == 0.25): not25s.append(x) differenceFrom1 = probA+probC+probG+probT -1 changeMade = -differenceFrom1/len(not25s) for x in range(0, len(not25s)): sequenceP[not25s[x]] += changeMade lengDNA = random.randint(0, 1000) DNAStrand = '' for x in range(lengDNA): DNAStrand += random.choice(sequenceState, p = sequenceP) return DNAStrand ``` #### Write a function that generates a random protein of length L ----- ----- ``` def ranRNA(): sequenceState = ["A", "C", "G", "U"] sequenceP = [.25, .25, .25, .25] lengRNA = random.randint(1, 100) RNAStrand = '' for x in range(lengRNA): RNAStrand += random.choice(sequenceState, p = sequenceP) return RNAStrand urMom = True proteinSeq = '' while urMom == True: RNAStrand = ranRNA() if(len(RNAStrand)>=3): if(amino_acids[RNAStrand[:3]] == 'M'): lengthOfProtein = int(len(RNAStrand)/3) for x in range(0, lengthOfProtein): proteinSeq += amino_acids[RNAStrand[x*3:x*3+3]] if(amino_acids[RNAStrand[x*3:x*3+3]] == '_'): urMom = False break print(proteinSeq) ``` ----- from numpy import random amino_acids = { 'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T', 'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R', 'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P', 'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R', 'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A', 'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G', 'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S', 'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L', 'UAC':'Y', 'UAU':'Y', 'UAA':'-', 'UAG':'-', 'UGC':'C', 'UGU':'C', 'UGA':'-', 'UGG':'W' } # set up the variables def generate_protein(length = 100): trials = 0 starts = 0 stops = 0 protein = "" rna = "" while len(protein) != length: protein = "" rna = "" # get the start working codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon while amino_acids[codon] != "M": #while the codon is not a start, try again (create a new codon) codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon trials += 1 starts += 1 # if it worked, add the codon and amino acid to the protein and rna protein += amino_acids[codon] rna += codon # while the codon is not a stop, keep generating and adding codons while amino_acids[codon] != "-": codon = random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"])+random.choice(["A", "U", "C", "G"]) #random codon protein += amino_acids[codon] rna += codon # once there's a stop, check if it is the right length or not if len(protein) != length: trials += 1 stops += 1 print("Total trials:", trials) print("Total restarts from start:", starts) print("Total restarts from end:", stops) print(protein) print("") print(rna) return protein, rna generate_protein(50) print("Done") ----- #### M&M Plotting |Sample name|Blue|Brown|Green|Orange|Red|Yellow| |-----------|----|-----|-----|------|---|------| tube_x = [,,,,,,] tube_3 = [14,9,14,9,1,1] tube_4 = [11,12,5,6,5,7] tube_6 = [7,2,6,10,4,18] tube_5 = [12,8,4,12,4,8] tube_7 = [10,8,11,8,8,3] tube_9 = [11,4,5,12,13,2] tube_10 = [23,2,3,8,4,9] #### How would you make a plot for all of the tubes in the class? ----- data = { 'tube 3': [14,9,14,9,1,1], 'tube 5': [12,8,4,12,4,8], 'tube 6': [7,2,6,10,4,18], 'tube 7': [10,8,11,8,8,3], 'tube 9': [11,4,5,12,13,2], 'tube 10': [23,2,3,8,4,9] } for key in data: plot_1 = plot.bar(index, data[key], color=colors, tick_label=colors, align='center') plot.title(key) plot.show(plot_1) ----- # function to plot the data def bar_plot(tube, name=""): n = len(tube) index = np.arange(n) colors = ['blue', 'brown', 'green', 'orange', 'red', 'yellow'] plot_1 = plot.bar(index, tube, color=colors, tick_label=colors, align='center') plot.title("Tube "+name) plot.show(plot_1) # defining tubes tube_3 = [14,9,14,9,1,1] tube_4 = [11,12,5,6,5,7] tube_5 = [12,8,4,12,4,8] tube_6 = [7,2,6,10,4,18] tube_7 = [10,8,11,8,8,3] tube_9 = [11,4,5,12,13,2] tube_10 = [23,2,3,8,4,9] # putting the tubes into a list for looping in the future tubes = [tube_3, tube_4, tube_5, tube_6, tube_7, tube_9, tube_10] tube_names = ["3", "4", "5", "6", "7", "9", "10"] # loop through the tubes and plot them for x in range(len(tubes)): bar_plot(tubes[x], tube_names[x]) ----- counter = 0 while counter <= len(tubes): observations = tubes[counter] n = len(observations) index = np.arange(n) colors = ['blue', 'brown', 'green', 'orange', 'red', 'yellow'] plot_1 = plot.bar(index, observations, color=colors, tick_label=colors, align='center') plot.show(plot_1) counter = counter + 1 ----- ``` def graph(x): #function to graph the graph thing = observations[x] n = len(thing) index = np.arange(n) colors = ['blue', 'brown', 'green', 'orange', 'red', 'yellow'] plot1 = plot.bar(index, thing, color=colors, tick_label=colors, align='center') plot.show(plot1) #list of the tubes with values observations = [tube_0, tube_3, tube_4, tube_5, tube_6, tube_7, tube_9, tube_10] #loop to graph the tubes for x in range(len(observations)): graph(x) ``` ----- tube_3 = [14,9,14,9,1,1,3] tube_4 = [11,12,5,6,5,7,4] tube_6 = [7,2,6,10,4,18,6] tube_5 = [12,8,4,12,4,8,5] tube_7 = [10,8,11,8,8,3,7] tube_9 = [11,4,5,12,13,2,9] tube_10 = [23,2,3,8,4,9,10] def graph(tube): observations = tube[0:6] n = len(observations) index = np.arange(n) colors = ['blue', 'brown', 'green', 'orange', 'red', 'yellow'] plot_1 = plot.bar(index, observations, color=colors, tick_label=colors, align='center') plot.title(tube[6]) plot.show(plot_1) observations=[tube_3,tube_4,tube_5,tube_6,tube_7,tube_9,tube_10] for x in range(len(observations)): graph(observations[x]) ------