Try   HackMD

運用Python解生物資訊問題 (1)

練習一

  1. Calculate the GC content of the following DNA sequence:
    ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGTAGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAACCCTCTATTGTGTGCATCAAAGGATAGAGATAAAAGACACCAAGGAAGCT
dna_seq = """ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGTAGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAACCCTCTATTGTGTGCATCAAAGGATAGAGATAAAAGACACCAAGGAAGCT"""
gc = 0
for nt in dna_seq.strip():
    if nt == 'G' or nt == 'C':
        gc += 1
gc_content = gc / len(dna_seq.strip())
print('The GC content of the DNA sequence is %.2f %%' % (gc_content * 100))

  1. Translate the above DNA sequence to protein sequence using the codon table:

codon_table = { 'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'UAU': 'Y', 'UAC': 'Y', 'UAA': '', 'UAG': '', 'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', }
Calculate the frequency of each amino-acid, and find the most frequent one.

codon_table = { 'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L', 'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S', 'UAU': 'Y', 'UAC': 'Y', 'UAA': '', 'UAG': '', 'UGU': 'C', 'UGC': 'C','UGA': '*', 'UGG': 'W', 'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L', 'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M', 'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K', 'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V', 'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G', }

transcribe_table = {'A': 'A', 'G': 'G', 'C': 'C', 'T': 'U'}

rna_letters = []
for nt in dna_seq:
    rna_letters.append(transcribe_table[nt])
rna_seq = ''.join(rna_letters)
# print(rna_seq)

protein_seq = []
for i in range(0, len(rna_seq), 3):
    codon = rna_seq[i: i + 3]
    aa = codon_table[codon]
    if aa == '*':
        break
    else:
        protein_seq.append(aa)
protein_seq = ''.join(protein_seq)
print(protein_seq)

  1. Are there other possible start sites in the sequence? How many? For each start site, what would be the sequence of the translated protein? What would be the lengths ofthe DNA and protein sequences? What are the reading frames (0, 1 or 2)?
start_sites = [] for i in range(len(rna_seq)): if rna_seq[i: i + 3] == 'AUG': start_sites.append(i) for start_site in start_sites: protein_seq = [] for i in range(start_site, len(rna_seq), 3): codon = rna_seq[i: i + 3] if len(codon) == 3: aa = codon_table[codon] else: break if aa == '*': break else: protein_seq += aa protein_seq = ''.join(protein_seq) report = 'Start site %d: %s, DNA sequence of length %d was translated into %d amino acids; frames = %d' % (start_site, protein_seq, len(rna_seq) - start_site, len(protein_seq), start_site % 3) print(report)

練習二

A)
Two strains of bacteria are grown in a lab on a petri dish. Strain A doubles (X2) its population every day,while strain B triples it (X3). The experiment starts with a colony of 10,000 bacteria of type A and 100 of type B on the same petri dish. When the total population (of both strains together) exceeds 108 bacteria,growth is reduced to X1.5 for strain A and X1.8 for strain B. When it exceeds 1010, growth stops. What willbe the number of bacteria during each day of the experiment? How many days will it take until growthstops? What will be the proportion between the two strains at the end of the experiment?

strain_a = 10000 strain_b = 100 replicate_a = 2 replicate_b = 3 total = strain_a + strain_b day = 1 while total < 10**10: if total > 10**8: replicate_a = 1.5 replicate_b = 1.8 strain_a *= replicate_a strain_b *= replicate_b total = strain_a + strain_b print('Day%d: strain A: %d bacteria, strain B: %d bacteria [total = %d]' % (day, strain_a, strain_b, total)) day += 1 prop_a = strain_a / total prop_b = strain_b / total print('At Day%d, the total bacteria reached %d with %.2f %% of strain A and %.2f %% of strain B' % (day - 1, total, prop_a * 100, prop_b * 100))

In a second experiment, a bacteriophage is introduced into the dish. When the total population exceeds

106 bacteria, it starts spreading and kills 40% of the cells every day (of both strains). Every day after the bacteriophage becomes active, 10% of type-A bacteria will develop immunity to the bacteriophage and retain normal growth rate. Strain B has a lower mutation rate, so only 1% of its cells will develop immunity every day. Run the simulation again and determine the fraction of the 4 populations at the end of the experiment (type A/B with/without immunity).

immunity_development_rate = {'A': 0.1, 'B': 0.01} populations = {('A', False): 1e4, ('A', True): 0, ('B', False): 1e2, ('B', True): 0} day = 0 is_phage_active = False while sum(populations.values()) < 1e10: if sum(populations.values()) > 1e6: is_phage_active = True if sum(populations.values()) < 1e8: growth_rate = {'A': 2, 'B': 3} else: growth_rate = {'A': 1.5, 'B': 1.8} for strain_key in populations.keys(): strain, is_immune = strain_key populations[strain_key] *= growth_rate[strain] if is_phage_active and not is_immune: populations[strain_key] *= 0.6 if is_phage_active: for strain in ['A', 'B']: immunity_rate = immunity_development_rate[strain] num_developed_immunity = immunity_rate * populations[(strain, False)] populations[(strain, True)] += num_developed_immunity populations[(strain, False)] -= num_developed_immunity day += 1 if is_phage_active: print('After %d days [active phage]:' % day) else: print('After %d days:' % day) for strain in ['A', 'B']: for is_immune in [False, True]: if is_immune: immune_label = 'Immune' else: immune_label = 'Unimmune' population_size = populations[(strain, is_immune)] prop = population_size / sum(populations.values()) * 100 print('\t' + '%s %s: %d bacteria(%.2f%%)' % (immune_label, strain, population_size, prop)) print('\t' + 'total bacteria: %d' % sum(populations.values()))

練習三

Find the reverse complement of the following DNA sequence and translate the output DNA sequence into protein sequence: CTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAACAAATTATAAACATGTGGCAGAAAGTAGGAAAAGCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACAGGGCTGCTATTAACAAGAGATGGTGGTAATAGCAACAATGAGTCCGAGATCTTCAGACCTGGAGGAGGAGATATGAGGGACAATTGGAGAAGTGAATTATATAAATATAAAGTAGTAAAAATTGAACCATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGGCGCAGCCTCAATGACGCTGACGGTACAGGCCAGACAATTATTGTCTGGTATAGTGCAGCAGCAGAACAATTTGCTGAGGGCTATTGAGGCGCAACAGCATCTGTTGCAACTCACAGTCTGGGGCAT

dna_seq = """CTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAAATAACA CTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAACAAATTATAAACATGTGGCAGAAAGTAGGAAAA GCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACAGGGCTGCTATTAACAAGAGA TGGTGGTAATAGCAACAATGAGTCCGAGATCTTCAGACCTGGAGGAGGAGATATGAGGGACAATTGGAGAAGTG AATTATATAAATATAAAGTAGTAAAAATTGAACCATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTG CAGAGAGAAAAAAGAGCAGTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGG CGCAGCCTCAATGACGCTGACGGTACAGGCCAGACAATTATTGTCTGGTATAGTGCAGCAGCAGAACAATTTGC TGAGGGCTATTGAGGCGCAACAGCATCTGTTGCAACTCACAGTCTGGGGCAT""" dna_seq = dna_seq.replace(" ", "") reverse_dna_table = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'} rev_dna_seq = [] for nt in dna_seq[::-1]: rev_dna_seq.append(reverse_dna_table[nt]) rev_dna_seq = ''.join(rev_dna_seq) rna_seq = rev_dna_seq.replace('T', 'U') protein_seq = [] for i in range(0, len(rna_seq), 3): codon = rna_seq[i: i + 3] if len(codon) == 3: aa = codon_table[codon] else: break if aa == '*': break else: protein_seq.append(aa) protein_seq = ''.join(protein_seq) print(f'Reverse DNA sequence: \n{rev_dna_seq} \nProtein sequence: \n{protein_seq}')

What is the frequency of each codon in the obtained sequence?

codon_counts = {} for i in range(0, len(rna_seq), 3): codon = rna_seq[i: i+3] if codon in codon_counts: codon_counts[codon] += 1 else: codon_counts[codon] = 1 codon_frequencies = {} total_count = sum(codon_counts.values()) for codon, count in codon_counts.items(): codon_frequencies[codon] = count / total_count print('%s: %.2f' % (codon, codon_frequencies[codon]))

Does the distribution of codons seem uniform? Are there codons that appear dramatically more (or less) than expected (given their amino-acid frequency in that sequence)?

aa_to_codon = {} for codon, aa in codon_table.items(): if aa == '*': continue if aa in aa_to_codon: aa_to_codon[aa].add(codon) else: aa_to_codon[aa] = {codon} for aa, aa_codons in aa_to_codon.items(): aa_counts = protein_seq.count(aa) if aa_counts == 0: continue print('Amino acid %s (%d codons, %d occurrence in the protein sequence' % (aa, len(aa_codons), aa_counts)) for codon in aa_codons: codon_count = codon_counts.get(codon, 0) codon_freq = codon_count / aa_counts print('\t' + 'Codon %s: %d occurrence (%.1f%%)' % (codon, codon_count, codon_freq * 100))