# Identification of a possible regulatory site in genomic DNA
Reference - http://www.people.vcu.edu/~elhaij/IntroBioinf/Scenarios/Scenario-RegulatorySite.html

Filament of Nostoc. The green cells are photosynthetic vegetative cells. The pale cell is a heterocyst, specialized for nitrogen fixation
## Nitrogen-fixing cyanobacteria: Eat air and prosper!
Certain cyanobacteria, amongst them Nostoc PCC 7120, are among the only creatures on earth able to survive on CO2 as a source of carbon, N2 as a source of nitrogen, water as a source of electrons, and sunlight as a source of energy. This is quite a trick, because the process of fixing carbon with electrons from water necessarily produces O2 as a byproduct and the process of fixing N2 is irreversibly inactivated by tiny amounts of oxygen. Nostoc is able to protect the machinery of nitrogen-fixation from inactivation by producing specialized cells, called heterocysts, that rigorously exclude oxygen from within them.
## The cost of fixing nitrogen: How to pay only when necessary?
Heterocysts are expensive to make and maintain, however, and you are interested studying the mechanism by which Nostoc regulates the appearance of heterocysts. When an alternative source of nitrogen is present, Nostoc makes no heterocysts. When that source is consumed or removed, vegetative cells differentiate into heterocysts within about 18 hours. How do the cells sense nitrogen deprivation and translate that perception into the induction of the genes necessary for heterocyst differentiation? At present, the answer to this question is not known.
## The discovery: starvation ==> *** NtcA-BINDING *** ==> heterocyst differentiation
You are studying the regulation of the gene hetR, whose product is known to be critical in controlling heterocyst differentiation. You're focusing on the protein HetQ, which you believe regulates the expression of hetR. Your plan is to make random mutations in hetQ (which encodes HetQ), hoping to understand from the resulting mutant protein how the regulation is achieved. In examining the sequence upstream of hetQ, you happen to notice the presence of the sequence:
atctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctct
The capital letters, you recognize, meet all the requirements of a binding site for the protein NtcA, known to mediate the expression of many genes sensitive to nitrogen-deprivation. Maybe, just maybe, you have accidently discovered the missing link that connects nitrogen-deprivation to the regulation of heterocyst genes!
## The discovery? How do you know?
Unfortunately, you need hard evidence that NtcA actually binds to that site before anyone will believe your theory. And hard evidence means spending the better part of a year measuring the binding of NtcA to your sequence in the test tube. If it DOESN'T bind, then you've wasted a lot of time. Is there any way to assess the LIKELIHOOD that NtcA will bind to your sequence without actually having to do time-consuming experiments? How can you tell whether the sequence you found might not have arisen by chance without regard to function?
## Problem
Use bioinformatic tools to assess the likelihood of encountering a specific DNA sequence by chance.
## Resources
- [Molecular Biology Background](http://www.people.vcu.edu/~elhaij/IntroBioinf/Notes/RegulatoryProtein.pdf)
## Tools
### Simulation
Make up a large number of sequences. Ask in each case whether the sequence fits the criteria for an NtcA binding site. Count how many times it does, how many times it doesn't.
### Pattern recognition
Scan the genome of Nostoc PCC 7120 and count how many sequences fit the pattern of an NtcA binding site.
#### Regular Expressions
The pattern you're looking for can be expressed as a regular expression. These are rules for matching patterns of characters. Our particular regular expression, [as explained here](http://www.people.vcu.edu/~elhaij/IntroBioinf/ProblemSets/PS1P.html) is
> `GTA.{8}TAC.{20,24}TA.{3}T`
> which stand for `GTA` followed by a gap of eight positions, then `TAC` followed by a gap of 20 to 24 positions, then `TA` followed by a gap of three positions and a `T`.
This can be visualized [here](https://regex101.com). Paste the regex into the top bar, and then paste the sequence `atctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctct` into the text box below. You can try learning more about regular expressions [here](https://regexone.com/). Once you're comfortable enough to understand the above expression, move on to the following programming tasks.
## Programming Tasks
1. Write a program that takes as input the length of a sequence $L$ and the number of sequences $N$ and outputs $N$ DNA sequences of length $L$ generated randomly. Each nucleotide is equally likely to occur in the sequence at any position.
2. Write a program that takes as input an array of size $N$ which contains sequences of length $L$ and finds the number of times the pattern `GTA.{8}TAC.{20,24}TA.{3}T` occurs in it and where it occurs. You may need to refer to [Julia RegExp Documentation](https://docs.julialang.org/en/v1/manual/strings/#man-regex-literals).
3. Download the [genome of Nostoc sp. PCC 7120](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009705.1/) and load in the FASTA file. There are two ways you can do this.
- **Relatively easy way that is still insightful** -- Use [FASTX.jl](https://github.com/BioJulia/FASTX.jl) to read in the file and store all the sequences in an array.
- **Significantly harder way for programming practice** -- Since it's a text file, you can open it, read it and store each sequence in the file as a string and the set of all sequences in an array. You will have to write a function to parse the file. The format can be found [here](https://en.wikipedia.org/wiki/FASTA_format#Overview).
4. Use the program from step 2 to count the number of times we see a NtcA-Binding siite in this genome. Let's call this number $N_0$
5. Using the program in step 1, randomly generate genomes with the same number of sequences as the genome of Nostoc sp. PCC 7120 and of the same length. If the length varies, pick `80`. Then using program 2, find the number of occurances the NtcA-Binding site in this randomly generated genome.
6. Repeat step 5 many many times and each time, store the number of NtcA-Binding sites found. Plot a histogram of this.
7. Using this data, find the probabilitiy that a randomly generated sequences has *at least* $N_0$ NtcA-Binding sites. As a scientist, what can you conclude from this?
---
## Checklist
- [x] Write a program that takes as input the length of a sequence $L$ and the number of sequences $N$ and outputs $N$ DNA sequences of length $L$ generated randomly. Each nucleotide is equally likely to occur in the sequence at any position.
- [x] Write a program that takes as input an array of size $N$ which contains sequences of length $L$ and finds the number of times the pattern `GTA.{8}TAC.{20,24}TA.{3}T` occurs in it and where it occurs. You may need to refer to [Julia RegExp Documentation](https://docs.julialang.org/en/v1/manual/strings/#man-regex-literals).
- [x] Download the [genome of Nostoc sp. PCC 7120](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009705.1/) and load in the FASTA file. There are two ways you can do this.
- **Relatively easy way that is still insightful** -- Use [FASTX.jl](https://github.com/BioJulia/FASTX.jl) to read in the file and store all the sequences in an array.
- **Significantly harder way for programming practice** -- Since it's a text file, you can open it, read it and store each sequence in the file as a string and the set of all sequences in an array. You will have to write a function to parse the file. The format can be found [here](https://en.wikipedia.org/wiki/FASTA_format#Overview).
- [x] Use the program from step 2 to count the number of times we see a NtcA-Binding siite in this genome. Let's call this number $N_0$
- [x] Using the program in step 1, randomly generate genomes with the same number of sequences as the genome of Nostoc sp. PCC 7120 and of the same length. If the length varies, pick `80`. Then using program 2, find the number of occurances the NtcA-Binding site in this randomly generated genome.
- [ ] Repeat step 5 many many times and each time, store the number of NtcA-Binding sites found. Plot a histogram of this.
- [ ] Using this data, find the probabilitiy that a randomly generated sequences has *at least* $N_0$ NtcA-Binding sites. As a scientist, what can you conclude from this?