Calculating Genomic Coverage

# Calculating Genomic Coverage ## Approach My first thought when reading over this problem was to use an [interval tree](https://en.wikipedia.org/wiki/Interval_tree). Such a datastructure would have the advantage of fast lookups for the read depth of any given position. It would also allow new reads and new lookups to be added quickly and easily. The disadvantage is that the datastructure needs time to be constructed. My solution reflects the desire for fast lookups. First, I construct a self-balancing binary search tree containing all of the intervals in reads.csv using the [AVL method](https://en.wikipedia.org/wiki/AVL_tree). Next, I iterate over each of the loci and do an overlap search in the binary search tree to count the number of overlapping reads. I output my answers to a file called ans.csv to avoid clobbering the input loci data or needing to read it all in to memory. To execute, run `./soln.py` in the directory containing `soln.py`, `reads.csv`, and `loci.csv`. The output will be stored in `ans.csv`. ## Further Work If I had more time to work on this problem, I would make five possible changes. 1. Instead of constructing the search tree in memory (which is problematic for large datasets), I would build the tree in a SQL database, having each row of the database represent a node in the tree with its associated values. This would slow down computation but allow for much larger datasets. 2. [Red-black trees](https://en.wikipedia.org/wiki/Red%E2%80%93black_tree) are slightly less well balanced but much faster for construction. I would implement the red-black algorithm instead of the AVL algorithm to reduce tree construction time. 3. I was struck by an idea for a slightly different datastructure that takes longer to construct but would be very fast for lookups in some cases. Instead of representing the ranges as they were in read.csv, instead consider the datastructure to be an ordered list of non-overlapping ranges with depth values. When a new range is inserted, each range that is totally consumed by the range increments its depth value by 1, each portion of the range that is not overlapped by an older range becomes a new range with a depth of 1, and when an older range partially overlaps with the new range the old range can be split. Below is a rough sketch of these three cases. This would make lookups very fast because only *one* node in the datastructure contains any locus and has the answer for the read depth right there. My main concern with this datastructure (and the reason I did not implement it) is that given a dataset of ranges that are very messy and overlap in unpredictable ways could cause a huge number of ranges that need to be stored. So, it's possible that this method in conjunction with the first suggestion of storing everything in a SQL database could result in the fastest lookups without overloading memory. ![](https://i.imgur.com/2az2zfr.jpg) 4. Make the script more user friendly. Allow the path to the read data file and loci data file to be arguments. It might also be helpful to spin up a REPL to allow additional reads to be input and further depth queries to be made easily. 5. While it would be difficult to multi-thread the tree creation process, the queries could be sped up by multithreading. ## Runtime Analysis Let *n* be the number of reads and *m* be the number of loci. Each read is inserted into the interval tree. Because it is self balancing, the number of comparisons needed to insert each read is O(log(*n*)). Similarly, the largest number of rotations needed to be performed to do each insertion is O(log(*n*)), and each rotation is constant time. So, each insertion requires O(log(*n*)) time, and there are *n* insertions, so the runtime of tree construction is O(*n* log(*n*)). The runtime of determining the read depth of any given position is a bit more complicated. Consider if a position is overlapped by every single interval in the tree. To determine this, each interval in the tree will be touched, requiring at least O(*n*) time. However, this is not a normal case. If we assume that there is a constant *C* which is the deepest read depth and is small compared to *n*, then the runtime of each position lookup is limited to O(log(*n*)+*C*), which is O(log(*n*)). This means that the runtime of *m* lookups is O(*m* log(*n*)). Thus, the total runtime of constructing the tree and doing each of the lookups is: O(*n* log(*n*) + *m* log(*n*)) Or: O((*n* + *m*) log(*n*)) For this particular dataset, my computer can construct the tree in about 150 seconds and can iterate through each of the loci in an additional 30 seconds.