# 碩士論文 初稿
張恆霖
基於EAGLE之基因組變異統計分析評分系統延伸應用至胞嘧啶的甲基化程度評估
## 摘要
近幾年以來,在基因表關遺傳學的研究領域當中,佔據了很重要且熱門一塊主題,就是DNA的甲基化現象。DNA甲基化(DNA methylation)[1]是DNA化學修飾的一種形式,它能影響DNA,在不改變序列的情況下,改變其序列原本會產生的遺傳表現,就是所謂的表觀遺傳編碼(epigenetic code)的一部分,而這也是屬於一種外遺傳的機制。DNA甲基化的流程,就是在胞嘧啶環(cytosine)[2]的第五號碳角位上,將甲基(Methyl group)銜接上去,而這種甲基化銜接於胞嘧啶5'方向上,較常見其發生於所有脊椎動物的身上。
DNA甲基化的發生會導致基因該區段的轉錄作用率降低,意即可能抑制蛋白質的相關生成,而這種現象又稱為基因靜默(Gene silencing),進而使其失去功能,在影響發育與癌症中扮演關鍵作用。在人類的細胞中,大約有1%的鹼基是發生了甲基化作用的,在一般成熟的細胞當中,DNA甲基化通常發生於CpG雙核甘酸(CpG dinucleotide),在人類的基因當中,大約80%左右的CpG位點可能都發生甲基化,但某些特定區域則未發生甲基化,例如密集的胞嘧啶和鳥嘌呤,也就是所謂的CpG島(CpG island)。而這與包含所有廣泛表達基因在內的56%的哺乳動物基因中的啟動子有關。因此對人類來說,研究哺乳動物之甲基化程度的現象,對於探討疾病或是特定生理表徵有著龐大的關係。
---
DNA methylation [1] is a form of chemical modification of DNA that affects the genetic expression of DNA without changing the sequence, which is part of the so-called epigenetic code, and it is also part of an epigenetic mechanism. The process of DNA methylation is the articulation of the Methyl group at the 5' carbon position of the cytosine ring [2], and this methylation is articulated in the cytosine 5' direction, which is more common in all vertebrates.
The occurrence of DNA methylation leads to a decrease in the transcriptional rate of that region of the gene, which means that it may inhibit the associated production of proteins, a phenomenon also known as Gene silencing, which in turn renders it non-functional and plays a key role in development and cancer. In human cells, about 1% of the bases are methylated. In mature cells, DNA methylation usually occurs at the CpG dinucleotide, and about 80% of the CpG sites in human genes may be methylated, but some specific regions are not methylated, such as dense cytosine and guanine. We call that the CpG island. This is associated with promoters in 56% of mammalian genes, which contain all widely expressed genes. Therefore, for humans, the study of methylation in mammals is of great relevance to the study of diseases or specific physiological phenotypes.
---
<!-- EAGLE(Explicit Alternative Genome Likelihood Evaluator)[3],EAGLE是一個用來評估生物個體對於變異資料信心程度評分的專案。這個專案將使用者給定的參考基因組(reference genome)、變異資料(Variant Calling File)以及生物序列資訊(Binary Map File),以建立一個用來衡量生物體變異的生成機率模型,將每個可能發生變異的變異點做評分。如得分越高,代表該變異點擁有更多條的生物個體序列資訊支持。
亞硫酸鹽定序(Bisulfite sequenceing)[4],是一種用來測定DNA甲基化的方法,其做法是利用亞硫酸鹽處理,原理是在於亞硫酸鹽可使DNA上的胞嘧啶(C)轉變為尿嘧啶(U),同時已受甲基化的5-甲基胞嘧啶則不受影響。而後進行聚合酶連鎖反應(PCR,Polymerase chain reaction),則會將上述尿嘧啶(U)轉換成胸腺嘧啶(T),因此,以亞硫酸鹽處理後的DNA片段只會保留甲基化的胞嘧啶。
-->
在我們的研究中,我們參考了EAGLE[3]的作法,將EAGLE計算生物體變異的生成機率模型,延伸應用至探討生物體基因序列中,胞嘧啶位點發生甲基化之可能性進行評估。我們的做法是模仿EAGLE,分別給定兩種假設序列,分別為假設的參考基因組序列(Reference sequence)以及假設的替代基因組序列(Alternative sequence)。首先,我們參考變異資料,先評估該變異資料對生物體的可能性評分,將評分後變異可能性發生較高的位點用來修改參考基因組序列,因此我們能得到一組已經將變異可能發生位點替換後的參考基因組。再來,我們將所有CpG位點假設皆已發生甲基化的序列當作參考基因組,接著分別將每個位點各自假設發生甲基化的序列當作替代基因組序列各自加入生成機率模型,並藉由貪婪演算法,分別計算各個CpG位點發生甲基化之最高可能性評估分數,在計算過程中,因為我們的資料來源為經過亞硫酸鹽定序[4]之資料,故在計算過程中,須考慮到來自生物序列資訊正股反股不同之情況。在輸出的結果中,若是原生物個體序列的位點其脫氨組合的可能性越高,則代表該生物個體序列在該位點傾向發生甲基化的程度越高。在EAGLE-Meth的設計架構中,使用者可以透過輸入經過亞硫酸鹽定序以及聚合酶連鎖反應後的個體資訊,來推斷生物個體CpG位點型甲基化發生的可能性評估。
在我們的實驗中,透過我們設計的生成機率模型,經由模型的計算,挑選出各個CpG位點可能發生甲基化之最高可能性分數,與模擬的甲基化實驗數據做比較,我們幾乎可以推測大部分模擬發生甲基化的CpG位點,EAGLE-METH給予之甲基化CpG位點的可能性分數,也大部分符合實際資料。
---
In our study, we extended the EAGLE[3] model for calculating the probability of variation in an organism to evaluate the probability of methylation at the cytosine site in the gene sequence of an organism. Our approach is to emulate EAGLE by giving two hypothetical sequences, a hypothetical Reference sequence, and a hypothetical Alternative sequence. First, we evaluate the likelihood score of the variant to the organism by referring to the variant data and use the sites with a higher likelihood of variation to modify the reference genome sequence, so that we can obtain a reference genome with the replacement of the possible variant sites, call modified reference. Then, we took the sequences that were assumed to be methylated at all CpG sites as the reference genome, then added the sequences that were assumed to be methylated at each site as alternative genomic sequences to the generative probability model, and calculated the highest probability scores of methylation at each CpG site by the Greedy algorithm. In the calculation process, because our data source is the data from bisulfite sequencing [4], we have to take into account the different situations from the biological sequence information of positive and negative strands. In the output, the higher the probability of deamination combination at the site of the sequence, the higher the degree of methylation of the individual sequence at that site. In the EAGLE-Meth design framework, users can infer the likelihood of CpG site methylation by inputting the information of an individual after bisulfite sequencing and polymerase chain reaction.
In our experiments, the highest probability scores for each CpG site were selected by our designed probability model and compared with the simulated methylation experimental data. We could almost predict most of the simulated CpG sites, and the probability scores of methylated CpG sites given by EAGLE-METH were mostly consistent with the actual data.
---
## Chapter 1 : Introduction
### 1.1 Overview
DNA甲基化現象是近幾年備受討論的一項主題,如果能有效預測與判斷DNA甲基化位點,將能夠了解甲基化對生物體的影響,甚至能預防甲基化引起的疾病發生。在我們研究我們的實驗以前,我們先研究EAGLE(Explicit Alternative Genome Likelihood Evaluator),這份研究專案式開發出一款a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual’s genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis.我們將EAGLE的想法做延伸,將其計算評估變異點的方法模擬於DNA甲基化的評估上,製作出一款能評估DNA甲基化位點發生的可能性機率以及保留原本EAGLE評估變異點的可能性機率之工具,我們稱其為EAGLE-Meth。
---
The phenomenon of DNA methylation has been a topic of much discussion in recent years. If DNA methylation sites can be effectively predicted and determined, it will be possible to understand the effects of methylation on organisms and even prevent the occurrence of methylation-induced diseases. Before studying our experiments, we first studied EAGLE (Explicit Alternative Genome Likelihood Evaluator), a research project that developed a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual’s genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis.
We have extended the idea of EAGLE by simulating its method of calculating variant evaluation to DNA methylation evaluation. And created a tool that can evaluate the probability of occurrence of DNA methylation sites and retain the probability of variant evaluation, which we call EAGLE-Meth.
---
### 1.2 Motivation
隨著科技進步,分子生物學的技術突飛猛進,基因體定序的技術也出現突破性的發展,即所謂的NGS(Next Generation Sequencing)技術,此技術對於生物學來說,有著巨大的影響。不同於早期的Sanger定序技術,共花費了大約十年,由六個國家一同合作完成,但也僅解出人類30億個鹼基。如果採取現今的NGS技術,則可以在幾天的時間內完成。使用Sanger進行基因組定序時,須先將目標基因一段一段的增幅放大,每段sequence長約1000bp,定序完後再進行比對與組裝,操作繁瑣且費時費工費錢;而次世代定序的技術是將基因組切成小片段,並接上轉接子(adapter),再藉由材料科學與影像系統的協助來進行快速且高通量的定序。而其中全基因體DNA甲基化定序則是次世代定序研究中的質重要主題之一,DNA甲基化對於脊椎動物的基因表現有一定程度上的影響,對於哺乳動物來說,甲基化的發生,也許會導致不良的疾病甚至癌症的發生。因此再研究完衡量基因體變異的生成機率模型(EAGLE)的MODEL後,我們有個全新想法,我們想要延伸EAGLE的做法,開發出一種以生成機率模型(generative probabilistic model, GPM)作為基本的運算方式,用來計算甲基化定序資料中的胞嘧啶甲基化程度,讓我們能夠較有效的判斷甲基化的發生程度,來協助窺探甲基化對我們的影響,以達到疾病患癌症的預防或是治療。
---
With the advancement of technology, molecular biology is advancing rapidly and there is a breakthrough in genome sequencing technology, the Next Generation Sequencing (NGS), which has a huge impact on biology. Unlike the early Sanger sequencing technology, which took about 10 years and was done by six countries, only 3 billion human bases were solved. With today's NGS technology, this can be done in a matter of days. When using Sanger for genome sequencing, the target gene has to be amplified one by one, each sequence is about 1000bp long, and then compared and assembled after sequencing, which is a tedious process, time-consuming, and costly. The NGS technology is to cut the genome into small fragments and connect them to adapters, and then perform rapid and high-throughput sequencing with the help of material science and imaging systems.DNA methylation is one of the most important topics in NGS research. DNA methylation has a certain degree of influence on the genetic performance of vertebrates. And for mammals, the occurrence of methylation may lead to adverse diseases or even cancer. Therefore, after studying the generative probabilistic model (GPM) of genomic variation in EAGLE, we have a new idea to extend the approach of EAGLE and develop a GPM as a basic computational approach to calculate the degree of cytosine methylation in sequencing data. Thus allowing us to determine the occurrence of methylation more effectively, and help to explore the impact of methylation on us for disease prevention and treatment of cancer.
---
### 1.3 Research framework
The organizational structure of our paper is as follows: In chapter 1, we introduce the research background, the overview of EAGLE-Meth, the technology, and the issue of methylation in biology, including sequencing technology, how methylation affects human gene expression, and problems of methylation caused. Thence, we decided to research EAGLE and extend it to evaluate the sequence's methylation level.
In the chapter 2, 我們介紹這篇研究的背景知識,包含先前的研究:EAGLE,甲基化介紹,bisulfite sequencing技術以及一些資料型態.
In the chapter 3, 這個章節主要介紹,一些related work以及最重要的EAGLE-METH架構和做法。
In the chapter 4,這個章節我們介紹這個研究所做的實驗,以及實驗的環境是使用那些工具,還有不同的case是如何配置的。
In the chapter 5,也是最後一個章節,我們對這次的研究,EAGLE-METH做一個結論,並且提出有那些問題是未來還需要改進的。
---
Chapter two, we introduce the background knowledge of this study, including the previous studies: EAGLE, introduction to methylation, bisulfite sequencing technique, and some data types.
Chapter three introduces some related work and the EAGLE-METH framework and practices.
Chapter four introduces the experiments in this study, the experimental environment using those tools, and how different cases are configured.
In chapter five, which is also the last chapter, we conclude this study, EAGLE-METH, and propose those problems that still need to be improved in the future.
---
## Chapter 2 :Background
### 2.1 EAGLE
EAGLE is our previous research, which is a method to calculate genome variant likelihood.EAGLE incorporates candidate variants into explicit hypotheses about the individual’s genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. First of all, you need to give three different inputs: the reference genome sequence, the called variant candidates in a vcf format and reads information(see figure 1). In EAGLE, we will evaluate candidate genome variant amounts by considering two hypothetical genome sequences, respectively, the reference sequence, and the alternative sequence. With the given data, we could obtain the quality of candidate variants by examining whether the reads better support the hypothesis that they were sequenced from the reference genome, or rather being more likely to be sequenced from an alternative sequence consisting of the candidate variants, indicating a higher probability that the candidate variants are present.
---
EAGLE是我们之前的研究,是一种计算基因组变异可能性的方法。EAGLE将候选变体纳入关于个体基因组的明确假设中,然后计算每个假设下的观察数据(测序读数)的概率。首先,你需要提供三个不同的输入:参考基因组序列、vcf格式的候选变体和读数信息(见图1)。在EAGLE中,我们将通过考虑两个假设的基因组序列,分别是参考序列和替代序列,来评估候选基因组变异量。有了给定的数据,我们可以通过检查读数是否更好地支持从参考基因组测序的假设,或者说更有可能是从由候选变体组成的替代序列测序而获得候选变体的质量,表明候选变体存在的概率更高。
---
***figure 1 :A high level overview of the EAGLE workflow***

In brief, EAGLE will use the read information that is aligned to the specific region reference sequence position according to the variant calling file.EAGLE divides different variants into the different variant sets and combines them, considers all variant combinations, and then replaces the corresponding position of the reference, generates an alternative sequence, and performs calculations.
Therefore, we applied the eagle extension to our EAGLE-METH research. We simulated the probability model of EAGLE generation and changed its calculation mode. The bisulfite sequence considering forward strand and reverse strand shares are added. When read from the forward strand, C to T must be taken into account, and when read from the reverse strand, G to A must be taken into account. We will also give two hypothesis sequences, respectively, reference sequence and alternative sequences. The detailed system architecture will be explained in Method.
### 2.2 Methylation
DNA甲基化是DNA的一種化學修飾形式,可以改變基因表達而不改變DNA序列。作為表觀遺傳密碼的一部分,它是一種表觀遺傳機制。 DNA甲基化的過程是在DNA分子上添加甲基,如在胞嘧啶環的5'碳上(圖2)。 DNA甲基化構成了最穩定的表觀遺傳修飾類型,調節哺乳動物基因組的轉錄可塑性。
---
DNA methylation is a form of chemical modification of DNA that can alter gene expression without altering the DNA sequence. As part of the epigenetic code, it is an epigenetic mechanism. The process of DNA methylation adds methyl groups to DNA molecules, such as on the 5' carbon of the cytosine ring(Figure 2.).DNA methylation constitutes the most stable type of epigenetic modification modulating the transcriptional plasticity of mammalian genomes.
---
***figure 2 : Methylation***

In general, in mature somatic tissues, methylation usually occurs at the CpG dinucleotide site, where the base sequence appears as cytosine followed by guanine. It is worth noting that 70-80% of cytosine at CpG sites is methylated, but most of the methylation occurs outside the CpG island. CpG island refers to a region rich in CpG sites. Generally, the formal definition of a CpG island is: a fragment with a length of at least 200bp, with a CG content of more than 50% (figure 3).
Benign DNA methylation helps organisms grow, while bad DNA methylation increases gene silencing(figure 4) and the incidence of disease .therefore, in our research, EAGLE-METH, we will focus on cytosine methylation to facilitate research and prevent the consequences of poor methylation.
***figure 3 : CpG island***

***figure 4 : Gene silencing***

### 2.3 Bisulfite sequencing
Bisulfite sequencing[5] using next generation sequencers yields genome-wide measurements of DNA methylation at single nucleotide resolution. Traditional aligners are not designed for mapping bisulfite-treated reads, where the unmethylated Cs are converted to Ts.
<!-- In basic biological research, and in numerous application fields such as diagnostics, biotechnology, forensic biology, biosystematics, knowledge of DNA sequences has become indispensable knowledge. The rapid sequencing speed with modern DNA sequencing technologies has helped to achieve sequencing complete DNA sequences, or genome sequencing of many types and complete DNA of living species, including the human genome and many other animal, plant and microbial species sequence. -->
In the field of DNA methylation research in the field of biology, there is also an equally important sequencing technology, that is, bisulfite sequencing. It is of great help in the field of methylation research to be able to determine the methylation sites of DNA sequences through this chemical reaction. The main process of bisulfite sequencing is divided into bisulfite and PCR. We will introduce them in the next paragraph. Nowadays, the mainstream technique used to research DNAm is the WGBS(Whole Genome Bisulfite Sequencing)and the targeted bisulfite sequencing.WGBS allows for measuring methylation in the whole human genome, and the targeted bisulfite that allows the sequence of the specific genomic regions. Both techniques belong to Next Generation Sequencing.
#### 2.3.1 Bisulfite
Bisulfite treatment is a common chemical reaction treatment used to judge DNA methylation sites. It mainly converts the unmethylated cytosine in the sequence into uracils through bisulfite treatment cytosine, which has been methylated, is directly retained without changing, to distinguish each other[6].
#### 2.3.2 PCR
Polymerase chain reaction(PCR) is a method used to the principle of DNA double-strand replication , nucleic acid synthesis technology that replicates specific DNA fragments in vitro.Through this technology, the target gene can be amplified in large quantities in a short period of time without relying on organisms such as Escherichia coli or yeast.To used the technologies of PCR, researchers and scientists do not have to worry about the amount of DNA fragments they get is too trace.
Typical bisulfite sequencing technology include PCR.After PCR,amplification coverts the Uracils from these non-methylation Cytosines into Thymines by a deamination process.
### 2.4 Bisulfite sequencing - Example
Give an example, as shown in figure 5. This is a double-stranded DNA fragment, and it can be seen that the position 3 site of the original forward strand (OT) is non-methylated cytosine, while the position 5 site is methylated cytosine. In addition, in the original reverse strand (OB), it can be seen that the position 1 site is non-methylated cytosine, and the position 6 site is methylated cytosine.
First step, we reacted the two strands with bisulfite. After the reaction,we can observe that the position 3 site of the forward strand, the original non-methylated cytosine has been converted into Uracil, and the position 5 site of the original methylated cytosine it remains unchanged as cytosine. Similarly, the position 1 site of the forward strand is converted to Uracil, while the position 6 remains the same as cytosine, which is the result of bisulfite treament.
Second step, after the bisulfite treament, we will do PCR.The function of PCR is to amplify the signal, that is, to copy the DNA fragments, so that each DNA segment can produce a large number of the same fragments.This method is mainly to make some originally small fragments, and then in the subsequent experiments, in order not to let them be ignored, so we copy them and produce a large number of the same sequences fragments. Although this method can avoid a small number of sequences fragments being ignored, it may also cause some effects, such as enlarging the wrong sequences, etc. At the figure 5,we show the result of PCR, there are four strands, OT, CTOT, OB, CTOB. They are original top, complement of original top, original bottom and complement of original bottom,respectively.We can find the result, comparing whit original forward strand, the non-methylation cytosine that locate in position 3 site,from Uracil becomes Thymine.Comparing whith the position 1 site in original reverse strand,we can see the from Uracil to Thymine,and so on.And we can compare with original forword strand,the position 1 in CTOB becomes A from G.
***figure 5 : Bisulfite sequencing***

### 2.5 Data format : FASTQ,FASTA,BAM,SAM
In the field of genomic biology, it is necessary to store a large amount of genomic information before analysis and research. Therefore, the file format for storing information is relatively important. Common data storage formats are as follows: FASTA,FASTQ, SAM, BAM, VCF, etc
FASTA [7] is a text format used to record nucleic acid sequences or peptide sequences in which nucleic acids or amino acids are encoded in a single letter. This format also allows you to define names and write comments before sequences.A complete sequence in FASTA format, including the first single line description line and multiple lines of sequence data. The leading half of the description line is greater than the sign (">") to distinguish it from the column. The content immediately after ">" is the identification code of the sequence, and the rest of the line is the description of the sequence.
FASTQ [8] is a file format for storing biology sequences and their sequencing quality score information. The sequence format of Fastq can be thought of as the fasta format plus the respective quality values of the bases.
SAM [9] is the abbreviation of Sequence Alignment Map. It is a text-based format originally for storing biological sequences aligned to a reference sequence. Through the content of the SAM file, we can flexibly describe the status of various comparisons. In addition, we can also use the SAMTools provided by the author to grab specific regions, merge or sort the results of the alignment, and even grab the corresponding sequence data according to different alignment conditions, etc.
BAM is also a text-based format originally for storing biological sequences aligned to a reference sequence. Different from SAM, SAM is understandable by humans, BAM is compressed into binary.
VCF [10] is the abbreviation of variant calling format. It is used to record the sequenced data, which contains information about different mutations, such as SNP, indel, etc. There are 9 lines of variation information, recording CHR chromosome number, POS position, ID variation number (whether in dbSNP), REF sequence of this site on the reference gene, ALT variation sequence, QUAL sequencing quality of this sequence, whether filter passes the set screening conditions, format information, and the status of samples at this site.
## Chapter 3 : Method and Related work
### 3.1 Related work
#### 3.1.1 Burrow-Wheeler Aligner
Burrow-Wheeler Aligner[11] is a package for mapping DNA sequences against a large reference genome, such as the human genome. It has two major components, one for read shorter than 150bp and the other for longer reads. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. We used BWA command to index the reference that after executing the index function of bwa-meth ( including converting forward strand C into T and reverse strand G into A ).
#### 3.1.2 BWA - meth
We used BWA-meth[12] in our research mainly. BWA-meth is an aligner for bisulfite-treated sequences. BWA-meth is similar in function to BWA-MEM[13]. BWA-MEM is a popular alignment tool, used for genome sequence alignment. In our experiment, we used BWA-meth to align the reference and the bisulfite-treated sequences, which were generated by MethlFASTQ.
#### 3.1.3 MethlFASTQ
MethylFASTQ[14] is a Python tool to simulate bisulfite sequencing data in a highly customizable way. In the past, when we wanted to simulate the results of Bisulfite treated experiments, we used to use a tool called Sherman. Compared with Sherman, MethylFASTQ provides more parameters and is easier to operate.
First, we give a genome reference sequence as input, the parameters that can be set are sequencing mode: single-end or paired-end, directional or non-directional of reads. MethylFASTQ also allows choosing different modes like WGBS experiment or targeted-BS data. And user can define the fragment size, like the read length and the depth of coverage. Moreover, methylation can be set through three context-based probabilities, such as CG, CHG, and CHH (H= A, T, or C). The user can also set probabilities about the sequencing errors and the probability that a SingleNucleotide Polymorphism.
Two output files are returned: a FASTQ file and a methylation call file. If the user gives the case of single-end sequencing, it would produce one FASTQ file. And in the case of paired-end sequencing, you would get two FASTQ files, which contain the forward and reverse reads. Additionally, the methylation call file contains the sequenced methylated cytosines information that produces from the MethylFASTQ tool.
#### 3.1.4 SAMtools
SAMtools[15] 是一種常見的工具,用來處理巨量生物資訊格式,像是BAM file或是SAM file。基本上 ,SAMtools有三大主要功能 : 第一個是檔案格式的轉換,可以將bam file 轉換成 sam file , 也可以將 sam file 轉換成 bam file。 第二個常用功能是 reads 的篩選,可以過濾不想留著的reads或是查看alignment的depth和reads mapping status等等。第三個也是蠻常見的功能就是calling variant,SAMtools 提供很多不同的參數設定,可以快速地瞭解整體bam file的狀態,也是相當方便的功能!
在我們的實驗中,處理reads資料也使用SAMtools的轉換功能,將 alignment 完的 bam file 轉換成 sam file , 再透過排序和給定index,把產出的reads資訊整理成適合輸入EAGLE-METH的格式。
---
SAMtools [15] is a common tool for handling huge amounts of biological information formats, such as BAM files or SAM files.
Basically, SAMtools has three main functions: the first one is file format conversion, which can convert bam file to sam file, and sam file to bam file. The second common function is reads filtering, which can filter the reads you do not want to keep or check the depth of alignment and reads mapping status, etc. The third common function is calling variant, SAMtools provides many different parameter settings to quickly understand the status of the overall bam file, which is also a very convenient function.
In our experiment, we also use SAMtools' conversion function to convert the aligned bam file into sam file and then transform the generated reads information into a format suitable for inputting EAGLE-METH by sorting and giving index.
In our experiment, we also use SAMtools' conversion function to convert the aligned bam file into sam file and then transform the generated reads information into a format suitable for inputting EAGLE-METH by sorting and giving index.
---
### 3.2 Method
EAGLE-METH是一個保留原本EAGLE變異點評估功能且同時擁有DNA甲基化程度評估的工具。
在EAGLE-METH的做法中,我們input的資料與原版的EAGLE相同,需要the reference genome以及candidate variants in a vcf format和the aligned sequencing reads in a bam file format。但不同於原版EAGLE,EAGLE-METH輸入的BAM file是bisulfite sequencing data,是經過bisulfite treated的基因片段,那在我們的實驗中,是採用MethylFASTQ工具所模擬的reads資料。
接著我們將介紹EAGLE-METH的流程與架構(see figure 6):
---
EAGLE-METH is a tool that retains the original EAGLE variant evaluation function and also has the DNA methylation evaluation.
In the EAGLE-METH approach, we input the same data as the original EAGLE, the reference genome and candidate variants in a vcf format and the aligned sequencing reads in a bam file format. Unlike the original EAGLE, the BAM file entered by EAGLE-METH is bisulfite sequencing data, which are bisulfite-treated gene fragments, and in our experiments, the reads are simulated by the MethylFASTQ tool.
Then we will introduce the process and structure of EAGLE-METH in Figure 6.
---
***figure 6 : EAGLE-METH workflow***

#### 3.2.1 Genome reference
首先,根據原版EAGLE,我們知道EAGLE會輸出最符合給定之sequence的variant set情況。因此,我們藉由執行完一次原版EAGLE,將輸入的reference以及VCF,VCF則包含所有可能發生variant的情況,再以EAGLE執行完選出最符合的variant set,把這些variant sites 拿來當標準,修改我們的reference,讓它修改成包含variant後的情況。所以我們得到一條全新的modified reference sequence(figure 6. 藍色的長方形的部分).
這個做法可以幫助我們在之後判斷正股C to T以及反股G to A的情況時,不會被這些bases的轉變所影響我們的假設與結果,也就是去handle像是SNP的情況,能夠幫助我們專心於找尋CpG methylation issue。
---
First, according to the original EAGLE, we know that EAGLE will output the variant set that best matches the given sequence. Therefore, we run the original EAGLE once, select the most compatible variant set, take these variant sites as the standard, and modify our reference so that it is modified to include the variant case. So we get a new modified reference sequence (figure 6. the blue rectangular part).
This approach can help us to evaluate the forward strand C to T and the reverse strand G to A cases later on, without the changes of these bases affecting our assumptions and results, i.e. to handle cases like SNPs, which can help us to focus on searching for CpG methylation issues.
---
#### 3.2.2 Read information
在我們的輸入檔案中,除了reference和VCF,還有一項就是Read的資訊。不同於EAGLE,在EAGLE-METH架構中,我們的read必須是經過bisulfite treated,也必須同時包含variant以及methylation的資訊。在我們的實驗中,使用的是模擬的bisulfite sequencing實驗結果,將MethylFASTQ工具產生出的FASTQ file(也就是reads 資訊),透過不同的基因測序工具做處理,來符合EAGLE-METH輸入格式。
首先,我們將read的FASTQ file透過使用BWA-meth的工具,對FASTQ file進行資料前處理,把我們要輸入的reference sequence與read做mapping,並將mapping完的結果產出一個SAM file。接著,使用SAMtools工具,因為BWA-meth產出來的SAM file並不是binary format,所以我們用SAMtools將其轉換成BAM file,轉成BAM file後在做檔案sort 以及給定index的動作。而這個處理完的BAM file就是包含forward strand以及reverse strand的資訊。在EAGLE-METH的機率模型中,我們要處理的就是BAM file裡的forward strand和reverse strand,當計算的read是forward strand時,需要注意的就是C to T的情況;反之,計算的read是reverse strand的情況時,則需要處理G to A的情況(figure 6. 黃色長方形的部分)。
---
In our input file, in addition to the genome reference and the vcf file, there is also the read information. Unlike EAGLE, in the EAGLE-METH framework, our reads must be bisulfite-treated and must contain both variant and methylation information. In our experiments, we use the simulated bisulfite sequencing results, and the FASTQ file generated by the MethylFASTQ tool is processed by different genetic sequencing tools to conform to the EAGLE-METH input format.
First, we use the BWA-meth tool to pre-process the FASTQ file, mapping the reference genome sequence we want to input with the read, and generating a SAM file from the mapping result. Then, we use the SAMtools tool, because the SAM file generated by BWA-meth is not in binary format, so we use SAMtools to convert it to a BAM file, and after converting it to the BAM file, we do the file sort, and the given index action. The processed BAM file is the one that contains the forward strand and the reverse strand information. In the EAGLE-METH probability model, we have to deal with the forward strand and reverse strand in the BAM file. When the calculated read is the forward strand, we need to pay attention to the C to T case; conversely, when the calculated read is the reverse strand case In the case of G to A, it is necessary to deal with the case of G to A (figure 6. the yellow rectangular part).
---
#### 3.2.3 Hypothetical sequences
參考EAGLE的作法,在EAGLE-METH中,也是給定兩種不同的Hypothetical sequences。第一條為Hypothetical reference sequence,在與不同的read比對計算機率的程度,我們會使用之前介紹的modified reference sequence,重新將這條modified reference sequence作假設,將位於這條reference上的所有CpG sites皆假設發生DNA甲基化,如果是forward strand的'C' position我們就以字母'M'取代,反之reverse strand的'C' position則以字母'W'取代,而這條全新的reference就是要輸入EAGLE-METH的Hypothetical reference。所以字母集也發生擴增,目前的nucleotide bases set改變為: A, T, C, G, M, W 。
第二條為Hypothetical alternative sequence,在EAGLE-METH中,在每次判斷要計算假設的甲基化位點時,我們會將鄰近的甲基化位點分成不同的子集合,在每次輸入模型計算前,會將每個子集各自做combination,考慮每種不同甲基化假設的可能性,因為我們認為DNA甲基化發生的情況也許會被鄰近的甲基化影響。所以每條不同的甲基化位點假設,皆是一條各自獨立發生的假設情況,也就是我們的ALTERNATIVE sequence(see figure 7.)。
---
In EAGLE-METH, two different Hypothetical sequences are also given.
The first one is the hypothetical reference sequence, and we will use the previously introduced modified reference sequence in order to compare the calculated probability with different reads. If the 'C' position of the forward strand is replaced by the letter 'M', and the 'C' position of the reverse strand is replaced by the letter 'W'. The new reference is to enter the Hypothetical reference of EAGLE-METH, so the set of letters has been expanded and the current nucleotide bases set has been changed to A, T, C, G, M, W. The second is the hypothetical alternative sequence. In EAGLE-METH, when we determine the methylation sites to be calculated each time, we will divide the neighboring methylation sites into different subsets, and before each input model calculation, we will do a combination of each subset to consider the possibility of each different methylation. We will consider the possibility of each methylation hypothesis because we believe that the occurrence of DNA methylation may be affected by the neighboring methylation. Therefore, each different methylation site hypothesis is a hypothetical situation that occurs independently, which is our alternative sequence (see figure 7).
---
***figure 7 : methylation set & how combination run***

#### 3.2.4 Calculate marginal probability
EAGLE-METH的計算機率模型是參考EAGLE的算法(see figure 8.),分別計算兩條不同的Hypothetical sequences,並挑選出likelihood較高的機率。在EAGLE-METH中,我們需要考慮到正股以及反股會有不同的情況,所以也會有不同的計算機率,分別處理正股發生的C to T以及反股發生的G to A。並且將原本設定的依據quality score產生的probability table修改成符合bisuflite sequence的probability matrix(建表是參考ambigous codes)。
---
The probability model of EAGLE-METH is based on EAGLE's calculation algorithm (see figure 8.), which calculates two different hypothetical sequences separately and selects the probability with higher likelihood. In EAGLE-METH, we need to take into account the different situations of the forward strand and the reverse strand. Therefore, there will be different calculation probabilities, and handle the C to T of the forward strand and the G to A of the reverse strand respectively. The original probability table based on the quality score is modified into a probability matrix that conforms to the bisulfite sequence.
---
***figure 8 : Hypothetical sequences calculate in EAGLE***


計算的過程是先將reference切割成小區域,每次皆計算一個小區域而不是一次計算全部的reference sequences以及read sequences
接著延續之前的內容(3.2.3),我們把combination後的每個組合各自假設成不同的ALTERNATIVE sequence,以Greedy algorithm的作法,保留每次做完likelihood計算的最大值,最後以最大值當作我們ALTERNATIVE sequence的最終值,在以這個結果與 reference sequence 的值做比較,選出機率較高的一方,即是EAGLE-METH的計算過程。
最後EAGLE-METH的輸出為一個.tab檔(see figure 9.),並且能同時保有原本EAGLE的variant評估功能,輸出一個variant評估檔案。EAGLE-METH輸出欄位分別為:Chromosome ID, Position, Reference base, Alternative base, Reads, Read from reference, Read from alternative, Marginal probability, Odds。一些重要欄位像是: chromosome ID 代表實驗的目標是來自哪條chromosome; reference base and alternative base 代表在這著位點,發生甲基化,並且這個甲基化是來自正股的話會以'M'表示,反之則以'W'表示; Reads 代表的是pile-up到這個位置的read總數; Marginal probability則代表的是EAGLE-METH計算的probaility,代表發生甲基化的'C'的機率/(發生甲基化的'C'的機率+位發生甲基化的'C'的機率),如果該值越高,代表它的methylation level越高,反之越低則表示methylation level越低。
---
The calculation process is to first cut the reference into small regions and calculate one region at a time instead of calculating the whole reference sequence at once.
Then, we continue the previous content section 3.2.3, we assume each combination into a different alternative sequence and use the Greedy algorithm to keep the maximum value after each likelihood calculation, and finally use the maximum value as the final value of our alternative sequence. And this result is used to compare with the value of the reference sequence to select the one with higher likelihood probability, which is the calculation process of EAGLE-METH.
The final output of EAGLE-METH is the .tab file including methylation probability information (see figure 9.), and it retains the variant evaluation function of the original EAGLE, outputting a variant evaluation file. Some important fields are: chromosome ID field represents the chromosome from which the target of the experiment is chromosome;
The reference base field and alternative base field represent the site where the methylation occurred, and this methylation is indicated by 'M' if it is from the forward strand, and 'W' if it is from the reverse strand;
Reads field represents the total number of reads pile-up to this position; Marginal probability represents the probability calculated by EAGLE-METH, which represents the probability of methylation of 'C' / (probability of methylation of 'C' + probability of non-methylation of 'C') if the value is higher, it means that the methylation level is higher, and conversely, the lower the methylation level is.
---
***figure 9 : EAGLE-METH output format example***

我們相信藉由將原本EAGLE的機率模型做延伸,修改其計算模式,考慮Bisulfite treated與甲基化特性,我們也能得到一個同時具有基因變異點評估以及評估DNA甲基化位點發生機率的工具。
---
We believe that by extending the original EAGLE probability model and modifying its computational model to take into account bisulfite treated and methylation properties, we can also obtain a tool for both genetic variant assessment and DNA methylation site occurrence assessment.
---
## Chapter 4 : Experiment and Results
在接下來的章節中,我們將介紹實際使用EAGLE-METH這個工具的幾個實驗。
### 4.1 Environment
在EAGLE-METH的實驗中,我們使用一款能彈性的模擬Bisulfite treated的工具,也就是MethylFASTQ,以及一些處理生物基因資訊資料的工具。
---
In the EAGLE-METH experiment, we used the MethylFASTQ that can flexibly simulate bisulfite-treated data and some tools to process biological genetic information.
---
---
***Table 1. Experiment configuration***
<!-- | Tools | Description |
| -------- | -------- |
| EAGLE-METH | 基因變異點評估以及DNA甲基化計算likelihood工具
||Input: reference genome(FASTA file), candidate variant(VCF file), aligned sequencing reads(BAM file)
||Output: Variants table and methylated points table|
| MethylFASTQ | Bisulfite treated data simulator
||Input: reference genome(FASTA file)
||Output: Reads(FASTQ), methylation call file
| BWA-meth | Alignment tool, design for Bisulfite treated & methylated files
||Input: reference genome(FASTA file), reads(FASTQ file)
||Output: aligned sequencing reads(SAM file)
| SAMtools | Change sequence alignment/map file into binary alignment/map file
||Input: SAM file
||Output: BAM file -->

---
用於EAGLE-METH的實驗是選用人類基因資料庫GRCh37第22條染色體片段,以及The Arabidopsis ChrM的染色體片段,經由MethylFASTQ模擬經過bisulfite treatment的sequence以及一份methylation call file。藉由給定不同的methylation level以及不同物種以EAGLE-METH來評估,比較評估的結果是否接近正確的methylation call file。
---
For the EAGLE-METH experiment, the human GRCh37 chromosome 22 fragment and the Arabidopsis chromosome M fragment were used to simulate the bisulfite-treated sequences and a methylation call file by MethylFASTQ. The results were evaluated by EAGLE-METH at different methylation levels and different species to compare whether the results were close to the correct methylation call file.
---
### 4.2 Experiment
***figure 10 : Workflow of my experiment***

圖九展示我們的實驗主要流程,分別處理我們EAGLE-METH三個主要的輸入。首先,看到圖九左邊的部分,我們使用MethylFASTQ,處理感興趣的target reference genome fragment。MethylFASTQ會輸出圖中黃色部分的simulated bisulfite treated reads,檔案格式為FASTQ,另外還會輸出一個methylation call file。接著我們分別使用BWA-meth這個alignment tool,將simulated reads作前處理,將reads與reference genome作alignment並產出一個SAM file。再來使用SAMtools去處理檔案格式,將SAM file轉換成BAM file,才能作為EAGLE-METH的輸入,到此步驟,我們處理好bisulfite sequence。
另外看到圖九右邊的部分,我們將candidate variants資訊也就是紅色區域與reference genome先使用EAGLE將variant發生可能性最高的位點作轉換,得到一條處理完variant sites的sequence,也就是modified reference,此做法能降低之後作methylation計算時,被variant影響的可能性。
最後將read的BAM file以及modified reference輸入EAGLE-METH作運算,以上就是我們的實驗流程。
---
Figure 9 shows the main flow of our experiment, processing each of the three main inputs of our EAGLE-METH. First, see the left part of Figure 9, we use MethylFASTQ to process the target reference genome fragment of interest. Then we use BWA-meth, an alignment tool, to pre-process the simulated reads, align the reads with the reference genome and generate a SAM file. SAMtools is used to process the file format and convert the SAM file to a BAM file, which can be used as the input of EAGLE-METH.
In addition, in the right part of Figure 9, we convert the information of candidate variants. That is, the red area and the reference genome, to the sites with the highest possibility of variant occurrence using EAGLE, and get a sequence that has finished processing the variant sites, namely modified reference, which can reduce the possibility of being affected by variants when doing methylation calculation later.
Finally, the read BAM file and the modified reference will be input into EAGLE-METH for calculation, and the above is the process of our experiment.
---
***Table 2. Experiment Data***
<!-- | Data (species) | Description |
| -------- | -------- |
| Chr 22 | from:GRC37第22條染色體,取其片段連續27920bp |
| Chr M | from:Arabidopsis第M條染色體,取其片段19530bp| -->

在我們的實驗中,我們是選用兩種不同物種的基因片段,其中一種是人類基因體資料庫中的GRCh37,並選取其中第22條染色體,擷取其目標片段。另外一種實驗資料是選取Arabidopsis的第M條染色體,一樣擷取其目標片段。
---
In our experiments, we used two different species of gene fragments, one of which is GRCh37 from the human genome database, and selected chromosome 22 to retrieve the target fragment. The other one is the chromosome M of Arabidopsis, and the target fragment was also extracted.
---
***Table 3. Different cases in our experiment***
<!-- | Case | Description |
| -------- | -------- |
| Case 1 | FASTA: fragment of GRC37 Chr22
||Read sequence: CpG Probability = 70%
||VCF: artificial variant |
| Case 2 | FASTA: fragment of GRC37 Chr22
||Read sequence: CpG Probability = 30%
||VCF: artificial variant |
| Case 3 | FASTA: fragment of Arabidopsis ChrM
||Read sequence: CpG Probability = 70%
||VCF: artificial variant |
| Case 4 | FASTA: fragment of Arabidopsis ChrM
||Read sequence: CpG Probability = 30%
||VCF: artificial variant |
-->

看到Table3,在我們的實驗中總共設計4種不同的情況。分別是人類染色體GRCh37的第22條染色體以及Arabidopsis的第M條染色體,並且經由MethylFASTQ模擬兩種不同的DNA甲基化程度分別是CpG=30%以及CpG=70%,實驗結果如下面展示:
---
As you can see in Figure 3, there are four different cases in our experiment. The human chromosome GRCh37 chromosome 22 and Arabidopsis chromosome M were simulated by MethylFASTQ at two different levels of DNA methylation, CpG=30%, and CpG=70%, and the experimental results are shown below:
---

***figure 11 : Case 1 each of position result***
<!--  -->


根據Table4 可以發現在case1中,在fragment of Chr22且CpG的methylation level為70%的情況下,EAGLE-METH計算的Marginal probability與MethylFASTQ產出的methylation call file比對,可以發現EAGLE-METH的計算結果是蠻接近methylation call file的。
另外可以看到figure 11,我們統計了不同的機率情況下各自發生的probability frequency為多少,從圖中也能發現,藍色的EAGLE-METH的統計數量是落在0.6~0.8,同樣的紅色部分,MethylFASTQ的methylation level的統計數量則是落在0.5~0.9之間。可以發現EAGLE-METH的結果較為集中,但評估的結果也是接近正確的答案。
並且我們將各個CpG site的probability以小排序到大並將兩個工具的結果作比較,紅色的線為MethylFASTQ而藍色的線為EAGLE-METH,可以發現兩工具的結果是接近的。
---
According to Table4, we can find that in the case1, with the fragment of Chr22 and the methylation level of CpG at 70%, the Marginal probability calculated by EAGLE-METH and the methylation call file generated by the MethylFASTQ file, we can find that the calculation result of EAGLE-METH is close to the methylation call file.
In addition, we can see that in Figure 11, we have calculated the probability frequency of each occurrence in different probability cases, and from the figure, we can also find that the statistical quantity of blue EAGLE-METH falls in 0.6~0.8, and the statistical quantity of the red part, the methylation level of MethylFASTQ falls between 0.5 and 0.9. It can be found that the results of EAGLE-METH are more concentrated, but the evaluation results are also close to the correct answers.
We also compared the results of the two tools by ranking the probability of each CpG site from small to large. The red line is MethylFASTQ and the blue line is EAGLE-METH, we can find that the results of the two tools are close to each other.
---

***figure 12 : Case 2 each of position result***


根據Table5,在case2中,fragment of Chr22且CpG的methylation level為30%,可以發現EAGLE-METH的計算結果是蠻接近methylation call file的。
另外可以看到figure 12,從圖中能發現,藍色的EAGLE-METH的統計數量主要是落在0~0.3,同樣的紅色部分,MethylFASTQ的methylation level的統計數量則是落在0~0.5之間。可以發現EAGLE-METH的結果依然較為集中,但評估的結果也是接近正確的答案。
並且我們將各個CpG site的probability以小排序到大並將兩個工具的結果作比較,紅色的線為MethylFASTQ,藍色的線為EAGLE-METH。
---

***figure 13 : Case 3 each of position result***


根據Table6,在case3中,fragment of ChrM且CpG的methylation level為70%,可以發現EAGLE-METH的計算結果是蠻接近methylation call file的。
看到figure 13,藍色的EAGLE-METH的統計數量主要集中在0.5以及0.7~1,紅色部分,MethylFASTQ的methylation level的統計數量則是落在0.5~1之間。
接著同樣將各個CpG site的probability以小排序到大並將兩個工具的結果作比較,紅色的線為MethylFASTQ,藍色的線為EAGLE-METH。可以發現兩者也是很接近。
---
According to Table 6, in case3, the fragment of ChrM and the methylation level of CpG is 70%. And we can find that the calculation result of EAGLE-METH is quite close to the methylation call file. See figure 13, the statistical quantities of EAGLE-METH in blue are mainly concentrated in 0.5 and between 0.7 to 1, and the statistical quantities of the methylation level of MethylFASTQ in red fall in the range of 0.5~1. Then, we also ranked the probability of each CpG site from small to large and compared the results of the two tools. It can be found that they are also close to each other.
---

***figure 14 : Case 4 each of position result***


根據Table7,在case4中,fragment of ChrM且CpG的methylation level為30%,可以發現EAGLE-METH的計算結果是蠻接近methylation call file的。
看到figure 14,藍色的EAGLE-METH的統計數量主要集中在0 ~ 0.3而平均落在0~0.5,紅色部分,MethylFASTQ的methylation level的統計數量則是落在主要落在0.3且平均落在0~0.5之間。
接著同樣將各個CpG site的probability以小排序到大並將兩個工具的結果作比較,紅色的線為MethylFASTQ,藍色的線為EAGLE-METH。可以發現兩者依然是蠻接近的。
---

最後我們看到Table8 ,這個表格主要統計Case1~Case4的結果,我們可以看到,In case1, EAGLE-METH的平均marginal probability輸出結果在70%的情況下為64.1%,與MethylFASTQ的平均72.3%相差大約8.2%; In case2, EAGLE-METH的平均marginal probability輸出結果在30%的情況下為25.1%,與MethylFASTQ的平均29.6%相差大約4.5%; In case3, EAGLE-METH的平均marginal probability輸出結果在70%的情況下為68.1%,與MethylFASTQ的平均70.6%相差大約2.5%; In case4, EAGLE-METH的平均marginal probability輸出結果在30%的情況下為25.7%,與MethylFASTQ的平均29.3%相差大約3.6%。並且由表中可得知,在4個不同的cases條件,EAGLE-METH的每個position與MethylFASTQ輸出的methylation call file對應的position相減取絕對值,並算出其平均,可以發現平均差值大約落在16%左右。
---
Finally, we see Table8, the table mainly statistics Case1 to Case4 results. In case1, the average marginal probability output of EAGLE-METH is 64.1% in 70% simulated case, and the difference with the average 72.3% of MethylFASTQ is about 8.2%; In case2, the average marginal probability output of EAGLE-METH is 25.1% at 30% simulated, which is about 4.5% different from the average 29.6% of MethylFASTQ; In case3, the average marginal probability output of EAGLE-METH is 68.1% at 70% simulated, which is about 2.5% different from the average 70.6% of MethylFASTQ; In case4, the average marginal probability output of EAGLE-METH is 25.7% at 30% simulated, which is about 3.6% different from the average of 29.3% for MethylFASTQ.
Moreover, from the table, it can be seen that for four different cases, the average difference between each position of EAGLE-METH and the position corresponding to the methylation call file of MethylFASTQ is calculated by subtracting the absolute value and the average value can be found to be around 16%.
---
## Chapter 5 : Conclusion and Future work
經過實驗,分析結果,我們認為EAGLE-METH透過它的機率模型,輸入reference genome,vcf和bisulfite treated read,我們有能力評估DNA甲基化發生的情況,並且透過模擬資料,比對結果,也與正確的答案相近,所以我們認為延伸EAGLE研發EAGLE-METH,一款DNA甲基化評估系統,是具有研究價值的。
但是扔然有一些問題需要解決,
因為reference genome sequence是非常長的資料,再加上我們是用the Greedy演算法做計算,會導致計算時間過長。還有記憶體空間不足的原因,促使我們目前只能計算固定長度的區間。
---
After analyzing the results of the experiments, we believe that EAGLE-METH can evaluate the occurrence of DNA methylation through its probability model, inputting the reference genome, the vcf, and the bisulfite-treated read, and comparing the results with the correct answers through simulated data, so we believe that the extension of EAGLE-METH is worthwhile. Therefore, we believe that the development of EAGLE-METH, a DNA methylation assessment system, is of research value.
However, there are still some issues we can improve. For instance, the total calculation time. Because the reference genome sequence is very long data, and we are using the Greedy algorithm to do the calculation, it will cause the calculation time to be too long. Also, the lack of RAM is the reason that we can only calculate fixed-length regions at present.
---
## Reference
[1] Lisa D Moore, Thuc Le & Guoping Fan. (2012). DNA Methylation and Its Basic Function.
https://www.nature.com/articles/npp2012112
[2] Barker, David L., and Richard E. Marsh. "The crystal structure of cytosine." Acta Crystallographica 17.12 (1964): 1581-1587.
https://scripts.iucr.org/cgi-bin/paper?a04431
[3] Tony Kuo, Martin C. Frith, Jun Sese and Paul Horton.EAGLE: Explicit Alternative Genome Likelihood Evaluator
https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-018-0342-1
[4] Krueger, Felix, et al. "DNA methylome analysis using short bisulfite sequencing data." Nature methods 9.2 (2012): 145-151.
https://www.nature.com/articles/nmeth.1828
[5] Darst, Russell P., et al. "Bisulfite sequencing of DNA." Current protocols in molecular biology 91.1 (2010): 7-9.
https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/0471142727.mb0709s91
[6] Felix Krueger∗ and Simon R. Andrews.Bismark: a flexible aligner and methylation caller for Bisulfite-Seq
applications.Vol. 27 no. 11 2011, pages 1571–1572
doi:10.1093/bioinformatics/btr167
[7] https://www.pnas.org/doi/abs/10.1073/pnas.85.8.2444
[8] https://academic.oup.com/nar/article-abstract/38/6/1767/3112533
[11] http://bio-bwa.sourceforge.net/
[12] Brent S. Pedersen, Kenneth Eyring, Subhajyoti De, Ivana V. Yang and David A. Schwartz. (2005). Fast and accurate alignment of long bisulfite-seq reads.
https://arxiv.org/abs/1401.1129
[13] Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics.
https://arxiv.org/abs/1303.3997
[14] Giulia Piaggeschi, Nicola Licheri, Greta Romano, Simone Pernice, Laura Follia, Giulio Ferrero. (2019). MethylFASTQ: A Tool Simulating Bisulfite Sequencing Data.
https://ieeexplore.ieee.org/document/8671567
[15] Heng Li, Bob Handsaker2, Alec Wysoker2, Tim Fennell2, Jue Ruan3, Nils Homer4, Gabor Marth5, Goncalo Abecasis6, Richard Durbin. 2009. The Sequence Alignment/Map format and SAMtools.
https://academic.oup.com/bioinformatics/article/25/16/2078/204688?login=true