Leopard - HackMD

--- tags: Deep Learning Based DNA Sequence Fast Decoding --- # Leopard ## Training, Validation, and Testing A “crisscross” training and validation strategy is used to build models and avoid overfitting. For each TF, they first collected all the available training cell types. Then a pair of cell types was selected, one for model training and the other for model validation. Meanwhile, the 23 chromosomes (Chr 1–22 and X) were also partitioned into the training, validation, and testing sets. Chr 1, Chr 8, and Chr 21 were fixed throughout this study as the testing chromosome set, whereas the remaining 20 chromosomes were randomly partitioned into the training and validation sets. ## The CNN Structure ![](https://i.imgur.com/RO5pSGm.png) The architecture of Leopard was adapted from a image segmentation neural network model, U-Net, which generates pixel-wise labels for every pixel in the input two-dimensional image. More specifically, there are a encoder and a decoder in the Leopard architecture. The encoder contains 5 ccps, two convolution layers and one pooling layer. The decoder contains 5 ucc, one upscaling layer and two convolution layers. ### The Input Layer ![](https://i.imgur.com/ZRTVEqF.png) 1. The six channels in the first dimension are signals from 1. (1)DNase-seq 2. (2)$\Delta$DNase-seq The $\Delta$DNase-seq is the difference between a specific cell type and the average of all cell types used in this study, which is designed to capture potential sequencing biases. 3. (3–6) one-hot-encoded DNA sequences. A standard way of encoding the DNA sequence into numeric values, in which each nucleotide is assigned a specific channel, and the value is encoded as one only when a specific nucleotide occurs. 2. The columns in the second dimension correspond to the input length of $10240$ successive genomic positions. ### The Convolutional Layer ![](https://i.imgur.com/d5SYN5V.png) In each convolutional layer, the kernel size of $7$ was used. Each convolution operation was followed by ReLU. $f(x)=max(0,x)$. ### The Pooling Layer In each pooling layer, max pooling was used. ![](https://i.imgur.com/TYaC71a.png) ### The Output Layer In the final layer, we used the sigmoid activation unit to restrict the prediction value between zero and one. $f(x)=\frac{1}{1+e^{-x}}$ ## The Loss Function $H(\hat{y},y)=\sum_{i=1}^N[-y_ilog\hat{y_i}+(1-y_i)log(1-\hat{y_i})]$ where $y_i$ is the gold standard label of $TF\ binding = 1$ or $TF\ nonbinding = 0$ at genomic position $i$, $\hat{y_i}$ is the prediction value at position $i$, $N = 10,240$ is the total number of base pairs in each segment, $y$ is the vector of the gold-standard labels, and $\hat{y}$ is the vector of predictions. **References** [Fast decoding cell type–specific transcription factor binding landscape at single-nucleotide resolution](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8015851/pdf/721.pdf) [GuanLab/Leopard - GitHub](https://github.com/GuanLab/Leopard) [CNN中的maxpool到底是什麼原理？ - 每日頭條](https://kknews.cc/news/jzr8yly.html)