# AlphaFold原理中文版(一點中文大部分英文)不是因為我懶
[AlphaFold 2 Explained: A Semi-Deep Dive](https://towardsdatascience.com/alphafold-2-explained-a-semi-deep-dive-fa7618c1a7f6)
以前是用 X-ray crystallography, nuclear magnetic resonance,and cryo-electron microscopy等方法來得到蛋白質結構
但因為這些方法有許多限制所以有很多蛋白質結構還不知道
理論上蛋白質結構由string of amino acids決定,DNA sequencing可以知道string of amino acids,但實際上就算知道string of amino acids我們還是很難預測蛋白質結構
所以AlphaFold用來解決這個問題
因為AlphaFold沒有官方說明他們的運作原理,所以這資料是經由別人透過AlphaFold2問世來猜測
AlphaFold2應該是用 concurrent neural networks (CNNs) and Transformers.這兩個神經網路
convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery.
The Transformer is a deep learning model introduced in 2017, used primarily in the field of natural language processing (NLP).Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization.
AlphaFold官方是說他們用 public dataset of 170,000 proteins with known structures and a much larger database of protein sequences with unknown structures去訓練
因為170,000筆資料並不算多,所以筆者猜測他們將資料做切割並利用其他未知的蛋白質結構一起訓練
他們推測是用Embeddings are a way of mapping data to vectors whose position in space capture meaning.
因為proteins with similar amino acid sequences, they’re more likely to share a similar structure.
生物學的MSA or Multiple Sequence Alignment. One amino acid sequence may be very similar to another, but it may have some extra or “inserted” amino acids that make it longer than the other. MSA is a way of adding gaps to make the sequences line up as closely as possible.

下面那張是AlphaFold模型結構

# AlphaFold原理
historically, determining protein structures (via experimental techniques like X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy) has been difficult, slow, and expensive. Plus, for some types of proteins, these techniques don’t work at all.
determined by the string of amino acids that make it up. And we can determine a protein’s amino acid sequences easily, via DNA sequencing (remember from Bio 101 how your DNA codes for amino acid sequences?).
It’s a neural-network-based algorithm that’s performed astonishingly well on the protein folding problem
was trained on a public dataset of 170,000 proteins with known structures, and a much larger database of protein sequences with unknown structures.\
Multiple Sequence Alignment.
MSA is a way of adding gaps to make the sequences line up as closely as possible.

neural network called a “Transformer.So one of the major differences between AlphaFold 1 and AlphaFold 2 is that the former used concurrent neural networks (CNNs) and the new version uses Transformers.
slice an image in half, and ask a model to predict
Embeddings are a way of mapping data to vectors whose position in space capture meaning.a tool for taking a word (i.e. “hammer”) and mapping it to n-dimensional space so that similar words (“screw driver,” “nail”) are mapped nearby. And, like GPT-3, it was trained on a dataset of unlabeled text.
ook at clusters of proteins with similar amino acid sequences. Often, one protein sequence might be similar to another because the two share a similar evolutionary origin. The more similar those amino acid sequences, the more likely those proteins serve a similar purpose for the organisms they’re made in, which means, in turn, they’re more likely to share a similar structure.
how similar two amino acid sequences are. To do that, biologists typically compute something called an MSA or Multiple Sequence Alignment. One amino acid sequence may be very similar to another, but it may have some extra or “inserted” amino acids that make it longer than the other. MSA is a way of adding gaps to make the sequences line up as closely as possible.