# HW1 - GWAS
###### tags: `c4lab`
HackMD (Recommanded): https://hackmd.io/oTjzyAzFQ9Genk7zsRWq_g
## Q0
> 0) If you are doing genetic counseling, what is the most frequent question you have been asked. Otherwise, what do you want to ask a certified Genetic Counselor? Why? (20%)
### Conseling
I have not been asked ever, but I guess the questions may related to cancer.
### Ask
How long can I live and how can I extend it. What is the action to extend longest lifetime for me within minimal lifestyle changes.
### Why
Gene shows the blueprint of the live. Precise prediction can be made if we know all the mechanism in our body. In addition, we can use environment factor in our model to get better accuracy. Finally, we can find out the best action by the model with gene and environment information. In order to predict our lifetime, we need to predict all of probability of disease that may induce by one of variants in my gene, this is our final goal in future research.
## Q1
> Please give an example from any published GWAS hits and answer the following questions.
> 1) Please describe the sample size of this GWAS hit. Assuming Aa/aa are the same, please describe the data in 2x2 table (10%)
I foucs on β-thalassemia which is a heritable gene disease and common in many Asian countries.
### Reference paper
https://link.springer.com/article/10.1007%2Fs00439-009-0770-2
Nuinoon, M., Makarasara, W., Mushiroda, T. et al. A genome-wide association identified the common genetic variants influence disease severity in β0-thalassemia/hemoglobin E. Hum Genet 127, 303–314 (2010). https://doi.org/10.1007/s00439-009-0770-2
### Data:
A total of 1,330 Thai β0-thalassemia/HbE patients were classified into three groups: mild, moderate, and severe. 235 mildly and 383 severely affected patients were selected for the GWAS study as described.
### Locus
I foucs on the most significant variant `rs2071348` in this research.
| dbSNPID | Position | Gene | Chr | Allele | Severe minor allele | Severe MAF | Mild minor allele | Mild MAF | Risk Allele | Odds ratio | Severity P value | HBF(%) | Abs HbF (g/dL) |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
| rs2071348 | 5220722 | HBBP1 | 112 | [A/C] | C | 0.334 | C | 0.496 | A | 4.33 (2.74–6.84) | 2.96E–13 | 3.03E–19 | 5.67E–24 |

### 2x2 Table
The number of each cells are calculated by allele frequency and total number of case/control.
| | Control(CC) | Risk(AA,CA) | Total |
|-|-|-|-|
| Severe | 43 | 340 | 383 |
| Mild | 58 | 177 | 235 |
| Total | | | 618 |
``` python
>>> rount(383 * 0.334 ** 2)
43
>>> rount(235 * 0.496 ** 2)
58
```
## Q2
> 2) The odds ratio (OR) of this variant and interpretation. (Please show me your calculation) (10%)
Odds ratio is the ratio of
* `Odds(case)`(The portion of the variant exists in the case) and
* `Odds(control)`(The portion of the variant exists in the control),
The formula shows the ratio of two portion. If two portion are same, `OR=1` means no difference when the variant exists.
In the other word, OR indicates the effect(mostly hazard) when the variant exist.
``` python
>>> (340/178) / (43/57)
2.5909867297332805
```
In this example, it is 2.56 times hazard when the variant exists.
However, it is different from the paper claim(OR=4.33)
## Q3
> 3) The p-value of this variant in the study (Please show me your calculation) (10%)
Calculate p-value by Chi-sequre
First, build expected table
| Expected | Control(CC) | Risk(AA,CA) | Total |
|-|-|-|-|
| Severe | (383/618)*(101/618)*618 | (383/618)*(517/618)*618 | 383 |
| Mild | (235/618)*(101/618)*618 | (235/618)*(517/618)*618 | 235 |
| Total | 101 | 517 | 618 |
Finally, Sum of `(observe - expected) ** 2 / expected`
``` python
>>> chi = (case_ref - exp_case_ref) ** 2 / exp_case_ref \
+ (case_var - exp_case_var) ** 2 / exp_case_var \
+ (cont_ref - exp_cont_ref) ** 2 / exp_cont_ref \
+ (cont_var - exp_cont_var) ** 2 / exp_cont_var
>>> chi
19.28086351480877
>>> 1 - stats.chi2.cdf(chi, 1))
1.1283157394403887e-05
```
p-value is `1.12e-5` is quick larger than `2.96E–13` which claimed by paper.
## Q4
> 4) Based on 2x2 table from Q1, please write the equation of classic logistic regression and calculate the coefficients (β1) (10%)
Logistic regression:
$$
\ln( \frac{p}{1-p}) = \beta_0 + \beta_1*X_1 + ...
$$
where $p$ is the portion of disease, $\beta_1 = \ln(\text{OR of variant }X_1)$
Calculate log(OR) for `rs2071348`
```
>>> np.log(2.5909867297332805)
0.9520387798892347
```
Thus, $\beta = 0.95$
## Q5
> 5) Please state a clear interpretation of β1 (10%)
To simply the equation, we can assume $X_1=0$ when no variant at the locus and $X_1=1$ when there has variant. $\beta_1X_i$ will not be 0 when there's variant for the person. If beta is larger, which indicates the variant matters, it contributes more to the final answer $\ln( \frac{p}{1-p})$. In brief, we called it effect size.
For example, if $\beta_1=10, X=1$, then $\ln( \frac{p}{1-p})=10$ and solve the equation you will get $p \approx 1$, show the person are predicted to have disease.
## Q6
> 6) Based on the equation from Q4, please write the equation of three models Dominant/Recessive/Additive and calculate the OR and p-value for each model (30%)
### Dominant
So far, I have calculated on Dominant model.
### Recessive
Then, I regroup the samples to control and risk in the below table under recessive model
| | Control(CC,CA) | Risk(AA) | Total |
|-|-|-|-|
| Severe | 213 | 170 | 383 |
| Mild | 175 | 60 | 235 |
| Total | 377 | 230 | 618 |
And the result
```
OR 2.3278560250391234
Chi 22.15776550189664
Chi-p 2.5113807993193404e-06
beta 0.8449476830422499
```
### Additive
This case is more complicated.
I rewrite the equivalent formula like this, I split the original genotype to two variables.
$$
\ln( \frac{p}{1-p})= \beta_{1} X_{Aa} + 2\beta_{2} X_{AA}
$$
where
$$
\left\{\begin{matrix}
X_{Aa} = 1 & \text{when} \text{X=Aa} \\
X_{Aa} = 0 & \text{otherwise}
\end{matrix}\right.
$$
$$
\left\{\begin{matrix}
X_{AA} = 1 & \text{when} \text{X=AA} \\
X_{AA} = 0 & \text{otherwise}
\end{matrix}\right.
$$
And add one more column to 2x2 table
| | Control(CC) | AA | CA | Total |
|-|-|-|-|-|
| Severe | 43 | 170 | 170 | 383 |
| Mild | 58 | 60 | 117 | 235 |
| Total | 101 | 230 | 287 | 618 |
The chi-square test is still the same: `(observe-expected)**2 / expected`.
Note that DoF(Degree of Freedom) should set to 2
```
Chi 30.955882412142437
Chi-p 1.8967738035780002e-07
```
Calculate OR:
```
OR1 for AA+Aa vs aa: 2.3278560250391234
OR2 for AA vs aa+Aa: 2.5909867297332805
>>> "beta1", np.log(OR1)
>>> "beta2", np.log(OR2)
beta1 0.8449476830422499
beta2 0.9520387798892347
```
## Code:
https://gist.github.com/linnil1/7063a73c873bb224d1bd91cddfd5fd16
{%gist linnil1/7063a73c873bb224d1bd91cddfd5fd16 %}