# HW1 - GWAS ###### tags: `c4lab` HackMD (Recommanded): https://hackmd.io/oTjzyAzFQ9Genk7zsRWq_g ## Q0 > 0) If you are doing genetic counseling, what is the most frequent question you have been asked. Otherwise, what do you want to ask a certified Genetic Counselor? Why? (20%) ### Conseling I have not been asked ever, but I guess the questions may related to cancer. ### Ask How long can I live and how can I extend it. What is the action to extend longest lifetime for me within minimal lifestyle changes. ### Why Gene shows the blueprint of the live. Precise prediction can be made if we know all the mechanism in our body. In addition, we can use environment factor in our model to get better accuracy. Finally, we can find out the best action by the model with gene and environment information. In order to predict our lifetime, we need to predict all of probability of disease that may induce by one of variants in my gene, this is our final goal in future research. ## Q1 > Please give an example from any published GWAS hits and answer the following questions. > 1) Please describe the sample size of this GWAS hit. Assuming Aa/aa are the same, please describe the data in 2x2 table (10%) I foucs on β-thalassemia which is a heritable gene disease and common in many Asian countries. ### Reference paper https://link.springer.com/article/10.1007%2Fs00439-009-0770-2 Nuinoon, M., Makarasara, W., Mushiroda, T. et al. A genome-wide association identified the common genetic variants influence disease severity in β0-thalassemia/hemoglobin E. Hum Genet 127, 303–314 (2010). https://doi.org/10.1007/s00439-009-0770-2 ### Data: A total of 1,330 Thai β0-thalassemia/HbE patients were classified into three groups: mild, moderate, and severe. 235 mildly and 383 severely affected patients were selected for the GWAS study as described. ### Locus I foucs on the most significant variant `rs2071348` in this research. | dbSNPID | Position | Gene | Chr | Allele | Severe minor allele | Severe MAF | Mild minor allele | Mild MAF | Risk Allele | Odds ratio | Severity P value | HBF(%) | Abs HbF (g/dL) | |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| | rs2071348 | 5220722 | HBBP1 | 112 | [A/C] | C | 0.334 | C | 0.496 | A | 4.33 (2.74–6.84) | 2.96E–13 | 3.03E–19 | 5.67E–24 | ![](https://i.imgur.com/aO2khXs.png) ### 2x2 Table The number of each cells are calculated by allele frequency and total number of case/control. | | Control(CC) | Risk(AA,CA) | Total | |-|-|-|-| | Severe | 43 | 340 | 383 | | Mild | 58 | 177 | 235 | | Total | | | 618 | ``` python >>> rount(383 * 0.334 ** 2) 43 >>> rount(235 * 0.496 ** 2) 58 ``` ## Q2 > 2) The odds ratio (OR) of this variant and interpretation. (Please show me your calculation) (10%) Odds ratio is the ratio of * `Odds(case)`(The portion of the variant exists in the case) and * `Odds(control)`(The portion of the variant exists in the control), The formula shows the ratio of two portion. If two portion are same, `OR=1` means no difference when the variant exists. In the other word, OR indicates the effect(mostly hazard) when the variant exist. ``` python >>> (340/178) / (43/57) 2.5909867297332805 ``` In this example, it is 2.56 times hazard when the variant exists. However, it is different from the paper claim(OR=4.33) ## Q3 > 3) The p-value of this variant in the study (Please show me your calculation) (10%) Calculate p-value by Chi-sequre First, build expected table | Expected | Control(CC) | Risk(AA,CA) | Total | |-|-|-|-| | Severe | (383/618)*(101/618)*618 | (383/618)*(517/618)*618 | 383 | | Mild | (235/618)*(101/618)*618 | (235/618)*(517/618)*618 | 235 | | Total | 101 | 517 | 618 | Finally, Sum of `(observe - expected) ** 2 / expected` ``` python >>> chi = (case_ref - exp_case_ref) ** 2 / exp_case_ref \ + (case_var - exp_case_var) ** 2 / exp_case_var \ + (cont_ref - exp_cont_ref) ** 2 / exp_cont_ref \ + (cont_var - exp_cont_var) ** 2 / exp_cont_var >>> chi 19.28086351480877 >>> 1 - stats.chi2.cdf(chi, 1)) 1.1283157394403887e-05 ``` p-value is `1.12e-5` is quick larger than `2.96E–13` which claimed by paper. ## Q4 > 4) Based on 2x2 table from Q1, please write the equation of classic logistic regression and calculate the coefficients (β1) (10%) Logistic regression: $$ \ln( \frac{p}{1-p}) = \beta_0 + \beta_1*X_1 + ... $$ where $p$ is the portion of disease, $\beta_1 = \ln(\text{OR of variant }X_1)$ Calculate log(OR) for `rs2071348` ``` >>> np.log(2.5909867297332805) 0.9520387798892347 ``` Thus, $\beta = 0.95$ ## Q5 > 5) Please state a clear interpretation of β1 (10%) To simply the equation, we can assume $X_1=0$ when no variant at the locus and $X_1=1$ when there has variant. $\beta_1X_i$ will not be 0 when there's variant for the person. If beta is larger, which indicates the variant matters, it contributes more to the final answer $\ln( \frac{p}{1-p})$. In brief, we called it effect size. For example, if $\beta_1=10, X=1$, then $\ln( \frac{p}{1-p})=10$ and solve the equation you will get $p \approx 1$, show the person are predicted to have disease. ## Q6 > 6) Based on the equation from Q4, please write the equation of three models Dominant/Recessive/Additive and calculate the OR and p-value for each model (30%) ### Dominant So far, I have calculated on Dominant model. ### Recessive Then, I regroup the samples to control and risk in the below table under recessive model | | Control(CC,CA) | Risk(AA) | Total | |-|-|-|-| | Severe | 213 | 170 | 383 | | Mild | 175 | 60 | 235 | | Total | 377 | 230 | 618 | And the result ``` OR 2.3278560250391234 Chi 22.15776550189664 Chi-p 2.5113807993193404e-06 beta 0.8449476830422499 ``` ### Additive This case is more complicated. I rewrite the equivalent formula like this, I split the original genotype to two variables. $$ \ln( \frac{p}{1-p})= \beta_{1} X_{Aa} + 2\beta_{2} X_{AA} $$ where $$ \left\{\begin{matrix} X_{Aa} = 1 & \text{when} \text{X=Aa} \\ X_{Aa} = 0 & \text{otherwise} \end{matrix}\right. $$ $$ \left\{\begin{matrix} X_{AA} = 1 & \text{when} \text{X=AA} \\ X_{AA} = 0 & \text{otherwise} \end{matrix}\right. $$ And add one more column to 2x2 table | | Control(CC) | AA | CA | Total | |-|-|-|-|-| | Severe | 43 | 170 | 170 | 383 | | Mild | 58 | 60 | 117 | 235 | | Total | 101 | 230 | 287 | 618 | The chi-square test is still the same: `(observe-expected)**2 / expected`. Note that DoF(Degree of Freedom) should set to 2 ``` Chi 30.955882412142437 Chi-p 1.8967738035780002e-07 ``` Calculate OR: ``` OR1 for AA+Aa vs aa: 2.3278560250391234 OR2 for AA vs aa+Aa: 2.5909867297332805 >>> "beta1", np.log(OR1) >>> "beta2", np.log(OR2) beta1 0.8449476830422499 beta2 0.9520387798892347 ``` ## Code: https://gist.github.com/linnil1/7063a73c873bb224d1bd91cddfd5fd16 {%gist linnil1/7063a73c873bb224d1bd91cddfd5fd16 %}