Simulation_PanGWAS

### 1. Simulating a bimodal distribution of a phenotype in a population of mice and a genotype matrix ``` import numpy as np import pandas as pd import matplotlib.pyplot as plt # Set the seed for reproducibility np.random.seed(42) # Parameters n_samples = 149 n_nodes = 20 n_group1 = 70 n_group2 = 79 # Generate phenotypes mean1, sd1 = 2, 0.5 mean2, sd2 = 12, 0.5 phenotype_group1 = np.random.normal(mean1, sd1, n_group1) phenotype_group2 = np.random.normal(mean2, sd2, n_group2) phenotype = np.concatenate([phenotype_group1, phenotype_group2]) # Create sample IDs sample_ids = [f'0{i:03d}' for i in range(341, 341+n_samples)] # Generate genotypes for the specific node (node.80) genotypes_specific = np.where(phenotype < np.mean(phenotype), 1, 0) # Generate random genotypes for other nodes genotypes_random = np.random.choice([0, 1], size=(n_nodes-1, n_samples)) # Combine genotypes genotypes = np.vstack([genotypes_specific, genotypes_random]) # Create a DataFrame for genotypes df_genotypes = pd.DataFrame(genotypes, columns=sample_ids) df_genotypes.index = [f'node.{i}' for i in range(80, 80+n_nodes)] # Save genotypes to CSV df_genotypes.to_csv('genotype_matrix.csv') print("Genotype matrix has been saved to 'genotype_matrix.csv'") # Create a DataFrame for phenotypes df_phenotypes = pd.DataFrame({'Sample': sample_ids, 'Phenotype': phenotype}) # Save phenotypes to CSV df_phenotypes.to_csv('./phenotype_data.csv', index=False) print("Phenotype data has been saved to 'phenotype_data.csv'") ``` ![trait_distr](https://hackmd.io/_uploads/SJTxni811l.png) For the first node (node.80): - B (1) for the samples with phenotype smaller then the mean - D (0) for the samples with phenotype higher then the mean For the second node (node.81), we have the opposite situation of the first node. All other nodes have random genotypes. Example of genotype matrix ``` sample,0341,0342,...,0489 node.80,1,1,...,0,0 node.81,0,0,...,1,1 ... node.99,1,0,...,0 ... ``` Example of phenotype matrix ``` Sample,Phenotype 0341,2.1234 0342,11.9876 ... 0489,1.8765 ``` Run GEMMA We found the associations that we expected for the first two nodes: ``` gemma -g genotype_matrix.csv -p phenotypes.txt -gk 1 -o mouse gemma -g genotype_matrix.csv -p phenotypes.txt -n 1 -k ./output/mouse.cXX.txt -lmm -o mouse_asso_lmm ``` ``` chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle p_wald -9 node.80 -9 0 X Y 0.235 -1.004715e+01 7.857953e-02 -9.869112e+01 5.960082e-03 1.385944e-152 -9 node.81 -9 0 X Y 0.265 1.004715e+01 7.857953e-02 -9.869112e+01 5.960082e-03 1.385944e-152 -9 node.82 -9 0 X Y 0.272 3.593632e-02 1.680335e+00 -1.532825e+02 2.529406e+02 9.829664e-01 -9 node.83 -9 0 X Y 0.279 3.187664e-02 1.680566e+00 -1.532852e+02 2.529375e+02 9.848925e-01 -9 node.84 -9 0 X Y 0.235 -2.271920e-02 1.680445e+00 -1.532806e+02 2.529416e+02 9.892315e-01 ``` ### 2. Simulating a bimodal distribution of a phenotype in a population of mice and use the real read mapped per node I will use the same phenotypes that I generated. In addition I create a real matrix per read mapped per node, but I changed only the first 70 values as 0 for the first node and for the second node the last 70 values as 0. ``` awk ' BEGIN {FS=OFS=","} NR==1 { # For the first line (node.80) first="node.80" $1=first ",X,Y" # Store all original values for(i=2; i<=NF; i++) temp[i]=$i # Create new line with first 70 zeros printf "%s", first ",X,Y" # Print first 70 zeros for(i=1; i<=70; i++) printf ",0" # Print the remaining values (after position 70) for(i=74; i<=NF; i++) printf ",%s", temp[i-2] printf "\n" next } NR==2 { # For the second line (node.139) first="node.139" $1=first ",X,Y" # Print first values until length-70 printf "%s", first ",X,Y" for(i=4; i<=(NF-70); i++) printf ",%s", $i # Print last 70 zeros for(i=1; i<=70; i++) printf ",0" printf "\n" next } {print} ' input_matrix.txt > modified_matrix.txt ``` ##### Now I generate a script that normalize the matrix in different way, run gemma and generate different qqplot. Here are all the normalizations I implemented in the script: - Mean Normalization (normalize_by_mean): Calculate mean of values in the sample Divide values by sample mean Scale to target mean of 2.0 Cap values at 2.0 - CPM (Counts Per Million) Normalization (normalize_by_cpm): For each sample (column), divides each value by the sum of all values in that sample Multiplies by 1,000,000 to get counts per million This ensures the total normalized value for each sample is 1,000,000 - Length Normalization (normalize_by_length): Divides each node's coverage value by its corresponding length from the node lengths file This adjusts for different node sizes, giving coverage per base Uses a default length of 1 for any unmatched nodes - Zygosity Normalization (normalize_by_zygosity): Calculates the mode (most common value) for each sample using kernel density estimation Divides each value by half of this mode value (assuming diploid) This accounts for copy number variation in the genome The script also creates all possible combinations of these normalizations (e.g., mean+cpm, mean+length, cpm+length, etc.) and applies a final cap of 2.0 to all results. Try different approaches to obtain a matrix of genotype: ##### 1. Simple way I calculated the mean of all the values in the matrix. I divided each value in the matrix by the mean calculated. This will rescale the values so that their new mean is 2.0. After rescaling, I capped any values greater than 2.0 to 2.0. ##### 2. Few steps - Normalization by individual For each sample, I am dividing each value (coverage per node) by the sum of all values in that sample. This normalizes the data within each sample, ensuring the total coverage for each sample sums to 1. I then multiply these normalized values by 1,000,000. This is called "scaling," and it adjusts the data to a scale that reflects counts per million (CPM). So, each sample's total normalized value will now be 1,000,000. - Normalization by node length Since nodes vary in length, I divided each node's normalised coverage value by its length to get coverage per base. This adjustment permits that comparisons between nodes are not biased by their size, giving a per-base coverage per node measure. - Normalize by zigosity → number of copies Rescale the coverage vector by dividing by ½ the coverage of the diploid peak (in the below example the peak is for BXD32: 0.0003535) I divided each node's normalized coverage by the zygosity. In a diploid organism, I can adjust the values to reflect how many copies of the node exist in the genome. If it's diploid (2 copies), I normalized by dividing by half the coverage of the diploid peak (which reflects typical coverage for a diploid region). This accounts for copy number variation, such as regions that might have more or fewer copies - Cap at 2. All big values will be at 2. After all normalizations, if any values exceed 2 (which could indicate overly high coverage), you cap them at 2. ##### 3. Steps before without normalization by individual

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.