### 1. Simulating a bimodal distribution of a phenotype in a population of mice and a genotype matrix
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Set the seed for reproducibility
np.random.seed(42)
# Parameters
n_samples = 149
n_nodes = 20
n_group1 = 70
n_group2 = 79
# Generate phenotypes
mean1, sd1 = 2, 0.5
mean2, sd2 = 12, 0.5
phenotype_group1 = np.random.normal(mean1, sd1, n_group1)
phenotype_group2 = np.random.normal(mean2, sd2, n_group2)
phenotype = np.concatenate([phenotype_group1, phenotype_group2])
# Create sample IDs
sample_ids = [f'0{i:03d}' for i in range(341, 341+n_samples)]
# Generate genotypes for the specific node (node.80)
genotypes_specific = np.where(phenotype < np.mean(phenotype), 1, 0)
# Generate random genotypes for other nodes
genotypes_random = np.random.choice([0, 1], size=(n_nodes-1, n_samples))
# Combine genotypes
genotypes = np.vstack([genotypes_specific, genotypes_random])
# Create a DataFrame for genotypes
df_genotypes = pd.DataFrame(genotypes, columns=sample_ids)
df_genotypes.index = [f'node.{i}' for i in range(80, 80+n_nodes)]
# Save genotypes to CSV
df_genotypes.to_csv('genotype_matrix.csv')
print("Genotype matrix has been saved to 'genotype_matrix.csv'")
# Create a DataFrame for phenotypes
df_phenotypes = pd.DataFrame({'Sample': sample_ids, 'Phenotype': phenotype})
# Save phenotypes to CSV
df_phenotypes.to_csv('./phenotype_data.csv', index=False)
print("Phenotype data has been saved to 'phenotype_data.csv'")
```

For the first node (node.80):
- B (1) for the samples with phenotype smaller then the mean
- D (0) for the samples with phenotype higher then the mean
For the second node (node.81), we have the opposite situation of the first node.
All other nodes have random genotypes.
Example of genotype matrix
```
sample,0341,0342,...,0489
node.80,1,1,...,0,0
node.81,0,0,...,1,1
...
node.99,1,0,...,0
...
```
Example of phenotype matrix
```
Sample,Phenotype
0341,2.1234
0342,11.9876
...
0489,1.8765
```
Run GEMMA
We found the associations that we expected for the first two nodes:
```
gemma -g genotype_matrix.csv -p phenotypes.txt -gk 1 -o mouse
gemma -g genotype_matrix.csv -p phenotypes.txt -n 1 -k ./output/mouse.cXX.txt -lmm -o mouse_asso_lmm
```
```
chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle p_wald
-9 node.80 -9 0 X Y 0.235 -1.004715e+01 7.857953e-02 -9.869112e+01 5.960082e-03 1.385944e-152
-9 node.81 -9 0 X Y 0.265 1.004715e+01 7.857953e-02 -9.869112e+01 5.960082e-03 1.385944e-152
-9 node.82 -9 0 X Y 0.272 3.593632e-02 1.680335e+00 -1.532825e+02 2.529406e+02 9.829664e-01
-9 node.83 -9 0 X Y 0.279 3.187664e-02 1.680566e+00 -1.532852e+02 2.529375e+02 9.848925e-01
-9 node.84 -9 0 X Y 0.235 -2.271920e-02 1.680445e+00 -1.532806e+02 2.529416e+02 9.892315e-01
```
### 2. Simulating a bimodal distribution of a phenotype in a population of mice and use the real read mapped per node
I will use the same phenotypes that I generated.
In addition I create a real matrix per read mapped per node, but I changed only the first 70 values as 0 for the first node and for the second node the last 70 values as 0.
```
awk '
BEGIN {FS=OFS=","}
NR==1 {
# For the first line (node.80)
first="node.80"
$1=first ",X,Y"
# Store all original values
for(i=2; i<=NF; i++) temp[i]=$i
# Create new line with first 70 zeros
printf "%s", first ",X,Y"
# Print first 70 zeros
for(i=1; i<=70; i++) printf ",0"
# Print the remaining values (after position 70)
for(i=74; i<=NF; i++) printf ",%s", temp[i-2]
printf "\n"
next
}
NR==2 {
# For the second line (node.139)
first="node.139"
$1=first ",X,Y"
# Print first values until length-70
printf "%s", first ",X,Y"
for(i=4; i<=(NF-70); i++) printf ",%s", $i
# Print last 70 zeros
for(i=1; i<=70; i++) printf ",0"
printf "\n"
next
}
{print}
' input_matrix.txt > modified_matrix.txt
```
##### Now I generate a script that normalize the matrix in different way, run gemma and generate different qqplot.
Here are all the normalizations I implemented in the script:
- Mean Normalization (normalize_by_mean):
Calculate mean of values in the sample
Divide values by sample mean
Scale to target mean of 2.0
Cap values at 2.0
- CPM (Counts Per Million) Normalization (normalize_by_cpm):
For each sample (column), divides each value by the sum of all values in that sample
Multiplies by 1,000,000 to get counts per million
This ensures the total normalized value for each sample is 1,000,000
- Length Normalization (normalize_by_length):
Divides each node's coverage value by its corresponding length from the node lengths file
This adjusts for different node sizes, giving coverage per base
Uses a default length of 1 for any unmatched nodes
- Zygosity Normalization (normalize_by_zygosity):
Calculates the mode (most common value) for each sample using kernel density estimation
Divides each value by half of this mode value (assuming diploid)
This accounts for copy number variation in the genome
The script also creates all possible combinations of these normalizations (e.g., mean+cpm, mean+length, cpm+length, etc.) and applies a final cap of 2.0 to all results.
Try different approaches to obtain a matrix of genotype:
##### 1. Simple way
I calculated the mean of all the values in the matrix.
I divided each value in the matrix by the mean calculated. This will rescale the values so that their new mean is 2.0.
After rescaling, I capped any values greater than 2.0 to 2.0.
##### 2. Few steps
- Normalization by individual
For each sample, I am dividing each value (coverage per node) by the sum of all values in that sample. This normalizes the data within each sample, ensuring the total coverage for each sample sums to 1.
I then multiply these normalized values by 1,000,000. This is called "scaling," and it adjusts the data to a scale that reflects counts per million (CPM). So, each sample's total normalized value will now be 1,000,000.
- Normalization by node length
Since nodes vary in length, I divided each node's normalised coverage value by its length to get coverage per base. This adjustment permits that comparisons between nodes are not biased by their size, giving a per-base coverage per node measure.
- Normalize by zigosity → number of copies
Rescale the coverage vector by dividing by ½ the coverage of the diploid peak (in the below example the peak is for BXD32: 0.0003535)
I divided each node's normalized coverage by the zygosity. In a diploid organism, I can adjust the values to reflect how many copies of the node exist in the genome. If it's diploid (2 copies), I normalized by dividing by half the coverage of the diploid peak (which reflects typical coverage for a diploid region).
This accounts for copy number variation, such as regions that might have more or fewer copies
- Cap at 2.
All big values will be at 2. After all normalizations, if any values exceed 2 (which could indicate overly high coverage), you cap them at 2.
##### 3. Steps before without normalization by individual