## Pipeline: GEMMA from pangenome graph
### 0. Extracted all the 200 folders of linked-read
```
/lizardfs/flaviav/mouse/assembly_D/GWAS/extract.sh
```
### 0. Build the pangenome graph whole-genome with pggb
### 1. Mapping linked-read vs D+B assembly
### 2. Inject the BAM (alignment) file from the above step into the pangenome graph
### 3. Generated a matrix of coverage using GAFpack tool
```
sbatch -p workers --job-name extractlinked -c 1 --wrap "hostname; cd /scratch; /lizardfs/flaviav/mouse/assembly_D/GWAS/pregemma_all.sh"
```
### 4. Merge all coverage matrixs' in a unique file (153 samples)
- Merge the file follow the same order of the genomic standard approach
- Reheader the file with the same notation of genomic standard file
Adjust genomic standard file:
- Exclude the samples in which we are not interesting
- Rename the samples
### 5. Generating GEMMA's file
We need a genotypes' matrix, phenotype files, and for us a matrix of start-end of the bubbes.
```
/home/flavia/work/DBA2J/GWAS/trait_154-155/
```
a. Generate a genotypes' matrix
```
python gfapacktogenotype.py in_gemma/merged20.tsv in_gemma/genonodes.tsv.gz
bgzip in_gemma/genonodes.tsv.gz
```
b. Generate a phenotype file
Using David's file, put the phenotypes in the same order of the genotype file
```
in_gemma/pheno_gemma.txt
```
Details info about the phenotype here: https://docs.google.com/spreadsheets/d/1tTmRaRZQYBOJf1xXcVegDqGT4UsxVpEv/edit#gid=1269026877
c. Create a annotation file, start-end bubbles with vg and cut only start and end
Information about phenotype:

https://genenetwork.org/show_trait?trait_id=10358973&dataset=UTHSC_SPL_RMAEx_1210
##### 5b. Debugging the matrix of coverage (from pangenome)
The idea is to plot the distribution of the coverage and the median per each node.
```
debugcov.R
```
### 6. Run GEMMA using whole-pangenome and 20 samples
Gemma guide: https://www.xzlab.org/software/GEMMAmanual.pdf
Nodes in the variation graph: 23,363,074
a. Kinship matrix, to obtain a Relatedness Matrix
```
gemma-0.98.5 -g genonodes_gemma.tsv.gz -p in_gemma/pheno_gemma.txt -gk -o output_gemma/mouse_t2t
```
b. Associations, run univariate LMM. Non-variant sites i.e. start/end of the bubbles are removed by GEMMA
```
gemma-0.98.5 -g in_gemma/genonodes_gemma.tsv.gz -p in_gemma/pheno_gemma.txt -n 1 -k ./output_gemma/mouse_t2t.cXX.txt -lmm -o output_gemma/mouse_t2t_lmm
## number of analyzed individuals = 20
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var = 23,363,074
## number of analyzed SNPs = 6,812,369
Start Eigen-Decomposition...
pve estimate =0.33888
se(pve) =0.591878
```
c. Example's output
```
chr rs ps n_miss allele1 allele0 af beta se logl_H1 l_remle p_wald
-9 node.1 -9 0 A A 0.325 -9.210579e-01 2.292064e+00 -4.395814e+01 8.026216e+00 6.925283e-01....
```
Summary:
```
nodes with p-value >0.05: 6,587,283
nodes with p-value <0.05: 225,086
nodes with p-value <0.001: 18,263
```
### 7. Run GN with only the samples that I am using.
Using only a subset of the data, we're getting a spurious result elsewhere in the genome (e.g. with only 20 strains, it could be that there is somewhere else in the genome that associates by chance with this trait). Still looks like a very strong peak on chr1 (although there is now a potentially spurious peak on chr17).

https://genenetwork.org/run_mapping/403599db0e7af0ae3db7ee15e4ea47c5
- The peak with this number of strains goes from ~151-162 Mb on chr1
- The peak with this number of strains goes from ~15-30 Mb on chr17
What I done again:
- Extract the region of interest from the graph:
```
$ODGI extract -i D_C_mm10.fa.gz.bf3285f.eb0f3d3.867196c.smooth.final.og -r REF#chr1:151000000-162000000 -t 16 -o - | $ODGI view -i - -g > test/151000000-162000000.gfa
```
- Calculate the start/end of the bubbles (markers) on the graph of interest:
```
vg stats -R 151000000-162000000.vg
```
- See the association for that region, it is still high:
```
test/testass.txt
```
- Still high p-wald:
```
summary(gwasResults$p_wald)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5472 0.5472 0.5472 0.5472 0.5472 0.5472
```