# Sorghum Analysis Sandbox v2
HapMap file: https://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/hapmap/all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt
VCF files: https://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/gvcf
Phenotype data:
https://drive.google.com/drive/folders/1K4OrHbaDvao7vrN0_V2yc0gD_wmvFXdC with more incoming via https://github.com/genophenoenvo/terraref-datasets/issues/24
https://github.com/genophenoenvo/terraref-datasets/files/4565996/short_format_traits_season_4.txt
Tassel: https://www.maizegenetics.net/tassel
Endpoints:
- SNP to phenotype associations with p-values and effect sizes
- Manhattan plots to determine p-value cutoff
- Phylogenetic tree
- Documentation on this process
Follow up analyses (to become future tickets):
- Look for conservation of gwas snps in closely related species (looking at syntenic regions)
- Look for similar phenotype data in qtl datasets in planteome
-------------------------------------------------
Some notes on the data:
Tassel requires one of the following formats for phenotypes:
Trait format: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual/Load/Load#markdown-header-trait-format
Phenotype format: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual/Load/Load#markdown-header-numerical-data
-------------------------------------------------
##### Downloading the hapmap and vcf files
Cyverse only downloading 1.1gb/31 then erroring, trying with irods
Irods not supported for OS, dockerizing...
Irods Dockerfile:
FROM ubuntu:14.04
VOLUME /data
RUN apt-get -y update && apt-get install -y wget lsb-core
RUN wget -qO - https://packages.irods.org/irods-signing-key.asc | apt-key add -
RUN echo "deb [arch=amd64] https://packages.irods.org/apt/ $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/renci-irods.list
RUN apt-get -y update && apt-get install -y irods-server irods-database-plugin-postgres
ENTRYPOINT ["/bin/sh"]
build and run
docker build . --tag irods
docker run -it -v /home/kshefchek/git/terraref-datasets:/data irods
set up iicommands per https://wiki.cyverse.org/wiki/display/DS/Setting+Up+iCommands#SettingUpiCommands-co
```iinit```
hostname: data.cyverse.org
port: 1247
zone: iplant
```cd data```
Relevant iget options:
-K verify the checksum
-r recursive
For the hapmap file:
iget -K /iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/hapmap/all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt
checksum should match adb69f50a3cf1122278f317c3e052e85
md5sum all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt
$ adb69f50a3cf1122278f317c3e052e85 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt
Hapmap header contains extra characters in comparison to trait file (eg IDUE_PI452692 vs PI452692)
head -1 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt | sed 's/[A-Z]\{4\}_\(PI[0-9]\{5,6\}\)/\1/g' > all_combined_genotyped_lines_SNPS_052217.hmp.txt
tail -n +2 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt >> all_combined_genotyped_lines_SNPS_052217.hmp.txt
For the vcf files:
```iget -K -r /iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/gvcf/```
grab some coffee...
##### Merging the VCF files
Unzip, bgzip and tabix index
```
gunzip *
docker pull biocontainers/vcftools:v0.1.16-1-deb_cv1
docker run \
--volume `pwd`:/data \
biocontainers/vcftools:v0.1.16-1-deb_cv1 \
/bin/sh -c 'for F in *.vcf ; do bgzip ${F} ; tabix -f -p vcf ${F}.gz ; done'
```
Download Sorghum reference, https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Sbicolor
Note that NCBI download doesn't work, headers need to match (eg >Chr01)
Make fai, and .dict indexes of fasta file
```
docker run \
--volume `pwd`:/data \
--user 1001 \
-d \
biocontainers/samtools:v1.9-4-deb_cv1 \
/bin/sh -c 'samtools faidx Sbicolor_313_v3.0.fa'
docker run \
--volume `pwd`:/data \
--user 1001 \
-d \
biocontainers/picard:v2.3.0_cv2 \
/bin/sh -c 'picard CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict'
```
Merge VCF files using bcftools (takes 2 days and change
```
docker run \
--volume `pwd`:/data \
--user 1001 \
-d \
biocontainers/bcftools:v1.9-1-deb_cv1 \
/bin/sh -c 'bcftools merge --merge both --threads 64 --output-type z \
--gvcf Sbicolor_313_v3.0.fa *vcf.gz > bcf-merged.vcf.gz'
```
Apply SNP filters, see
https://docs.terraref.org/protocols/genomic-data#applying-hard-snp-filters-with-gatk-variantfiltration
Run GATK3 GenotypeGVCFs (est 70 hours)
```
docker run \
-v `pwd`:/gatk/my_data \
-d \
broadinstitute/gatk3:3.5-0 \
/bin/sh -c 'java -Xmx100g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R /gatk/my_data/Sbicolor_313_v3.0.fa \
-V /gatk/my_data/bcf-merged.vcf.gz -o /gatk/my_data/all_combined_Genotyped_lines.vcf --max_alternate_alleles 10'
```
Run GATK3 VariantFiltration
```
docker run \
-v `pwd`:/gatk/my_data \
-d \
broadinstitute/gatk3:3.5-0 \
/bin/sh -c 'java -Xmx100g -jar GenomeAnalysisTK.jar -T VariantFiltration -R /gatk/my_data/Sbicolor_313_v3.0.fa \
-V /gatk/my_data/all_combined_Genotyped_lines.vcf -o /gatk/my_data/all_combined_Genotyped_lines_filtered.vcf \
--filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
--filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
--filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"'
```
Extract only cultivars with phenotype data (for this run season 4)
```
docker run \
--volume `pwd`:/data \
--user 1001 \
biocontainers/bcftools:v1.9-1-deb_cv1 \
/bin/sh -c 'bcftools view --output-file all_combined_Genotyped_lines_filtered.season4.vcf.gz --output-type z --samples-file season_4_cultivars.txt all_combined_Genotyped_lines_filtered.vcf.gz'
```
Keep only biallelic sites, filter and recode
```
docker run \
--volume `pwd`:/data \
--user 1001 \
-d \
biocontainers/vcftools:v0.1.16-1-deb_cv1 \
/bin/sh -c 'vcftools --gzvcf all_combined_Genotyped_lines_filtered.season4.vcf.gz --min-alleles 2 --max-alleles 2 \
--out all_combined_Genotyped_lines.filtered.season4.recode.vcf --max-missing 0.2 --recode'
```
### Filtering, Imputation, Filtering
Filter missing alleles at 0.7, minor allele frequency at 0.01
```bash=
bcftools view --min-af 0.01:minor -i 'F_MISSING<0.7' all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf001.vcf
```
```
# SN [2]id [3]key [4]value
SN 0 number of samples: 296
SN 0 number of records: 3018438
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 2738403
SN 0 number of MNPs: 0
SN 0 number of indels: 280035
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
```
#### Impute
First, find the contigs with more than 100 sites:
```bash=
cat all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.vcf | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn | awk '{ if ($1 > 100) { print $2 } }' > contigs_with_100_or_more_variants.txt
```
Get the contigs into the right format to filter with grep
```bash=
cat contigs_with_100_or_more_variants.txt | sort | xargs -L 1 echo \\\|^ | tr '\n' ' ' | sed 's@ @@g'
```
```
cat all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.vcf | grep -w "^#\|^#CHROM\|^01\|^02\|^03\|^04\|^05\|^06\|^07\|^08\|^09\|^10\|^SUPER_100\|^SUPER_103\|^SUPER_104\|^SUPER_105\|^SUPER_109\|^SUPER_110\|^SUPER_114\|^SUPER_118\|^SUPER_119\|^SUPER_120\|^SUPER_123\|^SUPER_127\|^SUPER_128\|^SUPER_1305\|^SUPER_132\|^SUPER_133\|^SUPER_136\|^SUPER_141\|^SUPER_142\|^SUPER_144\|^SUPER_146\|^SUPER_16\|^SUPER_18\|^SUPER_20\|^SUPER_22\|^SUPER_25\|^SUPER_26\|^SUPER_27\|^SUPER_28\|^SUPER_29\|^SUPER_30\|^SUPER_31\|^SUPER_32\|^SUPER_33\|^SUPER_35\|^SUPER_36\|^SUPER_368\|^SUPER_37\|^SUPER_38\|^SUPER_39\|^SUPER_398\|^SUPER_40\|^SUPER_403\|^SUPER_41\|^SUPER_42\|^SUPER_43\|^SUPER_44\|^SUPER_45\|^SUPER_46\|^SUPER_47\|^SUPER_48\|^SUPER_49\|^SUPER_50\|^SUPER_51\|^SUPER_52\|^SUPER_522\|^SUPER_53\|^SUPER_54\|^SUPER_55\|^SUPER_56\|^SUPER_58\|^SUPER_60\|^SUPER_61\|^SUPER_62\|^SUPER_63\|^SUPER_637\|^SUPER_64\|^SUPER_65\|^SUPER_66\|^SUPER_68\|^SUPER_680\|^SUPER_69\|^SUPER_71\|^SUPER_72\|^SUPER_73\|^SUPER_74\|^SUPER_75\|^SUPER_76\|^SUPER_77\|^SUPER_78\|^SUPER_79\|^SUPER_80\|^SUPER_81\|^SUPER_82\|^SUPER_83\|^SUPER_84\|^SUPER_85\|^SUPER_86\|^SUPER_87\|^SUPER_88\|^SUPER_89\|^SUPER_90\|^SUPER_91\|^SUPER_92\|^SUPER_93\|^SUPER_94\|^SUPER_95\|^SUPER_96\|^SUPER_97\|^SUPER_99" | pigz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.vcf.gz
```
Now run Beagle
```bash=
java -jar -Xmx12g beagle.28Jun21.220.jar \
gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.vcf.gz \
out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed
```
Now filter minor allele frequency to 0.05
```bash=
bcftools view -Oz --min-af 0.01:minor all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed.vcf.gz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed.maf_005.vcf.gz
```
```
# SN [2]id [3]key [4]value
SN 0 number of samples: 296
SN 0 number of records: 1609194
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 1512852
SN 0 number of MNPs: 0
SN 0 number of indels: 96342
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
```
##### DELETE BELOW
Index before filter
```
singularity exec ~/images/bcftools.img bcftools index all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz
```
Filter by count of unknown alleles at 10% (REDO: less stringent now, more stringent later)
```
singularity exec ~/images/bcftools.img bcftools view \
-i 'F_MISSING<0.1' \
all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz \
-Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz
```
records: 145492, SNPs: 136133, indels: 9359
index again
```
singularity exec ~/images/bcftools.img bcftools index \
all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz
```
Filter by minor allele frequency at 0.05
```
singularity exec ~/images/bcftools.img bcftools view \
--min-af 0.05:minor \
all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz \
--output-type z \
-o all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz
```
records: 69035, SNPs: 66726, indels: 2309 👍
Imputation with Beagle
```
java -jar ~/images/beagle.28Jun21.220.jar \
gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz \
out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.imputed.vcf.gz
```
This fails with
``` Exception in thread "main" java.lang.IllegalArgumentException: Window has only one position: CHROM=SUPER_1028 POS=2879 ```
Let's see how many contigs only have one position
```
zcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep SUPER_ | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn
```
...
1 SUPER_99
1 SUPER_926
1 SUPER_802
1 SUPER_70
1 SUPER_629
1 SUPER_587
1 SUPER_512
1 SUPER_396
1 SUPER_296
1 SUPER_267
1 SUPER_248
1 SUPER_197
1 SUPER_1623
1 SUPER_1522
1 SUPER_117
1 SUPER_112
1 SUPER_1028
At the top of the list, the values look like
8460 05
7001 07
6685 02
6161 06
5825 03
5560 08
5418 09
5409 04
3976 01
3939 10
980 SUPER_18
831 SUPER_20
707 SUPER_16
576 SUPER_22
557 SUPER_32
412 SUPER_26
330 SUPER_30
307 SUPER_75
300 SUPER_47
256 SUPER_38
234 SUPER_35
222 SUPER_31
205 SUPER_29
190 SUPER_37
180 SUPER_28
171 SUPER_33
163 SUPER_36
150 SUPER_25
148 SUPER_127
137 SUPER_27
121 SUPER_60
113 SUPER_43
112 SUPER_80
110 SUPER_45
104 SUPER_56
95 SUPER_39
92 SUPER_49
Going to pick a cutoff of contigs with at least 100 variants
```bash=
zcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn | awk '{ if ($1 > 100) { print $2 } }' > contigs_with_100_or_more_variants.txt
```
Now limit the VCF to just these contigs, single column isn't valid for - bcftools didn't seem to like the contig file or command line, so falling back to grep
```
gzcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep -w "^#\|^#CHROM\|^01\|^02\|^03\|^04\|^05\|^06\|^07\|^08\|^09\|^10\|^SUPER_18\|^SUPER_20\|^SUPER_16\|^SUPER_22\|^SUPER_32\|^SUPER_26\|^SUPER_30\|^SUPER_75\|^SUPER_47\|^SUPER_38\|^SUPER_35\|^SUPER_31\|^SUPER_29\|^SUPER_37\|^SUPER_28\|^SUPER_33\|^SUPER_36\|^SUPER_25\|^SUPER_127\|^SUPER_27\|^SUPER_60\|^SUPER_43\|^SUPER_80\|^SUPER_45\|^SUPER_56" | pigz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.vcf.gz
```
Re-Beagling
```bash=
java -jar beagle.28Jun21.220.jar \
gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.vcf.gz \
out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.imputed
```
### Filtering, Imputation, Filtering Retry
```
bcftools view -i 'F_MISSING<0.3' all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.vcf.gz
```
number of records: 5,031,759 => 751,518
number of SNPs: 4,565,515 => 693,157
number of indels: 466,244 => 58,361
rMVP at PC=1
```R=
library(rMVP)
##input file preparation
MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.vcf",
filePhe="short_format_traits_season_4.txt",
fileKin=TRUE,
filePC=TRUE,
out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098"
)
##attach input file
genotype <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.geno.desc")
phenotype <- read.table("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.phe",head=TRUE)
map <- read.table("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.geno.map" , head = TRUE)
Covariates <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.pc.desc")
kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.kin.desc")
##GWAS analysis
for(i in 2:ncol(phenotype)){
imMVP <- MVP(
phe=phenotype[, c(1, i)],
geno=genotype,
map=map,
K=kinship,
nPC.MLM = 1,
nPC.FarmCPU = 1,
threshold=0.05,
method=c("MLM", "FarmCPU")
)
gc()
}
```
Filter FarmCPU.csv output for subset of phenotypes
For files: `PC1.aboveground_dry_biomass.FarmCPU.csv PC1.canopy_height.FarmCPU.csv PC1.flag_leaf_emergence_time.FarmCPU.csv PC1.flowering_time.FarmCPU.csv`
```bash=
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.aboveground_dry_biomass.FarmCPU.csv > PC1.aboveground_dry_biomass.FarmCPU.p05.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.canopy_height.FarmCPU.csv > PC1.canopy_height.FarmCPU.p05.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.flag_leaf_emergence_time.FarmCPU.csv > PC1.flag_leaf_emergence_time.FarmCPU.p05.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.flowering_time.FarmCPU.csv > PC1.flowering_time.FarmCPU.p05.csv
cat *p05.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > 4pheno.PC1.FarmCPU.p05.unique.csv
```
bgzip & tabix the vcf file, then:
```bash=
bcftools view --regions-file 4pheno.PC1.FarmCPU.p05.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf.gz
```
Unzip again...
```bash=
gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf.gz
```
Now produce a new kinship matrix with rMVP
```R=
library(rMVP)
library(readr)
##input file preparation
MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf",
filePhe="short_format_traits_season_4.txt",
fileKin=TRUE,
filePC=TRUE,
out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno"
)
kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.desc")
write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.tsv")
```
Add labels with pandas, because I can't figure it out in R.
```python=
import pandas as pd
kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.tsv', sep="\t")
samples = pd.read_csv('samples.txt', header=None)
kin.columns = samples[0].tolist()
kin.index = samples[0].tolist()
kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.labeled.tsv', sep='\t')
```
#### Same as above, but for p < 0.001
Filter FarmCPU.csv output for subset of phenotypes
For files: `PC1.aboveground_dry_biomass.FarmCPU.csv PC1.canopy_height.FarmCPU.csv PC1.flag_leaf_emergence_time.FarmCPU.csv PC1.flowering_time.FarmCPU.csv`
```bash=
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.aboveground_dry_biomass.FarmCPU.csv > PC1.aboveground_dry_biomass.FarmCPU.p0001.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.canopy_height.FarmCPU.csv > PC1.canopy_height.FarmCPU.p0001.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.flag_leaf_emergence_time.FarmCPU.csv > PC1.flag_leaf_emergence_time.FarmCPU.p0001.csv
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.flowering_time.FarmCPU.csv > PC1.flowering_time.FarmCPU.p0001.csv
cat *p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > 4pheno.PC1.FarmCPU.p0001.unique.csv
```
bgzip & tabix the vcf file, then:
```bash=
bcftools view --regions-file 4pheno.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf.gz
```
Unzip again...
```bash=
gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf.gz
```
Now produce a new kinship matrix with rMVP
```R=
library(rMVP)
library(readr)
##input file preparation
MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf",
filePhe="short_format_traits_season_4.txt",
fileKin=TRUE,
filePC=TRUE,
out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001"
)
kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.desc")
write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.tsv")
```
Add labels with pandas, because I can't figure it out in R.
```python=
import pandas as pd
kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.tsv', sep="\t")
samples = pd.read_csv('samples.txt', header=None)
kin.columns = samples[0].tolist()
kin.index = samples[0].tolist()
kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.labeled.tsv', sep='\t')
```
#### Individual matricies for just canopy height & aboveground dry biomass
```bash=
cat PC1.aboveground_dry_biomass.FarmCPU.p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > aboveground_dry_biomass.PC1.FarmCPU.p0001.unique.csv
cat PC1.canopy_height.FarmCPU.p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > canopy_height.PC1.FarmCPU.p0001.unique.csv
```
```bash=
bcftools view --regions-file aboveground_dry_biomass.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf.gz
bcftools view --regions-file canopy_height.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf.gz
```
```bash=
gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf.gz
gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf.gz
```
```R=
library(rMVP)
library(readr)
##input file preparation
MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf",
filePhe="short_format_traits_season_4.txt",
fileKin=TRUE,
filePC=TRUE,
out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001"
)
kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.desc")
write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.tsv")
##input file preparation
MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf",
filePhe="short_format_traits_season_4.txt",
fileKin=TRUE,
filePC=TRUE,
out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001"
)
kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.desc")
write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.tsv")
```
```python=
import pandas as pd
kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.tsv', sep="\t")
samples = pd.read_csv('samples.txt', header=None)
kin.columns = samples[0].tolist()
kin.index = samples[0].tolist()
kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.labeled.tsv', sep='\t')
kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.tsv', sep="\t")
samples = pd.read_csv('samples.txt', header=None)
kin.columns = samples[0].tolist()
kin.index = samples[0].tolist()
kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.labeled.tsv', sep='\t')
```
```bash=
gzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.labeled.tsv
gzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.labeled.tsv
```
#### Produce a vcf file limiting to plant height QTL regions
[Search aussorgm](https://aussorgm.org.au/sorghum-qtl-atlas/search/) for `Assembly is v3.0`
and `Trait Subcategory CONTAINS "height"` and export as excel, then export to tsv and awk it up to limit to only rows that contain `plant height` as a trait. (searching for `plant height` directly didn't work, unfortunately.) Also: the regions file needs to be delimited by tabs, so convert the '1:234-5678' format into 3 columns by replacing `:` and `-`.
```
awk -F '\t' 'BEGIN {OFS="\t"} ($4 ~ /plant height/) {print $5}' height_qtl.tsv | tr ':' '\t' | tr '-' '\t' > plant_height_regions.txt
```
Filter to those regions with bcftools, and maybe a vcf filename reset is in order
```bash=
bcftools view -R plant_height_regions.txt all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.vcf.gz > sorghum_plant_height_qtl.vcf
```
Make a trait file for just canopy height
```bash=
cut -f 1,24 short_format_traits_season_4.txt > canopy_height_trait_season4.txt
```
Run the GWAS
```R=
library(rMVP)
library(readr)
##input file preparation
MVP.Data(fileVCF="sorghum_plant_height_qtl.vcf",
filePhe="canopy_height_trait_season4.txt",
fileKin=TRUE,
filePC=TRUE,
out="sorghum_plant_height_qtl"
)
##attach input file
genotype <- attach.big.matrix("sorghum_plant_height_qtl.geno.desc")
phenotype <- read.table("sorghum_plant_height_qtl.phe",head=TRUE)
map <- read.table("sorghum_plant_height_qtl.geno.map" , head = TRUE)
Covariates <- attach.big.matrix("sorghum_plant_height_qtl.pc.desc")
kinship <- attach.big.matrix("sorghum_plant_height_qtl.kin.desc")
##GWAS analysis
for(i in 2:ncol(phenotype)){
imMVP <- MVP(
phe=phenotype[, c(1, i)],
geno=genotype,
map=map,
K=kinship,
nPC.MLM = 1,
nPC.FarmCPU = 1,
threshold=0.05,
method=c("MLM", "FarmCPU")
)
gc()
}
```
QQ Plot for FarmCPU looks great, filter to p < 0.05 gives ~6000 locations
```bash=
awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' canopy_height.FarmCPU.csv > canopy_height.FarmCPU.p005.csv
cat canopy_height.FarmCPU.p005.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > canopy_height.FarmCPU.p005.unique.csv
```
Filter the vcf based on this regions file & unzip before sending back to rMVP
```bash=
bgzip sorghum_plant_height_qtl.vcf
tabix sorghum_plant_height_qtl.vcf.gz
bcftools view --regions-file canopy_height.FarmCPU.p005.unique.csv sorghum_plant_height_qtl.vcf.gz -Oz -o sorghum_plant_height_qtl.canopy_height_p005.vcf.gz
gunzip sorghum_plant_height_qtl.canopy_height_p005.vcf.gz
```
Make the kinship matrix
```R=
library(rMVP)
library(readr)
##input file preparation
MVP.Data(fileVCF="sorghum_plant_height_qtl.canopy_height_p005.vcf",
filePhe="canopy_height_trait_season4.txt",
fileKin=TRUE,
filePC=TRUE,
out="sorghum_plant_height_qtl.canopy_height_p005"
)
kinship <- attach.big.matrix("sorghum_plant_height_qtl.canopy_height_p005.kin.desc")
write_tsv(as.data.frame(kinship[,]), "sorghum_plant_height_qtl.canopy_height_p005.kin.tsv")
```
Finally, add labels to the matrix
```python=
import pandas as pd
kin = pd.read_csv('sorghum_plant_height_qtl.canopy_height_p005.kin.tsv', sep="\t")
samples = pd.read_csv('samples.txt', header=None)
kin.columns = samples[0].tolist()
kin.index = samples[0].tolist()
kin.to_csv('sorghum_plant_height_qtl.canopy_height_p005.kin.labeled.tsv', sep='\t')
```
### Get just the ~90 from p < 0.001
Nicely packaged two of the steps:
```bash=
./matrix.R sorghum_plant_height_qtl.canopy_height_p0001.vcf canopy_height_trait_season4.txt
python3 add_matrix_labels.py sorghum_plant_height_qtl.canopy_height_p0001.kin.tsv samples.txt
```