Sorghum Analysis Sandbox v2

# Sorghum Analysis Sandbox v2 HapMap file: https://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/hapmap/all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt VCF files: https://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/gvcf Phenotype data: https://drive.google.com/drive/folders/1K4OrHbaDvao7vrN0_V2yc0gD_wmvFXdC with more incoming via https://github.com/genophenoenvo/terraref-datasets/issues/24 https://github.com/genophenoenvo/terraref-datasets/files/4565996/short_format_traits_season_4.txt Tassel: https://www.maizegenetics.net/tassel Endpoints: - SNP to phenotype associations with p-values and effect sizes - Manhattan plots to determine p-value cutoff - Phylogenetic tree - Documentation on this process Follow up analyses (to become future tickets): - Look for conservation of gwas snps in closely related species (looking at syntenic regions) - Look for similar phenotype data in qtl datasets in planteome ------------------------------------------------- Some notes on the data: Tassel requires one of the following formats for phenotypes: Trait format: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual/Load/Load#markdown-header-trait-format Phenotype format: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/UserManual/Load/Load#markdown-header-numerical-data ------------------------------------------------- ##### Downloading the hapmap and vcf files Cyverse only downloading 1.1gb/31 then erroring, trying with irods Irods not supported for OS, dockerizing... Irods Dockerfile: FROM ubuntu:14.04 VOLUME /data RUN apt-get -y update && apt-get install -y wget lsb-core RUN wget -qO - https://packages.irods.org/irods-signing-key.asc | apt-key add - RUN echo "deb [arch=amd64] https://packages.irods.org/apt/ $(lsb_release -sc) main" | tee /etc/apt/sources.list.d/renci-irods.list RUN apt-get -y update && apt-get install -y irods-server irods-database-plugin-postgres ENTRYPOINT ["/bin/sh"] build and run docker build . --tag irods docker run -it -v /home/kshefchek/git/terraref-datasets:/data irods set up iicommands per https://wiki.cyverse.org/wiki/display/DS/Setting+Up+iCommands#SettingUpiCommands-co ```iinit``` hostname: data.cyverse.org port: 1247 zone: iplant ```cd data``` Relevant iget options: -K verify the checksum -r recursive For the hapmap file: iget -K /iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/hapmap/all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt checksum should match adb69f50a3cf1122278f317c3e052e85 md5sum all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt $ adb69f50a3cf1122278f317c3e052e85 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt Hapmap header contains extra characters in comparison to trait file (eg IDUE_PI452692 vs PI452692) head -1 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt | sed 's/[A-Z]\{4\}_$PI[0-9]\{5,6\}$/\1/g' > all_combined_genotyped_lines_SNPS_052217.hmp.txt tail -n +2 all_combined_genotyped_lines_SNPS_052217.recalibrated_vcftools.filtered.recode.hmp.txt >> all_combined_genotyped_lines_SNPS_052217.hmp.txt For the vcf files: ```iget -K -r /iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/gvcf/``` grab some coffee... ##### Merging the VCF files Unzip, bgzip and tabix index ``` gunzip * docker pull biocontainers/vcftools:v0.1.16-1-deb_cv1 docker run \ --volume `pwd`:/data \ biocontainers/vcftools:v0.1.16-1-deb_cv1 \ /bin/sh -c 'for F in *.vcf ; do bgzip ${F} ; tabix -f -p vcf ${F}.gz ; done' ``` Download Sorghum reference, https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Sbicolor Note that NCBI download doesn't work, headers need to match (eg >Chr01) Make fai, and .dict indexes of fasta file ``` docker run \ --volume `pwd`:/data \ --user 1001 \ -d \ biocontainers/samtools:v1.9-4-deb_cv1 \ /bin/sh -c 'samtools faidx Sbicolor_313_v3.0.fa' docker run \ --volume `pwd`:/data \ --user 1001 \ -d \ biocontainers/picard:v2.3.0_cv2 \ /bin/sh -c 'picard CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict' ``` Merge VCF files using bcftools (takes 2 days and change ``` docker run \ --volume `pwd`:/data \ --user 1001 \ -d \ biocontainers/bcftools:v1.9-1-deb_cv1 \ /bin/sh -c 'bcftools merge --merge both --threads 64 --output-type z \ --gvcf Sbicolor_313_v3.0.fa *vcf.gz > bcf-merged.vcf.gz' ``` Apply SNP filters, see https://docs.terraref.org/protocols/genomic-data#applying-hard-snp-filters-with-gatk-variantfiltration Run GATK3 GenotypeGVCFs (est 70 hours) ``` docker run \ -v `pwd`:/gatk/my_data \ -d \ broadinstitute/gatk3:3.5-0 \ /bin/sh -c 'java -Xmx100g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R /gatk/my_data/Sbicolor_313_v3.0.fa \ -V /gatk/my_data/bcf-merged.vcf.gz -o /gatk/my_data/all_combined_Genotyped_lines.vcf --max_alternate_alleles 10' ``` Run GATK3 VariantFiltration ``` docker run \ -v `pwd`:/gatk/my_data \ -d \ broadinstitute/gatk3:3.5-0 \ /bin/sh -c 'java -Xmx100g -jar GenomeAnalysisTK.jar -T VariantFiltration -R /gatk/my_data/Sbicolor_313_v3.0.fa \ -V /gatk/my_data/all_combined_Genotyped_lines.vcf -o /gatk/my_data/all_combined_Genotyped_lines_filtered.vcf \ --filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \ --filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \ --filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"' ``` Extract only cultivars with phenotype data (for this run season 4) ``` docker run \ --volume `pwd`:/data \ --user 1001 \ biocontainers/bcftools:v1.9-1-deb_cv1 \ /bin/sh -c 'bcftools view --output-file all_combined_Genotyped_lines_filtered.season4.vcf.gz --output-type z --samples-file season_4_cultivars.txt all_combined_Genotyped_lines_filtered.vcf.gz' ``` Keep only biallelic sites, filter and recode ``` docker run \ --volume `pwd`:/data \ --user 1001 \ -d \ biocontainers/vcftools:v0.1.16-1-deb_cv1 \ /bin/sh -c 'vcftools --gzvcf all_combined_Genotyped_lines_filtered.season4.vcf.gz --min-alleles 2 --max-alleles 2 \ --out all_combined_Genotyped_lines.filtered.season4.recode.vcf --max-missing 0.2 --recode' ``` ### Filtering, Imputation, Filtering Filter missing alleles at 0.7, minor allele frequency at 0.01 ```bash= bcftools view --min-af 0.01:minor -i 'F_MISSING<0.7' all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf001.vcf ``` ``` # SN [2]id [3]key [4]value SN 0 number of samples: 296 SN 0 number of records: 3018438 SN 0 number of no-ALTs: 0 SN 0 number of SNPs: 2738403 SN 0 number of MNPs: 0 SN 0 number of indels: 280035 SN 0 number of others: 0 SN 0 number of multiallelic sites: 0 SN 0 number of multiallelic SNP sites: 0 ``` #### Impute First, find the contigs with more than 100 sites: ```bash= cat all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.vcf | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn | awk '{ if ($1 > 100) { print $2 } }' > contigs_with_100_or_more_variants.txt ``` Get the contigs into the right format to filter with grep ```bash= cat contigs_with_100_or_more_variants.txt | sort | xargs -L 1 echo \\\|^ | tr '\n' ' ' | sed 's@ @@g' ``` ``` cat all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.vcf | grep -w "^#\|^#CHROM\|^01\|^02\|^03\|^04\|^05\|^06\|^07\|^08\|^09\|^10\|^SUPER_100\|^SUPER_103\|^SUPER_104\|^SUPER_105\|^SUPER_109\|^SUPER_110\|^SUPER_114\|^SUPER_118\|^SUPER_119\|^SUPER_120\|^SUPER_123\|^SUPER_127\|^SUPER_128\|^SUPER_1305\|^SUPER_132\|^SUPER_133\|^SUPER_136\|^SUPER_141\|^SUPER_142\|^SUPER_144\|^SUPER_146\|^SUPER_16\|^SUPER_18\|^SUPER_20\|^SUPER_22\|^SUPER_25\|^SUPER_26\|^SUPER_27\|^SUPER_28\|^SUPER_29\|^SUPER_30\|^SUPER_31\|^SUPER_32\|^SUPER_33\|^SUPER_35\|^SUPER_36\|^SUPER_368\|^SUPER_37\|^SUPER_38\|^SUPER_39\|^SUPER_398\|^SUPER_40\|^SUPER_403\|^SUPER_41\|^SUPER_42\|^SUPER_43\|^SUPER_44\|^SUPER_45\|^SUPER_46\|^SUPER_47\|^SUPER_48\|^SUPER_49\|^SUPER_50\|^SUPER_51\|^SUPER_52\|^SUPER_522\|^SUPER_53\|^SUPER_54\|^SUPER_55\|^SUPER_56\|^SUPER_58\|^SUPER_60\|^SUPER_61\|^SUPER_62\|^SUPER_63\|^SUPER_637\|^SUPER_64\|^SUPER_65\|^SUPER_66\|^SUPER_68\|^SUPER_680\|^SUPER_69\|^SUPER_71\|^SUPER_72\|^SUPER_73\|^SUPER_74\|^SUPER_75\|^SUPER_76\|^SUPER_77\|^SUPER_78\|^SUPER_79\|^SUPER_80\|^SUPER_81\|^SUPER_82\|^SUPER_83\|^SUPER_84\|^SUPER_85\|^SUPER_86\|^SUPER_87\|^SUPER_88\|^SUPER_89\|^SUPER_90\|^SUPER_91\|^SUPER_92\|^SUPER_93\|^SUPER_94\|^SUPER_95\|^SUPER_96\|^SUPER_97\|^SUPER_99" | pigz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.vcf.gz ``` Now run Beagle ```bash= java -jar -Xmx12g beagle.28Jun21.220.jar \ gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.vcf.gz \ out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed ``` Now filter minor allele frequency to 0.05 ```bash= bcftools view -Oz --min-af 0.01:minor all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed.vcf.gz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_070.maf_001.contig-filtered.imputed.maf_005.vcf.gz ``` ``` # SN [2]id [3]key [4]value SN 0 number of samples: 296 SN 0 number of records: 1609194 SN 0 number of no-ALTs: 0 SN 0 number of SNPs: 1512852 SN 0 number of MNPs: 0 SN 0 number of indels: 96342 SN 0 number of others: 0 SN 0 number of multiallelic sites: 0 SN 0 number of multiallelic SNP sites: 0 ``` ##### DELETE BELOW Index before filter ``` singularity exec ~/images/bcftools.img bcftools index all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz ``` Filter by count of unknown alleles at 10% (REDO: less stringent now, more stringent later) ``` singularity exec ~/images/bcftools.img bcftools view \ -i 'F_MISSING<0.1' \ all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz \ -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz ``` records: 145492, SNPs: 136133, indels: 9359 index again ``` singularity exec ~/images/bcftools.img bcftools index \ all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz ``` Filter by minor allele frequency at 0.05 ``` singularity exec ~/images/bcftools.img bcftools view \ --min-af 0.05:minor \ all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.vcf.gz \ --output-type z \ -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz ``` records: 69035, SNPs: 66726, indels: 2309 👍 Imputation with Beagle ``` java -jar ~/images/beagle.28Jun21.220.jar \ gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz \ out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.imputed.vcf.gz ``` This fails with ``` Exception in thread "main" java.lang.IllegalArgumentException: Window has only one position: CHROM=SUPER_1028 POS=2879 ``` Let's see how many contigs only have one position ``` zcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep SUPER_ | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn ``` ... 1 SUPER_99 1 SUPER_926 1 SUPER_802 1 SUPER_70 1 SUPER_629 1 SUPER_587 1 SUPER_512 1 SUPER_396 1 SUPER_296 1 SUPER_267 1 SUPER_248 1 SUPER_197 1 SUPER_1623 1 SUPER_1522 1 SUPER_117 1 SUPER_112 1 SUPER_1028 At the top of the list, the values look like 8460 05 7001 07 6685 02 6161 06 5825 03 5560 08 5418 09 5409 04 3976 01 3939 10 980 SUPER_18 831 SUPER_20 707 SUPER_16 576 SUPER_22 557 SUPER_32 412 SUPER_26 330 SUPER_30 307 SUPER_75 300 SUPER_47 256 SUPER_38 234 SUPER_35 222 SUPER_31 205 SUPER_29 190 SUPER_37 180 SUPER_28 171 SUPER_33 163 SUPER_36 150 SUPER_25 148 SUPER_127 137 SUPER_27 121 SUPER_60 113 SUPER_43 112 SUPER_80 110 SUPER_45 104 SUPER_56 95 SUPER_39 92 SUPER_49 Going to pick a cutoff of contigs with at least 100 variants ```bash= zcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep -v "^#" | cut -f 1 | sort | uniq -c | sort -rn | awk '{ if ($1 > 100) { print $2 } }' > contigs_with_100_or_more_variants.txt ``` Now limit the VCF to just these contigs, single column isn't valid for - bcftools didn't seem to like the contig file or command line, so falling back to grep ``` gzcat all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.vcf.gz | grep -w "^#\|^#CHROM\|^01\|^02\|^03\|^04\|^05\|^06\|^07\|^08\|^09\|^10\|^SUPER_18\|^SUPER_20\|^SUPER_16\|^SUPER_22\|^SUPER_32\|^SUPER_26\|^SUPER_30\|^SUPER_75\|^SUPER_47\|^SUPER_38\|^SUPER_35\|^SUPER_31\|^SUPER_29\|^SUPER_37\|^SUPER_28\|^SUPER_33\|^SUPER_36\|^SUPER_25\|^SUPER_127\|^SUPER_27\|^SUPER_60\|^SUPER_43\|^SUPER_80\|^SUPER_45\|^SUPER_56" | pigz > all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.vcf.gz ``` Re-Beagling ```bash= java -jar beagle.28Jun21.220.jar \ gt=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.vcf.gz \ out=all_combined_Genotyped_lines.filtered.season4.recode.sorted.miss010.maf005.contig-filtered.imputed ``` ### Filtering, Imputation, Filtering Retry ``` bcftools view -i 'F_MISSING<0.3' all_combined_Genotyped_lines.filtered.season4.recode.sorted.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.vcf.gz ``` number of records: 5,031,759 => 751,518 number of SNPs: 4,565,515 => 693,157 number of indels: 466,244 => 58,361 rMVP at PC=1 ```R= library(rMVP) ##input file preparation MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.vcf", filePhe="short_format_traits_season_4.txt", fileKin=TRUE, filePC=TRUE, out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098" ) ##attach input file genotype <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.geno.desc") phenotype <- read.table("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.phe",head=TRUE) map <- read.table("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.geno.map" , head = TRUE) Covariates <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.pc.desc") kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.kin.desc") ##GWAS analysis for(i in 2:ncol(phenotype)){ imMVP <- MVP( phe=phenotype[, c(1, i)], geno=genotype, map=map, K=kinship, nPC.MLM = 1, nPC.FarmCPU = 1, threshold=0.05, method=c("MLM", "FarmCPU") ) gc() } ``` Filter FarmCPU.csv output for subset of phenotypes For files: `PC1.aboveground_dry_biomass.FarmCPU.csv PC1.canopy_height.FarmCPU.csv PC1.flag_leaf_emergence_time.FarmCPU.csv PC1.flowering_time.FarmCPU.csv` ```bash= awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.aboveground_dry_biomass.FarmCPU.csv > PC1.aboveground_dry_biomass.FarmCPU.p05.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.canopy_height.FarmCPU.csv > PC1.canopy_height.FarmCPU.p05.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.flag_leaf_emergence_time.FarmCPU.csv > PC1.flag_leaf_emergence_time.FarmCPU.p05.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' PC1.flowering_time.FarmCPU.csv > PC1.flowering_time.FarmCPU.p05.csv cat *p05.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > 4pheno.PC1.FarmCPU.p05.unique.csv ``` bgzip & tabix the vcf file, then: ```bash= bcftools view --regions-file 4pheno.PC1.FarmCPU.p05.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf.gz ``` Unzip again... ```bash= gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf.gz ``` Now produce a new kinship matrix with rMVP ```R= library(rMVP) library(readr) ##input file preparation MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.vcf", filePhe="short_format_traits_season_4.txt", fileKin=TRUE, filePC=TRUE, out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno" ) kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.desc") write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.tsv") ``` Add labels with pandas, because I can't figure it out in R. ```python= import pandas as pd kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.tsv', sep="\t") samples = pd.read_csv('samples.txt', header=None) kin.columns = samples[0].tolist() kin.index = samples[0].tolist() kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.kin.labeled.tsv', sep='\t') ``` #### Same as above, but for p < 0.001 Filter FarmCPU.csv output for subset of phenotypes For files: `PC1.aboveground_dry_biomass.FarmCPU.csv PC1.canopy_height.FarmCPU.csv PC1.flag_leaf_emergence_time.FarmCPU.csv PC1.flowering_time.FarmCPU.csv` ```bash= awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.aboveground_dry_biomass.FarmCPU.csv > PC1.aboveground_dry_biomass.FarmCPU.p0001.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.canopy_height.FarmCPU.csv > PC1.canopy_height.FarmCPU.p0001.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.flag_leaf_emergence_time.FarmCPU.csv > PC1.flag_leaf_emergence_time.FarmCPU.p0001.csv awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.001) {print $0}' PC1.flowering_time.FarmCPU.csv > PC1.flowering_time.FarmCPU.p0001.csv cat *p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > 4pheno.PC1.FarmCPU.p0001.unique.csv ``` bgzip & tabix the vcf file, then: ```bash= bcftools view --regions-file 4pheno.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf.gz ``` Unzip again... ```bash= gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf.gz ``` Now produce a new kinship matrix with rMVP ```R= library(rMVP) library(readr) ##input file preparation MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.4pheno.p0001.vcf", filePhe="short_format_traits_season_4.txt", fileKin=TRUE, filePC=TRUE, out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001" ) kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.desc") write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.tsv") ``` Add labels with pandas, because I can't figure it out in R. ```python= import pandas as pd kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.tsv', sep="\t") samples = pd.read_csv('samples.txt', header=None) kin.columns = samples[0].tolist() kin.index = samples[0].tolist() kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.4pheno.p0001.kin.labeled.tsv', sep='\t') ``` #### Individual matricies for just canopy height & aboveground dry biomass ```bash= cat PC1.aboveground_dry_biomass.FarmCPU.p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > aboveground_dry_biomass.PC1.FarmCPU.p0001.unique.csv cat PC1.canopy_height.FarmCPU.p0001.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > canopy_height.PC1.FarmCPU.p0001.unique.csv ``` ```bash= bcftools view --regions-file aboveground_dry_biomass.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf.gz bcftools view --regions-file canopy_height.PC1.FarmCPU.p0001.unique.csv all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.vcf.gz -Oz -o all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf.gz ``` ```bash= gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf.gz gunzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf.gz ``` ```R= library(rMVP) library(readr) ##input file preparation MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.aboveground_dry_biomass.p0001.vcf", filePhe="short_format_traits_season_4.txt", fileKin=TRUE, filePC=TRUE, out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001" ) kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.desc") write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.tsv") ##input file preparation MVP.Data(fileVCF="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-085.canopy_height.p0001.vcf", filePhe="short_format_traits_season_4.txt", fileKin=TRUE, filePC=TRUE, out="all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001" ) kinship <- attach.big.matrix("all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.desc") write_tsv(as.data.frame(kinship[,]), "all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.tsv") ``` ```python= import pandas as pd kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.tsv', sep="\t") samples = pd.read_csv('samples.txt', header=None) kin.columns = samples[0].tolist() kin.index = samples[0].tolist() kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.labeled.tsv', sep='\t') kin = pd.read_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.tsv', sep="\t") samples = pd.read_csv('samples.txt', header=None) kin.columns = samples[0].tolist() kin.index = samples[0].tolist() kin.to_csv('all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.labeled.tsv', sep='\t') ``` ```bash= gzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.aboveground_dry_biomass.p0001.kin.labeled.tsv gzip all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.canopy_height.p0001.kin.labeled.tsv ``` #### Produce a vcf file limiting to plant height QTL regions [Search aussorgm](https://aussorgm.org.au/sorghum-qtl-atlas/search/) for `Assembly is v3.0` and `Trait Subcategory CONTAINS "height"` and export as excel, then export to tsv and awk it up to limit to only rows that contain `plant height` as a trait. (searching for `plant height` directly didn't work, unfortunately.) Also: the regions file needs to be delimited by tabs, so convert the '1:234-5678' format into 3 columns by replacing `:` and `-`. ``` awk -F '\t' 'BEGIN {OFS="\t"} ($4 ~ /plant height/) {print $5}' height_qtl.tsv | tr ':' '\t' | tr '-' '\t' > plant_height_regions.txt ``` Filter to those regions with bcftools, and maybe a vcf filename reset is in order ```bash= bcftools view -R plant_height_regions.txt all_combined_Genotyped_lines.filtered.season4.recode.sorted.fmissing_030.maf_001.contig-filtered.imputed.plink-prune-098.vcf.gz > sorghum_plant_height_qtl.vcf ``` Make a trait file for just canopy height ```bash= cut -f 1,24 short_format_traits_season_4.txt > canopy_height_trait_season4.txt ``` Run the GWAS ```R= library(rMVP) library(readr) ##input file preparation MVP.Data(fileVCF="sorghum_plant_height_qtl.vcf", filePhe="canopy_height_trait_season4.txt", fileKin=TRUE, filePC=TRUE, out="sorghum_plant_height_qtl" ) ##attach input file genotype <- attach.big.matrix("sorghum_plant_height_qtl.geno.desc") phenotype <- read.table("sorghum_plant_height_qtl.phe",head=TRUE) map <- read.table("sorghum_plant_height_qtl.geno.map" , head = TRUE) Covariates <- attach.big.matrix("sorghum_plant_height_qtl.pc.desc") kinship <- attach.big.matrix("sorghum_plant_height_qtl.kin.desc") ##GWAS analysis for(i in 2:ncol(phenotype)){ imMVP <- MVP( phe=phenotype[, c(1, i)], geno=genotype, map=map, K=kinship, nPC.MLM = 1, nPC.FarmCPU = 1, threshold=0.05, method=c("MLM", "FarmCPU") ) gc() } ``` QQ Plot for FarmCPU looks great, filter to p < 0.05 gives ~6000 locations ```bash= awk -F ',' 'BEGIN {OFS="\t"} (NR==1 || $8 < 0.05) {print $0}' canopy_height.FarmCPU.csv > canopy_height.FarmCPU.p005.csv cat canopy_height.FarmCPU.p005.csv | cut -d, -f 2,3 | grep -v "CHROM" | sort | uniq | sed 's@"@@g' | tr ',' '\t' > canopy_height.FarmCPU.p005.unique.csv ``` Filter the vcf based on this regions file & unzip before sending back to rMVP ```bash= bgzip sorghum_plant_height_qtl.vcf tabix sorghum_plant_height_qtl.vcf.gz bcftools view --regions-file canopy_height.FarmCPU.p005.unique.csv sorghum_plant_height_qtl.vcf.gz -Oz -o sorghum_plant_height_qtl.canopy_height_p005.vcf.gz gunzip sorghum_plant_height_qtl.canopy_height_p005.vcf.gz ``` Make the kinship matrix ```R= library(rMVP) library(readr) ##input file preparation MVP.Data(fileVCF="sorghum_plant_height_qtl.canopy_height_p005.vcf", filePhe="canopy_height_trait_season4.txt", fileKin=TRUE, filePC=TRUE, out="sorghum_plant_height_qtl.canopy_height_p005" ) kinship <- attach.big.matrix("sorghum_plant_height_qtl.canopy_height_p005.kin.desc") write_tsv(as.data.frame(kinship[,]), "sorghum_plant_height_qtl.canopy_height_p005.kin.tsv") ``` Finally, add labels to the matrix ```python= import pandas as pd kin = pd.read_csv('sorghum_plant_height_qtl.canopy_height_p005.kin.tsv', sep="\t") samples = pd.read_csv('samples.txt', header=None) kin.columns = samples[0].tolist() kin.index = samples[0].tolist() kin.to_csv('sorghum_plant_height_qtl.canopy_height_p005.kin.labeled.tsv', sep='\t') ``` ### Get just the ~90 from p < 0.001 Nicely packaged two of the steps: ```bash= ./matrix.R sorghum_plant_height_qtl.canopy_height_p0001.vcf canopy_height_trait_season4.txt python3 add_matrix_labels.py sorghum_plant_height_qtl.canopy_height_p0001.kin.tsv samples.txt ```