###### tags: `unix` `FileJoiner` `plink` `qsub` `cat` `PBS` `GWAS ID remapper` `pedigree` # Collaborators' script files * [FileJoiner](https://drive.google.com/open?id=0B0rTuebw8wohR3JGZVAwQmx0MUk) `/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner` * [qsub_clumping_chr.sh](https://drive.google.com/open?id=0B0rTuebw8wohNm1taW9EUkw2M2c) * [CompileProfileFiles.sh](https://drive.google.com/open?id=0B0rTuebw8wohSE10a2ZCR19tQzA) * [GWAS_ID_remapper](https://drive.google.com/open?id=0B0rTuebw8wohdlFwTUZNNExlU00) * ID must be 1st field * [match.pl]() ### fileChecker.sh * file: `/mnt/lustre/home/lunC/scripts/fileChecker.sh` ```bash= # Output column count, row count and field position of one file sh $locScripts/fileChecker.sh temp_ADHD2017_10rows temp_ADHD2017_10rows has 5 columns and 10 rows Position and field name: 1 CHR 2 SNP 3 BP 4 CHR:BP 5 P ``` ### Perform SQL joins with FileJoiner ![](https://i.imgur.com/YWlOLkL.jpg) #### Inner join `FileJoiner A B` `FileJoiner A -posfilter=B` or `FileJoiner B -posfilter=A` depending on the purpose #### Left join `FileJoiner A B` `FileJoiner A B,outfields=,outemptyfieldstring=` #### Left join minus intersection with B `FileJoiner A -negfilter=B` #### Right join `FileJoiner B A` `FileJoiner B A,outfields=,outemptyfieldstring=` #### Right join minus intersection with A `FileJoiner B -negfilter=A` > Hi Chang Thanks for the feedback. More accurately described the options are : Inner join : FileJoiner A B [-or- FileJoiner A -posfilter=B -or- FileJoiner B -posfilter=A depending on the purpose] Left join : FileJoiner A B,optional [most likely with a ,outfields=…. for file B, and a -outemptyfieldstring= as well, to pad out the missing fields correctly] Left join (minus intersection with B) : FileJoiner A -negfilter=B Right join : FileJoiner B A,optional [most likely with a ,outfields=…. for file A, and a -outemptyfieldstring= as well, to pad out the missing fields correctly] Right join (minus intersection with A) : FileJoiner B -negfilter=A There isn’t really a ‘Full join’ or a ‘Full join (minus intersection)’ as currently implemented; you’d have to combine two runs of the program – it’s the two ‘-negfilter=’ commands above, concatenated and reformatted to make the fields line up correctly. It’s hard to do this automatically without knowing how the final file is meant look, hence not implemented now. It could probably be implemented. I don’t have time to work on it currently (due to the amount of testing needed). Cheers Scott From: LunHsien Chang [mailto:luenhchang@gmail.com] Sent: Friday, September 29, 2017 11:40 AM To: Scott Gordon Subject: Do SQL joins with FileJoiner ### FileJoiner per-file options are separated by comma without white spaces. For example ```bash! locFileJoiner="/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner" $locFileJoiner -quiet $PRSFile,headerline,sep=whitespace,key=2,outfields=1-5,7- $PCFile,headerline,sep=tabs,key=2,outfields=6-15 $imputationCovFile,sep=whitespace,key=2,outfields=6 > PRS_PC1-PC10_impCov ``` * **-quiet** to suppress warning * type **headerline** when files contain headers * **sep=** defines the field separator, default= whitespace * specify merging key with **key=** * specify output fields with **outfields=** Allow the use of a range of fields (e.g. 1-5) and omitting the last field (e.g. 7-) #### 1.1 left-join 2 tables. Copy all columns from matched rows of right table to merged file Testing FileJoiner * left table: clumped.ADHD2017.1.snps * right table: ADHD2017.betas.temp * merged table: clumped.ADHD2017.1.betas.temp ```bash! # location of script files and input, output files [lunC@hpcpbs01 ~]$ cd /mnt/lustre/working/lab_nickm/lunC/testing locClumping="/mnt/lustre/working/lab_nickm/lunC/PRS_test/clumping" locFileJoiner="/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner" locPRScalc= "/mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc" locReference="/working/lab_nickm/lunC/reference" locTest="/mnt/lustre/working/lab_nickm/lunC/testing" # left-join clumped.ADHD2017.1.snps (left) and ADHD2017.betas.temp (right). Copy all columns from matched rows of right table, except merging key, to merged file [lunC@hpcpbs01 testing]$ ${locFileJoiner} -quiet ${locPRScalc}/ADHD2017.betas.temp -posfilter=clumped.ADHD2017.1.snps > clumped.ADHD2017.1.betas.temp #Warning : in positive-filter file clumped.ADHD2017.1.snps : line 17852 is too short to produce all output fields (expected >= 1 but found only 0). Suppressing further warnings for this file. #Warning : in positive-filter file clumped.ADHD2017.1.snps : line 17852 is too short to find the key field but not blank (expected >= 1 field(s) but only found 0). Suppressing further warnings for this file. #Warning : line 17852 of positive-filter file clumped.ADHD2017.1.snps has a different number of fields (0) to the previous (1). Suppressing further warnings for this file. # check how row numbers change [lunC@hpcpbs01 testing]$ wc -l ${locPRScalc}/ADHD2017.betas.temp clumped.ADHD2017.1.snps clumped.ADHD2017.1.betas.temp 8047420 /mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.betas.temp 17853 clumped.ADHD2017.1.snps 17851 clumped.ADHD2017.1.betas.temp 8083124 total [lunC@hpcpbs01 testing]$ head ${locPRScalc}/ADHD2017.betas.temp clumped.ADHD2017.1.snps clumped.ADHD2017.1.betas.temp ==> /mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.betas.temp <== rs202152658 A 0.0307038 rs143225517 T -0.0306963 rs3094315 A -0.0408012 rs3131972 A 0.0399993 rs3131971 T 0.0421012 rs61770173 A -0.0329985 rs3131970 T 0.0329027 rs2073814 C 0.0377003 rs2073813 A 0.0335025 rs3131969 A 0.0323026 ==> clumped.ADHD2017.1.snps <== rs112984125 rs3001723 rs2842198 rs2391769 rs2391734 rs867605 rs9661242 rs17652815 rs1222062 rs143129347 ==> clumped.ADHD2017.1.betas.temp <== rs79211741 T 0.0437968 rs36161240 A -0.0275973 rs186608774 T -0.00339576 rs6603787 T -0.0449032 rs75189781 T -0.0344051 rs2368566 A -0.0453949 rs7531530 T -0.014596 rs11260621 T 0.0107025 rs6676565 T -0.00780034 rs28691882 T -0.00589736 # Every row in merged file should be found in the left table [lunC@hpcpbs01 testing]$ grep -E 'rs112984125|rs3001723|rs2842198|rs2391769|rs2391734|rs867605|rs9661242|rs17652815|rs1222062|rs143129347' clumped.ADHD2017.1.betas.temp rs9661242 A -0.0625991 rs2842198 A -0.0780048 rs867605 T 0.0694981 rs3001723 A 0.0894019 rs112984125 A -0.106005 rs2391734 T 0.0855995 rs1222062 A -0.0610034 rs2391769 A -0.0754027 rs143129347 T 0.1235 rs17652815 T -0.146796 ``` #### 1.2 Inner-join 2 tables. All columns from 2 tables are copied to merged file * left table: meta.data * right table: clumped.ADHD2017.1.betas.temp * merged table: ADHD2017.1.betas.strand ```shell! [lunC@hpcpbs01 testing]$ ${locFileJoiner} -quiet ${locReference}/meta.data,key=5 clumped.ADHD2017.1.betas.temp,key=1 > ADHD2017.1.betas.strand [lunC@hpcpbs01 testing]$ wc -l ${locReference}/meta.data clumped.ADHD2017.1.betas.temp ADHD2017.1.betas.strand 6303050 /working/lab_nickm/lunC/reference/meta.data 17851 clumped.ADHD2017.1.betas.temp 17757 ADHD2017.1.betas.strand 6338658 total [lunC@hpcpbs01 testing]$ head -2 ${locReference}/meta.data clumped.ADHD2017.1.betas.temp ADHD2017.1.betas.strand ==> /working/lab_nickm/lunC/reference/meta.data <== SNP REF ALT bp_Build37 SNP_dbSNP MAF Rsq_rederived 1:998395 A G 998395 rs7526076 0.24938 0.640528 ==> clumped.ADHD2017.1.betas.temp <== rs79211741 T 0.0437968 rs36161240 A -0.0275973 ==> ADHD2017.1.betas.strand <== 1:1034243 C T 1034243 rs79211741 0.120352 0.798223 rs79211741 T 0.0437968 1:1043829 A G 1043829 rs36161240 0.166522 0.782422 rs36161240 A -0.0275973 ``` ### 2 Create a here document that creates multiple script files for submitting a series of similar jobs #### 2.1 qsub_clumping_chr.sh is a script that [passes multi-line string to a file in Bash](https://stackoverflow.com/questions/2500436/how-does-cat-eof-work-in-bash), creating multiple files containing code, to do a bunch of similar jobs * line 7-11: set directories that is referred to in sub script files as variables * line 15-24: 2 for loops * line 27 & 43: create script files that are serially named by the 2 iterators: ${i} and ${j} * line 28-41: contect of individual subscript files * line 30-32: define how you will run jobs on HPC using PBS. Don't put other non-PBS commands before this section * line 34-41: plink commnads * line 46: submit the subscript files as jobs on HPC ```shell! #!/bin/sh ### use for loop to conduct analysis ## define file location genoprefix=/mnt/lustre/reference/genepi/public_reference_panels/1000G_20101123_v3_GIANT/derived_plinkformat genoend=phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL extractfile=/working/lab_nickm/lunC/PRS_test/clumping/allSNPRsNumberGWASUniqueRmINDELs clumpfile=/working/lab_nickm/lunC/PRS_test/clumping/ outfile=/working/lab_nickm/lunC/PRS_test/clumping2/ ## for loop in trait for analysis for i in ADHD2017 ASD2015 do mkdir -p $i cd $i ## for loop in chr for ((j=1; j<=5; j++)) do ## prepare sh file cat << __EOT__ > ${i}.clump_chr${j}.sh #!/bin/sh #PBS -l ncpus=1 #PBS -l mem=8gb #PBS -l walltime=24:00:00 plink_1.90b3.36 --bfile ${genoprefix}/chr${j}.${genoend} \ --extract ${extractfile} \ --clump ${clumpfile}clump${i}.available1 \ --clump-p1 1 \ --clump-p2 1 \ --clump-r2 0.1 \ --clump-kb 10000 \ --out ${outfile}${i}.${j} __EOT__ ## qsub the jobs. Disable this line when checking files. Activate this line after checking each generated script file qsub ${i}.clump_chr${j}.sh done cd .. done ``` ```shell! # run the script that generates subscript files sh qsub_clumping_chr.sh ``` #### 2.2 qsub_summingPRS.sh (better than previous qsub file) * if [ -f file ] note spaces between [] and its content ```shell! # one of 16 files that are generated by qsub_summingPRS.sh [lunC@hpcpbs01 PRScalc]$ cat ADHD2017.S1.summingPRS.sh #!/bin/sh #PBS -l ncpus=2 #PBS -l mem=8gb #PBS -l walltime=24:00:00 awk '{print $2,$1}' /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.1.1.profile.S1.profile > /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.S1.temp ## loop thru each chromosome (can move this loop outside the script) for((chromosome=1;chromosome<=22;chromosome++)) do for((chunk=1; chunk<=82; chunk++)) do if [ -f /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.${chromosome}.${chunk}.profile.S1.profile] then ## check if the file exists echo ADHD2017.S1.${chromosome}.${chunk} ## merge files using external scripts /mnt/lustre/home/lunC/scripts/match.pl -f ADHD2017.${chromosome}.${chunk}.profile.S1.profile -g ADHD2017.S1.temp -k 2 -l 1 -v 4 > ADHD2017.S1.temp2 ## replace temp2 with temp mv ADHD2017.S1.temp2 ADHD2017.S1.temp fi done done awk '{for(i=3;i<=NF;i++) t+=$i; print t; t=0}' ADHD2017.S1.temp > ADHD2017.S1.rawSum #__EOT__ [lunC@hpcpbs01 PRScalc]$ sh ${locScripts}/qsub_summingPRS.sh 2697577.hpcpbs02 2697578.hpcpbs02 2697579.hpcpbs02 2697580.hpcpbs02 2697581.hpcpbs02 2697582.hpcpbs02 2697583.hpcpbs02 2697584.hpcpbs02 2697585.hpcpbs02 2697586.hpcpbs02 2697587.hpcpbs02 2697588.hpcpbs02 2697589.hpcpbs02 2697590.hpcpbs02 2697591.hpcpbs02 2697592.hpcpbs02 [lunC@hpcpbs01 PRScalc]$ qstat -u lunC hpcpbs02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 2697577.hpcpbs0 lunC short ADHD2017.S 173584 1 2 8gb 24:00 R 00:00 2697578.hpcpbs0 lunC short ADHD2017.S 173616 1 2 8gb 24:00 R 00:00 2697579.hpcpbs0 lunC short ADHD2017.S 173625 1 2 8gb 24:00 R 00:00 2697580.hpcpbs0 lunC short ADHD2017.S 173627 1 2 8gb 24:00 R 00:00 2697581.hpcpbs0 lunC short ADHD2017.S 173628 1 2 8gb 24:00 R 00:00 2697582.hpcpbs0 lunC short ADHD2017.S 173633 1 2 8gb 24:00 R 00:00 2697583.hpcpbs0 lunC short ADHD2017.S 173640 1 2 8gb 24:00 R 00:00 2697584.hpcpbs0 lunC short ADHD2017.S 173646 1 2 8gb 24:00 R 00:00 2697585.hpcpbs0 lunC short ASD2015.S1 173654 1 2 8gb 24:00 R 00:00 2697586.hpcpbs0 lunC short ASD2015.S2 173661 1 2 8gb 24:00 R 00:00 2697587.hpcpbs0 lunC short ASD2015.S3 173667 1 2 8gb 24:00 R 00:00 2697588.hpcpbs0 lunC short ASD2015.S4 173673 1 2 8gb 24:00 R 00:00 2697589.hpcpbs0 lunC short ASD2015.S5 173679 1 2 8gb 24:00 R 00:00 2697590.hpcpbs0 lunC short ASD2015.S6 173686 1 2 8gb 24:00 R 00:00 2697591.hpcpbs0 lunC short ASD2015.S7 173692 1 2 8gb 24:00 R 00:00 2697592.hpcpbs0 lunC short ASD2015.S8 173699 1 2 8gb 24:00 R 00:00 ``` ### GWAS ID remapper #### An example of 3 quantitative traits * script file: GWAS_CCO_P6IRT_S6IRT_SP12IRT.txt * input data file (phenotype data file): uniqID_1stNonmissDepVar_keep.txt * output file: ID_remapped_sp12_IRT.txt ##### Order columns of your input phenotype file- ID in first column, followed by traits and then covariates * ID: ID * traits:PSYCH6, SOMA6, SPHERE_sum, PSYCH6_IRT, SOMA6_IRT, SPHERE12_IRT * covariates: age, ageSq (age^2^), nSEX (numeric sex), sexAge (sex by age), sexAgeSq (sex by age^2^) ```shell! [lunC@hpcpbs01 sphere]$ head -1 uniqID_1stNonmissDepVar_keep.txt ID PSYCH6 SOMA6 SPHERE_sum PSYCH6_IRT SOMA6_IRT SPHERE12_IRT age ageSq nSEX sexAge sexAgeSq ``` ### CompileProfileFiles.sh ##### A script to compile PLINK .profile files run separately for different chromosomes or blocks of markers, into a single file with appropriate summing. ##### All files must have been run for the same set of individuals (not necessarily for the same genetic dataset although that would be the usual practice). ##### Files can be specified either as a list on the command line, or as a filename with a preceding '@' symbol in which case the individual files are listed in that file, rather than individually on the command line. The two methods can be mixed. ##### The .profile files can be gzip'd (must have names ending in .gz in this case). ##### Redirect output to a new (destination) file. ##### eg. $0 GRStest1_chr*.profile > GRStest1_compiled.profile ### Remap ID using GWAS_ID_remapper * directory of GWAS_ID_remapper ==/reference/genepi/GWAS_release/Release8/Scripts/GWAS_ID_remapper== * options * **-intabdelim** for tab-separated input file * **-headerline** if input file has header * **-outpedigree** inserts pedigree info, five columns, into output file: * FAMID * ID * FATHERID * MOTHERID * GENDER * directory of phenotype data file: ==/working/lab_nickm/lunC/sphere/uniqID_1stNonmissDepVar_keep.txt== * output file name: ==ID_remapped_sp12_IRT.txt== This file contains pedigree information and phenotype data file ```shell! # remap ID [lunC@hpcpbs01 sphere]$ /reference/genepi/GWAS_release/Release8/Scripts/GWAS_ID_remapper -intabdelim -headerline -outpedigree /working/lab_nickm/lunC/sphere/uniqID_1stNonmissDepVar_keep.txt > ID_remapped_sp12_IRT.txt # headers of pedigree+phenotype data [lunC@hpcpbs01 sphere]$ head -1 ID_remapped_sp12_IRT.txt FAMID ID FATHERID MOTHERID GENDER PSYCH6 SOMA6 SPHERE_sum PSYCH6_IRT SOMA6_IRT SPHERE12_IRT age ageSq nSEX sexAge sexAgeSq ``` ### fieldnumbers #### script content ```shell! #!/bin/bash # By Scott Gordon, August 2015 for QIMR Berghofer # # List the field numbers from the header line of 1 or more files. # Reports the filename if >1 file N=$# cnt=0 for f in $*; do ((++cnt)) if (( N > 1 )); then echo "file ${f}" fi prefilter=`echo $f|awk '{f=toupper($1);print (substr(f,length(f)-2)==".GZ")?"zcat":"cat"}'` ${prefilter} ${f} | head -1 | awk '{for(i=1;i<=NF;++i){print i,$i}}' if (( N > cnt )); then echo fi done ``` #### use fieldnumbers on a tab-separated input file ```shell! # copy the file to my script folder [lunC@hpcpbs01 ~]$ cp /reference/genepi/GWAS_release/Release8/Scripts/fieldnumbers /mnt/lustre/home/lunC/scripts/fieldnumbers # obtain filed numbers [lunC@hpcpbs01 GWAS_summary]$ ${locScripts}/fieldnumbers -sep=tabs scz2.snp.relts.txt file -sep=tabs cat: invalid option -- 'p' Try 'cat --help' for more information. file scz2.snp.results.txt 1 hg19chrc 2 snpid 3 a1 4 a2 5 bp 6 info 7 or 8 se 9 p 10 ngt ``` ### match.pl #### 5.1 Perform right-join with match.pl. * Usage: match.pl -f file1 -g filetwo -k 1 -l 2 -v 3 4 7 * -f specifies name of file1 (left table) * -g specifies name of file2 * -f specifies name of file1 (right table) * -k specifies position of key variable in file1 (1 here) * -l specifies position of key variable in file2 (2 here) * -v specifies position of values in file one (3, 4, 7 here) to be added to file2. * This option cannot be ignored * In the merged file, values of the added columns are set to dash for unmatched row (i.e. not found in file1); otherwise carry original values (i.e. rows common to both file1 and file2) > * you should get an overlap > 1 Milions, maybe need to use another SNP ID for the merge (if I remember well, the QIMR data uses mostly the ID type CHR:BP) #### 5.2 Test match.pl using test data files. Suppose there two files with 2 rows common to both * temp_meta_10rows (file1 or left table in SQL language) * temp_ADHD2017_11rows (file2 or right table in SQL language) ```shell! [lunC@hpcpbs01 testing]$ cd /working/lab_nickm/lunC/testing # content of file 1 (left table) [lunC@hpcpbs01 testing]$ head temp_meta_10rows SNP REF ALT bp_Build37 SNP_dbSNP MAF Rsq_rederived 1:998395 A G 998395 rs7526076 0.24938 0.640528 1:1002387 A G 1002387 rs74048003 0.183707 0.602693 1:1004957 G A 1004957 rs4073176 0.421162 0.686436 1:1004980 G A 1004980 rs4073177 0.421158 0.686789 1:1005806 C T 1005806 rs3934834 0.142079 0.902389 1:1006223 G A 1006223 rs9442394 0.42339 0.691605 1:1006990 G A 1006990 rs4326571 0.278881 0.724348 1:1007222 G T 1007222 rs71628928 0.10088 0.608873 1:1007432 G A 1007432 rs4333796 0.444029 0.728123 # content of file 2 (right table) [lunC@hpcpbs01 testing]$ awk '{print $0}' temp_ADHD2017_11rows CHR SNP BP CHR:BP P 1 rs202152658 751343 1:751343 0.1654 1 rs143225517 751756 1:751756 0.1654 1 rs3094315 752566 1:752566 0.03208 1 rs3131972 752721 1:752721 0.03301 1 rs3131971 752894 1:752894 0.02772 1 rs61770173 753405 1:753405 0.1208 1 rs3131970 753425 1:753425 0.1217 1 rs2073814 753474 1:753474 0.04622 1 rs2073813 753541 1:753541 0.1165 1 rs3934834 1005806 1:1005806 0.54 1 rs4333796 1007432 1:1007432 0.4440 # perform right join using rs number as merging key [lunC@hpcpbs01 testing]$ /mnt/lustre/home/lunC/scripts/match.pl -f temp_meta_10rows -g temp_ADHD2017_11rows -k 5 -l 2 -v 6 > temp_mergeKeyRsNum [lunC@hpcpbs01 testing]$ tail temp_mergeKeyRsNum 1 rs143225517 751756 1:751756 0.1654 - 1 rs3094315 752566 1:752566 0.03208 - 1 rs3131972 752721 1:752721 0.03301 - 1 rs3131971 752894 1:752894 0.02772 - 1 rs61770173 753405 1:753405 0.1208 - 1 rs3131970 753425 1:753425 0.1217 - 1 rs2073814 753474 1:753474 0.04622 - 1 rs2073813 753541 1:753541 0.1165 - 1 rs3934834 1005806 1:1005806 0.54 0.142079 1 rs4333796 1007432 1:1007432 0.4440 0.444029 # perform right join using CHR:BP as merging key. The result is similar to using rs number above [lunC@hpcpbs01 testing]$ /mnt/lustre/home/lunC/scripts/match.pl -f temp_meta_10rows -g temp_ADHD2017_11rows -k 1 -l 4 -v 6 > temp_mergeKeyCHRBP [lunC@hpcpbs01 testing]$ tail temp_mergeKeyCHRBP 1 rs143225517 751756 1:751756 0.1654 - 1 rs3094315 752566 1:752566 0.03208 - 1 rs3131972 752721 1:752721 0.03301 - 1 rs3131971 752894 1:752894 0.02772 - 1 rs61770173 753405 1:753405 0.1208 - 1 rs3131970 753425 1:753425 0.1217 - 1 rs2073814 753474 1:753474 0.04622 - 1 rs2073813 753541 1:753541 0.1165 - 1 rs3934834 1005806 1:1005806 0.54 0.142079 1 rs4333796 1007432 1:1007432 0.4440 0.444029 ```