tags: `unix` `FileJoiner` `plink` `qsub` `cat` `PBS` `GWAS ID remapper` `pedigree`

Collaborators' script files

FileJoiner
/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner
qsub_clumping_chr.sh
CompileProfileFiles.sh
GWAS_ID_remapper
- ID must be 1st field
match.pl

fileChecker.sh

file: /mnt/lustre/home/lunC/scripts/fileChecker.sh











# Output column count, row count and field position of one file
sh $locScripts/fileChecker.sh temp_ADHD2017_10rows

temp_ADHD2017_10rows has 5 columns and 10 rows
Position and field name:
1 CHR
2 SNP
3 BP
4 CHR:BP
5 P

Perform SQL joins with FileJoiner

Inner join

FileJoiner A B
FileJoiner A -posfilter=B or FileJoiner B -posfilter=A depending on the purpose

Left join

FileJoiner A B
FileJoiner A B,outfields=,outemptyfieldstring=

Left join minus intersection with B

FileJoiner A -negfilter=B

Right join

FileJoiner B A
FileJoiner B A,outfields=,outemptyfieldstring=

Right join minus intersection with A

FileJoiner B -negfilter=A

Hi Chang

Thanks for the feedback.

More accurately described the options are :

Inner join : FileJoiner A B [-or- FileJoiner A -posfilter=B -or- FileJoiner B -posfilter=A depending on the purpose]

Left join : FileJoiner A B,optional [most likely with a ,outfields=…. for file B, and a -outemptyfieldstring= as well, to pad out the missing fields correctly]

Left join (minus intersection with B) : FileJoiner A -negfilter=B

Right join : FileJoiner B A,optional [most likely with a ,outfields=…. for file A, and a -outemptyfieldstring= as well, to pad out the missing fields correctly]

Right join (minus intersection with A) : FileJoiner B -negfilter=A

There isn’t really a ‘Full join’ or a ‘Full join (minus intersection)’ as currently implemented; you’d have to combine two runs of the program – it’s the two ‘-negfilter=’ commands above, concatenated and reformatted to make the fields line up correctly. It’s hard to do this automatically without knowing how the final file is meant look, hence not implemented now. It could probably be implemented. I don’t have time to work on it currently (due to the amount of testing needed).

Cheers
Scott

From: LunHsien Chang [mailto:luenhchang@gmail.com]
Sent: Friday, September 29, 2017 11:40 AM
To: Scott Gordon
Subject: Do SQL joins with FileJoiner

FileJoiner per-file options are separated by comma without white spaces. For example

locFileJoiner="/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner"
$locFileJoiner -quiet $PRSFile,headerline,sep=whitespace,key=2,outfields=1-5,7- $PCFile,headerline,sep=tabs,key=2,outfields=6-15 $imputationCovFile,sep=whitespace,key=2,outfields=6 > PRS_PC1-PC10_impCov

-quiet to suppress warning
type headerline when files contain headers
sep= defines the field separator, default= whitespace
specify merging key with key=
specify output fields with outfields= Allow the use of a range of fields (e.g. 1-5) and omitting the last field (e.g. 7-)

1.1 left-join 2 tables. Copy all columns from matched rows of right table to merged file Testing FileJoiner

left table: clumped.ADHD2017.1.snps
right table: ADHD2017.betas.temp
merged table: clumped.ADHD2017.1.betas.temp


# location of script files and input, output files
[lunC@hpcpbs01 ~]$ cd /mnt/lustre/working/lab_nickm/lunC/testing
locClumping="/mnt/lustre/working/lab_nickm/lunC/PRS_test/clumping"
locFileJoiner="/reference/genepi/GWAS_release/Release8/Scripts/FileJoiner"
locPRScalc= "/mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc"
locReference="/working/lab_nickm/lunC/reference"
locTest="/mnt/lustre/working/lab_nickm/lunC/testing"

# left-join clumped.ADHD2017.1.snps (left) and ADHD2017.betas.temp (right). Copy all columns from matched rows of right table, except merging key, to merged file 
[lunC@hpcpbs01 testing]$ ${locFileJoiner} -quiet ${locPRScalc}/ADHD2017.betas.temp -posfilter=clumped.ADHD2017.1.snps > clumped.ADHD2017.1.betas.temp
#Warning : in positive-filter file clumped.ADHD2017.1.snps : line 17852 is too short to produce all output fields (expected >= 1 but found only 0). Suppressing further warnings for this file.
#Warning : in positive-filter file clumped.ADHD2017.1.snps : line 17852 is too short to find the key field but not blank (expected >= 1 field(s) but only found 0). Suppressing further warnings for this file.
#Warning : line 17852 of positive-filter file clumped.ADHD2017.1.snps has a different number of fields (0) to the previous (1). Suppressing further warnings for this file.

# check how row numbers change
[lunC@hpcpbs01 testing]$ wc -l ${locPRScalc}/ADHD2017.betas.temp clumped.ADHD2017.1.snps clumped.ADHD2017.1.betas.temp
  8047420 /mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.betas.temp
    17853 clumped.ADHD2017.1.snps
    17851 clumped.ADHD2017.1.betas.temp
  8083124 total
 
[lunC@hpcpbs01 testing]$ head  ${locPRScalc}/ADHD2017.betas.temp clumped.ADHD2017.1.snps clumped.ADHD2017.1.betas.temp
==> /mnt/lustre/working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.betas.temp <==
rs202152658 A 0.0307038
rs143225517 T -0.0306963
rs3094315 A -0.0408012
rs3131972 A 0.0399993
rs3131971 T 0.0421012
rs61770173 A -0.0329985
rs3131970 T 0.0329027
rs2073814 C 0.0377003
rs2073813 A 0.0335025
rs3131969 A 0.0323026

==> clumped.ADHD2017.1.snps <==
rs112984125
rs3001723
rs2842198
rs2391769
rs2391734
rs867605
rs9661242
rs17652815
rs1222062
rs143129347

==> clumped.ADHD2017.1.betas.temp <==
rs79211741 T 0.0437968
rs36161240 A -0.0275973
rs186608774 T -0.00339576
rs6603787 T -0.0449032
rs75189781 T -0.0344051
rs2368566 A -0.0453949
rs7531530 T -0.014596
rs11260621 T 0.0107025
rs6676565 T -0.00780034
rs28691882 T -0.00589736

# Every row in merged file should be found in the left table
[lunC@hpcpbs01 testing]$ grep -E 'rs112984125|rs3001723|rs2842198|rs2391769|rs2391734|rs867605|rs9661242|rs17652815|rs1222062|rs143129347' clumped.ADHD2017.1.betas.temp
rs9661242 A -0.0625991
rs2842198 A -0.0780048
rs867605 T 0.0694981
rs3001723 A 0.0894019
rs112984125 A -0.106005
rs2391734 T 0.0855995
rs1222062 A -0.0610034
rs2391769 A -0.0754027
rs143129347 T 0.1235
rs17652815 T -0.146796

1.2 Inner-join 2 tables. All columns from 2 tables are copied to merged file

left table: meta.data
right table: clumped.ADHD2017.1.betas.temp
merged table: ADHD2017.1.betas.strand

[lunC@hpcpbs01 testing]$ ${locFileJoiner} -quiet ${locReference}/meta.data,key=5 clumped.ADHD2017.1.betas.temp,key=1 > ADHD2017.1.betas.strand
[lunC@hpcpbs01 testing]$ wc -l  ${locReference}/meta.data clumped.ADHD2017.1.betas.temp ADHD2017.1.betas.strand
  6303050 /working/lab_nickm/lunC/reference/meta.data
    17851 clumped.ADHD2017.1.betas.temp
    17757 ADHD2017.1.betas.strand
  6338658 total
[lunC@hpcpbs01 testing]$ head -2  ${locReference}/meta.data clumped.ADHD2017.1.betas.temp ADHD2017.1.betas.strand
==> /working/lab_nickm/lunC/reference/meta.data <==
SNP REF ALT bp_Build37 SNP_dbSNP MAF Rsq_rederived
1:998395 A G 998395 rs7526076 0.24938 0.640528

==> clumped.ADHD2017.1.betas.temp <==
rs79211741 T 0.0437968
rs36161240 A -0.0275973

==> ADHD2017.1.betas.strand <==
1:1034243 C T 1034243 rs79211741 0.120352 0.798223 rs79211741 T 0.0437968
1:1043829 A G 1043829 rs36161240 0.166522 0.782422 rs36161240 A -0.0275973

2 Create a here document that creates multiple script files for submitting a series of similar jobs

2.1 qsub_clumping_chr.sh is a script that passes multi-line string to a file in Bash, creating multiple files containing code, to do a bunch of similar jobs

line 7-11: set directories that is referred to in sub script files as variables
line 15-24: 2 for loops
line 27 & 43: create script files that are serially named by the 2 iterators: ${i} and ${j}
line 28-41: contect of individual subscript files
line 30-32: define how you will run jobs on HPC using PBS. Don't put other non-PBS commands before this section
line 34-41: plink commnads
line 46: submit the subscript files as jobs on HPC

#!/bin/sh

### use for loop to conduct analysis

## define file location

genoprefix=/mnt/lustre/reference/genepi/public_reference_panels/1000G_20101123_v3_GIANT/derived_plinkformat
genoend=phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL
extractfile=/working/lab_nickm/lunC/PRS_test/clumping/allSNPRsNumberGWASUniqueRmINDELs
clumpfile=/working/lab_nickm/lunC/PRS_test/clumping/
outfile=/working/lab_nickm/lunC/PRS_test/clumping2/

## for loop in trait for analysis

for i in ADHD2017 ASD2015
do

mkdir -p $i

cd $i

## for loop in chr
   for ((j=1; j<=5; j++))
	do

## prepare sh file
cat << __EOT__  > ${i}.clump_chr${j}.sh
#!/bin/sh

#PBS -l ncpus=1
#PBS -l mem=8gb
#PBS -l walltime=24:00:00

plink_1.90b3.36 --bfile  ${genoprefix}/chr${j}.${genoend} \
                --extract ${extractfile} \
                --clump ${clumpfile}clump${i}.available1 \
                --clump-p1 1 \
                --clump-p2 1 \
                --clump-r2 0.1 \
                --clump-kb 10000 \
                --out ${outfile}${i}.${j}

__EOT__

## qsub the jobs. Disable this line when checking files. Activate this line  after checking each generated script file
qsub ${i}.clump_chr${j}.sh

	done
cd ..
done

# run the script that generates subscript files
sh qsub_clumping_chr.sh

2.2 qsub_summingPRS.sh (better than previous qsub file)

if [ -f file ] note spaces between [] and its content

# one of 16 files that are generated by qsub_summingPRS.sh
[lunC@hpcpbs01 PRScalc]$ cat ADHD2017.S1.summingPRS.sh
#!/bin/sh

#PBS -l ncpus=2
#PBS -l mem=8gb
#PBS -l walltime=24:00:00

awk '{print $2,$1}' /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.1.1.profile.S1.profile > /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.S1.temp

## loop thru each chromosome (can move this loop outside the script)
for((chromosome=1;chromosome<=22;chromosome++))
        do for((chunk=1; chunk<=82; chunk++))
                do      if [ -f /working/lab_nickm/lunC/PRS_test/PRScalc/ADHD2017.${chromosome}.${chunk}.profile.S1.profile]
                        then
                                ## check if the file exists
                                echo ADHD2017.S1.${chromosome}.${chunk}
                                ## merge files using external scripts
                                /mnt/lustre/home/lunC/scripts/match.pl -f ADHD2017.${chromosome}.${chunk}.profile.S1.profile -g ADHD2017.S1.temp -k 2 -l 1 -v 4 > ADHD2017.S1.temp2
                                ## replace temp2 with temp
                                mv ADHD2017.S1.temp2 ADHD2017.S1.temp
                        fi
                done
        done
        awk '{for(i=3;i<=NF;i++) t+=$i; print t; t=0}' ADHD2017.S1.temp > ADHD2017.S1.rawSum
#__EOT__
[lunC@hpcpbs01 PRScalc]$ sh ${locScripts}/qsub_summingPRS.sh
2697577.hpcpbs02
2697578.hpcpbs02
2697579.hpcpbs02
2697580.hpcpbs02
2697581.hpcpbs02
2697582.hpcpbs02
2697583.hpcpbs02
2697584.hpcpbs02
2697585.hpcpbs02
2697586.hpcpbs02
2697587.hpcpbs02
2697588.hpcpbs02
2697589.hpcpbs02
2697590.hpcpbs02
2697591.hpcpbs02
2697592.hpcpbs02
[lunC@hpcpbs01 PRScalc]$ qstat -u lunC

hpcpbs02:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
2697577.hpcpbs0 lunC     short    ADHD2017.S 173584   1   2    8gb 24:00 R 00:00
2697578.hpcpbs0 lunC     short    ADHD2017.S 173616   1   2    8gb 24:00 R 00:00
2697579.hpcpbs0 lunC     short    ADHD2017.S 173625   1   2    8gb 24:00 R 00:00
2697580.hpcpbs0 lunC     short    ADHD2017.S 173627   1   2    8gb 24:00 R 00:00
2697581.hpcpbs0 lunC     short    ADHD2017.S 173628   1   2    8gb 24:00 R 00:00
2697582.hpcpbs0 lunC     short    ADHD2017.S 173633   1   2    8gb 24:00 R 00:00
2697583.hpcpbs0 lunC     short    ADHD2017.S 173640   1   2    8gb 24:00 R 00:00
2697584.hpcpbs0 lunC     short    ADHD2017.S 173646   1   2    8gb 24:00 R 00:00
2697585.hpcpbs0 lunC     short    ASD2015.S1 173654   1   2    8gb 24:00 R 00:00
2697586.hpcpbs0 lunC     short    ASD2015.S2 173661   1   2    8gb 24:00 R 00:00
2697587.hpcpbs0 lunC     short    ASD2015.S3 173667   1   2    8gb 24:00 R 00:00
2697588.hpcpbs0 lunC     short    ASD2015.S4 173673   1   2    8gb 24:00 R 00:00
2697589.hpcpbs0 lunC     short    ASD2015.S5 173679   1   2    8gb 24:00 R 00:00
2697590.hpcpbs0 lunC     short    ASD2015.S6 173686   1   2    8gb 24:00 R 00:00
2697591.hpcpbs0 lunC     short    ASD2015.S7 173692   1   2    8gb 24:00 R 00:00
2697592.hpcpbs0 lunC     short    ASD2015.S8 173699   1   2    8gb 24:00 R 00:00

GWAS ID remapper

An example of 3 quantitative traits

script file: GWAS_CCO_P6IRT_S6IRT_SP12IRT.txt
input data file (phenotype data file): uniqID_1stNonmissDepVar_keep.txt
output file: ID_remapped_sp12_IRT.txt

Order columns of your input phenotype file- ID in first column, followed by traits and then covariates

ID: ID
traits:PSYCH6, SOMA6, SPHERE_sum, PSYCH6_IRT, SOMA6_IRT, SPHERE12_IRT
covariates: age, ageSq (age²), nSEX (numeric sex), sexAge (sex by age), sexAgeSq (sex by age²)

[lunC@hpcpbs01 sphere]$ head -1 uniqID_1stNonmissDepVar_keep.txt
ID      PSYCH6  SOMA6   SPHERE_sum      PSYCH6_IRT      SOMA6_IRT       SPHERE12_IRT    age     ageSq   nSEX    sexAge  sexAgeSq

CompileProfileFiles.sh

A script to compile PLINK .profile files run separately for different chromosomes or blocks of markers, into a single file with appropriate summing.

All files must have been run for the same set of individuals (not necessarily for the same genetic dataset although that would be the usual practice).

Files can be specified either as a list on the command line, or as a filename with a preceding '@' symbol in which case the individual files are listed in that file, rather than individually on the command line. The two methods can be mixed.

The .profile files can be gzip'd (must have names ending in .gz in this case).

Redirect output to a new (destination) file.

eg. $0 GRStest1_chr*.profile > GRStest1_compiled.profile

Remap ID using GWAS_ID_remapper

directory of GWAS_ID_remapper /reference/genepi/GWAS_release/Release8/Scripts/GWAS_ID_remapper
options
- -intabdelim for tab-separated input file
- -headerline if input file has header
- -outpedigree inserts pedigree info, five columns, into output file:
  - FAMID
  - ID
  - FATHERID
  - MOTHERID
  - GENDER
directory of phenotype data file: /working/lab_nickm/lunC/sphere/uniqID_1stNonmissDepVar_keep.txt
output file name: ID_remapped_sp12_IRT.txt This file contains pedigree information and phenotype data file

# remap ID
[lunC@hpcpbs01 sphere]$ /reference/genepi/GWAS_release/Release8/Scripts/GWAS_ID_remapper -intabdelim -headerline -outpedigree /working/lab_nickm/lunC/sphere/uniqID_1stNonmissDepVar_keep.txt > ID_remapped_sp12_IRT.txt

# headers of pedigree+phenotype data
[lunC@hpcpbs01 sphere]$ head -1 ID_remapped_sp12_IRT.txt
FAMID ID FATHERID MOTHERID GENDER PSYCH6 SOMA6 SPHERE_sum PSYCH6_IRT SOMA6_IRT SPHERE12_IRT age ageSq nSEX sexAge sexAgeSq

fieldnumbers

script content

#!/bin/bash
# By Scott Gordon, August 2015 for QIMR Berghofer
#
# List the field numbers from the header line of 1 or more files.
# Reports the filename if >1 file

N=$#
cnt=0
for f in $*; do
  ((++cnt))
  if (( N > 1 )); then
    echo "file ${f}"
  fi
  prefilter=`echo $f|awk '{f=toupper($1);print (substr(f,length(f)-2)==".GZ")?"zcat":"cat"}'`
  ${prefilter} ${f} | head -1 | awk '{for(i=1;i<=NF;++i){print i,$i}}'
  if (( N > cnt )); then
    echo
  fi
done

use fieldnumbers on a tab-separated input file

# copy the file to my script folder
[lunC@hpcpbs01 ~]$ cp /reference/genepi/GWAS_release/Release8/Scripts/fieldnumbers /mnt/lustre/home/lunC/scripts/fieldnumbers

# obtain filed numbers
[lunC@hpcpbs01 GWAS_summary]$ ${locScripts}/fieldnumbers -sep=tabs scz2.snp.relts.txt
file -sep=tabs
cat: invalid option -- 'p'
Try 'cat --help' for more information.

file scz2.snp.results.txt
1 hg19chrc
2 snpid
3 a1
4 a2
5 bp
6 info
7 or
8 se
9 p
10 ngt

match.pl

5.1 Perform right-join with match.pl.

Usage: match.pl -f file1 -g filetwo -k 1 -l 2 -v 3 4 7
- -f specifies name of file1 (left table)
- -g specifies name of file2 * -f specifies name of file1 (right table)
- -k specifies position of key variable in file1 (1 here)
- -l specifies position of key variable in file2 (2 here)
- -v specifies position of values in file one (3, 4, 7 here) to be added to file2.
  - This option cannot be ignored
  - In the merged file, values of the added columns are set to dash for unmatched row (i.e. not found in file1); otherwise carry original values (i.e. rows common to both file1 and file2)

you should get an overlap > 1 Milions, maybe need to use another SNP ID for the merge (if I remember well, the QIMR data uses mostly the ID type CHR:BP)

5.2 Test match.pl using test data files. Suppose there two files with 2 rows common to both

temp_meta_10rows (file1 or left table in SQL language)
temp_ADHD2017_11rows (file2 or right table in SQL language)

[lunC@hpcpbs01 testing]$ cd /working/lab_nickm/lunC/testing

# content of file 1 (left table)
[lunC@hpcpbs01 testing]$ head temp_meta_10rows
SNP REF ALT bp_Build37 SNP_dbSNP MAF Rsq_rederived
1:998395 A G 998395 rs7526076 0.24938 0.640528
1:1002387 A G 1002387 rs74048003 0.183707 0.602693
1:1004957 G A 1004957 rs4073176 0.421162 0.686436
1:1004980 G A 1004980 rs4073177 0.421158 0.686789
1:1005806 C T 1005806 rs3934834 0.142079 0.902389
1:1006223 G A 1006223 rs9442394 0.42339 0.691605
1:1006990 G A 1006990 rs4326571 0.278881 0.724348
1:1007222 G T 1007222 rs71628928 0.10088 0.608873
1:1007432 G A 1007432 rs4333796 0.444029 0.728123

# content of file 2 (right table)
[lunC@hpcpbs01 testing]$ awk '{print $0}' temp_ADHD2017_11rows
CHR SNP BP CHR:BP P
1 rs202152658 751343 1:751343 0.1654
1 rs143225517 751756 1:751756 0.1654
1 rs3094315 752566 1:752566 0.03208
1 rs3131972 752721 1:752721 0.03301
1 rs3131971 752894 1:752894 0.02772
1 rs61770173 753405 1:753405 0.1208
1 rs3131970 753425 1:753425 0.1217
1 rs2073814 753474 1:753474 0.04622
1 rs2073813 753541 1:753541 0.1165
1 rs3934834 1005806 1:1005806 0.54
1 rs4333796 1007432 1:1007432 0.4440

# perform right join using rs number as merging key
[lunC@hpcpbs01 testing]$ /mnt/lustre/home/lunC/scripts/match.pl -f temp_meta_10rows -g temp_ADHD2017_11rows -k 5 -l 2 -v 6 > temp_mergeKeyRsNum
[lunC@hpcpbs01 testing]$ tail temp_mergeKeyRsNum
1 rs143225517 751756 1:751756 0.1654    -
1 rs3094315 752566 1:752566 0.03208     -
1 rs3131972 752721 1:752721 0.03301     -
1 rs3131971 752894 1:752894 0.02772     -
1 rs61770173 753405 1:753405 0.1208     -
1 rs3131970 753425 1:753425 0.1217      -
1 rs2073814 753474 1:753474 0.04622     -
1 rs2073813 753541 1:753541 0.1165      -
1 rs3934834 1005806 1:1005806 0.54      0.142079
1 rs4333796 1007432 1:1007432 0.4440    0.444029

# perform right join using CHR:BP as merging key. The result is similar to using rs number above
[lunC@hpcpbs01 testing]$  /mnt/lustre/home/lunC/scripts/match.pl -f temp_meta_10rows -g temp_ADHD2017_11rows -k 1 -l 4 -v 6 > temp_mergeKeyCHRBP
[lunC@hpcpbs01 testing]$ tail temp_mergeKeyCHRBP
1 rs143225517 751756 1:751756 0.1654    -
1 rs3094315 752566 1:752566 0.03208     -
1 rs3131972 752721 1:752721 0.03301     -
1 rs3131971 752894 1:752894 0.02772     -
1 rs61770173 753405 1:753405 0.1208     -
1 rs3131970 753425 1:753425 0.1217      -
1 rs2073814 753474 1:753474 0.04622     -
1 rs2073813 753541 1:753541 0.1165      -
1 rs3934834 1005806 1:1005806 0.54      0.142079
1 rs4333796 1007432 1:1007432 0.4440    0.444029

tags: unix FileJoiner plink qsub cat PBS GWAS ID remapper pedigree