GBS scripts - and cleaned up below

:construction::construction::construction: # GBS scripts - and cleaned up below Set up folders "Species" * "raw" - from sequencers, (each ) includes the barcode file.txt. What process tags needs> it is from the excel spread sheet which can be copied over to the server. (demultiplex in raw folder) * "samples.t100" - which includes all the results from process radtags. Zoe has defaulted do trimming to 100bp (t=100 in process radtages) * [NOTES](https://docs.google.com/document/d/1Rqx2tRw4MvJk0Ixo8YG1C0Yu-Gk5Nuqu/edit) Buxbaumia 10.2.0.10 Recall screen screen -R “name” find screen name screen -ls Killing screen screen -X -S “name” kill ### Download and Processing raw Illumina Data Logging on to Buxbaumia 10.2.0.10 If you are not at the Garden you will have to open “FortiClient” utility to log into the Garden’s VPN using your CBG network password before step 1. 1) Mac/Windows - open Terminal/bash.exe (UbuntU)program 2) ssh yourlogin@10.2.0.10 3) Follow password prompt (note cursor will not move when you type your password) ### FROM ZOE to run process radtags #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 process_radtags -P -p /data/Attalea/raw/batch1.1 -i gzfastq -o /data/Attalea/samples.t100 -b barcodes1.1 --inline_index --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm t=100 * P,--paired - *we used pair end reads* * -p /data/Attalea/raw/batch1.1 where to find the sequencing * -i gzfastq * -o /data/Attalea/samples.t100 * -b barcodes1.1 * --inline_index * --renz_1 ecoRI * --renz_2 mspI * --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT * --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT * --adapter_mm * t=100 :::danger ERROR: that dylan seems to get from (base) -bash-4.2$ /opt/Software/stacks-2.55/process_radtags -h /opt/Software/stacks-2.55/process_radtags: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /opt/Software/stacks-2.55/process_radtags) /opt/Software/stacks-2.55/process_radtags: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /opt/Software/stacks-2.55/process_radtags) /opt/Software/stacks-2.55/process_radtags: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /opt/Software/stacks-2.55/process_radtags) (base) -bash-4.2$ cat .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi # User specific environment and startup programs PATH=$PATH:$HOME/.local/bin:$HOME/bin export PATH ::: :::info :bulb: From https://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php * p — path to a directory of files. * P,--paired — files contained within the directory are paired. * I,--interleaved — specify that the paired-end reads are interleaved in single files. * i — input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to). * b — path to a file containing barcodes for this run, omit to ignore any barcoding. * o — path to output the processed files. * f — path to the input file if processing single-end sequences. * 1 — first input file in a set of paired-end sequences. * 2 — second input file in a set of paired-end sequences. * c,--clean — clean data, remove any read with an uncalled base. * q,--quality — discard reads with low quality scores. * r,--rescue — rescue barcodes and RAD-Tags. * t — truncate final read length to this value. * D — capture discarded reads to a file. * E — specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5). * w — set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15). * s — set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10). * y — output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type). Barcode options: * --inline_null: barcode is inline with sequence, occurs only on single-end read (default). * --index_null: barcode is provded in FASTQ header, occurs only on single-end read. * --inline_inline: barcode is inline with sequence, occurs on single and paired-end read. * --index_index: barcode is provded in FASTQ header, occurs on single and paired-end read. * --inline_index: barcode is inline with sequence on single-end read, occurs in FASTQ header for paired-end read. * --index_inline: barcode occurs in FASTQ header for single-end read, is inline with sequence on paired-end read. * Restriction enzyme options: * -e [enz], --renz_1 [enz]: provide the restriction enzyme used (cut site occurs on single-end read) * --renz_2 [enz]: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read). Currently supported enzymes include: 'aciI', 'ageI', 'aluI', 'apaLI', 'apeKI', 'apoI', 'aseI', 'bamHI', 'bbvCI', 'bfaI', 'bfuCI', 'bgIII', 'bsaHI', 'bspDI', 'bstYI', 'btgI', 'cac8I', 'claI', 'csp6I', 'ddeI', 'dpnII', 'eaeI', 'ecoRI', 'ecoRV', 'ecoT22I', 'haeIII', 'hinP1I', 'hindIII', 'hpaII', 'hpyCH4IV', 'kpnI', 'mluCI', 'mseI', 'mslI', 'mspI', 'ncoI', 'ndeI', 'nheI', 'nlaIII', 'notI', 'nsiI', 'nspI', 'pacI', 'pspXI', 'pstI', 'rsaI', 'sacI', 'sau3AI', 'sbfI', 'sexAI', 'sgrAI', 'speI', 'sphI', 'taqI', 'xbaI', or 'xhoI'. Protocol-specific options: * --bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially tranpose reads. * Adapter options: * --adapter_1 [sequence]: provide adaptor sequence that may occur on the single-end read for filtering. * --adapter_2 [sequence]: provide adaptor sequence that may occur on the paired-read for filtering. * --adapter_mm [mismatches]: number of mismatches allowed in the adapter sequence. * Output options: * --retain_header: retain unmodified FASTQ headers in the output. * --merge: if no barcodes are specified, merge all input files into a single output file. * Advanced options: * --filter_illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing. * --disable_rad_check: disable checking if the RAD site is intact. * --len_limit [limit]: specify a minimum sequence length (useful if your data has already been trimmed). * --barcode_dist_1: the number of allowed mismatches when rescuing single-end barcodes (default 1). * --barcode_dist_2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode_dist_1). ::: ### Get data from Sequencing facility DO THIS ONLY ONCE. Check what directory you are in when running as this determine where it’s saved Type… “pwd” Make sure you are in Cratoneuron (for data – not running) Type “cd /cratoneuron/USERNAME” Type “pwd” to check BaseSpaceRunDownloader_v2.py is currently in /cratoneuron/roverson/RADseq/ and I also copy it to /cratoneuron/acisternas. That way is not necessary to add the pathway to the program. Is in the same folder that you will be working on. cd /cratoneuron/USERNAME python BaseSpaceRunDownloader_v2.py -r run number –a 31b72dddefa24d79881168dbdbb74aa7 # This command can be used to download an entire project of raw data from BaseSpace using the runID (-r) and the access token (-a). code lines put command first eg “-a” or “–r” its then followed by important information after “ID, files etc” python BaseSpaceRunDownloader_v2.py means Run “python BaseSpaceRunDownloader_v2.py”, –r 22798807 means run that ID – get from URL # Get the run ID from the url of the project page on BaseSpace for that run -a 31b72dddefa24d79881168dbdbb74aa7 means access “our” data – lab password #The access token is always the same (special password for our lab) CONVERT DATA to raw data file we can use bcl2fastq --runfolder-dir ./22579592_xyloprim_plate2 --output-dir ./22579592_xyloprim_plate2_fastq_out # mkdir RUN_NUMBER_SPECIES_PLATENUMBER_FASTQ_OUT # Then run de code example: bcl2fastq --runfolder-dir ./23564548 --output-dir ./23564548_clarkia_plate1_fastq_out # it will take less than half an hour. # This script can then be used to convert the downloaded project above to a raw fastq.gz file “./” places it in the current directory. (Make sure it matches directory where BaseSpace placed it OR moved dat to current directory OR add the file location information in place of “./”. bcl2fastq File converter that changes data from BCL to FastaQ --runfolder-dir ./22579592_xyloprim_plate2 this tells it the file name “./22579592_xyloprim_plate2” and directory “runfolder” # Notice that Illumina names the downloaded folder after the run ID, and I leave this in place for transparency --output-dir ./22579592_xyloprim_plate2_fastq_out this tells it rename the file name “./22579592_xyloprim_plate2_fastq_out” and directory “output” # If we don't specify a template (like below) than this script will not demultiplex the samples and will produce one raw file with all reads (which is what we want in this case because we want process_radtags to do the demultiplexing rather than Illumina) CONCATENATE all data into one folder mkdir process_radtags # Place this folder into the output folder from the bcl2fastq command above that contains the *.fastq.qz files from each lane of the sequencer and concatenate them all together #In this example you would make it in “./22579592_xyloprim_plate2_fastq_out” “mkdir” = make directory called “process_radtags” Create a folder where we will put all files. We need to places it in the directory you want to work in. # make this directory for your output cat *fastq.gz > ./process_radtags/xyloprim_plate1_AllLanes.fq.gz # move into the process_radtags folder “cat” = categorizes/concatenate all files with fastq.gz extension “*.fastq.gz” This will place all these files into one directory. “./” places it in the current directory. (Make sure it matches directory where process_radtags folder exist it OR OR add the file location information in place of “./”. cd process_radtags # change working directory to the process_radtags folder “cd” = change directory to “process_radtags” which should be in the working directory. IF not then direct with full file address. ########## DOWNLOADED AND PROCESSING RAW ILLUMINA DATA ######################################## DEMULTIPLEXING raw data into individual files ready for STACKS # If raw data is downloaded per the above instructions from BaseSpace then you need to use the program/platform STACKS:process_radtags to clean, filter and demultiplex your data # see manual for complete options mkdir ./”platename”/process_radtags/process_radtags_output “mkdir” = make directory called “process_radtags” and “process_radtags_output” # Before starting - Need to make text file with barcodes… (eg “barcodes_xyloprim_plate1.txt”) which will be a list with 1) Barcode and 2) individual ID. At the bottom of the text file you need to have a blank line (make a space) Sample Names – (eg “Samples_xyloprim_plate1.txt”) which just lists individual ID. At the bottom of the text file you need to have a blank line (make a space) # MOVE/SAVE both to eg </cratoneuron/username/process_radtags/ > Use in appropriate program so can be moved easily to CBG server Mac: TextWrangler, CyberDuck, iTerm2 When saving go to “Save to FTP/SFTP Server” First time might need to add Server “10.0.0.74”; Port “22”; Checkbox “SFTP”; Add path name eg “/cratoneuron/acisternas/’plate#_Fastq_out’/process-radtags/ Windows: Notepad++, CyberDuck, PuTTY (ssh client for windows) For Notepad++ Make sure you have “NppFTP is downloaded – go to pluggin -> pluggin Manager. Go to pluggin “NppFTP” and “Show NppFTP window” Go to Cog symbol – select “profile setting” and create path. Server “10.0.0.74”; Port “22”; Checkbox “SFTP”; Done in iTerm (Mac)/BASH but current program. -> screen –S “screen_name” Since the next command takes too long we will want to create a “screen” which will run your process even if you are not logged on. tutorial: http://www.tecmint.com/screen-command-examples-to-manage-linux-terminals/ “screen_name” can be anything you want. In screen commands will change… Ctrl-A – tells screen you want its attention. cheat sheet for screen command: http://aperiodic.net/screen/quick_reference # In new screen write. process_radtags -f ./*.fq.gz -i gzfastq -e apeKI --inline_null -b ./barcodes_xyloprim_plate1.txt -o ./process_radtags_output -E phred33 –q –r –c “process_radtags – in this situation is a function/command “–f” telling it where to find files; all with extension .fq.gz in that directory “./*.fq.gz” I did not use the “./*.fq.gz “ since there was two files (plate 1 and plate 2) with “fq.gz” – so instead I specified the exact file to use. “-i” telling us the type of input file. “gzfastq” data type in this case “-e” tells it the enzyme name. “apeK1” “—inline_null - b” is telling the program that the file contains each barcode “-b” is “inline” with the sample. (So we need make sure they are listed in the same order). “-o” tells us where to put the output “./process_radtags_output” which we just created. “-E” is a quality scores for illumina which is always “phred33” for illumine “-q” discard low quality reads “-r” rescue barcodes and radseq data –c did not add but supposedly will — clean data, remove any read with an uncalled base. # If all running fine – then leave screen. Press Ctrl-A then type “d” # Check progress htop this tells you who/what is running current # Kill screen once done. screen –r “screen_name” this gets you back into screen Press Ctrl-A then type “k” this kills screen # after process radtags is done you can check the number of sequences that demultiplexed for each sample with the above command The code above is to run the process radtags so we do need another one to check the number of sequences after Great news!! Until here I replicated everything using this guide ☺ Maybe we could make it more straight forward and organize so a beginner in this could do it too. Analysis with STACKS: TO WORK OUT PARAMETERS TO USE SHOULD DO A RUN WITH SUBSET OF DATA TO SEE WHAT IS BEST. – Did not do initially but should moving forward. See Paris et al 2017 or Rochette and Catchen 2017 # Trim and clean samples. # script for USTACKS # Trim and clean samples. We will do this once to make sure command is working. Check for errors. Once it works will run in “parallel” to get all samples. Make sure you made the directory mkdir /data/username/ustack_output/ in Truebia (NOT Cratoneuron) mkdir /data/USERNAME/ustack_output/ This is moving data from Cratoneuron to Truebia Note there is no “./” to indicate current directory so it is moved to home directory. “/” ustacks -t gzfastq –i {#} -f ./{filename}.fq.gz -o /data/USERNAME/ustacks_output/ -p 1 -m 2 -M 2 -N 4 -d “ustack” – in this situation is a function/command “–t” input file type “gzfastq” “-i” a unique integer id to identify this sample ( unique number “{#}”) “-f” is input path… “./cratoneuron/username/process radtags/ /trimmed_cleaned/{}_1_trimm.fq.gz” “-o” tells us where to put the output “./” which is current directory The default server is Treubia so by typing ./data/USERNAME you can put output in your folder in Truebia. *in the output file that we define in the code is where we change to treubia defining as data/usernamefile/ustack_output. “-p” will enable parallel executions with “1” thread “-m” minimum depth of coverage to create a stack of “2”; which is default “-M” maximum distance (in Nucleotides) allowed between stacks; default is “2” “-N” maximum distance allowed to align se ry reads to primary stacks (Default: 4=M+2) “-d” enable the deleveraging algorithm, used for resolving over merged tags # If above works will run in “parallel” to get all samples - DO THIS AS TEST BUT QUIT WHEN DONE. parallel –j 10 “ustacks -t gzfastq -o /data/acisternas/ustacks_output/ -p 1 -m 2 -M 2 -N 4 -d -i {#} –f ./{}.fq.gz” :::: ../SampleNames.txt “parallel -j 10 < file” runs a sequence of shell commands in parallel, “-j” compressing them in blocks of ten shell jobs at a time. “::::” –this is for when using Parallel to use read the sample names in order. “../” one folders back So look for in “./”platename”/process_radtags/” rather than in “./”platename”/process_radtags/process_radtags_output” which is the current working directory SampleNames.txt is the file made above (Analysis with Stacks) If you get no errors within few minutes press Press “Crtl+C” to stop # As running in this command can take a long time – we run this code in “Qsub” – a job submission program which allows you to run the program offline and make sure people use resources efficiently. This is great as if machine is busy it will wait until others are finish and start immediately. # Make text file with “QSUB_ustack.txt” in TextWrangler (Mac) or CyberDuck (PC) or equivalent and place in folder “./”platename”/process_radtags/process_radtags_output” See above to work with TextWrangler (Mac) or CyberDuck (PC) Paste following into text. Save as “QSUB_ustack.pbs” The ppn above is how many cores you are requesting for your script. It’s a queue, the larger the number the longer you will wait for available space BUT the faster it will run. There is 24 cores so you should run accordingly. qsub QSUB_ustack.pbs type qstat This will tell you the status of your run Under “S” you have R=running, C=Complete, Q=Queued. If you see C immediately then it did not run properly – check output file “USTACK_screen_output.txt” for errors. To cancel a qsub that you are currently running, use qstat to check the number of the work ex: 1391.treubia. You go back to the command line and write qdel 1391.treubia – this will cancel the work and in qstat will appear as C. UP TO HERE IN NOTETAKING (ANITA: I ADDED SOME NOTES ABOVE THIS ABOUT HOW TO CANCEL A QSUB) gunzip ./*.gz # Remember to unzip the files from tsv.gz to just . tsv code for that is gunzip ./*.gz (You have to do it just once after ustacks) # Script for CSTACKS #Check the size of the files and choose the two higher per population to build the catalog (if you run samples in different plates make sure you select equally from both plates, to avoid any problem) use –ls lhS to see the sizes of files in order # to build the catalog from the two samples from each populations with the highest number of reads b database for this catalog, default value is 1, p enable parallel execution (This will be the number of cores that will used, try to be less than 15), s represent different samples that you want to use to build the catalog n number of mismatches allow between samples default is 1, I’m sure there is a reason why Rick changed it to 4 Error in CLARKIA used “–n 12” this is supposed to be number of mismatches allowed NOT number of samples … First time I rand Cstacks I think I forgot to add the space between “–s” and “./” – don’t think it made difference at least with batch 7 – mini run b — database/batch ID for this catalog (default 1). P — path to the directory containing Stacks files. M — path to a population map file. g,--aligned — base catalog construction on alignment position, not sequence identity. n — number of mismatches allowed between sample loci when build the catalog (default 1). p — enable parallel execution with num_threads threads. s — sample prefix from which to load loci into the catalog. o — output path to write results. This took approximate 6 days. cstacks -b 2 -p 8 -n 12 -o ./stacks -s ./HR-2158 -s ./HR-2166 -s ./NR-1907 -s ./NR-1917 -s ./DA-2 -s ./DA-11 -s ./WB-22 -s ./WB-16 -s ./ED-3651 -s ./ED-3669 -s ./BH-1959 -s ./BH-1966 CLARKIA For clarkia I decided to make multiple folders rawdata – all data unzipped. This not to me touched but used as backup catalogs – this holds all the catalogs made (batches) CATALOG folders - For each catalog the matches are stored in one folder Working folder – to run programs which need all files in one. This should be cleaned out regularly qsub QSUB_ustack.pbs Repeat BATCH 2-5 all are for “-n 4” BATCH 6-9 all are for “-n 2” BATCH 1: (-b 7): Highest two Breweri samples from BM. cstacks -b 7 -p 5 -n 4 -s ./CB-BM2-10 -s./CB-BM2-17 BATCH 3: (-b 3): Highest two Breweri samples from each pop. One from each plate. All samples. cstacks -b 3 Clarkia -p 5 -n 4 -s ./CB-BM2-10 -s./CB-BM2-17 -s./CB-CR-6 -s./CB-CR-15 -s./CB-FR-6 -s./CB-FR-20 -s./CB-MM-8 -s./CB-MM-13 -s./CB-PI-6 -s./CB-PI-14 -s./CB-PP-8 -s./CB-PP-15 BATCH 4: (-b 4): Highest two concinna ssp concinna samples from each pop. One from each plate. All samples. cstacks -b 4 -p 4 -n 4 -s./CC-DH-8 -s./CC-DH-14 -s./CC-LC-5 -s./CC-LC -s./CC-MW-6 -s./CC-MW-15 -s./CC-RV-5 -s./CC-RV-14 BATCH 5 (-b 5): Highest two concinna ssp automixa samples from each pop. One from each plate. All samples. cstacks -b 5 -p 4 -n 4 -s./CC-CM2-11 -s./CC-CM2-20 -s./CC-MH-6 -s./CC-MH-20 BATCH 6: (-b 2, n -2): Highest two samples from each pop. One from each plate. All samples. cstacks -b 6 -p 8 -n 2 -o ./data/acisternas/ustacks_output_clarkia/catalogs -s ./rawdata/CB-BM2-10 -s ./rawdata/CB-BM2-17 -s ./rawdata/CB-CR-6 -s ./rawdata/CB-CR-15 -s ./rawdata/CB-FR-6 -s ./rawdata/CB-FR-20 -s ./rawdata/CB-MM-8 -s ./rawdata/CB-MM-13 -s ./rawdata/CB-PI-6 -s ./rawdata/CB-PI-14 -s ./rawdata/CB-PP-8 -s ./rawdata/CB-PP-15 -s ./rawdata/CC-CM2-11 -s ./rawdata/CC-CM2-20 -s ./rawdata/CC-DH-8 -s ./rawdata/CC-DH-14 -s ./rawdata/CC-LC-5 -s ./rawdata/CC-LC-12 -s ./rawdata/CC-MH-6 -s ./rawdata/CC-MH-20 -s ./rawdata/CC-MW-6 -s ./rawdata/CC-MW-15 -s ./rawdata/CC-RV-5 -s ./rawdata/CC-RV-14 BATCH 7: (-b 3): Highest two Breweri samples from each pop. One from each plate. All samples. cstacks -b 7 Clarkia -p 4 -n 4 -o ./data/acisternas/ustacks_output_clarkia/catalogs -s ./CB-BM2-10 -s ./rawdata/CB-BM2-17 -s ./rawdata/CB-CR-6 -s ./rawdata/CB-CR-15 -s ./rawdata/CB-FR-6 -s ./rawdata/CB-FR-20 -s ./rawdata/CB-MM-8 -s ./rawdata/CB-MM-13 -s ./rawdata/CB-PI-6 -s ./rawdata/CB-PI-14 -s ./rawdata/CB-PP-8 -s ./rawdata/CB-PP-15 BATCH 8: (-b 4): Highest two concinna ssp concinna samples from each pop. One from each plate. All samples. cstacks -b 8 -p 4 -n 2 -o ./data/acisternas/ustacks_output_clarkia/catalogs -s ./rawdata/CC-DH-8 -s ./rawdata/CC-DH-14 -s ./rawdata/CC-LC-5 -s ./rawdata/CC-LC -s ./rawdata/CC-MW-6 -s ./rawdata/CC-MW-15 -s ./rawdata/CC-RV-5 -s ./rawdata/CC-RV-14 BATCH 9 (-b 5): Highest two concinna ssp automixa samples from each pop. One from each plate. All samples. cstacks -b 9 -p 4 -n 2 -o ./data/acisternas/ustacks_output_clarkia/catalogs -s ./rawdata/CC-CM2-11 -s ./rawdata/CC-CM2-20 -s ./rawdata/CC-MH-6 -s ./rawdata/CC-MH-20 # script for SSTACKS Sets of stacks, i.e. putative loci, constructed by the ustacks program can be searched against a catalog produced by cstacks. In the case of a genetic map, stacks from the progeny would be matched against the catalog to determine which progeny contain which parental alleles. In the case of a general population, all samples in the population would be matched against the catalog with sstacks. sstacks -P ./stacks -M ./popmap -p 8 sstacks [--aligned] -P dir [-b batch_id] -M popmap [-p n_threads] sstacks [--aligned] -c catalog_path -s sample_path [-s sample_path ...] -o path [-p n_threads] b —batch ID of the catalog to consider (default: guess). – CHECK in CSTACK notes P — path to the directory containing Stacks files. M — path to a population map file from which to take sample names. s — filename prefix from which to load sample loci. c — path to the catalog. g,--aligned — base matching on alignment position, not sequence identity. p — enable parallel execution with num_threads threads. o — output path to write results. x — don't verify haplotype of matching locus. Gapped assembly options: --gapped — preform gapped alignments against the catalog loci. -b batch id of the catalog, -p is enable parallel execution –c is the location of the catalog, -s the locations of the samples and the text file is the files you want to use. parallel -j 10 "sstacks -b 2 -p 1 -c ./batch_2 -s /data/acisternas/ustacks_output/{}.tsv -o ./" :::: ./primi_plate1.txt mkdir /catalogs mv -v ./batch* ./catalogs/ Move all batches to the catalogs folder mkdir / rawdata mv -v ./*.tsv ./rawdata/ Moves all rawdata to one folder. mkdir /Breweri Name refers to catalog NOT species BATCH 3: Just Breweri - Highest two Breweri samples from each pop. One from each plate. All samples parallel -j 10 "sstacks -b 3 -p 4 -c /data/acisternas/ustacks_output_clarkia/catalogs/batch_3 -o /data/acisternas/ustacks_output_clarkia/BreweriCATALOG -s /data/acisternas/ustacks_output_clarkia/{}.tsv " :::: ./samples_clarkia_combined.txt TEST sstacks -b 3 -c ./catalogs/batch_3 -s ./rawdata/CB-CR-12 -o ./BreweriCATALOG START: sstacks -b 3 -p 4 -c ./catalogs/batch_3 -o ./BreweriCATALOG DATA: -s ./rawdata/BLANK01 -s ./rawdata/BLANK02 -s ./rawdata/BLANK04 -s ./rawdata/BLANK05 -s ./rawdata/CB-BM2-10 -s ./rawdata/CB-BM2-11 -s ./rawdata/CB-BM2-13 -s ./rawdata/CB-BM2-14 -s ./rawdata/CB-BM2-15 -s ./rawdata/CB-BM2-17 -s ./rawdata/CB-BM2-18 -s ./rawdata/CB-BM2-19 -s ./rawdata/CB-BM2-2 -s ./rawdata/CB-BM2-3 -s ./rawdata/CB-BM2-4 -s ./rawdata/CB-BM2-5 -s ./rawdata/CB-BM2-7 -s ./rawdata/CB-BM2-8 -s ./rawdata/CB-BM2-9 -s ./rawdata/CB-CR-1 -s ./rawdata/CB-CR-10 -s ./rawdata/CB-CR-11 -s ./rawdata/CB-CR-12 -s ./rawdata/CB-CR-13 -s ./rawdata/CB-CR-14 -s ./rawdata/CB-CR-15 -s ./rawdata/CB-CR-2 -s ./rawdata/CB-CR-3 -s ./rawdata/CB-CR-4 -s ./rawdata/CB-CR-5 -s ./rawdata/CB-CR-6 -s ./rawdata/CB-CR-7 -s ./rawdata/CB-CR-8 -s ./rawdata/CB-CR-9 -s ./rawdata/CB-FR-1 -s ./rawdata/CB-FR-11 -s ./rawdata/CB-FR-12 -s ./rawdata/CB-FR-13 -s ./rawdata/CB-FR-14 -s ./rawdata/CB-FR-16 -s ./rawdata/CB-FR-17 -s ./rawdata/CB-FR-2 -s ./rawdata/CB-FR-20 -s ./rawdata/CB-FR-3 -s ./rawdata/CB-FR-4 -s ./rawdata/CB-FR-5 -s ./rawdata/CB-FR-6 -s ./rawdata/CB-FR-7 -s ./rawdata/CB-FR-8 -s ./rawdata/CB-MM-1 -s ./rawdata/CB-MM-10 -s ./rawdata/CB-MM-11 -s ./rawdata/CB-MM-12 -s ./rawdata/CB-MM-13 -s ./rawdata/CB-MM-14 -s ./rawdata/CB-MM-15 -s ./rawdata/CB-MM-2 -s ./rawdata/CB-MM-3 -s ./rawdata/CB-MM-4 -s ./rawdata/CB-MM-5 -s ./rawdata/CB-MM-6 -s ./rawdata/CB-MM-7 -s ./rawdata/CB-MM-8 -s ./rawdata/CB-MM-9 -s ./rawdata/CB-PI-1 -s ./rawdata/CB-PI-10 -s ./rawdata/CB-PI-11 -s ./rawdata/CB-PI-12 -s ./rawdata/CB-PI-13 -s ./rawdata/CB-PI-14 -s ./rawdata/CB-PI-15 -s ./rawdata/CB-PI-2 -s ./rawdata/CB-PI-3 -s ./rawdata/CB-PI-4 -s ./rawdata/CB-PI-5 -s ./rawdata/CB-PI-6 -s ./rawdata/CB-PI-7 -s ./rawdata/CB-PI-8 -s ./rawdata/CB-PI-9 -s ./rawdata/CB-PP-1 -s ./rawdata/CB-PP-10 -s ./rawdata/CB-PP-11 -s ./rawdata/CB-PP-12 -s ./rawdata/CB-PP-13 -s ./rawdata/CB-PP-14 -s ./rawdata/CB-PP-15 -s ./rawdata/CB-PP-2 -s ./rawdata/CB-PP-3 -s ./rawdata/CB-PP-4 -s ./rawdata/CB-PP-6 -s ./rawdata/CB-PP-7 -s ./rawdata/CB-PP-8 -s ./rawdata/CB-PP-9 -s ./rawdata/CC-CM2-1 -s ./rawdata/CC-CM2-10 -s ./rawdata/CC-CM2-11 -s ./rawdata/CC-CM2-12 -s ./rawdata/CC-CM2-13 -s ./rawdata/CC-CM2-15 -s ./rawdata/CC-CM2-16 -s ./rawdata/CC-CM2-18 -s ./rawdata/CC-CM2-19 -s ./rawdata/CC-CM2-2 -s ./rawdata/CC-CM2-20 -s ./rawdata/CC-CM2-3 -s ./rawdata/CC-CM2-5 -s ./rawdata/CC-CM2-6 -s ./rawdata/CC-DH-1 -s ./rawdata/CC-DH-10 -s ./rawdata/CC-DH-11 -s ./rawdata/CC-DH-12 -s ./rawdata/CC-DH-13 -s ./rawdata/CC-DH-14 -s ./rawdata/CC-DH-15 -s ./rawdata/CC-DH-2 -s ./rawdata/CC-DH-3 -s ./rawdata/CC-DH-4 -s ./rawdata/CC-DH-5 -s ./rawdata/CC-DH-6 -s ./rawdata/CC-DH-7 -s ./rawdata/CC-DH-8 -s ./rawdata/CC-DH-9 -s ./rawdata/CC-LC-1 -s ./rawdata/CC-LC-10 -s ./rawdata/CC-LC-11 -s ./rawdata/CC-LC-12 -s ./rawdata/CC-LC-13 -s ./rawdata/CC-LC-14 -s ./rawdata/CC-LC-15 -s ./rawdata/CC-LC-2 -s ./rawdata/CC-LC-3 -s ./rawdata/CC-LC-4 -s ./rawdata/CC-LC-5 -s ./rawdata/CC-LC-6 -s ./rawdata/CC-LC-7 -s ./rawdata/CC-LC-8 -s ./rawdata/CC-LC-9 -s ./rawdata/CC-MH-1 -s ./rawdata/CC-MH-13 -s ./rawdata/CC-MH-14 -s ./rawdata/CC-MH-15 -s ./rawdata/CC-MH-16 -s ./rawdata/CC-MH-17 -s ./rawdata/CC-MH-18 -s ./rawdata/CC-MH-19 -s ./rawdata/CC-MH-2 -s ./rawdata/CC-MH-20 -s ./rawdata/CC-MH-3 -s ./rawdata/CC-MH-4 -s ./rawdata/CC-MH-5 -s ./rawdata/CC-MH-6 -s ./rawdata/CC-MH-7 -s ./rawdata/CC-MH-8 -s ./rawdata/CC-MW-1 -s ./rawdata/CC-MW-10 -s ./rawdata/CC-MW-11 -s ./rawdata/CC-MW-12 -s ./rawdata/CC-MW-13 -s ./rawdata/CC-MW-14 -s ./rawdata/CC-MW-15 -s ./rawdata/CC-MW-2 -s ./rawdata/CC-MW-3 -s ./rawdata/CC-MW-4 -s ./rawdata/CC-MW-5 -s ./rawdata/CC-MW-6 -s ./rawdata/CC-MW-7 -s ./rawdata/CC-MW-8 -s ./rawdata/CC-MW-9 -s ./rawdata/CC-RV-1 -s ./rawdata/CC-RV-10 -s ./rawdata/CC-RV-11 -s ./rawdata/CC-RV-12 -s ./rawdata/CC-RV-13 -s ./rawdata/CC-RV-14 -s ./rawdata/CC-RV-15 -s ./rawdata/CC-RV-2 -s ./rawdata/CC-RV-3 -s ./rawdata/CC-RV-4 -s ./rawdata/CC-RV-5 -s ./rawdata/CC-RV-6 -s ./rawdata/CC-RV-7 -s ./rawdata/CC-RV-8 -s ./rawdata/CC-RV-9 -s ./rawdata/OHARRI01 -s ./rawdata/OHARRI02 -s ./rawdata/OHARRI03 -s ./rawdata/OHARRI04 -s ./rawdata/OHARRI05 -s ./rawdata/OHARRI06 “-b 3” Using Batch 3 with catalog made with Breweri data “–P /data/acisternas/ustacks_output/rawdata” – location of stacks files (“.tsv”) “-p 4” – run in parallel to 4 “-c /data/acisternas/ustacks_output/catalogs” - location of batch files. “ -s /data/acisternas/ustacks_output/{}.tsv” –name of files (which is all of them) “-o ./Breweri" - where we are going to stick output “:::: ./primi_plate1.txt” – name of all samples BATCH 4: Just concinna ssp concinna - sstacks -b 4 -p 4 -c /data/acisternas/ustacks_output_clarkia/catalogs/batch_4 -o /data/acisternas/ustacks_output_clarkia/ConcinnasspCCATALOG BATCH 5: Just concinna ssp automixia - sstacks -b 5 -p 4 -c /data/acisternas/ustacks_output_clarkia/catalogs/batch_5 -o /data/acisternas/ustacks_output_clarkia/ConcinnasspACATALOG BATCH 2: Complete Clarkia dataset - sstacks -b 2 -p 4 -c /data/acisternas/ustacks_output_clarkia/catalogs/batch_2 -o /data/acisternas/ustacks_output_clarkia/ClarkiaCombinedCATALOG BATCH 6: Complete Clarkia dataset - sstacks -b 6 -p 4 -c ./catalogs/batch_6 -o ./ClarkiaCombinedCATALOG6 BATCH 7 Just Breweri - Highest two Breweri samples from each pop. One from each plate. All samples sstacks -b 7 -p 4 -c ./catalogs/batch_7 -o ./BreweriCATALOG7 BATCH 8: Just concinna ssp concinna - sstacks -b 8 -p 4 -c ./catalogs/batch_8 -o ./ClarkiasspCCATALOG8 BATCH 9: Just concinna ssp automixia - sstacks -b 9 -p 4 -c ./catalogs/batch_9 -o ./ClarkiasspACATALOG9 drwxr-sr-x. 2 jfant gbs 12288 Aug 4 12:26 BreweriCATALOG drwxr-sr-x. 2 jfant gbs 4096 Aug 21 15:10 BreweriCATALOG7 drwxr-sr-x. 2 jfant gbs 4096 Aug 21 13:31 catalogs drwxr-sr-x. 3 jfant gbs 12288 Aug 4 12:26 ClarkiaCombinedCATALOG drwxr-sr-x. 2 jfant gbs 4096 Aug 21 15:10 ClarkiaCombinedCATALOG6 drwxr-sr-x. 2 jfant gbs 4096 Aug 21 15:10 ClarkiasspACATALOG9 drwxr-sr-x. 2 jfant gbs 4096 Aug 21 15:11 ClarkiasspCCATALOG8 drwxr-sr-x. 2 jfant gbs 12288 Aug 4 12:26 ConcinnasspACATALOG drwxr-sr-x. 2 jfant gbs 12288 Aug 4 12:26 ConcinnasspCCATALOG -rw-r--r--. 1 jfant gbs 821 Aug 10 14:57 QSUB_ustackBatch6.pbs -rw-r--r--. 1 jfant gbs 561 Aug 16 14:51 QSUB_ustackBatch7.pbs -rw-r--r--. 1 jfant gbs 458 Aug 10 14:57 QSUB_ustackBatch8.pbs -rw-r--r--. 1 jfant gbs 375 Aug 10 14:57 QSUB_ustackBatch9.pbs drwxr-sr-x. 2 jfant gbs 24576 Aug 4 12:25 rawdata -rw-r--r--. 1 jfant gbs 1571 Jul 26 15:55 samples_clarkia_combined.txt -rw-------. 1 jfant gbs 15225 Jul 17 12:17 USTACK_screen_output_1May.txt -rw-------. 1 jfant gbs 109762 Aug 3 16:50 USTACK_screen_output_Batch2All.txt -rw-------. 1 jfant gbs 108116 Jul 27 13:35 USTACK_screen_output_Batch3Breweri.txt -rw-------. 1 jfant gbs 108974 Jul 27 14:55 USTACK_screen_output_Batch4coccinea.txt -rw-------. 1 jfant gbs 108803 Jul 27 17:01 USTACK_screen_output_Batch5coccineaA.txt -rw-------. 1 jfant gbs 12047 Aug 18 09:10 USTACK_screen_output_Batch6ALL.txt -rw-------. 1 jfant gbs 5912 Aug 17 14:25 USTACK_screen_output_Batch7breweri.txt -rw-------. 1 jfant gbs 3445 Aug 12 11:57 USTACK_screen_output_Batch8concinnaC.txt -rw-------. 1 jfant gbs 1833 Aug 10 16:56 USTACK_screen_output_Batch9coccineaA.txt -rw-------. 1 jfant gbs 780 Jul 19 10:31 USTACK_screen_output_breweri_mini.txt -rw-------. 1 jfant gbs 1754 Jul 24 22:07 USTACK_screen_output_breweri.txt -rw-------. 1 jfant gbs 11358 Aug 3 04:14 USTACK_screen_output_clarkiaALL.txt -rw-------. 1 jfant gbs 3282 Jul 26 09:23 USTACK_screen_output_concinna_sspCon.txt drwxr-sr-x. 2 jfant gbs 45056 Aug 21 14:21 working POPULATIONS I created a new folder in ustacks_output to run the test of primi compared to the catalog of just primi. To avoid confusion because each pop analysis write over a previous one and it can be very confusing. You need to combine rawdata and matches into one folder… you may want to do each separately and then delete the old one? parallel -j 10 "sstacks -b 1 -p 1 -c ./batch_1 -s ./{}_1_trimm -o ./" :::: ../Elshire_SampleNames.txt # example scripts for POPULATIONS -r is percent individuals and -p is # of populations required to keep locus populations -b 3 -P ./ -M ../popmap.txt -t 5 -r 0.25 -k --fstats populations -b 1 -P ./denovo_map_output_AP -M ./popmap_AP.txt -t 5 -r 0.25 -k --fstats populations -b 1 -P ./denovo_map_output_AP -M .wh/popmap_AP.txt -t 5 -k --fstats populations -b 1 -P ./ -M ../popmap_xylo_prim_test.txt -t 18 -k -r 0.75 -p 3 –fstats populations -b 6 -P ./ClarkiaCombinedCATALOG6 -M ./population_map_clarkia.txt -t 5 -r 0. 5 -k –fstats --genepop POPULATION NOTES Try this code when I have enough space! –b is the batch of the catalog, -P is the path of the files and it requires –b, -M is the path to the population map, -t number of threads to run in parallel, -k enable kernel smooth, -r minimum percentage of individuals in a population required to process the locus, -p minimum number of populations a locus must be present to process a locus -f p_value - Include a p-value correction to FST scores, if an FST score isn't significantly different from 0 (according to Fisher's Exact Test), set the value to 0 populations -b 1 -P ./ -M ./primi_popmap.txt -t 10 -k -r 0.75 -p 5 –fstats Anita - notes I created the catalog and proceed with the sstacks step. I couldn't make the code work so it does it one time for all samples. Instead, I had to specify every sample separately (the code itself runs really fast but it takes some time to make sure that every sample is listed. I did it by chunks of samples and doing all samples took me like 2 hours).Then I start with populations, it took me a while because I created a new directory but then realize that it doesn't run if you don't have the *.tags.tsv files in the same place with the *.matches.tsv files. I follow what we have done before, present in the 6 populations and in 75% (with 791 loci) , 70% of the samples (also with 791 loci) and 80% (with 271 loci). It's lower to what I got just using one plate but it is not too bad! The following code will allow to extract output data for STRUCTURE and GenePop format. Only write first SNP from any RAD locus, to prevent linked data from being processed by STRUCTURE This code is the one I’m currently using for populations of primi populations -P ./ -M ./primi_plate1.txt -b 1 -k -p 6 -r 0.75 -f p_value -t 15 –fstat --structure --genepop --write_single_snp using this with 60 samples of primi plate 1 we have 232 loci that is in 75% of the population and in at least 6 populations but I have 766 in the second plate! Here is another option, it just selects 1,000 random loci so that we have a computationally manageable amount of data to STRUCTURE populations -P ../ -M ./primi_popmap.txt -b 1 -k -p 10 -r 0.75 -f p_value -t 15 --structure --genepop --write_single_snp -W ./wl_1000 populations -P dir -b batch_id [-O dir] [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats) Move all files to “working directory” cp rawdata/* working cp ClarkiaCombinedCATALOG/* working RUNNING against Pop 1-6 Breweri (BM,CR,FR,MM,PI, PP) Pop 7&10 (Concinna ssp Auto – CM2, MH) Pop 8,9,11,12 (Concinna ssp Conc – DH, LC,MW,RV) BATCH 2 – catalog made with all samples populations -P ./ -M ./population_map_clarkia.txt -b 2 -k -p 6 -r 0.85 -f p_value -t 4 --fstats --structure --genepop --write_single_snp -W ./wl_1000 BATCH 6 - populations -b 6 -P ./ClarkiaCombinedCATALOG6results -M ./population_map_clarkia.txt -t 5 -r 0. 5 -k –p 5 -f p_value --fstats –genepop –structure --write_single_snp –b is the batch of the catalog, 6 (Batch 6) -P is the path of the files and it requires –b, ./ClarkiaCombinedCATALOG6results -M is the path to the population map, ./population_map_clarkia.txt -t number of threads to run in parallel, 5 -k enable kernel smooth, YES? -r minimum percentage of individuals in a population required to process the locus, 0.5 (50%) – 8 individuals -p minimum number of populations a locus must be present to process a locus -p 4 (all but one of each species) -f p_value - Include a p-value correction to FST scores, if an FST score isn't significantly different from 0 (according to Fisher's Exact Test), set the value to 0 “-p 6” – at least in one species. & “-r 0.65” – have 15 individuals would like loci in 10 179 samples with 0 loci “-p 6” – at least in one species. & “-r 0.70” – have 15 individuals would like loci in 10 179 samples with 79 loci Only single loci was variable. “-p 4” – at least in one species. & “-r 0.70” – have 15 individuals would like loci in 10 179 samples with 391 loci BATCH r – catalog made with all samples populations -P ./ -M ./population_map_clarkia.txt -b 4 -k -p 6 -r 0.85 -f p_value -t 4 --fstats --structure --genepop --write_single_snp -W ./wl_1000 “-p 4” – at least in one species. & “-r 0.75” – have 15 individuals would like loci in 10 179 samples with 84 loci 0 samples were variable? WTF “-p 11” – at least in one species. & “-r 0.5” – have 15 individuals would like loci in 8 179 samples with 38 loci [[Notes](https://docs.google.com/document/d/1DEAnOAZ9EakjdknrQHda7sglU9o9CSZx/edit#heading=h.gjdgxs)] GAME PLAN: Aim is to determine best value of m (3-7) and M (1-7) (i.e. N=M+2) STEP1: Prepare the data for running. (THIS IS NOT IN STACKS) Convert raw data (bcl2fastq) Concatenate data by sample Process radtags – identify cut regions etc. STEP2: Identify Samples which will be used to identify best parameters We selected the samples which span each population AND each plate used. In Clarkia we had 12 populations and two plates so selected the 24 samples with “Largest” file size. Copies of these samples are moved to a directory called “parameter test” STEP 3: Identify loci in each samples using various parameters to account for error and divergence in alleles m = is minimum number of sequence reads need to exist before it will identify it as a “stack” or aka an “allele”. So it will scan all sequences and match all that are identical. m can decrease genotyping error by distinguishing real loci from PCR and sequencing error m is too low; reads with convergent sequencing errors can be erroneously labelled as stacks. m is too high true alleles will not be recorded (drop out). – Calling homozygous rather than heterozygous. We test m=3-7 m=1 – everything unique is counted (even sequencing error) m>3 – is a safe number as it assumes the read is common enough in sample If there is high coverage, potential contamination or when doing phylogenetic work might want to go higher. M = is number of mismatches you will expect between your “alleles” before you call them a “loci”. So it will scan your stacks/alleles – and try to work out which come from the same loci. M too low; alleles will be treated as different loci. (undermerging) M is too large; paralogs will form large nonsensical loci (overmerging) We test m=1-8 M=0 – each stack/allele is a unique loci M=2 – allows to find moderate number of differences in between alleles M>2 = Much higher would also make sense as it would be assumed that any loci share more snps in common than not Want to increase if have high polymorphism, high genomic divergence, or polyploidy N= is number of mismatches allowed to align secondary reads (reads that did not form stacks) to assembled putative loci to increase locus depth Stacks recommends using M+2 STEP 4: DENOVO – as we do not have a genome to compare our samples to we need to make a “catalog. (i.e a list of all loci identified within your dataset) Two process; 1) identify samples to use to make catalog and then 2) determine how many mismatches to match loci between samples. (“n” parameter is identical to ”M” – although across Samples) If you assume diversity WITHIN will be same as BETWEEN – then n=M. If you expect high diversity BETWEEN then you would increase n-M+1 (or more) if you expect more mismatches between your “alleles”. There are multiple aims in this parameter testing phase 1) Finish the data-analysis from STEP 3 2) Determine best value of n (M, M-1, M+1) - n (like M) is number of mismatches you will expect between your “alleles” before you call them a “loci” but across multiple samples. 3) Work out how many samples are needed to be used to make an accurate catalog. We have recommended starting by using 2 samples with most data, (one of each species). Rather than do all combinations of these, it is easier to do one at a time, keeping other factors fix. You can use information you gain from each step to inform next steps. It does not matter where you start. Step 4.1 – Completing the analysis in Step 3. Use default from STACKS n=M (test later in step 4.2). Use 2-3 sample with most data for catalog (rather than ALL samples – test later in Step 4.3) Step 4.2– Determine the best value of n STACKS recommend using n=M We will repeat with n=M-1 and M+1 We recommend choosing the Default USTACK (m=3,M=2, N=4 n=(324)) As per Step 4.1 – will use 2-3 sample with most data for catalog (rather than ALL samples – test later in Step 4.3) Step 4.3– Determine the number of samples to make catalog STACKS recommend using all samples but this can be very time consuming. We will repeat with n=M-and the Default USTACK (m=3,M=2, N=4, n=(324)) We then will make catalog using 1) half the samples and then 2) all samples STEP 5: SSTACKS: Will scan putative loci, constructed by the ustacks in each individual and compare back to the catalogs produced by cstacks. This will create a matches files which has all the loci which match the catalog you are using. The only decision in this step is to choose which catalog you are using. STEP 6: Populations: This will identify your population parameters from your matches. The most important selection parameter is “-r” – which dictates how many samples need to have a “loci” before it gets included in the analysis.