CBG Metagenomics

# CBG Metagenomics On a personal computer. ## note: r2 is lost somewhere ? by: sagesteppe, & CBG Genetics Team. Overview: I chose to ran a dual boot of Linux Ubuntu with Windows 10 on a Lenovo ThinkPad T580. ```sagesteppe@sagesteppe-ThinkPad-T580:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz Stepping: 9 CPU MHz: 600.029 CPU max MHz: 3100.0000 ``` Just to be clear the number of threads are: ``` $ nproc --all > 4 ``` This value can be wrong if you have some hyper threading of the CPU, then defer to the CPU(s) on the printout above. we will use this info occasionally. The only modifications to the computer is that it has two 16gb RAM sticks (Crucial DDR4 2400). ## 1: Installing Ubuntu 'Free your mind and your software will follow' - George Clinton *mostly* I was able to largely follow two guides to accomplish this. My instructions below are brief. For the dualboot installation download the most recent stable version of Ubuntu as an ISO image. I have previously used Balena Etcher to flash the image onto a USB thumbdrive, but have switched to RUFUS for the last couple boots. I recommend RUFUS now. Following this use an administrator account to access the 'Command Line' and 'Run as Administrator'. Now enter ```'diskmgmt.msc'```, in the gui click on the 'C:' drive and 'Shrink Volume' of it. Shink to between 20,000-80,000mib. I did 75,000 - should be way overkill. To dual boot I had to go with the old approach of the F12 at startup. If I remember this was the same route I had to use for Mint on a Lenovo ThinkPad T-530. So it could be a Lenovo thing. ### 1a: Troubleshooting Windows Shutdown As a consequence of dual booting it may be hard to shutdown your hardware from the WindowsOS. This is a workaround, which will make it take longer to log into Windows, but will let your hardware shutdown. This Unix & Linux StackExchange Thread had a solution which worked for me. https://unix.stackexchange.com/questions/247184/unable-to-shutdown-windows-after-installing-grub * (1) Go to control panel. * (2) Find power options. * (3) From the left menu click on "Choose what the power button does". * (4) Click on "Change settings that are currently unavailable." * (5) Go to the bottom of the page, uncheck "Turn on fast startup" and save changes. After doing the typical updates and processes associated with installing Linux you are good to go! ### 1b: Troubleshooting Time Zones You'll find on your Windows OS the time is probably wrong. Run this on the linux side of the computer to fix this. ``` $ timedatectl set-local-rtc 1 ``` Now you need to update the time manually in windows and you should be good. ### 1c: Data transfer from Fabronia to Local Strange things happen here and we do not know why. Here we have some quick lines to inspect the files you transferred over in order to figure out if you are missing something. ``` $ mkdir /media/sagesteppe/Genomics/data_summaries # A) find distinct Sample ID's. $ cd /media/sagesteppe/Genomics/Original_zip $ find . -regex ".*\_S[0-9][0-9]?[0-9]?_.*" > ../data_summaries/sample_ids.txt $ sed -i -E "s/.*(S[0-9][0-9]?[0-9]?).*/\1/" ../data_summaries/sample_ids.txt $ sort -u ../data_summaries/sample_ids.txt > ../data_summaries/unique_seqs.txt # B) count the distinct samples to determine how many are present. $ wc -l ../data_summaries/unique_seqs.txt # C) determine which samples have each of 4 reads associated with them. $ sort ../data_summaries/sample_ids.txt | uniq -c > ../data_summaries/reads_per_sample.txt # D) Identify the missing & extra reads. $ awk ' $1 < 7 { print }' ../data_summaries/reads_per_sample.txt > ../data_summaries/missing_reads.txt $ awk ' $1 > 8 { print }' ../data_summaries/reads_per_sample.txt > ../data_summaries/extra_reads.txt # E) Report the missing & extra reads. $ cat ../data_summaries/missing_reads.txt $ cat ../data_summaries/extra_reads.txt # F) Determine which reads are empty. " find ~/lists -empty " # not solved. # G) Determine if any samples are missing. $ sed -i -E 's/S//g' ../data_summaries/unique_seqs.txt # remove 'S' $ sort -n ../data_summaries/unique_seqs.txt -o ../data_summaries/unique_seqs.txt # arrange numerically $ min=$(sort -n ../data_summaries/unique_seqs.txt | sed -n '1p') #lowest sample number $ max=$(sort -n ../data_summaries/unique_seqs.txt | sed -n '$p') # highest sample number $ seq "$min" 1 "$max" > ../data_summaries/full_seq.txt # create range $ difference=$(( $(wc -l ../data_summaries/full_seq.txt | sed 's/[^0-9]*//g') - $(wc -l ../data_summaries/unique_seqs.txt | sed 's/[^0-9]*//g') )) # see how many samples in results $ echo "$difference" # compared to range of all sample numbers $ comm -3 ../data_summaries/unique_seqs.txt ../data_summaries/full_seq.txt # you may get a message in your output like "comm: file 2 is not in sorted order" - but for me it is so... if you do receive any output from this you are good. # H) Determine file size of read. (In MiB) $ du -a --block=1M > ../data_summaries/fsize_raw_reads.txt $ cd ``` ### 1c: alternate download from dropbox Drobbox is a thing and does not like to let you download the amount of data you put on their service, which exists so you can download the data you put on there service... This section was run on an old Lenovo Thinkpad T-530 running Linux Mint. ``` $ sudo apt update $ sudo apt upgrade $ git --version ## if you do not have it then. $ sudo apt install git-core ## if you do have it skip to here $ sudo apt install python-pip # if you do not have it $ git clone git@github.com:dpdornseifer/dropbox_download.git ## if git spits back some: "git@github.com: Permission denied (publickey). fatal: Could not read from remote repository." # than $ ssh-keygen # do not type in a passphrase or anything hit 'enter' on your keyboard # in the output keep note of where the public key is saved # /home/steppe/.ssh/id_rsa.pub. $ sudo apt-get install xclip $ ls -al ~/.ssh ## you should see the key now $ xclip -sel clip < ~/.ssh/id_rsa.pub ## now copy and paste the key. ## using the github website add this to your account Settings > SSH & GPG Keys > New SSH key (you may need to make an account.). Name the computer you are using so you know what it is associated with. $ cd /home/steppe/.local/lib/python3.6/site-packages $ git clone git@github.com:dpdornseifer/dropbox_download.git $ ls $ cd /home/steppe/.local/lib/python3.6/site-packages/dropbox_download $ pip3 install -r dropbox_download/requirements.txt $ cd /home/steppe/.local/lib/python3.6/site-packages/dropbox_download $ python dropbox_download.py ``` ### 1d: FastQC install & usage ``` sudo apt update sudo apt upgrade sudo apt install fastqc # install $ mkdir /media/sagesteppe/Genomics/Pre_trim_FastQC # to hold output $ cd /media/sagesteppe/Genomics/Original_zip # fastqs to evaluate $ fastqc *fastq.gz --outdir=/media/sagesteppe/Genomics/Pre_trim_FastQC -t 3 # torun process $ threads=$(nproc --all) $ threads=$(($threads - 1)) $ echo $threads # to print results. # hopefully dynamic using all threads - 1 $ fastqc *fastq.gz --outdir=/media/sagesteppe/Genomics/Pre_trim_FastQC -t echo$threads ``` ## 2: Trimmomatic Installation Most of my trimmomatic advice came from a blog hosted by the University of Texas at Austin, and John Zhang. https://wikis.utexas.edu/display/bioiteam/Trimmomatic+-+GVA2020 This is what worked for me, I did a few things they did not. ```$ mkdir $~/Bioinformatics``` The function mkdir, makes a directory, descendent one node from HOME. ```$ cd ~/Bioinformatics``` the function cd, changes the directory, to this newly created folder ``` ~/Bioinformatics$ wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip ``` wget will download the binary file to the current directory. note that I have included "~/Bioinformatics$" in the code chunk, this section indicates that our current directory is the Bioinformatics folder we just created ```~/Bioinformatics$ unzip Trimmomatic-0.39.zip``` the function unzip, will unzip the newly downloaded file ```~/Bioinformatics$ rm Trimmomatic-0.39.zip``` the function rm -remove, will delete the original and now unneeded zip file. ```export PATH=/home/sagesteppe/Bioinformatics/Trimmomatic-0.39:$PATH ``` add the folder containing the trimmomatic directory to our $PATH for this session. we can verify that we have successfully added this folder to our path via: ``` echo $PATH``` I won't show the read out from running 'echo' here, but the trimmomatic directory should now be the first one listed (note the directories are seperated by colons). ```java -jar ~/Bioinformatics/Trimmomatic-0.39/trimmomatic-0.39.jar``` We can do a quick test to see if our install worked. Now that is a long prompt to call to run Trimmomatic, so we will create a small bash script. To open up the nano terminal, to write a shell script, run this in the linux terminal: ```nano ~/Bioinformatics/Trimmomatic-0.39/trimmomatic``` You will now have a strange new terminal before you. Don't Panic, but do listen to Widespread. Place this, or a similar command into (i.e. for a diferent directory) nano. --NANO INTERFACE START-- ```#!/bin/bash``` ```java -jar $HOME/Bioinformatics/Trimmomatic-0.39/trimmomatic-0.39.jar $@``` Note that the #!/bin/bash argument has to be on it's own line. It is defining that the .txt file you are writing is meant to be processed by Bash. Now to safely exit nano, on your keyboard: ```ctrl + X ``` to close the terminal ```click Y``` to say 'YES' Now this is an important part, it may ask if you want to update the name of the script you just wrote - do not! My script name is never displayed, but you have already supplied the name and the directory for it to be saved too! On your keyboard hit ```ENTER``` --NANO INTERFACE END-- Now we will need to make it so that this script is executable. We can do this as so: ```$chmod +x ~/Bioinformatics/Trimmomatic-0.39/trimmomatic``` Now you should be able to run Trimmomatic by: ```trimmomatic``` Note if you closed your terminal you will need to re-add the folder to the $PATH. You can add this to the $PATH forever, but you won't be using trimmomatic for that long anyways so I do not advise it. ### 2a: Using Trimmomatic. Trimmomatic has many options and features. The UTA blog has some good info on those (actually this is virtually identical to their write up with our specs). For a one off read you can do something like: ```trimmomatic PE -phred33 <READ1> <READ2> -baseout Trim_Reads/<basename> ILLUMINACLIP:<FASTAadapterFILE>:2:30:10 LEADING:3 TAILING:3 SLIDINGWINDOW:5:20 MINLEN:36``` my breakdown of the UTA information: * 'trimmomatic' = call program via bash script * 'PE' = Paired end mode * '-phred33' = how quality scores are embedded * '<READ1>' = the first read to trim * '<READ2>' = the other read file * '-baseout' = prefix to files * 'Trimmed_reads' = new folder * '/<basename>' = name of original file to keep * 'ILLUMINACLIP' = what we want clip * '<FASTadapterFile>' = directory and name of file containing adaptors * '2:30:10' = allow 2 bp mismatches, require 30bp overlap between reads, and require 10bp of sequence to match * 'MINLEN:36' = discard reads that are less than 36bp after trimming http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf All of the options are available from the developers. Since I bet you are smarter than me you should figure 'em out. I am running some modified specs recommened by OG Plant Hyb-Seq'er Angela McDonnell Here we go. Make sure that the trimmomatic program is in your path, let's add it again. ```$ export PATH=/home/sagesteppe/Bioinformatics/Trimmomatic-0.39:$PATH``` Let's create a new folder to keep the reads. I'll use an external hardrive (HDD, but eh I bet you can afford SSD - if so do it without any hesitation). ```$ mkdir /media/sagesteppe/Genomics/Trimmed_reads``` we will also change the directory to keep our path names somewhat short. ```$ cd /media/sagesteppe/Genomics``` Now let's copy over the TruSeq adaptor file we used to this folder for easy access ```Genomics$ cp ~/Bioinformatics/Trimmomatic-0.39/adapters/TruSeq3-PE.fa . ``` Now we can run some code that will look like this: ```$ trimmomatic PE -phred33 1a_S76_L002_R1_001_paired.fastq/1a_S76_L002_R1_001_paired.fastq 1a_S76_L002_R2_001_paired.fastq/1a_S76_L002_R2_001_paired.fastq -baseout Trimmed_reads/1a_S76.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 SLIDINGWINDOW:5:20 MINLEN:36``` While the paths are longer than I would like them to be (each file is in it's own folder!), we can reduce the path lengths making the code somewhat readable this way. If you get back a message which looks like this: ```Input Read Pairs: 1418558 Both Surviving: 1415397 (99.78%) Forward Only Surviving: 3152 (0.22%) Reverse Only Surviving: 0 (0.00%) Dropped: 9 (0.00%) TrimmomaticPE: Completed successfully``` than trimmomatic is working! you can also check to see that the files were written via: ``` $ ls Trimmed_reads``` Now if you only have one sample the above approach works. However, I cannot imagine you only have one sample... Trimmomatic with multiple samples ``` $ cd /media/sagesteppe/Genomics/to_trim /media/sagesteppe/Genomics/to_trim$ source ~/Bioinformatics/auto_trim.sh *.fastq.gz ``` Let's make all of our directory (=folder) names shorter really quick. First we will remove the file extension name from that of the directory. ``` ls -d *.fastq/ # download a new package mmv $ sudo apt install mmv /media/sagesteppe/Genomics$ mmv -r "*.fastq" "#1" # note -r is 'remove' from mmv, # we keep all the name up to .fastq with #1 ``` We can also remove the 'Lane' indicator from Illumina. 'L002' means these samples were run in the second lane, *all* of my samples were run in the 2nd lane so let's drop this. ```/media/sagesteppe/Genomics$ mmv -r '*_L002_*' '#1_#2'``` To save on space we can also abbreviate 'paired' and 'unpaired' to 'P' and 'U'. ``` /media/sagesteppe/Genomics$ mmv -r '*unpaired' '#1U' /media/sagesteppe/Genomics$ mmv -r '*paired' '#1P' # remember order of operations matters here ``` Let us also toss the '001', this is unecessary for our usage ```/media/sagesteppe/Genomics$ mmv -r '*_001_*' '#1_#2'``` Most scripts will assume you have all of your reads you need to trim in the same place. This is not the way we get our data, but I had to move all of my files to the same folder to get trimmomatic to work. COpying the files will take some time, if you have a full Lane of runs back, let this run overnight or before an hour long meeting or something. First we will copy our data to a new folder ``` /media/sagesteppe/Genomics$ mkdir to_trim /media/sagesteppe/Genomics$ find . -name '*.fastq' ``` Work on creating a little glob statement for the find function to make sure you can find the files you want to find - those of a fastq type. Once you have determined that it works, run the code below to start moving the files. ```find /media/sagesteppe/Genomics -name '*.fastq' -exec cp "{}" /media/sagesteppe/Genomics/to_trim \;``` Now we have all of the fastq's in the same folder, note it took me about 2 hours for a single lane to move. # NEW STUFF HERE THE CODE BELOW SHOULD WORK TO THEN COPY ALL THESE MATERIAL INTO A DIFFERENT FOLDER FOR TRIMMOMATIC AND HYB-PIPER # ``` # you need yout sample info to determine which samples are your reference library samples. You can glimpse at in the terminal like so. $ column -s, -t < /media/sagesteppe/Genomics/Prospective_samples.csv | less -#2 -N -S # hit any alphabetical letter to exit >> $ awk 'BEGIN{FS=","} $4 ~ /library/' /media/sagesteppe/Genomics/Prospective_samples.csv | awk -F "\"*,\"*" '{print $5}' > /media/sagesteppe/Genomics/file2.csv # works #filter on one column, keep the other with the sample names. # within the directory, use this to find all REFERENCE LIBRARY SAMPLES and move to new folder. $ find . -type f -regex ".*\/[0-9][0-9]?[a-z]?_*-*[a-z].*" -exec mv "{}" /media/sagesteppe/Genomics/reference_library_to_trim \; ``` Now we have all of the fastq's in the same folder, note it took me about 2 hours for a single lane to move. # DIFFERENT NOW FORK SEQS. the code below will move the trims to a new folder, and create trim logs if you want to evaulate the process (in most instances you want thses). you can use code like this to import your Adapters to the folder with the items to trim. Note the period at the end of this chunk indicates copy to the current directory. ``` /media/sagesteppe/Genomics $ cp ~/Bioinformatics/Trimmomatic-0.39/adapters/TruSeq3-PE.fa . mkdir Trim_Reads mkdir trimLogs for r1 in *_R1_* do r2=$(echo $r1|sed 's/_R1_/_R2_/') name=$(echo $r1|sed 's/_R1_001.fastq//'|sed 's/testify\///') echo "java -jar ~/Bioinformatics/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 6 $r1 $r2 -baseout Trim_Reads/$name.fastq ILLUMINACLIP:TruSeq3-PE.fa:4:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:15 MINLEN:30 >& trimLogs/$name.log" >> uta1_trim.sh done $ bash uta1_trim.sh ``` To run the code above, copy the for loop into a text editor, edit any paths you may need to, and then paste into a terminal (in the directory with the files, after $). I kept this trimmomatic runner named after the anonymous homie at UTA who put this pup up on their bioinformatics page. Also, try and remove the ''.fastq' after the -baseout, to avoid tagging that onto the name. Now that we have trimmed each of the files, and saved their logs into two directories, within our current directory, we will remove our original fastq. ```/media/sagesteppe/Genomics/to_trim/$ rm -fv *.fastq``` We will now truncate the file names of the reads themslves (remember earlier we did directory names) ``` sudo apt-get install -y mmv mmv -r '*_L002_*' '#1_#2' mmv -r '*_001_*' '#1_#2' ``` ## 2b: run a follow up fastqc we will run a follow up fastqc to see how our results have changed after running trimmomatic. ``` $ mkdir /media/sagesteppe/Genomics/Post_trim_FastQC # to hold output $ cd /media/sagesteppe/Genomics/to_trim/Trim_Reads # fastqs to evaluate $ threads=$(nproc --all) $ threads=$(($threads - 1)) $ echo $threads # to print results. $ fastqc *.fastq --outdir=/media/sagesteppe/Genomics/Post_trim_FastQC -t echo $threads $ cd '/media/sagesteppe/Genomics/Post_trim_FastQC' $ unzip -q '*.zip' $ find . -depth -name '*.zip' -exec rm {} \; $ cd ``` ## 3: Hyb-Piper & Installations of Python, R, Rstudio. stuff ## 3a: Hyb-Piper Installation Note this is almost *identical* to John Zhangs script. Hyb-Piper will require a python > 3.x. You can check what version you have. ``` $ python3 --version # or $ python3 --V # if you mess up and do a lower case 'v', # but see '>>>' you are in python, use >>> quit() # to get out. ``` Likewise, you probably also already have **pip** and **curl**. Check this via the same process as above. Note **pip** is pythons package manager. **curl** is used to send a request for data from the internet from your computer. ``` $ pip -V $ curl -V # if you do not have these installed # run the code below. $ sudo dpkg --configure -a $ sudo apt install python3-pip $ sudo apt install curl ``` Now let's install Github. ``` $ git --version # if you do not have it... $ sudo apt install git-core ``` Now let's install and configure homebrew. This takes up to a half hour or so. ```$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"``` If installation worked out, When you come back to your terminal, you should see some Blue arrows and text like: ==> **Next Steps** there should be two commands for you to run from your terminal, for me they look like this: ``` $ echo ‘eval $(/home/linuxbrew/.linuxbrew/bin/brew shellevn)’ >> /home/sageteppe/.profile $ eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)" ``` Run them. This is a good time for an old... ``` $ sudo apt update $ sudo apt upgrade ``` Let's grab some dependecies for HybPiper. This takes up to an hour or so. ``` $ brew install blast bwa gcc parallel spades samtools ``` Let's ensure we have these ``` $ brew leaves --installed-on-request | xargs -n1 brew desc``` Let's grab a couple more items ``` $ sudo apt-get install exonerate $ python3.9 -m pip install biopython ``` Note as usual Biopython was absurdly sticky to install, as of today December 12th, 2021. The line above works, hopefully it still works on the 13th. Besure to change to your version of Py as time passes. OK so NOW, we are ready to install the HybPiper. Remember folks, MOSS MATTERS !!! We are going to install it to an area which is already on our path. ``` $ $ECHO (yep) $ cd /home/sagesteppe/.local/bin ~/.local/bin$ git clone https://github.com/mossmatters/HybPiper.git ~/.local/bin$ cd ~/.local/bin/HybPiper ~/.local/bin/HybPiper$ python3 reads_first.py --check-depend ``` Remember that HybPiper assembles a locus using read mapping, followed by de novo assembly of the reads which have been successfull mapped to the locus. By increasing the number of representative reads of each locus in the initial mapping step, and in particular increasing the number of taxa in target clades, more of our reads will be mapped successfully, allowing more comprehensive covreage of each locus. HybPiper comes with a file which has few taxa per locus (6-18), we will use a file with more taxa per locus. We are going to add the Mega 353 target file, from it's developer - Chris Jackson's Github [NewTargets](https://github.com/chrisjackson-pellicle/NewTargets). If you want to read about them check out the [paper in Applications in Plant Science](https://bsapubs.onlinelibrary.wiley.com/share/GMMQPE7AQBCVKIIE223I?target=10.1002/aps3.11420). ``` python3.9 -m pip install pandas # dependency cd ~/.local/bin ~/.local/bin$ git clone https://github.com/chrisjackson-pellicle/NewTargets.git $ cd ~/.local/bin/NewTargets ~/.local/bin/NewTargets$ unzip mega353.fasta.zip # cd is not the best way to do the # last step but how i did it. ``` ## 3b: HybPiper Usage We want to use HybPiper on our reference taxa, i.e. the plant species which we hope to identify in our environmental samples which we could not find existing sequence data for online and sequenced ourselves. We will not put our environmental samples through this pipeline. We will need to: * Identify which reads these are. * Gather *some* systematic information on them. * Create a Mega353 '[select_file.txt](https://github.com/chrisjackson-pellicle/NewTargets/wiki/filter_mega353.py:-filtering-the-mega353.fasta-target-file)' Take a look at what taxa we have reads from real quick. ``` $ cd ~/.local/bin/NewTargets $ column -s, -t < filtering_options.csv | less -#2 -N -S $ cd > # random symbol b/c latex was messed up. $ cp filtering_options.csv /media/sagesteppe/Genomics/filtering_options.csv $ cp .local/bin/NewTargets/select_file_example.txt /media/sagesteppe/Genomics/select_file_example_cp.txt # remove later ``` And type any alphabetical letter like 'q' to quit the view mode. To me, I think the easiest option is to try and join the taxonomic information regarding my reference sample to either the Order or family in this file. My notes on my extractions are in an excel spreadsheet. While we can use this in Python, lets convert to a CSV. ``` $ cd /media/sagesteppe/Genomics $ ls *.xlsx # to check file name $ soffice --headless --convert-to csv Prospective_samples.xlsx $ rm Prospective_samples.xlsx ``` Let us also install R real quick, it can come in handy where Python lacks in a few skills. I will try and keep its usage to a minimal in this flow, but it pops up from time to time. This takes about 30 minutes total, but you need to keep entering the commands. ``` $ sudo apt update -qq $ sudo apt install --no-install-recommends software-properties-common dirmngr $ sudo wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc $ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" $ apt install --no-install-recommends r-base $ sudo usermod -a -G staff sagesteppe $ sudo apt-get update -y $ sudo apt-get install -y libssl-dev $ sudo apt-get install libcurl4-openssl-dev $ sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+ $ sudo apt install --no-install-recommends r-cran-rstan $ sudo apt-get install libtiff-dev libjpeg-dev libpng-dev $ sudo apt-get install libblas-dev liblapack-dev ``` Now restart your session you can do this from the terminal with 'sudo reboot' You can also install the R GUI if you want. We go through a pretty trouble shot install here. ``` $ sudo add-apt-repository universe $ sudo apt-get install gdebi-core $ sudo apt install dpkg-sig $ cd /home/sagesteppe/Downloads ~/Downloads$ gpg --keyserver keyserver.ubuntu.com --recv-keys 3F32EE77E331692F #$ wget --no-check-certificate https://www.rstudio.com#/products/rstudio/download/#download #$ curl -k -O -L https://www.rstudio.com/products/rstudio/download/#download" "rstudio-2021.09.1-372-amd64.deb" # currently download by hand :-( #) ~/Downloads$ dpkg-sig --verify rstudio-2021.09.1-372-amd64.deb ~/Downloads$ ls *.deb ~/Downloads$ sudo gdebi ./rstudio-2021.09.1-372-amd64.deb ~/Downloads$ rstudio # ensure installation works. > quit() # (in gui) ~/Downloads$ rm rstudio-2021.09.1-372-amd64.deb ~/Downloads$ cd ``` to gather and append taxonomic information to your samples the following can be done. ``` $ R > install.packages("taxize") # note 'Y', 'Y' for the two options. > library(taxize) > setwd("/media/sagesteppe/Genomics") > f = read.csv("Prospective_samples.csv") # > head(f) > f = f[c('Voucher_specimen','Number','Sample', 'type', 'vials')] > f = subset(f, type == 'library') > f = f[c('Number','Sample', 'vials')] > f$Sample = gsub("_", " ", f$Sample) > # note NCBI subsumes L. bakeri into synonymy with L. sericeus > f[f == 'Lupinus bakeri'] = 'Lupinus sericeus' > f[f == 'Erigeron elatior'] = 'Erigeron grandiflorus' > # similar to above > taxa = f$Sample > ENTREZ_KEY = '1dddef4ac326bf8b74fa69ae430411218f09' > results = classification(taxa, db = 'ncbi') > family = lapply(results, subset, rank == "family") > family = do.call(rbind, family) > family = cbind(rownames(family), data.frame(family, row.names=NULL)) > family = family[,1:2] > colnames(family) = c('Sample', 'Family') # repeat for order. > order = lapply(results, subset, rank == "order") > order = do.call(rbind, order) > order = cbind(rownames(order), data.frame(order, row.names=NULL)) > order = order[,1:2] > colnames(order) = c('Sample', 'Order') # combine > taxonomy = merge(family, order, on = "Sample") > f = merge(f, taxonomy, on = "Sample") > f[f == 'Lupinus sericeus'] = 'Lupinus bakeri' > f[f == 'Erigeron grandiflorus'] = 'Erigeron elatior' > rm(taxonomy, order, family, taxa) > write.csv(f, "Taxonomy_ref_lib_samples.csv", row.names = FALSE) > quit() ``` We can write the files to filter the M353 targets as so ``` $ python3 import os import pandas as pd import numpy taxonomy_ref = pd.read_csv(r'/media/sagesteppe/Genomics/Taxonomy_ref_lib_samples.csv') targets = pd.read_csv(r'/media/sagesteppe/Genomics/filtering_options.csv') df = taxonomy_ref.merge(targets, on= 'Order', how= 'left').drop(columns=['Family_y']).rename(columns={"Family_x": "Family"}) df = df.drop_duplicates(subset=['Target_name']) orders = df['Order'].values.tolist() orders = list(sorted(set(orders))) df['Filepath'] = '/media/sagesteppe/Genomics/to_pipe/' df["Filepath"] = df["Filepath"] + df["Order"] + '/' df['Filename'] = df["Order"] + '_filter' df = df[["Family", "Order", "Target_name", "Group", "Species", "Filepath", 'Filename']] def m353_select(df): """ Write a text file for use with M353 that include information on phylogenetically appropriate samples for read mapping. Useful for mapping reads from plant species across multiple orders such as in meta-barcoding projects. Parameters --------- A single dataframe of all 1kp transcriptomes in M353, with your sample information, and systematic placements (Order- Species) of taxa. order /Order: Taxonomic rank, sensu APG4. fam /Family: Taxonomy rank, sensu APG4. sp /Species: taxon with sequences to map, generally from NCBI. trgt_n /Target_name: M353 target taxa using their codes. fname /File_path: filename of fastq to map. fpath/ Directory: directory from root to folder containing fastqs. A list of each fastq file which needs to go through HybPiper Returns -------- A text file like the M353 example. """ orders = df['Order'].values.tolist() orders = list(sorted(set(orders))) for x in orders: sub = str(x) df_sub = df[df.Order.eq(sub)] order = df_sub['Order'].values.tolist()[0] fam = df_sub['Family'].values.tolist()[0] trgt_n = df_sub['Target_name'].values.tolist() sp = df_sub['Species'].values.tolist() Filepath = df_sub['Filepath'].values.tolist()[0] fname = df_sub['Filename'].values.tolist()[0] fname: str = Filepath + fname filename = "%s.txt" % fname f = open(filename, "x") f.write("[Target_name]\n") with open(filename, 'a') as f: for item in trgt_n: f.write("%s\n" % item) f = open(filename, "a") f.write("\n[Group]\n \n[Order]\n%s \n\n[Family]\n%s \n\n[Species]\n%s" % (order, fam, sp)) f.close() >>> m353_select(df) >>> quit() ``` We will now move each set of sequence data to a folder associated with their Order so that we do not have to make many copies of M353 fastq. This should have been done sooner, but hopefully allows people to have more autonomy to fit there needs. ``` $ python3 >>> import os >>> import pandas as pd >>> import shutil >>> taxonomy_ref = pd.read_csv(r'/media/sagesteppe/Genomics/Taxonomy_ref_lib_samples.csv') files = os.listdir('/media/sagesteppe/Genomics/to_trim/Trim_Reads/') files = pd.DataFrame(files, columns = ['File_path']) files["Files"] = files["File_path"] files['Files'] = (files['Files'].replace('.fastq', '', regex=True).replace('_[0-9][P|U]','', regex=True).replace('_[R][1_2]','', regex=True).replace('_unpaired','', regex=True).replace('_paired','', regex=True).replace('_S[0-9]+', '', regex=True).replace('_', '', regex=True).replace('-', '', regex=True)) mask = ~files['File_path'].str.contains('_fastqc', case=False, na=False) # find fastqc files = files[mask] # remove fastqc files = files[files['Files'].str.contains('[A-Za-z]')] # remove non reference library. files['Files'] = files['Files'].str.lower() files = files.merge(taxonomy_ref, left_on = 'Files', right_on = 'vials') parent_dir = "/media/sagesteppe/Genomics/to_pipe" os.mkdir(parent_dir) directory = files['Order'].values.tolist() directory = list(set(directory)) paths = os.path.join(parent_dir, directory) # list of paths duhr for i in directory: npath = os.path.join(parent_dir, i) os.mkdir(npath) c_path = '/media/sagesteppe/Genomics/to_trim/Trim_Reads/' ord = files['Order'].values.tolist() cu_paths = files['File_path'].values.tolist() current_paths = [] for i in cu_paths: current_paths.append(os.path.join(c_path, i)) new_paths = [] new_paths = [parent_dir + '/' + a + '/' + b for a, b in zip(ord, cu_paths)] list(map(shutil.move, current_paths, new_paths)) # pretty sure this is the one. quit() ``` We can copy the M353 file to each directory as so: ``` $ python3 >>> import os >>> import shutil >>> m353_path = ("/home/sagesteppe/.local/bin/NewTargets/mega353.fasta.zip") >>> dest_path = '/media/sagesteppe/Genomics/to_pipe' >>> os.chdir(dest_path) >>> directories = next(os.walk('.'))[1] >>> new_paths = [] >>> new_paths = [dest_path + '/' + a + '/m353.fasta.zip'for a in directories] >>> for path in new_paths: shutil.copy(m353_path, path) >>> quit() # all copied $ cd '/media/sagesteppe/Genomics/to_pipe' $ find . -name '*.zip' -exec sh -c 'unzip -d `dirname {}` {}' ';' # extract .zip $ find . -type f -name '*.zip' -delete # delete .zip ``` and we can filter it as so... Note I could not make this work programmatically (and did not put in much effort) ``` $ cd '/media/sagesteppe/Genomics/to_pipe/Rosales' $ python3 /home/sagesteppe/.local/bin/NewTargets/filter_mega353.py mega353.fasta Rosales_filter.txt # if you are done you can remove the original m353.fasta $ cd '/media/sagesteppe/Genomics/to_pipe' $ find . -maxdepth 2 -type f -iname 'mega353.fasta' -exec rm -rf {} \; ``` Here is how it should kind of work programmatically, maybe you need to keep cd'ing into each folder and run the script from there? A question for JZHANG.... ``` $ cd '/media/sagesteppe/Genomics/to_pipe' $ filters=$(find ~+ -maxdepth 2 -type f -iname '*filter.txt') $ files=$(find ~+ -maxdepth 2 -type f -iname 'mega353.fasta') $ array1=($files) # $ echo $files # to check if worked $ for i in "${!filters[@]}"; do python3 /home/sagesteppe/.local/bin/NewTargets/filter_mega353.py "${files[i]}" "${filters[i]}" done $ for index in ${!filters[*]}; do echo "${files[$index]} is in ${filters[$index]}" done python3 /home/sagesteppe/.local/bin/NewTargets/filter_mega353.py mega353.fasta Asterales_filter.txt ``` Now we can run hyb-piper Using JZ's code with a little alteration ``` # first we concatenate the unpaired reads for file in *_1_1U* do file2=${file//1[P|U]/2U} file3=${file//1U/U} cat $file $file2 > $file3 rm $file $file2 done echo python3 ~/HybPiper/reads_first.py -r $file $file2 --unpaired $file3 -b filtered_target_file.fasta --prefix $file4 --bwa >> reads_first.sh done $ chmod u+x bashful_piper2.sh cd '/media/sagesteppe/Genomics/to_pipe/Myrtales' bash ../bashful_piper.sh bash ./reads_first.sh ``` ## 4: Kraken page 2 Column1 "Sequences that have no k-mers in the datase are left unclassified by Kraken." A very thorough review of how to install Kraken is provided below. https://github.com/DerrickWood/kraken2/wiki/Manual Steps for prepping a novel Kraken database: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases - we will see if we can supplant NCBI with Kew POW for taxids - I kind of doubt it and do not think it should make much of a difference. ## 4a build KRAKEN database Let us download the taxonomy which NCBI uses, We have it laying around for R, but I do not want to troubleshoot linking it to kraken myself. It is also not a very large file. We will also download the complete set of Plant Genomes and proteins. ``` kraken2-build --download-taxonomy --db $pl_mtg kraken2-build --download-library plant --db $pl_mtg ``` ## 4a.1 prep novel sequence data We will now add our novel sequence data to the database. First we need to find their NCBI taxid. ``` $ r > install.packages('taxize') species = read.csv ('/media/sagesteppe/Genomics/Taxonomy_ref_lib_samples.csv') species_list = species_list[,1] species_list = unique(species) species_ncbi = taxize::classification(species_list, db = 'ncbi') results = lapply(species_ncbi, function(x) x[nrow(x),c(3)]) results = t(rbind(unlist(results))) results = cbind(rownames(results), data.frame(results, row.names=NULL)) names(results) = c('Sample','NCBI_code') species = merge(species, results, by = 'Sample' ) # write.csv ('/media/sagesteppe/Genomics/Taxonomy_ref_lib_samples.csv', append = FALSE) rm(species_list, species, results) > quit() ``` Now we append the taxid to the top line of each file before the first whitespace. ```sequence16|kraken:taxid|32630 Adapter sequence # example``` Now we can add our genomes to the kraken database ! :-) ``` find genomes/ -name '*.fa' -print0 | xargs -0 -I{} -n1 kraken2-build --add-to-library {} --db $plant``` ## 4a.2 supplement database with ToL data OK now we will see what we have, and what else we want from Kew. While the repository for the ToL sequence data is the European Nucleotide Archive, it is much easier to access these data via NCBI SRA. It seems easiest to manually navigate to the [SRA RUN Selector](https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=10&WebEnv=MCID_61d4719c8f8e1f11220da754&o=acc_s%3Aa) and manually download the 'Total' 'Metadata ' csv file. We can open this file in python and compare it to the target geographically filtered material for our study area. We will also focus on flowering plants less Poales. We will use a table which can look up clades from genera from "[Three keys to the radiation of angiosperms into freezing environments](https://datadryad.org/stash/dataset/doi:10.5061/dryad.63q27)" Nature 89(92) 2014 Zanne et al. I hoped to use taxizedb here, but it is unable to install on Ubuntu and as Scott left ROpenSci I think it has fell out of maintenance. ``` $ python3 >>> import pandas as pd sra_table = pd.read_csv("/media/sagesteppe/Genomics/data_summaries/SraRunTable.txt") sra_table = sra_table.iloc[:, [0,1,2,12,22,33,34,35,36,37,38,39,40,41]] sra_table = sra_table[~(sra_table.Organism == 'unidentified')] sra_table = sra_table[~sra_table.Organism.str.contains(" sp\\.")] sra_table[['Genus','Species']] = sra_table['Organism'].str.split(' ', 1, expand=True) lookup = pd.read_csv("~/Bioinformatics/files/Spermatophyta_Genera.csv") lookup = lookup.iloc[:,[0,1,2]] lookup = lookup.rename(columns={lookup.columns[0]: "Genus"}) sra_table = sra_table.merge(lookup, on = 'Genus', how = 'left') site_target_taxa = pd.read_csv("/hdd/MS_SDM_RMBL/SDMS_RMBL/data/processed/taxa_predicted_rmbl.csv") site_target_taxa['name'].str.split('_', 1, expand=True) # alternate route df2 = pd.DataFrame(site_target_taxa.name.str.split(pat = '_').tolist(), columns="Genus Epithet".split()) genera = df['Genus'].tolist() #original route genera = site_target_taxa['Genus'].tolist() genus_filter = sra_table['Genus'].isin(genera) sra_sub = sra_table[genus_filter] print(sra_sub.to_string()) sra_requests = sra_sub.iloc[:,0] sra_requests.to_csv("~/sequenceData/files/sra_requests_taxonomy.csv", header=False, index = False) sra_requests_to_scaffold = sra_sub.iloc[:,[0,3,14,15,16,17]] sra_requests_to_scaffold.to_csv("/media/reed/Genomics/kew_tol/Trim_Reads/sra_to_scaffold.csv", header=False, index = False) quit() ``` We will now download a program to make downloading data from SRA easier. [SRA toolkit](https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit) is installed via: ``` $ cd /home/sagesteppe/.local/bin $ wget --output-document sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz $ tar -vxzf sratoolkit.tar.gz $ rm sratoolkit.tar.gz $ ls # note your file will have a different name. but be sure to keep the ''/bin' $ export PATH=$PATH:$PWD/sratoolkit.2.11.2-ubuntu64/bin $ cd /home/sagesteppe/.local/bin/sratoolkit.2.11.2-ubuntu64/bin $ which ./fastq-dump ``` refer to [this page](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration) for configuration in a bash script ```$ ./fastq-dump --stdout SRR390728 | head -n 8``` and compare it to the output on the main git page. only exception is you will have something after the message they use an example 'fastq-dump --stdout SRR390728 | head -n 8', which should pop up on linux. [Quick Toolkit configuration](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration) Now you can start to configure the SRA toolkit refer to the link above for the steps. ```vdb-config -i``` Once you have the SRA toolkit installed you can do the following to 'prefetch' the reads. It took me about 1.5 hours to get 126 samples on a Friday using Northwesterns internet via ethernet running at 900 Mbps/s ``` ./prefetch --option-file ~/Bioinformatics/files/sra_requests.csv ``` We can add all of the data we downloaded from ToL to the database here. Finally we can build the database now! ```kraken2-build --build --db Eastwood --threads 16``` # Unleash the Kraken