John HybPiper1 Workflow

--- title: John HybPiper1 Workflow breaks: false tags: HybSeq, John --- # JZ HybPiper Workflow ==**READ ME**== > This document was created using HackMD and is for those running the pipeline on a personal computer and not Fabronia. > > email: johnathanzhang2022@u.northwestern.edu ## ~~1a\. Setting up a virtual environment~~ see 1b. :::danger I found using a VM to be unstable, so I will instead be using a dual-boot approach to setting up a Linux environment on my PC ::: ### Download a Virtual Machine (VM) HybPiper is designed to run on **Linux platforms** so in order to run this on a windows PC, you must simulate a Linux environment using a virtual machine. I used :desktop_computer: Oracle VM VirtualBox to do so. > Another option is to install Linux on a separate hard drive and boot your computer with that. To download Oracle's virtual machine, visit **[their website](https://www.virtualbox.org/wiki/Downloads)**. Select *Windows Host* and it should begin downloading. :::spoiler What is a virtual machine? + In short, you are using a computer to emulate another computer. In this case, we are running a different operating system *(a Linux platform)* using a host computer *(a windows PC)*. + Read more **[here](https://www.vmware.com/topics/glossary/content/virtual-machine)**. ::: ### Download Ubuntu **Ubuntu** is the specific Linux platform I use to run the virutal machine. We'll need to download a disk file :dvd: and "insert" the disk into our VM to set it up. **[Download it here](https://ubuntu.com/#download)** and select the ++Ubuntu Desktop 20.04 LTS++ version. ### Installing and setting up the virtual machine: Run the installer for Oracle's VM. If the program doesn't prompt you to create a new VM the first time you open it, go to the :hammer_and_pick: tools tab on the left and press *New*. #### Setting it up 1. ++Name++: You may enter whatever you'd like 2. ++Machine folder++: Where you would like to store the files that represents your virtual machine 3. ++Type++: Linux 4. ++Version++: Ubuntu (64 bit) // *Troubleshooting for this step below* 5. ++Memory Size++: How much RAM you would like to dedicate to the virtual machine. I'd recommend 4 gb at least. 6. ++Hard Disk++: Create a virtual hard disk now VDI (VirtualBox Disk Image) Dynamically Allocated Specificy the location to wherever you'd like Choose a size // *See below for how to choose size* At this point you should see the VM you just created in the left tab. Before you press :arrow_right: *Start*, click on :gear: *settings*. 7. Navigate to ++System++ and click on the ++processor++ tab. This part depends entirely on your PC specs, I'd recommend 2-4 processors *(2-4 threads, not cores)* at least, but do NOT dedicate too many processors to the VM. 8. Navigate to ++Display++. For ++Graphics Controller++, select VBoxVGA. Ignore the warning message 9. Navigate to ++Storage++. Right click on ++Controller: IDE++ and select optical drive. In the new window that pops up, select ++Add++ and navigate to the :dvd: Ubuntu ISO file that we downloaded earlier. Add the ISO file and then choose the ISO file. ##### :desktop_computer: **You can now run the virtual machine!** Follow the first-time setup instructions for Ubuntu. Recommended (default) settings are fine. ### Troubleshooting :::spoiler Some extra details on setting up the VM - ++Step 4++: If you do not see *Ubuntu 64 bit version*, you will need to enter your computer's BIOS and enable virtualization. This differs for practically every single computer so only a very general guide can be written on how to do this. **[Here](https://2nwiki.2n.cz/pages/viewpage.action?pageId=75202968)** is a guide for certain computer models. A safer bet would be to google this using your specific computer model. - ++Step 6++: *Dynamically allocated disks* have a maximum storage capacity (that you set), but the storage it takes up on the host machine will only grow as the disk itself grows. A *fixed disk* means that the disk will be created and occupy that space on the host machine immediately. Its size also cannot be changed. If you'd like to change your dynamically allocated disk size, google it. Make sure to make a backup copy of the disk file before making any changes. In terms of choosing disk size, 50gb of Brighamia data ended up needing ~500 gb. Plan ahead and give plenty of leeway. ::: ## 1b\. Setting up Dual-Boot Windows and Linux > This is harder/more "dangerous" than setting up a virtual machine. Understand how to partition hard drives for dual-booting computers. ### Set up a flash drive with Ubuntu First you will need a flash drive that will house the Ubuntu installer. I follow **[these instructions](https://ubuntu.com/tutorials/create-a-usb-stick-on-windows#1-overview)** to set up the flash drive, which includes installing **[Rufus](https://rufus.ie/en/)**. ### Install and Set Up Ubuntu I chose to install Ubuntu on a separate hard drive than where my windows OS is, following **[these](https://askubuntu.com/questions/726972/dual-boot-windows-10-and-linux-ubuntu-on-separate-hard-drives)** instructions. Of note: * Grub does not work for me, I have to go into BIOS to start up Ubuntu. * There are also a number of BIOS settings you may/may not have to tweak, discussed in other forums on google. ## 2\. Setting up HybPiper Here is the **HybPiper Website**: https://github.com/mossmatters/HybPiper. Installation instructions are on the github page, but specific procedure for my case is below. ### Installing the dependencies HybPiper uses a collection of tools developed by others that we will need to install. These are the **dependencies** and we will install them using terminal. #### Ubuntu should come with **python** installed. To check what version of python is installed type the following into the terminal ```bash= $ python3 -V ``` and it should output a python version like `Python 3.x.x` :::info :bulb: In this document, code with line numbers are *typed into terminal*. If a code block does not have a number, it is the the *terminal's output*. ::: > As long as your python version starts with 3, everything should be fine. If it starts with 2, you will need to update python #### :::spoiler The next thing to install is **pip** and **curl**. - These python packages are used to install and manage other python packages. ::: ```bash= $ sudo dpkg --configure -a $ sudo apt install python3-pip $ sudo apt install curl ``` :::info :bulb: When multiple lines of code are presented at once in this document, you will need to type each one into terminal individually and press enter before typing the next line. ::: #### Next is installing :beer: **Homebrew**, which will be used to install the actual tools that HybPiper needs. We also need to install git for this step and future ones. ```bash=1 ! $ sudo apt install git-core $ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)" ``` Once this is done installing, it will ask you to add homebrew to your `$PATH` and do some other things. You should do as it asks. ```bash=1 $ sudo apt-get install build-essential $ echo ‘eval $(/home/linuxbrew/.linuxbrew/bin/brew shellevn)’ >> /home/jz/.profile $ eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv) ``` :::info :bulb: You should copy the code that *your* terminal outputs, not my code, as user specific code *(like the path to certain files)* is specific to my VM. ::: #### Using Homebrew, install the dependencies (most of them)for HybPiper: ```bash= $ brew install bwa $ brew install samtools $ brew install spades $ brew install blast $ brew install parallel $ brew install gcc $ sudo apt-get install exonerate ``` > Some of these might take a bit; do them one by one and have some patience. The last dependency, which personally was the most frustrating to install, was **biopython**, because it depends on what versions of python are installed on your vm at the time of setting this up. I needed to install it under python 3.9, even though Ubuntu's default python version was 3.8 at the time. ```bash= $ sudo apt-get install python3-biopython ``` ### Installing HybPiper We will install **git** to download HybPiper. ```bash= $ sudo apt install git-core ``` Before you download the HybPiper Pipeline, cd to wherever you'd like to download it. If you wanted to make a folder on your desktop, you could open terminal and type `cd Desktop`. Type the following and it will download the HybPiper contents into whichever folder your terminal is in: ```bash= $ git clone https://github.com/mossmatters/HybPiper.git ``` To check if all of the dependencies installed correctly, cd to the HybPiper folder you created and type the following: ```bash= $ python3 reads_first.py --check-depend ``` If the output at the bottom contains `Everything looks good!` then everything should be set up and ready to go. --- ### Troubleshooting If one of the steps along the way has failed, you will get the error message: :::danger ERROR: One or more dependencies not found! ::: #### Click the links below to see troubleshooting tips. - Installing biopython ## 3\. Setting up FortiClient > Steps 3 and 4 are specifically for me trying to download your data off of Fabronia. If you can get the data onto your machine somehow else, skip these steps. > > Norm or someone at CBG will likely give you login information and help with installation. Here is a very broad overview. :shield: **FortiClient** is the VPN service that the CBG uses to allow people off site to access CBG servers. In this case we are trying to access the **Fabronia** server. ### Forticlient Installation 1. Download :shield: FortiClient installer on your host OS, **not the virtual machine**. I was unable download a linux version anyways 2. Run the installer and launch FortiClient once it has finished installing 3. Click on remote access and enter the credentials that are given to you :::info :bulb: Note that **FortiClient** credentials may be different than your **Fabronia** credentails ::: ## 4\. Setting up Cyberduck :baby_chick: **Cyberduck** is an app that will let you visualize server contents. The GUI mimics a windows file explorer, but this GUI is only available on Windows and Mac operating systems. To use Cyberduck in Ubuntu, we will have to use command line. ### Installing Cyberduck Visit **[here](duck.sh)** to visit the cyberduck website and download the GUI on Windows, or type in the following into terminal to install it on Ubuntu. If the above link is broken, the website is duck.sh ```bash=1 ! $ echo -e "deb https://s3.amazonaws.com/repo.deb.cyberduck.io stable main" | sudo tee /etc/apt/sources.list.d/cyberduck.list > /dev/null ``` ```bash=2 ! $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys FE7097963FEFBE72 $ sudo apt-get update $ sudo apt-get install duck ``` ### Using Cyberduck I used :baby_chick: **Cyberduck** to download raw sequencing data that was uploaded to Fabronia from the CBG, but it can broadly be used for file sharing. I was connected to CBG using the :shield: VPN while downloading files. Here is the terminal usage: ```bash= $ duck --download sftp://Username@xx.x.x.xx/[Fabronia Path]/ /[Local Machine Path]/ ``` - `Username` refers to your Fabronia username - The `xx.x.x.xx` is the Fabronia IP, which will be given to you if it is needed - `[Fabronia Path]` is the absolute path of the folder/file you are trying to download on Fabronia - `[Local Machine Path]` is the absolute path on your VM to where you would like the files to be downloaded To download my data, I typed `duck --download sftp://JZhang@xx.x.x.xx/home/JZhang/JZhang_Brighamia/ /home/jz/Desktop/Brighamia_data` The terminal will then prompt for my *Fabronia* password and then the download will begin. :::info :bulb: Installing Cyberduck onto your Windows PC may also be useful, as you can actually see the Fabronia path and files you want to download. There might be a way to easily transfer files from your PC to the VM, but I'm not sure. ::: ## 5\. FastQC Setup and Usage **FastQC** is an app developed by the Babraham Institute that allows for a quick assessment of read quality. Read more about the app and how to interpret the data [**here**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). ### Installing FastQC To download FastQC, visit their website and find the download link *or* click [**here**](https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc). Download the Ubuntu version and extract the contents into a folder wherever you'd like. Make sure java is installed by typing `java -version` into terminal. If it is not installed, type `sudo apt install default-jre` into terminal to install it. Navigate to the fastqc folder where you extracted the contents into, and type the following in terminal ```bash= ! $ chmod 755 fastqc $ sudo ln -s /[absolute path to the FastQC folder your in]/fastqc /usr/local/bin/fastqc ``` :::spoiler Some notes - chmod is a command related to assigning permissions; chmod 755 means full permissions for owner, read/execute permissions for others - the second line of code adds the fastqc to your path, i.e. you don't need to be in the FastQC folder to run the program through terminal. My absolute path was `/home/jz/Desktop/FastQC/fastqc` ::: ### Using FastQC If using the GUI is desired, simply type `fastqc` into terminal. If a more streamlined process is desired, cd into the data folder and type `fastqc` followed by the name of files. This will output FastQC information as a html and zip file per input file. ```bash= $ fastqc file1.fastq.gz file2.fastqc.gz ``` To run fastqc on all the data files in your folder, I first created another folder for the fastqc outputs and then specified the output directory. Here I tell FastQC to also use 6 threads to speed things up. ```bash= $ fastqc *fastq.gz --outdir=/home/jz/Desktop/brighamia_data/fastqc_outputs -t 6 ``` :::spoiler Some Notes - The path to the output folder needs to be the absolute path, e.g. `/home/jz/Desktop/data/fastqc_outputs` - -t refers to the number of threads you want FastQC to use. It runs 1 thread per file. In the above command, FastQC will use 6 threads. ::: ###### > At some point you will need to unzip all of your .gz files. I do so at this point, by navigating to my data folder and typing `gunzip *.gz`. If you'd like to keep the original contents, you can type `gunzip *.gz -k`. ## 6\. Trimmomatic **Trimmomatic** is a tool that is used to trim the ends of our reads. You can specify what to trim, such as the adapters, leading/trailing base pairs with low quality reads, etc. Read more on the ++[trimmomatic github page](https://github.com/usadellab/Trimmomatic)++. ### Installing Trimmomatic Installing trimmomatic is as simple as downloading it from ++[their website](http://www.usadellab.org/cms/?page=trimmomatic)++. Download the ++binary++ version of their most recent release, and extract it to wherever you'd like. ### Using Trimmomatic The basic use case of trimmomatic is as follows: ```bash= ! $ java -jar /[path to trimmomatic folder]/trimmomatic-0.39.jar PE -phred33 [input file 1] [input file 2] [output file 1] [output file 2] [output file 3] [output file 4] ILLUMINACLIP:/[path to trimmomatic folder]/adapters/[specific adapter]:2:30:10 LEADING:3 TAILING:3 SLIDINGWINDOW:5:20 MINLEN:36 ``` - The input file names must follow a naming convention - *_R1_001.fastq *and* *_R2_001.fastq - *.f.fastq *and* *.r.fastq - *.1.sequence.txt *and* *.2.sequence.txt - The specific adapter to insert in the `[specific adapter]` field depends on what was used in the sequencing lab. In my case we used `TruSeq3-PE.fa`. To run trimmomatic on an entire folder of reads, I made a shell script with a for loop. While in the data folder, I typed: ```bash= ! > for R1 in *R1* > do > R2=${R1//R1.fastq/R2.fastq} > R1paired=${R1//.fastq/_paired.fastq} > R1unpaired=${R1//.fastq/_unpaired.fastq} > R2paired=${R2//.fastq/_paired.fastq} > R2unpaired=${R2//.fastq/_unpaired.fastq} > echo java -jar /home/jz/Desktop/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 6 -phred33 $R1 $R2 $R1paired $R1unpaired $R2paired $R2unpaired ILLUMINACLIP:/home/jz/Desktop/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:36 >> brighamia_trimmomatic.sh > done ``` :::spoiler Arguments - `-threads 6` tells trimmomatic how many CPU threads to use - `-phred33` refers to the quality metric that the reads use - `LEADING:3, TRAILING:3` tells trimmomatic to cut off the first and last 3 nucleotides - `SLIDINGWINDOW:5:20` tells trimmomatic to perform a sliding window of size 15 (I think?) - `MINLEN:36` reads must be at least 36 nucleotides long to not be discarded immediately - `brighamia_trimmomatic.sh` is just the name of the shell script I give ::: ###### Finally, make the shell script accessible to all again and then execute the script. ```bash= $ chmod a+x brighamia_trimmomatic.sh $ bash brighamia_trimmomatic.sh ``` :::info :bulb: This last part is often unnecessary if running on your own PC. However, if you do get permission errors, try chmod'ing the file ::: ## 7\. Deduplicating reads :::warning I'm not entirely sure this section should belong here, in terms of the workflow ::: Due to duplicated reads in my Brighamia fastq files (*NOT referring to paralogs here*), I was having issues downstream and needed to use **BBTools** to delete the duplicates. ### Installing BBTools First I visited [**BBTools' website**](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/installation-guide/) and downloaded the zip file at the top of the site. I extracted the folder onto my desktop. This is one of the simpler installations. ### Concatenating unpaired reads > This step is required regardless of the issue I was having, as HybPiper calls for only one unpaired reads file. First I navigated to my data folder with all my fastq files. To combine my \*R1_unpaired and \*R2_unpaired files, I concatenated them using `cat`. ```bash= ! $ cat BRIN-001_R1_unpaired.fastq BRIN-001_R2_unpaired.fastq > BRIN-001_unpaired.fastq ``` I turned this into a for loop to work on all my fastq files. ```bash= ! > for file in *R1_unpaired* > do > file2=${file//R1/R2} > file3=${file//R1_unpaired/unpaired} > cat $file $file2 > $file3 > rm $file $file2 > done ``` If you'd like to keep the original files, you can remove the `rm $file $file2` line. ### Deduping my reads Dedupe is very versatile in the types of duplicates it can remove. My usage of dedupe is as follows: ```bash= ! $ /home/jz/Desktop/bbmap_38.90/dedupe.sh in=BRIN-001_R1_paired.fastq out=BRIN-001_R1_paired_dd.fastq sort=name ascending=t rmn=t ac=f -da -Xmx8g ``` :::spoiler Arguments - `in` and `out` specify the names of the input file and the desired output name - `sort=name` sorts the reads in the output y name, with `ascending=t` meaning top to bottom - `rmn=t` stands for RequireMatchingNames = true; a duplicate read must have matching names and sequences - `ac=f` means that dedupe will not absorb full containments of contigs - `-da` stands for disable assertions- basicallyignore problems - `Xmx8g` is java for how much ram to use. Here it is set to 8 gb ::: ###### I turned this into a shell script to run on only the R1_paired and R2_paired files, not the unpaired files. Note that writing these for loops only works when you are cd'd into your data folder. ```bash= ! > for file in *_paired.fastq > do > file2=${file//_paired.fastq/_paired_dd.fastq} > echo "/home/jz/Desktop/bbmap_38.90/dedupe.sh in=$file out=$file2 sort=name ascending=t rmn=t ac=f -da -Xmx8g" >> dedupeloop.sh > echo "rm $file" >> dedupeloop.sh > done ``` To finally dedupe my reads, I type `bash dedupeloop.sh` into terminal. :::info :bulb: Remember if you are having permission errors, `chmod a+x dedupeloop.sh` and then run it. ::: ## 8\. HybPiper > At this point, I have the HybPiper installed, and prepared the reads for the pipeline. I was also given baits files. **HybPiper** is a pipeline meant to capture sequences from libraries using baits. The [**HybPiper wiki**](https://github.com/mossmatters/HybPiper/wiki) goes over installation & tutorial. ### Running reads_first.py The first step in the pipeline is running a python file called **reads first**. This python script does a lot of heavy lifting, including mapping the reads, assembling contigs, and then assigning contigs to the baits. It uses: - Both paired end reads (R1, R2) - The concatenated unpaired reads - Your baits file :::info :bulb: A baits file will probably be given to you if you weren't the one originally sent the samples to be sequences. ::: For me to run one sample through reads_first.py, I cd into my data folder and type: ```bash=1 ! $ python /home/jz/Desktop/HybPiper/reads_first.py -r BRIN-001_R1_paired_dd.fastq BRIN-001_R2_paired_dd.fastq --unpaired BRIN-001_unpaired.fastq -b supercontig_baits.fasta --prefix BRIN-001 --bwa ``` :::spoiler Arguments - `-r` This specifies the paired end reads files - `--unpaired` This specifies the unpaired read file - `-b` This specifies the baits file - `--prefix` reads_first.py creates an output directory (a folder). This will be the name of the folder - `--bwa` This tells reads_first.py that we are using nucleotide sequences and not amino acid sequences ::: ###### To create a shell script and run it on a whole folder of data, I type ```bash= ! > for file in *R1_paired* > do > file2=${file//R1/R2} > file3=${file//R1_paired_dd.fastq/unpaired.fastq} > file4=${file//_R1_paired_dd.fastq/} > echo python /home/jz/Desktop/HybPiper/reads_first.py -r $file $file2 --unpaired $file3 -b supercontig_baits.fasta --prefix $file4 --bwa >> reads_first.sh > done ``` This process is very CPU intensive and will take a while. There is also a lot of information your terminal will output as it's running. For every individual, reads_first will create one folder containing the file outputs. These file outputs will include sequences that were successfully mapped and assembled. :::info **A note about the baits file I use:** I end up using the baits file constructed from supercontigs instead of exons. We think the issue is that the bait length is too short when constructed from the exon data. Also, an interesting observation with this image. ![AliView Baits](https://i.imgur.com/LTBkxRY.png) The original bait we used contained this part of the sequence, the paralogs contained this part of the sequence, but the baits constructed from the paralogs failed to capture this sequence... This is an issue that needs to be addressed in manuscript ::: --- ### Creating a heatmap of read coverage (optional) > Wiki has good instructions on this I wanted to create heatmaps to compare read coverage on two runs of HybPiper. For the first run, we used the original baits (don't remember where they came from) to target the Brighamia reads. For the second run we used the outputs of the first run to develop baits more specific to Brighamia. We used these baits (made from supercontigs, not just exons) and reran HybPiper. The first step is to use the `get_seq_lengths.py` script to get the length of the sequences reads_first recovered. ```bash= ! python /home/jz/Desktop/HybPiper/get_seq_lengths.py brighamia_baits.fasta name_file.txt dna > brighamia_seq_lengths.txt ``` Next we use an R script called `gene_recovery_heatmap.R` to generate the heatmap in RStudio. This requires a *now archived* package called heatmap.plus that you must install by downloading the zip file **[here](https://cran.r-project.org/src/contrib/Archive/heatmap.plus/)**. To install the package, navigate to packages tab and click the install button. Install from Package Archive File and select the package from wherever you downloaded it. The only line of code that needs to be changed in `gene_recovery_heatmap.R` is line 6, where in quotation marks, you enter the absolute path to the sequence lengths file we just generated. ```bash=6 ! sample.filename = "/home/jz/Desktop/brighamia_scbaits/brigSC_seq_lengths.txt" ``` Source the code and it will generate the heatmap for you. ### Intronerate Intron (and exon) sequences can be extracted using the provided script `intronerate.py`. This is a script run for each sample, so a for loop is written to run them on all samples. ```bash= ! for i in */ do echo "python ~/Tools/HybPiper/intronerate.py --prefix $i " >> intronscript.sh done ``` Then run the shell script by typing `bash intronscript.sh`. ## Retrieve Sequences ## Combine .FNA files with other projects If combining data from a new run, you can concatenate (retrieved) .FNA files with identical file names. While in a directory with one set of the unaligned .FNA files: ```bash= ! for i in *.FNA; do cat ${i} ../secondDirectoryWithFNAfiles/${i} > ../newDirectoryForFNAfiles/${i}; done ``` --- ## HybPhaser HybPhaser is an extension to HybPiper that aids in detecting hybrids and its github can be found **[here](https://github.com/larsnauheimer/HybPhaser)**. Well written instructions can be found on the github page. ### Install Dependences HybPhaser requires some other dependencies to be installed: - **[BCFTools](https://samtools.github.io/bcftools/bcftools.html)** is a suite of tools designed to extract information from VCF's. To install it: ``` bash= ! sudo apt-get install bcftools ``` ## ### 1. SNP Assessment As you run HybPhaser you will need to fill out corresponding sections of the _config.txt_ file. #### ++1_generate_consensus_sequences.sh++ The first step is Consensus Sequence Generation using the `1_generate_consensus_sequences.sh` shell script. ```bash= ! bash /home/jz/Desktop/HybPhaser/1_generate_consensus_sequence.sh -n name_file.txt -o /home/jz/Desktop/brighamia_sc_baits -t 12 ``` I run HybPhaser only on the HybPiper products produced using the supercontig baits. This creates the output folder **01_data** with consensus sequences in them. #### ++1a_count_snps.R++ The R script 1a_count_snps.R is used to generate a table with the proportions of SNPs in each locus and sample as well as a table with recovered sequence length The first step is to locate a file called `config.txt` in the HybPhaser folder. Open it and enter the necessary information in the `# General Settings` section. :::spoiler Details below - `path_to_output_folder:` set path for HybPhaser output folder ("/path/to/hybphaser_output_folder/") - `fasta_file_with_targets`: direct to the file with target sequences (baits) ("/path/to/target_file.fasta") - `target_file_format`: set whether target file is in DNA ("DNA") or AA ("AA") format - `path_to_namelist`: direct to a file that contains all sample names ("/path/to/namelist.txt") - `intronerated_contig`: select whether HybPiper was run with intronerate to fetch the intronerated contigs ("yes"), or whether normal contigs should be used ("no") - Here are the `# General Settings` I used: ``` path_to_output_folder = "/home/jz/Desktop/brighamia_scbaits/HybPhaserOutputs" fasta_file_with_targets = "/home/jz/Desktop/brighamia_scbaits/supercontig_baits.fasta" target_file_format = "DNA" path_to_namelist = "/home/jz/Desktop/brighamia_scbaits/name_file2.txt" intronerated_contig = "no" ``` ::: ###### Before we run the R script we will need to make sure the R packages are installed. Create an R environment in terminal by simply typing `R`. Then install the packages and exit the R environment (I don't think it's necessary to save the working environment): ```r= packages.install(ape) packages.install(seqinr) packages.install(stringr) q() ``` In the normal terminal window, you can now run the R script. ``` bash= ! Rscript 1a_count_snps.R ``` This will produce the output folder 00_R_objects with consensus sequence lengths and SNP proportions in samples and loci #### ++1b_assess_dataset.R++ This R script generates tables and graphs to assess the dataset in regards of sequence length recovered and proportion of SNPs per sample and gene. Details below for parameters set in config.txt. To run the script type ```bash= ! Rscript 1b_assess_dataset.R ``` :::spoiler Details below - `name_for_dataset_optimization_subset = ""` These values are the values _Nauheimer et. al._ uses in their paper - `remove_samples_with_less_than_this_propotion_of_loci_recovered = 0.2` - `remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.6` - `remove_loci_with_less_than_this_propotion_of_samples_recovered = 0.2` - `remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.75` For the paralogs: - `remove_loci_for_all_samples_with_more_than_this_mean_proportion_of_SNPs = "outliers"` - `file_with_putative_paralogs_to_remove_for_all_samples = ""` - `remove_outlier_loci_for_each_sample = "no"` ::: These values should be carefully evaluated per dataset. #### ++1c_generate_sequence_lists.R++ This R script generates a list of sequences for every locus and sample. These files are used for downstream analyses. The script itself had quite a few occurences of the same error. The brighamia data did not have any loci that completely failed (I think I removed them earlier on due to poor coverage & issues with some of the scripts). This R script expects the existence of an R object called failed loci, unsure of where/when in the script this object is created but it didn't exist for Brighamia data. An example of how I fixed this can be found in troubleshooting. Running the actual script is again as easy as ```bash= Rscript 1c_generate_sequence_lists.R ``` ### Finding alleles in our data We deviate from HybPhaser here. We are looking for alleles in our loci_consensus_sequences by looking in the files for individuals with the least amount of amgiguity codes in them. We made the parameter in step 1b `remove_loci_with_less_than_this_propotion_of_target_sequence_length_recovered` set to `0.80`. Then we use MAFFT to align the loci consensus sequences within the files and trimAl to remove the gappy regions. ```bash= #####Running MAFFT on an individual sample: mafft --auto 2000468_BI005_consensus.fasta > 2000468_BI005_consensus_aligned.fasta #####Creating a loop to run MAFFT on all files in the folder for file in * do file2=${file//.fasta/_aligned.fasta} #I created a folder called loci_consensus_aligned to house all the aligned files echo "mafft --auto $file > $file2" >> mafft_loop.sh echo "mv $file2 loci_consensus_aligned" >> mafft_loop.sh done #####Running trimAl on an individual sample: trimal -in 2001114_BI146_consensus_aligned.fasta -out 2001114_BI146_trimmed.fasta -gt 0.85 -st 0.001 -cons 35 #####Loop trimAl through folder of files for file in * do file2=${file//_consensus_aligned.fasta/_consensus_trimmed.fasta} echo "trimal -in $file -out $file2 -gt 0.85 -st 0.001 -cons 10" >> trimAl_loop.sh echo "mv $file2 loci_consensus_trimmed" >> trimAl_loop.sh done ``` **changed trimal to gt 0.85 and hybphaser to 0.8** final product for reference step Hybphaser: `remove_samples_with_less_than_this_propotion_of_loci_recovered = 0.99 remove_samples_with_less_than_this_propotion_of_target_sequence_length_recovered = 0.8` trimAl: `-gt 0.85 -cons 10` ### New approach to finding alleles Maybe look at sequences with lots of ambiguity, assume it represents heterozygosity, split those sequences into two based on the ambiguity code? ### Reworking HybPhaser 2 Since HybPhaser 2 makes reference files using an entire individual, we want to reorganize the data to be split based off of genes. #### Original Data Structure /01_data/ - Individual 1/ - Individual 2/ - reads/ - Gene_1_combined.fasta - Gene_2_combined.fasta #### New Data Structure /04_something/ - Gene_1 - Gene_2 - Individual_1_Gene_2_combined.fasta - Individual_2_Gene_2_combined.fasta - etc... ## Putative Paralog Detection Install ```bash= ! git clone https://github.com/Bean061/putative_paralog sudo apt-get update sudo apt install default-jre #apparently java doesnt come with ubuntu? ###picard install### #download .jar off of http://broadinstitute.github.io/picard/ #make it an environmental variable export PICARD=/home/jz/Tools/picard/picard.jar ###gatk install### # download gatk from github and extract anywhere # export gatk export PATH="~/Tools/gatk/:$PATH" ###mafft install### sudo apt-get install mafft ###trimAl install### # download trimal and extract anywhere cd source/ make export PATH=/home/jz/Tools/trimal-trimAl/source:$PATH ``` Run ```bash= ! bash Step1.sh /home/jz/Tools/brighamia_scbaits/ /home/jz/Tools/brighamia_scbaits/name_file.txt bash Step2.sh 100 4 /home/jz/Data/ppd_test/supercontig /home/jz/Data/backup_data/ /home/jz/Data/ppd_test/step2/ /home/jz/Data/ppd_test/name_file.txt #If the designated output folder doesn't already exist (`/step2/` or whatever you name it), the output will be put in the folder where Step2.sh is located python PPD.py -ifa ~/Data/ppd_test/step2/ -ina ~/Data/name_file.txt -iref ~/Data/ppd_test/supercontig_baits.fasta -io ~/Data/ppd_test/outgroup.txt -o ~/Data/ppd_test/ ``` Added a number of changes to `Step2.sh` ```bash #change these to match my files naming schemes 33 | read1fq=$4/${prefix}_R1_paired_dd.fastq.gz 34 | read2fq=$4/${prefix}_R2_paired_dd.fastq.gz #folders were not be created in correct directory 79 | cd $5 #call picard correctly 90 | java -jar $PICARD #and 4 other times ``` Added a number of changes to `PPD.py` (add ~2 lines for PPD1.02) ```bash ### issue with opening file under `def input_namelist` ~29 | f = open(namelist_file, 'r') #removed deprecated 'rU' ### file names?? wtf?? added + ".degenerated.fasta" ~64 | input_name = seq_directory + str(file2) + ".degenerated.fasta" ### issue with differences in naming scheme of samples/gene under `def sort_to_genes` ~74 | sequence_name = "-".join(test1.description.split(" ")[1].split("-")[0:2] ~75 | file_name = "Gene"+test1.description.split(" ")[1].split("-")[2].split(":")[0] ### prevent a divid by zero error (probably result of low coverage) ~128 | if sequence.id in namelist_path and len(sequence) > 0: multithread mafft + fix sed '' ``` Note that this pipeline asks for an outgroup to be included, brighamia data I am using is composed of only 1 species so outgroup was arbitrarily assigned to BRIN-001, not sure if this has impact downstream. ## Paralog Wizard #### Download pipeline from **[their github](https://github.com/rufimov/ParalogWizard)**. ``` bash git clone https://github.com/rufimov/ParalogWizard ``` #### Install Dependencies - **Python 3.6+** should already be installed - **Biopython 1.77+** should already be installed - **BLAST command line tools** Instructions can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK52640/) - Remember to export path variables - **NumPy** should already be installed - **Scikit-learn** Installation instructions found [here](https://scikit-learn.org/stable/install.html) - Make sure to install under whichever python version is your main version (I set python to python3.8 and python3 to python3.9, my main is python3.7) - **SciPy** Should be installed with NumPy - **Matplotlib** Installation instructions found [here](https://matplotlib.org/stable/) - **BLAT**: download all files from [here](hgdownload.cse.ucsc.edu/admin/exe/). If the link isnt working, (hgdownload.cse.ucsc.edu/admin/exe/) - **SPAdes** should be installed #### Running Paralog Wizard Paralog Wizard actually starts with the raw sequencing reads, and performs its own assembly derived from HybPiper. Folder organization is described on the github. The main script for running Paralog Wizard is `ParalogWizard.py`, and it is used for multiple steps. :::info Paralog Wizard actually requires **2 probe sets** ::: The first step is the assembly of raw sequencing reads; the equivalent of running HybPiper ```bash= python ParalogWizard.py cast_assemlble -d test_dataset -pr kew_probes_Malus_exons_concat.fasta ``` Step 2 ```bash= python ParalogWizard.py cast_retrieve -d test_dataset -pe kew_probes_Malus_exons.fasta -c -nc 8 ``` Note that the probes used in step 2 are the exons split, not concatenated. Also, `-c` can only be used once? Not sure what this means Step 3 ```bash= python ParalogWizard.py cast_analyze -d test_dataset -nc 8 ``` Step 4 ```bash= python ParalogWizard.py cast_detect -d test_dataset -pe kew_probes_Malus_exons.fasta -p -mi 4 -ma 10 ``` Step 5 ```bash= python ParalogWizard.py cast_separate -d test_dataset -pc ./test_dataset/41detected_par/refined_customized_reference_div_5.04_13.98.fas -i 80 ``` - this script doesn't work, --- ## Troubleshooting > In this section, errors that are given will not be verbose (i.e., the errors here will be summarized versions, and you need to dig through terminal output to find the error that matches) ### Issues with setting up the virtual machine #### Ubuntu 64 bit not available - ++Step 4++: If you do not see *Ubuntu 64 bit version*, you will need to enter your computer's BIOS and enable virtualization. This differs for practically every single computer so only a very general guide can be written on how to do this. **[Here](https://2nwiki.2n.cz/pages/viewpage.action?pageId=75202968)** is a guide for certain computer models. A safer bet would be to google this using your specific computer model. #### How much storage should I allocate to the VM? - ++Step 6++: *Dynamically allocated disks* have a maximum storage capacity (that you set), but the storage it takes up on the host machine will only grow as the disk itself grows. A *fixed disk* means that the disk will be created and occupy that space on the host machine immediately. Its size also cannot be changed. If you'd like to change your dynamically allocated disk size, google it. Make sure to make a backup copy of the disk file before making any changes. - In terms of choosing disk size, 50gb of Brighamia data ended up needing 500+ gb. Plan ahead and give plenty of leeway. ### Issues with running HybPiper #### reads_first.py :::danger **Error: No Module named Bio** - At the root of it, this error means that the reads_first.py script was unable to locate the package (shorthand Bio). This could be for a number of reasons: - It is not installed. This won't be true if you ran `reads_first.py --check-depend` as in the the earlier parts of this workflow. - The more likely issue is that it is installed, but reads_first.py can't find it. A common issue is that there are multiple versions of python installed on your computer/server, and while is installed in one version, reads_first.py is using a different version. - Try typing `python -V` into terminal, and then type `python3 -V`. If your terminal outputs 2 different python versions, this may be the issue. - If the above is true, type `python` into termianl, and your terminal will become a python script environment, denoted by `>>>`. Type `import Bio` and press enter. If nothing happens, then *this* python version is where BioPython is installed. If termianl outputs an error along the lines of `Error: Module Not Found`, then Biopython is probably installed on the other version of python. At any time, type `quit()` to exit the python environment. If python isn't installed in this version, try typing `python3` in terminal to open up another python environment, this time using the other version. If `import Bio` still gives `Error: Module Not Found`, then BioPython might not be properly installed. ::: :::danger **time: cannot run python**: At some point I was getting error `time: cannot run python: no directory or file found` in one of the first steps of reads_first. I think this was another issue related to having multiple python versions installed. I (as a user) called for the reads_first script to be read via `python3 reads_first.py ...`, but within the scripts python is called via `python ...`. Aliasing `python="python3"` did not work. Setting up python profiles seemed to solve the issue, key step naming the profiles `python` instead of `python3`. ```bash= ! sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.8 1 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 2 sudo update-alternatives --config python ``` ::: --- ### Issues with running HybPhaser To **install R using conda on Fabronia, try following **[these](https://datascience.stackexchange.com/questions/77335/anconda-r-version-how-to-upgrade-to-4-0-and-later)** instructions. ::: danger **Name File Issues:** my name_file.txt file had incorrect newline characters ; replace them: `vi -b filename` `:%s/\r$` `:x` ::: :::danger **Issue with 1_generate_consensus_sequence**: I think if `reads_first.py` doesn't generate a contig for any bait in an individual, `1_generate_consensus_sequences` is unable to copy over files (it interprets the lack of a file as an error) and skips to the next individual. This (1) potentially leads to lost data and (2) causes downstream problems when future scripts expect folders/files to be there based on the original namelist. Current solution: update the namelist by removing individuals that have no contigs for a bait. ::: :::danger **1b_assess_dataset_r**: Line 370 `write(samples[which(!(samples %in% outsamples_missing))], file=file.path(output_asses, "0_namelist_included_samples.txt"))` gave error- commented out the line. Unsure how to fix. ::: :::danger **1c_generate_sequence_lists.R**: Part of the scripts expect R objects to exist (like failed_samples, failed_loci), but I couldn't find where they are supposed to be created and these two did not exist for me. This is an issue when the script calls on these objects like such: ```r= ! loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) ``` This line returns the error `object failed_loci not found`. I'm not sure if my fix still leads to the desired outcome of the script but I added a step to check for the existence of the object. If it didn't exist I would create an empty vector for it ```r= ! if(exists("failed_loci")) { loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) } else { loci_to_remove <- c() failed_loci <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0 } ``` This solution was applied everywhere the Rscript expected the existence of an object that did not exist for me (specifically objects `failed_samples` and `failed_loci`) ::: --- ## Log ```csvpreview {header="true", escapeChar='\'} Date (mm/dd/yy), Entry June-July 2021, Set up VM; installed HybPiper & dependencies; downloaded Brighamia data 07/15/21, Began running reads_first using baits file constructed from supercontigs 07/16/21, Finished reads_first script ; ran summary statistics on reads_first outputs 10/08/21, The VM has been scrapped- lost all data due to crashes. Now I'm running Ubuntu through a dual-boot method on my PC. (1) Redownloading data (2) Setting up HybPiper 10/10/21, Set up HybPiper and redownloaded data. Will run reads_first using two different baits: supercontigs + original brighamia baits 10/11/21, Successfully ran HybPiper using supercontig baits file, Need to find original (brighamia) baits file 10/12/21, Successfully ran HybPiper using original baits file 10/20/21, Created heatmaps of recovery percentage for both runs of reads_first 11/10/21, Installed bcftools, started working through HybPhaser pipeline 11/12/21, Troubleshoot and ran 1 & 1a of HybPhaser 01/05/22, Troubleshoot and ran 1b of HybPhaser 01/09/22, Troubleshoot and ran 1c of HybPhaser- this script had a lot of the same type of error 01/21/22, PPD: Ran Intronerate on original HybPiper brighamia data + then ran Step2.sh on testdata (lots of troubleshooting difficulties...); HYBPHASER: reduced data to only BRIN; Upload BRIN HypPhaser to Fabronia 01/26/22, PPD: Ran steps 1 and 2 on brighamia test data 01/31/22, PPD: Begin troubleshooting PPD.py on brighamia test data ```