Bioinformatics_NCBI_database

![datalab Nigeria](https://hackmd.io/_uploads/Sk7fwzC5C.png) --- # Automating Access to NCBI using Entrez Direct and Quality control ## **Session Overview** - **Objective**: This session aims to equip participants with the skills to automate data retrieval from NCBI using Entrez Direct and E-utils, enhancing their ability to perform bioinformatics tasks efficiently. - **Duration**: 2 hours - **Format**: Hands-on, interactive exercises with live coding and demonstrations. --- ## **1. Introduction to NCBI and Entrez** ### **What is NCBI?** The National Center for Biotechnology Information (NCBI) is a vast repository of life-sciences-oriented data. Initially focused on DNA data, NCBI now offers access to: - **DNA and protein sequences** - **Genomes and gene expression data** - **Scientific literature, including PubMed** - **Structural data and more** ### **What is Entrez?** Entrez is NCBI’s primary search and retrieval system that integrates the PubMed database of biomedical literature with 39 other molecular databases. It allows users to: - **Search across multiple databases** - **Retrieve specific records** - **Download data in various formats** ### **What is Entrez Direct?** Entrez Direct (EDirect) is a command-line toolset that simplifies access to NCBI databases, making it easier to automate data retrieval and integrate it into bioinformatics workflows. --- ## **2. Installing Entrez Direct** ### **Installation Steps** Entrez Direct can be installed either via Bioconda or directly from the source. ## A. From Bioconda 1. **Install via Bioconda**: If using a package manager like Bioconda, simply run: ```bash conda install bioconda::entrez-direct ``` ## B. From Source code #### **Step 1: Install from Source** - Download the source code. - Unpack and navigate to the directory. - Run the installation script. ```bash= cd ~ curl -O ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.tar.gz tar -xzvf edirect.tar.gz rm edirect.tar.gz export PATH=$PATH:$HOME/edirect ``` #### **Step 2: Add to `.bashrc` (Optional but Recommended)** To ensure Entrez Direct is always available in your PATH: ```bash echo 'export PATH=$PATH:$HOME/edirect' >> ~/.bashrc source ~/.bashrc ``` ### **Verifying Installation** After installation, verify it by running: ```bash esearch -version ``` *Expected Output*: The version number of Entrez Direct should be displayed. *Participants will be guided through this installation process during the session.* --- ## **3. Basic Entrez Direct Commands** Entrez Direct includes a suite of tools, but we will focus on two core commands: `efetch` and `esearch`. ### **Fetching Data with `efetch`** `efetch` allows users to retrieve data from NCBI databases. #### **Example 1: Retrieve GenBank Format Data** ```bash efetch -db nuccore -format gb -id AF086833 | head ``` *Explanation*: This command retrieves the GenBank record for the accession number AF086833 and displays the first few lines. #### **Example 2: Retrieve FASTA Format Data** ```bash efetch -db nuccore -format fasta -id AF086833 > AF086833.fa ``` *Explanation*: This command fetches the FASTA format sequence for the given accession number. ### **Searching Databases with `esearch`** `esearch` is used to query NCBI databases and retrieve environment data for further processing. #### **Example 1: Search Nucleotide Database** ```bash esearch -db nucleotide -query "Ebola virus 1976" ``` *Explanation*: This command searches the nucleotide database for records matching "Ebola virus 1976". #### **Interactive Exercise**: Participants will search for sequences related to an organism of their choice and use `efetch` to retrieve them in both GenBank and FASTA formats. --- ## **4. Working with Sequence Data** ### **Retrieving Sequences in Various Formats** Entrez Direct allows retrieval of sequences in multiple formats depending on the research needs. #### **Example 1: Retrieve Subsections of Sequences** ```bash efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 100 ``` *Explanation*: This command fetches the first 100 bases of the sequence. #### **Example 2: Fetch Reverse Complements** ```bash efetch -db nuccore -format fasta -id AF086833 -seq_start 1 -seq_stop 100 -strand 2 ``` *Explanation*: This retrieves the reverse complement of the first 100 bases. ### **Interactive Exercise**: Participants will fetch a specific gene or region of interest and practice retrieving the reverse complement. --- ### 5. BLAST: A Practical Introduction #### What is BLAST? BLAST, which stands for Basic Local Alignment Search Tool, is both an algorithm and a suite of tools designed to compare biological sequences, such as nucleotides or proteins. It helps researchers identify similarities between sequences, which can be crucial in understanding genetic relationships, function, and evolution. BLAST can be run via a web interface from NCBI or as standalone tools on your local machine. #### Types of BLAST Tools BLAST consists of various tools, each suited to different types of searches: - **blastn**: Compares nucleotide sequences. - **blastp**: Compares protein sequences. - **blastx**: Translates nucleotide sequences into proteins and compares them. - **tblastn**: Compares protein sequences against translated nucleotide sequences. - **tblastx**: Translates both query and target nucleotide sequences into proteins and compares them. Each tool can be fine-tuned using specific tasks, such as `megablast` for highly similar sequences or `blastp-short` for shorter protein sequences. #### Installing BLAST via Conda To install BLAST on your local machine using Conda: 1. **Install Conda (if not installed)**: Download and install from the [official Conda website](https://docs.conda.io/en/latest/miniconda.html). 2. **Set up a Conda Environment**: ```bash conda create -n bioinfo-env conda activate bioinfo-env ``` 3. **Install BLAST**: ```bash conda install -c bioconda blast ``` 4. **Verify Installation**: ```bash blastn -version ``` #### Running a Pairwise Alignment with BLAST Pairwise alignment compares two sequences to find regions of similarity. Here's how to use BLAST for this purpose: 1. **Fetch Sequences**: - **Ebola Nucleoprotein 1976 Strain**: ```bash efetch -db protein -format fasta -id AAD14590 > AAD14590.fa ``` - **Ebola Nucleoprotein 2018 Strain**: ```bash efetch -db protein -format fasta -id ARG44037 > ARG44037.fa ``` 2. **Run Pairwise BLAST Alignment**: ```bash blastp -query AAD14590.fa -subject ARG44037.fa ``` This command aligns the two sequences and provides detailed output about their similarities, including percent identity and alignment length. #### Understanding BLAST Output BLAST outputs can be complex. For example, you might see that sequences match at a 99% level, with 728 out of 739 bases being identical. This information is valuable for assessing the similarity between sequences. To simplify the output and focus on specific details, you can format the results: ```bash blastp -query AAD14590.fa -subject ARG44037.fa -outfmt '6 pident' ``` This command returns the percentage identity only, making it easier to interpret the results. #### Conclusion BLAST is a versatile tool that can be used for various sequence alignment tasks, from simple pairwise alignments to complex database searches. Understanding how to use and interpret BLAST results is crucial for genomic research and can provide insights into evolutionary relationships and functional genomics. ## **5. Q&A** ### **Open Q&A** Participants are encouraged to ask questions or discuss any issues they encountered during the session. --- Here's a 30-minute step-by-step tutorial on downloading a FASTQ file, installing FastQC, and running quality control with image visualization on the web using Ubuntu and Google Chrome. ### **Outline** 1. **Introduction to FASTQ and FastQC (5 minutes)** 2. **Downloading a FASTQ file (5 minutes)** 3. **Installing FastQC on Ubuntu (5 minutes)** 4. **Running FastQC Quality Control on FASTQ files (5 minutes)** 5. **Visualizing Quality Control Results in Google Chrome (5 minutes)** 6. **Conclusion (5 minutes)** --- ### **1. Introduction to FASTQ and FastQC (5 minutes)** **What is a FASTQ file?** - A FASTQ file is a text-based format for storing nucleotide sequences (such as DNA sequences) along with their corresponding quality scores. - It is commonly used in high-throughput sequencing data such as next-generation sequencing (NGS). **What is FastQC?** - FastQC is a tool used to assess the quality of sequence data from high-throughput sequencing pipelines. - It generates visual reports to help analyze whether your sequencing data is of sufficient quality. --- ### **2. Downloading a FASTQ file (5 minutes)** You can download FASTQ files from public databases like NCBI’s Sequence Read Archive (SRA). #### **Steps:** 1. Open **Google Chrome** on Ubuntu. 2. Go to [NCBI SRA Database](https://www.ncbi.nlm.nih.gov/sra). 3. Search for a dataset. For example, search for a study like "SRR000123" (replace with actual dataset ID). 4. Click on a sequence record and navigate to the FASTQ download section. 5. Download the file to your desired directory. Alternatively, you can download FASTQ files using the **SRA Toolkit**: ```bash # First, install the SRA toolkit sudo apt-get install sra-toolkit # Use prefetch to download the data from SRA prefetch SRR000123 # Replace with your sequence ID # Convert the downloaded SRA file to FASTQ fastq-dump SRR000123 ``` --- ### **3. Installing FastQC on Ubuntu (5 minutes)** #### **Steps:** 1. Open a terminal (Ctrl+Alt+T). 2. Update your system’s package list: ```bash sudo apt-get update ``` 3. Install FastQC: ```bash sudo apt-get install fastqc ``` 4. Verify the installation by typing: ```bash fastqc --version ``` --- ### **4. Running FastQC Quality Control on FASTQ Files (5 minutes)** #### **Steps:** 1. Open the terminal. 2. Navigate to the directory where your FASTQ file is located. For example: ```bash cd /path/to/your/fastq_file ``` 3. Run FastQC on the FASTQ file: ```bash fastqc your_fastq_file.fastq ``` 4. After processing, FastQC will generate output files including: - A `.zip` file with results - An HTML file with a visual report --- ### **5. Visualizing Quality Control Results in Google Chrome (5 minutes)** #### **Steps:** 1. Open **Google Chrome**. 2. Navigate to the directory where the FastQC output files are stored. 3. Open the HTML file by either: - Right-clicking the file and selecting “Open with Google Chrome.” - Typing `file:///path/to/your/output_file.html` in Chrome’s address bar. 4. Review the quality control metrics: - **Per base sequence quality:** Graph showing average quality score per base. - **Per sequence GC content:** Distribution of GC content. - **Adapter content:** Presence of adapter sequences in the reads. - Other visual reports generated by FastQC. You can analyze these plots to identify any quality issues with your sequencing data (e.g., low-quality reads, adapter contamination). --- ### **6. Conclusion (5 minutes)** Summarize the tutorial: - **What we did:** Downloaded a FASTQ file, installed FastQC, ran quality control, and visualized the results. - **Importance of quality control in sequencing:** Ensuring good quality data before proceeding with downstream analysis like alignment and variant calling. Encourage the audience to explore further by running quality control on their own data or experimenting with different datasets. --- ### **Additional Notes:** - **Troubleshooting FastQC errors:** FastQC might give warnings/errors if the sequence data has issues. Check the FastQC manual for common issues. - For large datasets, you might need more computational resources. Consider using cloud platforms or HPC resources if necessary.