Applied Bioinformatics

# Applied Bioinformatics This is a collaborative document that was used on day 4 and day 5 of the Applied Bioinformatics Course 2024. The developed and shared code from day 5 is on top, the one for day 4 at the bottom of the page. # Day 5 bracken results into db ## How should the format look like? Article "How to Share Data for Collaboration" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7518408/ What are the general guidelines outlined in the paper? - send raw data - tidy, not messy - documentation - codes for the variables - how was the data generated? How should our table look like? - we need to add sample identifier to the existing table - we will use the accession number for that ## Our batch script(s) Not necesserily in this order we need to: 1: add the sample identifier to the bracken output. use accession number ### Day 5: Group with 4 ``` #!/bin/bash #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH -A naiss2024-22-540 #SBATCH --mem=1GB #SBATCH --time=0-00:15 #SBATCH -o slurm.%A.%a.out #SBATCH -e slurm.%A.%a.err #SBATCH --job-name=combine_bracken_report bracken_out_dir="/path/to/bracken_output" ## path to directory with bracken output files accnum_file="/path/to/accnum_file" ## path to file with the list of accessions # add header echo -e "accession_number\tname\ttaxonomy_id\ttaxonomy_lvl\tkraken_assigned_reads\tadded_reads\tnew_est_reads\tfraction_total_reads" > ${bracken_out_dir}/combined_bracken.out # append all files with column accession number (Out files should be: {accession_number}.braken.out) srun xargs -I{} -a $accnum_file awk -v id={} 'NR > 1 {print id "\t" $0}' ${bracken_out_dir}/{}.bracken.out >> ${bracken_out_dir}/combined_bracken.out ``` 2: gather the samples from everyone. We will do that by individually uploading our data sets into the database 3: make a table in the sqlite data base. ``` SQL_STATEMENT=" CREATE TABLE IF NOT EXISTS bracken_abundances_long ( accession_number TEXT NOT NULL, name TEXT NOT NULL, taxonomy_id TEXT NOT NULL, taxonomy_lvl TEXT NOT NULL, kraken_assigned_reads INTEGER NOT NULL, added_reads INTEGER NOT NULL, new_est_reads INTEGER NOT NULL, fraction_total_reads REAL NOT NULL ); " DB_PATH="/proj/applied_bioinformatics/common_data/sample_collab.db" sqlite3 "$DB_PATH" "$SQL_STATEMENT" ``` 4: fill the data base table with our data. ``` #sqlite3 -batch -separator "$(printf '\t')" sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db ".separator \t" ".import path/to/your/combined_bracken.out bracken_abundances_long" ``` ## Questions to answer with the data base: Please add the code you used to get the answers. 1) can you check that the fraction_total_reads proportions add to one per sample ? ``` SELECT accession_number, SUM(fraction_total_reads) AS total_fraction FROM bracken_abundances_long GROUP BY accession_number HAVING total_fraction != 1.0; ``` 2) can you list the first 20 samples with highest SARS-CoV-2 abundance ? ``` SELECT accession_number, name, fraction_total_reads AS total_coronavirus_abundance FROM bracken_abundances_long WHERE name LIKE '%Severe acute respiratory%' GROUP BY accession_number ORDER BY total_coronavirus_abundance DESC LIMIT 20; ``` ``` sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT * FROM bracken_abundances_long WHERE name = 'Severe acute respiratory syndrome-related coronavirus' ORDER BY fraction_total_reads DESC LIMIT 20;" ``` 3) can you see a relationship between SARS-CoV-2 abundance and Rt values ? In Python: ``` df_sars = df[df["name"].str.contains("Severe acute respiratory")].sort_values("fraction_total_reads", ascending=False) combined = df_sars.merge(sample_annot, left_on="accession_number", right_on="run_accession") print(combined.columns) plt.figure() sns.scatterplot(combined, x="fraction_total_reads", y="Ct") ``` On command line: ``` sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db " SELECT * FROM sample_annot JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name = 'Severe acute respiratory syndrome-related coronavirus' ORDER BY fraction_total_reads DESC" ``` Or in a nice box: ``` sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT run_accession, name, fraction_total_reads, Ct FROM sample_annot INNER JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name LIKE '%Severe acute respiratory%' ORDER BY fraction_total_reads DESC LIMIT 20" ``` 4) can you check if SARS-CoV-2 is present in rtPCR negative patients ? ``` sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db \ "SELECT run_accession, name, fraction_total_reads, Ct FROM sample_annot INNER JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name LIKE '%Severe acute respiratory%' AND Ct = '0.0';" ``` Also on command line ``` sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT run_accession, name, fraction_total_reads, Ct, host_disease_status FROM sample_annot INNER JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name LIKE '%Severe acute respiratory%' AND host_disease_status = 'SARS-CoV-2 negative' ORDER BY fraction_total_reads DESC LIMIT 20" ``` 5) is SARS-CoV-2 only ever seen in RNA samples (ie not in DNA samples) ? ``` sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT run_accession, name, fraction_total_reads AS total_coronavirus_abundance, nuc, host_disease_status FROM sample_annot sa INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number WHERE nuc='RNA' AND name LIKE '%Severe acute respiratory%' ORDER BY total_coronavirus_abundance DESC LIMIT 20" ``` 6) what are the 10 most ubiquitous human pathogens found in the 125 patients ? Below is list known to be associated with upper airways and oral/nasal cavities of the study's supp tables 3 & 4: 'Dolosigranulum pigrum','Haemophilus influenzae', 'Haemophilus parainfluenzae', 'Klebsiella pneumoniae','Streptococcus pneumoniae','Staphylococcus aureus','Influenza A virus', 'Severe acute respiratory syndrome-related coronavirus','Chlamydia pneumoniae', 'Haemophilus parainfluenzae','Serratia marcescens','Enterobacter hormaechei', 'Enterobacter cloacae', 'Enterobacter asburiae','Burkholderia multivorans', 'Human coronavirus HKU1','Human coronavirus NL63','Rhinovirus A' * to view the top 10 ``` SELECT name, COUNT(name) as prevalence FROM sample_annot sa INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number GROUP BY name ORDER BY prevalence DESC LIMIT 10; ``` * to compare the top 10 against the list of known pathogens ``` SELECT name, COUNT(name) as prevalence FROM sample_annot sa INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number WHERE name IN ('Dolosigranulum pigrum', 'Haemophilus influenzae', 'Haemophilus parainfluenzae', 'Klebsiella pneumoniae', 'Streptococcus pneumoniae', 'Staphylococcus aureus', 'Influenza A virus', 'Severe acute respiratory syndrome-related coronavirus', 'Chlamydia pneumoniae', 'Haemophilus parainfluenzae', 'Serratia marcescens', 'Enterobacter hormaechei', 'Enterobacter cloacae', 'Enterobacter asburiae', 'Burkholderia multivorans', 'Human coronavirus HKU1', 'Human coronavirus NL63', 'Rhinovirus A') GROUP BY name ORDER BY prevalence DESC LIMIT 10; ``` ### Scripts from other groups: #### some group to add accession number in each file. In your MedBioInfo folder: ``` # first add accession numbers as the last column of each bracken.out files # and concatenate all in one (called #bracken_sql.out). #Dont forget to change the username and file pattern. srun \ cat analyses/username_run_accessions.txt \ | xargs -I % sh -c \ "sed '1d;s/$/\t%/' analyses/kraken2/%_bracken.out" \ > analyses/kraken2/bracken_sql.out -- % # then add headers to the created file srun \ sed -i 1i"name\ttaxonomy_id\ttaxonomy_lvl\tkraken_assigned_reads\tadded_reads\tnew_est_reads\tfraction_total_reads\taccession_number" analyses/kraken2/bracken_sql.out ``` #### group x ``` # usage: bash your_bash_script_filename_here.sh # run only once srun -A naiss2024-22-540 bash -c ' directory_to_search="/proj/applied_bioinformatics/users/x_ronod/MedBioinfo/data/bracken_out/" # Change this to your directory # Find all files matching ERR*result find "$directory_to_search" -type f -name "ERR*result" | while read -r file_path; do # Split the basename using "_" and take the first field basename=$(basename "$file_path") accession_number=$(echo "$basename" | cut -d"_" -f1) # Create a temporary file tmp_file=$(mktemp) # Add the accession number column to the start of the file awk -v acc="$accession_number" "BEGIN {FS=OFS=\"\t\"} NR==1 {\$0=\"accession_number\" OFS \$0} NR>1 {\$0=acc OFS \$0} 1" "$file_path" > "$tmp_file" # Move the temporary file back to the original file mv "$tmp_file" "$file_path" echo "Processed $file_path" done ' ``` #### Group: Fantastic four Adding accession number or Sample ID to your Bracken output. Step by step guide: 1) Make sure that you have the Bracken output files by first go to your output directory `cd /proj/applied_bioinformatics/user/"YOUR PATH HERE"/bracken_output_files`. Locate the files using `ls`. 2) Create the bash script in your directory using `nano add_accession_number.sh`. Copy-paste the code below. 3) Run the script and pray. 4) Make sure that the output is correct by examine your `.tsv` file by using the following bash command in your terminal: `head ERR******_bracken.tsv`. If you now can see your Accession number or Sample ID, you have completed the task. Good job! ``` #!/bin/bash #SBATCH -A naiss2024-22-540 #SBATCH -n 1 #SBATCH -t 01:00:00 #SBATCH -J blastp #SBATCH -o blastp.%j.out #SBATCH -e blastp.%j.err #SBATCH --mail-user ***********@ki.se #SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT_80 # To run this script your bracken out file name should be as bracken.1.ERR****.out # Define paths bracken_dir="/path/to/bracken/files" cd "$bracken_dir" || exit # Initialize the output file and add a header output_all="all_sample.tsv" echo -e "Accession_Number\tColumn1\tColumn2" > "$output_all" # Loop through each file in the directory for file in *.out; do # Check if the file name matches the pattern we need if [[ $file =~ bracken\.[0-9]+\.(ERR[0-9]+)\.out ]]; then # Extract the sample name using the regex capture group sample="${BASH_REMATCH[1]}" # Create the output filename output="${sample}_bracken.tsv" # Run the awk command awk -F, -v sample="$sample" 'NR==1 {print "Accession_Number\t"$1"\t"$2} NR>1 {print sample "\t"$1"\t"$2}' "$file" > "$output" # Append the generated file content to the all_sample.tsv, excluding the header tail -n +2 "$output" >> "$output_all" else # Print a message if the file does not match the pattern echo "Skipping file: $file (does not match the required pattern)" fi done ``` #### Group: At the End The following piece of code has two main steps. Firstly, it adds the accession number of every file into the bracken report output, in a column titled 'sample_id'. Secondly, it collates all the now modified reports into a single file, to be loaded into SQL. The file has headers, corresponding to: * name * taxonomy_id * taxonomy_lvl * kraken_assigned_reads * added_reads new_est_reads * fraction_total_reads * sample_id The script is intended to be loaded as part of the nextflow pipeline such that all produced bracken outputs will be processed. This first part is to be appended to the BRACKEN process. ``` process BRACKEN...{ input: path(bracken_output) // is not collected tuple val(id), path(reads) // directives: container 'https://depot.galaxyproject.org/singularity/gawk:5.3.0' publishDir "$params.outdir/06_prepbracken" script: """ awk 'BEGIN {FS=OFS="\t"}{\$0 = \$0 OFS (NR == 1 ? "sample_id" : "${id}")} 1' ${bracken_output} | awk 'NR>1' > ${id}_sortedbrackenreport.txt; """ output: path "${id}_sortedbrackenreport.txt", emit: indiv_modified_bracken_report } ``` This second part is a process of its own. ``` process PREPBRACKEN{ input: path(indiv_modified_bracken_report) // is collected tuple val(id), path(reads) // directives: container 'https://depot.galaxyproject.org/singularity/gawk:5.3.0' publishDir "$params.outdir/06_prepbracken" script: """ cat ${indiv_modified_bracken_report} > sortedbrackenreport.txt """ output: path "sortedbrackenreport.txt", emit: modified_bracken_report } ``` ------ # Day 4 Introduction to Nextflow This document is a living document we can all work on in real time. We can ask questions and answer them, post code for everyone etc. The document is written in Markdown. Here is a link to some basic syntax: https://www.markdownguide.org/basic-syntax/ > blockquote code (indented by one tab, or four spaces) In line `code`by using backticks (`) ## Flash2: Merging paired end reads Write your first own process: ``` process FLASH2 { input: // directives script: """ """ output: } ``` #### Amrei ``` process FLASH2 { input: tuple val(id), path(reads) // directives container 'https://depot.galaxyproject.org/singularity/flash2:2.2.00--hed695b0_2' publishDir "$params.outdir/02_flash" script: """ flash2 \\ $reads \\ --output-prefix="${id}.flash2" \\ --max-overlap=150 \\ | tee -a ${id}_flash2.log """ output: tuple val(id), path ("${id}.flash2.extendedFrags.fastq"), emit: merged_reads path "${id}.flash2.notCombined*.fastq", emit: notCombined_fastq path "${id}.flash2.hist", emit: flash2_hist path "${id}.flash2.histogram", emit: flash2_his2togram path "${id}_flash2.log", emit: log } ``` #### Group with 5 ``` process FLASH2 { input: tuple val(id), path(reads) // directives container "https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6" publishDir "$params.outdir/02_flash2" script: """ flash \\ -z -t 2 \\ $reads 2>&1 \\ | tee flash -t 2 \ -o ${id}.flash \ -d $params.outdir/02_flash \ ${reads[0]} ${reads[1]} """ output: path "${id}*flash", emit: flash path "${id}*log", emit: log } ``` #### Group 5 ``` process FLASH2 { input: tuple val(id), path(reads) // directives container https://depot.galaxyproject.org/singularity/flash2%3A2.2.00--he4a0461_6 publishDir '$params.outdir/flas2' script """ flash \ $reads """ output: path "${id}", emit: merged_reads } ``` ``` process BOWTIE2 { input: database tuple val(id), path(merged_reads) // directives container https://depot.galaxyproject.org/singularity/bowtie2%3A2.5.4--he20e202_0 publishDir '$params.outdir/03_bowtie2' script """ bowtie2 \ -x $bowtie2_db \\ -U $merged_reads \\ -S ${id}.sam """ output: path "${id}.sam", emit: sam_file } ``` #### Group the best potato group ##### Flash2: Merging paired end reads ``` process FLASH2 { input: tuple val(id), path(reads) // directives container 'https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6' publishDir "$params.outdir/02_merged" script: """ flash \\ --threads 2 --compress -o {}.flash ${id}_1.fastq.gz ${id}_2.fastq.gz """ output: path "${id}*extendedFrags.fasta.gz", emit: merged_reads } ``` ``` process BOWTIE2 { input: tuple val(id), path(merged_reads), path(bowtie_db) // directives container 'https://depot.galaxyproject.org/singularity/bowtie2:2.5.2--py39h6fed5c7_0' publishDir "$params.outdir/03_bowtie" script: """ bowtie2 \\ -x $bowtie_db \\ --no-unal \\ -U $merged_reads \\ -S ${id}_merged2PhiX.sam \\ | tee -a ${id}_merged2PhiX.log """ output: path "${id}_merged2PhiX.sam", emit: bowtie_sam path "${id}_merged2PhiX.log", emit: bowtie_log } ``` #### Group the best best group ##### Flash2: Mergigng paired end reads ``` process FLASH2 { input: tuple val(id), path(reads) // directives container 'https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6' publishDir "$params.outdir/02_flash2" script: """ flash \\ --threads 2 \\ --compress \\ --output-prefix .flash \\ $reads \\ """ output: path "${id}*fastqc.html", emit: flash_ " } ``` ``` process BOWTIE2{ input: path(bowtie2_db) tuple val(id), path(merged_reads) // directives container 'https://depot.galaxyproject.org/singularity/bioconductor-crisprbowtie:1.6.0--r43hdfd78af_0' publishDir "$params.outdir/03_bowtie2" script: """ bowtie2 -x $bowtie2_db -U $merged_reads --threads 8 --no-unal \\ | tee ${id}_bowtie2.log """ output: path "${id}.sam", emit: sam } ``` ## change workflow scope Here is also the old workflow scope. How do we need to change that one to add the new process? ``` workflow{ ch_input = Channel.fromFilePairs(params.input_read_pairs, checkIfExists: true ) // quality control FASTQC ( ch_input ) // assembling reports MULTIQC ( FASTQC.out.qc_zip.collect()) } ``` ## Where to find containers? Here is a good resource on where to find containers: https://depot.galaxyproject.org/singularity/ Alternatively, you can do an internet search for **nf-core modules** and the name of your tool - in this case flash2. Go to the github repository of that module and copy paste the link to the singularity container from the **main.nf** file. Look up seqera containers - brand-new - which can make singularity containers for you on the fly. Choose which tools should be available in there, which environment they will be run in and it'll make a container for you. ## Bowtie2: Map the reads to the data base We need to: a) change the workflow scope What do we need as an input for bowtie? Which channel factory/factories could be useful here? b) add a new process called BOWTIE2 ``` process BOWTIE2 { input: // directives script: """ """ output: } ``` c) define the run time environment for this process ``` withName:'BOWTIE2'{ } ``` ### Flash2: Merging paired end reads ``` process FLASH2 { input: tuple val(id), path(reads) //directives container 'https://depot.galaxyproject.org/singularity/flash%3A1.2.11--h5bf99c6_6' publishDir "$params.outdir/02_flash" script: """ flash $reads \ """ output: path "${id}.fastq.gz", emit: merged_pairs } ``` #### erik ``` process FLASH2 { input: tuple val(id), path(reads) // directives container 'https://depot.galaxyproject.org/singularity/flash2:2.2.00--hed695b0_2' publishDir "$params.outdir/02_flash" script: """ flash2 \\ $reads \\ --output-prefix="${id}.flash2" \\ --max-overlap=150 \\ | tee -a ${id}_flash2.log """ output: tuple val(id), path ("${id}.flash2.extendedFrags.fastq"), emit: merged_reads path "${id}.flash2.notCombined*.fastq", emit: notCombined_fastq path "${id}.flash2.hist", emit: flash2_hist path "${id}.flash2.histogram", emit: flash2_histogram path "${id}_flash2.log", emit: flash2_log } ``` #### Amrei Bowtie2 ``` process BOWTIE2 { input: path(bowtie2_db) tuple val(id), path(merged_reads) // directives: container 'https://depot.galaxyproject.org/singularity/mulled-v2-ac74a7f02cebcfcc07d8e8d1d750af9c83b4d45a:f70b31a2db15c023d641c32f433fb02cd04df5a6-0' publishDir "$params.outdir/03_bowtie2" script: db_name = bowtie2_db.find{it.name.endsWith(".rev.1.bt2")}.name.minus(".rev.1.bt2")**** """ bow****tie2 \\ -x $db_name \\ -U $merged_reads \\ -S ${id}_bowtie2_merged_${db_name}.sam \\ --no-unal \\ |& tee -a ${id}_bowtie2_merged_${db_name}.log """ output: path "${id}_bowtie2_merged_${db_name}.log", emit: logs path "${id}_bowtie2_merged_${db_name}.sam", emit: aligned_reads } ``` ## Kraken ``` process KRAKEN2{ input: path(kraken2_db) tuple val(id), path(reads) // directives: container 'https://depot.galaxyproject.org/singularity/mulled-v2-8706a1dd73c6cc426e12dd4dd33a5e917b3989ae:c8cbdc8ff4101e6745f8ede6eb5261ef98bdaff4-0' publishDir "$params.outdir/04_kraken2" script: """ kraken2 \\ -db $kraken2_db \\ --paired \\ --gzip-compressed \\ --output ${id}_kraken2_taxonomy.txt \\ --classified-out "${id}_kraken2_classified#.fq" \\ --report ${id}_kraken2_report.txt \\ $reads \\ |& tee -a ${id}_kraken2.log """ output: path "${id}_kraken2.log", emit: logs path "${id}_kraken2_classified*.fq" , emit: kraken2_classified_reads path "${id}_kraken2_taxonomy.txt", emit: kraken2_taxonomy path "${id}_kraken2_report.txt", emit: kraken2_report } ``` Command to make the output look nicer: sed -i -E 's/\b[scogfkp]\__//g' $krona_output