# Applied Bioinformatics
This is a collaborative document that was used on day 4 and day 5 of the Applied Bioinformatics Course 2024.
The developed and shared code from day 5 is on top, the one for day 4 at the bottom of the page.
# Day 5 bracken results into db
## How should the format look like?
Article "How to Share Data for Collaboration"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7518408/
What are the general guidelines outlined in the paper?
- send raw data
- tidy, not messy
- documentation
- codes for the variables
- how was the data generated?
How should our table look like?
- we need to add sample identifier to the existing table
- we will use the accession number for that
## Our batch script(s)
Not necesserily in this order we need to:
1: add the sample identifier to the bracken output.
use accession number
### Day 5: Group with 4
```
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A naiss2024-22-540
#SBATCH --mem=1GB
#SBATCH --time=0-00:15
#SBATCH -o slurm.%A.%a.out
#SBATCH -e slurm.%A.%a.err
#SBATCH --job-name=combine_bracken_report
bracken_out_dir="/path/to/bracken_output" ## path to directory with bracken output files
accnum_file="/path/to/accnum_file" ## path to file with the list of accessions
# add header
echo -e "accession_number\tname\ttaxonomy_id\ttaxonomy_lvl\tkraken_assigned_reads\tadded_reads\tnew_est_reads\tfraction_total_reads" > ${bracken_out_dir}/combined_bracken.out
# append all files with column accession number (Out files should be: {accession_number}.braken.out)
srun xargs -I{} -a $accnum_file awk -v id={} 'NR > 1 {print id "\t" $0}' ${bracken_out_dir}/{}.bracken.out >> ${bracken_out_dir}/combined_bracken.out
```
2: gather the samples from everyone.
We will do that by individually uploading our data sets into the database
3: make a table in the sqlite data base.
```
SQL_STATEMENT="
CREATE TABLE IF NOT EXISTS bracken_abundances_long (
accession_number TEXT NOT NULL,
name TEXT NOT NULL,
taxonomy_id TEXT NOT NULL,
taxonomy_lvl TEXT NOT NULL,
kraken_assigned_reads INTEGER NOT NULL,
added_reads INTEGER NOT NULL,
new_est_reads INTEGER NOT NULL,
fraction_total_reads REAL NOT NULL
);
"
DB_PATH="/proj/applied_bioinformatics/common_data/sample_collab.db"
sqlite3 "$DB_PATH" "$SQL_STATEMENT"
```
4: fill the data base table with our data.
```
#sqlite3 -batch -separator "$(printf '\t')"
sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db ".separator \t" ".import path/to/your/combined_bracken.out bracken_abundances_long"
```
## Questions to answer with the data base:
Please add the code you used to get the answers.
1) can you check that the fraction_total_reads proportions add to one per sample ?
```
SELECT accession_number, SUM(fraction_total_reads) AS total_fraction
FROM bracken_abundances_long
GROUP BY accession_number
HAVING total_fraction != 1.0;
```
2) can you list the first 20 samples with highest SARS-CoV-2 abundance ?
```
SELECT accession_number, name, fraction_total_reads AS total_coronavirus_abundance
FROM bracken_abundances_long
WHERE name LIKE '%Severe acute respiratory%'
GROUP BY accession_number
ORDER BY total_coronavirus_abundance DESC
LIMIT 20;
```
```
sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT *
FROM bracken_abundances_long
WHERE name = 'Severe acute respiratory syndrome-related coronavirus'
ORDER BY fraction_total_reads DESC
LIMIT 20;"
```
3) can you see a relationship between SARS-CoV-2 abundance and Rt values ?
In Python:
```
df_sars = df[df["name"].str.contains("Severe acute respiratory")].sort_values("fraction_total_reads", ascending=False)
combined = df_sars.merge(sample_annot, left_on="accession_number", right_on="run_accession")
print(combined.columns)
plt.figure()
sns.scatterplot(combined, x="fraction_total_reads", y="Ct")
```
On command line:
```
sqlite3 -batch /proj/applied_bioinformatics/common_data/sample_collab.db "
SELECT *
FROM sample_annot
JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number
WHERE name = 'Severe acute respiratory syndrome-related coronavirus'
ORDER BY fraction_total_reads DESC"
```
Or in a nice box:
```
sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT run_accession, name, fraction_total_reads, Ct FROM sample_annot INNER JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name LIKE '%Severe acute respiratory%' ORDER BY fraction_total_reads DESC LIMIT 20"
```
4) can you check if SARS-CoV-2 is present in rtPCR negative patients ?
```
sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db \
"SELECT run_accession, name, fraction_total_reads, Ct
FROM sample_annot
INNER JOIN bracken_abundances_long
ON sample_annot.run_accession = bracken_abundances_long.accession_number
WHERE name LIKE '%Severe acute respiratory%' AND Ct = '0.0';"
```
Also on command line
```
sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db "SELECT run_accession, name, fraction_total_reads, Ct, host_disease_status FROM sample_annot INNER JOIN bracken_abundances_long ON sample_annot.run_accession = bracken_abundances_long.accession_number WHERE name LIKE '%Severe acute respiratory%' AND host_disease_status = 'SARS-CoV-2 negative' ORDER BY fraction_total_reads DESC LIMIT 20"
```
5) is SARS-CoV-2 only ever seen in RNA samples (ie not in DNA samples) ?
```
sqlite3 -box -batch /proj/applied_bioinformatics/common_data/sample_collab.db
"SELECT run_accession, name, fraction_total_reads AS total_coronavirus_abundance, nuc, host_disease_status
FROM sample_annot sa
INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number
WHERE nuc='RNA' AND name LIKE '%Severe acute respiratory%'
ORDER BY total_coronavirus_abundance DESC
LIMIT 20"
```
6) what are the 10 most ubiquitous human pathogens found in the 125 patients ? Below is list known to be associated with upper airways and oral/nasal cavities of the study's supp tables 3 & 4:
'Dolosigranulum pigrum','Haemophilus influenzae', 'Haemophilus parainfluenzae',
'Klebsiella pneumoniae','Streptococcus pneumoniae','Staphylococcus aureus','Influenza A virus',
'Severe acute respiratory syndrome-related coronavirus','Chlamydia pneumoniae',
'Haemophilus parainfluenzae','Serratia marcescens','Enterobacter hormaechei',
'Enterobacter cloacae', 'Enterobacter asburiae','Burkholderia multivorans',
'Human coronavirus HKU1','Human coronavirus NL63','Rhinovirus A'
* to view the top 10
```
SELECT name, COUNT(name) as prevalence
FROM sample_annot sa
INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number
GROUP BY name
ORDER BY prevalence DESC
LIMIT 10;
```
* to compare the top 10 against the list of known pathogens
```
SELECT name, COUNT(name) as prevalence
FROM sample_annot sa
INNER JOIN bracken_abundances_long bal ON sa.run_accession = bal.accession_number
WHERE name IN ('Dolosigranulum pigrum',
'Haemophilus influenzae',
'Haemophilus parainfluenzae',
'Klebsiella pneumoniae',
'Streptococcus pneumoniae',
'Staphylococcus aureus',
'Influenza A virus',
'Severe acute respiratory syndrome-related coronavirus',
'Chlamydia pneumoniae',
'Haemophilus parainfluenzae',
'Serratia marcescens',
'Enterobacter hormaechei',
'Enterobacter cloacae',
'Enterobacter asburiae',
'Burkholderia multivorans',
'Human coronavirus HKU1',
'Human coronavirus NL63',
'Rhinovirus A')
GROUP BY name
ORDER BY prevalence DESC
LIMIT 10;
```
### Scripts from other groups:
#### some group
to add accession number in each file.
In your MedBioInfo folder:
```
# first add accession numbers as the last column of each bracken.out files
# and concatenate all in one (called #bracken_sql.out).
#Dont forget to change the username and file pattern.
srun \
cat analyses/username_run_accessions.txt \
| xargs -I % sh -c \
"sed '1d;s/$/\t%/' analyses/kraken2/%_bracken.out" \
> analyses/kraken2/bracken_sql.out
-- %
# then add headers to the created file
srun \
sed -i 1i"name\ttaxonomy_id\ttaxonomy_lvl\tkraken_assigned_reads\tadded_reads\tnew_est_reads\tfraction_total_reads\taccession_number" analyses/kraken2/bracken_sql.out
```
#### group x
```
# usage: bash your_bash_script_filename_here.sh
# run only once
srun -A naiss2024-22-540 bash -c '
directory_to_search="/proj/applied_bioinformatics/users/x_ronod/MedBioinfo/data/bracken_out/" # Change this to your directory
# Find all files matching ERR*result
find "$directory_to_search" -type f -name "ERR*result" | while read -r file_path; do
# Split the basename using "_" and take the first field
basename=$(basename "$file_path")
accession_number=$(echo "$basename" | cut -d"_" -f1)
# Create a temporary file
tmp_file=$(mktemp)
# Add the accession number column to the start of the file
awk -v acc="$accession_number" "BEGIN {FS=OFS=\"\t\"} NR==1 {\$0=\"accession_number\" OFS \$0} NR>1 {\$0=acc OFS \$0} 1" "$file_path" > "$tmp_file"
# Move the temporary file back to the original file
mv "$tmp_file" "$file_path"
echo "Processed $file_path"
done
'
```
#### Group: Fantastic four
Adding accession number or Sample ID to your Bracken output.
Step by step guide:
1) Make sure that you have the Bracken output files by first go to your output directory `cd /proj/applied_bioinformatics/user/"YOUR PATH HERE"/bracken_output_files`. Locate the files using `ls`.
2) Create the bash script in your directory using `nano add_accession_number.sh`. Copy-paste the code below.
3) Run the script and pray.
4) Make sure that the output is correct by examine your `.tsv` file by using the following bash command in your terminal: `head ERR******_bracken.tsv`. If you now can see your Accession number or Sample ID, you have completed the task. Good job!
```
#!/bin/bash
#SBATCH -A naiss2024-22-540
#SBATCH -n 1
#SBATCH -t 01:00:00
#SBATCH -J blastp
#SBATCH -o blastp.%j.out
#SBATCH -e blastp.%j.err
#SBATCH --mail-user ***********@ki.se
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT_80
# To run this script your bracken out file name should be as bracken.1.ERR****.out
# Define paths
bracken_dir="/path/to/bracken/files"
cd "$bracken_dir" || exit
# Initialize the output file and add a header
output_all="all_sample.tsv"
echo -e "Accession_Number\tColumn1\tColumn2" > "$output_all"
# Loop through each file in the directory
for file in *.out; do
# Check if the file name matches the pattern we need
if [[ $file =~ bracken\.[0-9]+\.(ERR[0-9]+)\.out ]]; then
# Extract the sample name using the regex capture group
sample="${BASH_REMATCH[1]}"
# Create the output filename
output="${sample}_bracken.tsv"
# Run the awk command
awk -F, -v sample="$sample" 'NR==1 {print "Accession_Number\t"$1"\t"$2} NR>1 {print sample "\t"$1"\t"$2}' "$file" > "$output"
# Append the generated file content to the all_sample.tsv, excluding the header
tail -n +2 "$output" >> "$output_all"
else
# Print a message if the file does not match the pattern
echo "Skipping file: $file (does not match the required pattern)"
fi
done
```
#### Group: At the End
The following piece of code has two main steps. Firstly, it adds the accession number of every file into the bracken report output, in a column titled 'sample_id'. Secondly, it collates all the now modified reports into a single file, to be loaded into SQL.
The file has headers, corresponding to:
* name
* taxonomy_id
* taxonomy_lvl
* kraken_assigned_reads
* added_reads new_est_reads
* fraction_total_reads
* sample_id
The script is intended to be loaded as part of the nextflow pipeline such that all produced bracken outputs will be processed.
This first part is to be appended to the BRACKEN process.
```
process BRACKEN...{
input:
path(bracken_output) // is not collected
tuple val(id), path(reads)
// directives:
container 'https://depot.galaxyproject.org/singularity/gawk:5.3.0'
publishDir "$params.outdir/06_prepbracken"
script:
"""
awk 'BEGIN {FS=OFS="\t"}{\$0 = \$0 OFS (NR == 1 ? "sample_id" : "${id}")} 1' ${bracken_output} | awk 'NR>1' > ${id}_sortedbrackenreport.txt;
"""
output:
path "${id}_sortedbrackenreport.txt", emit: indiv_modified_bracken_report
}
```
This second part is a process of its own.
```
process PREPBRACKEN{
input:
path(indiv_modified_bracken_report) // is collected
tuple val(id), path(reads)
// directives:
container 'https://depot.galaxyproject.org/singularity/gawk:5.3.0'
publishDir "$params.outdir/06_prepbracken"
script:
"""
cat ${indiv_modified_bracken_report} > sortedbrackenreport.txt
"""
output:
path "sortedbrackenreport.txt", emit: modified_bracken_report
}
```
------
# Day 4 Introduction to Nextflow
This document is a living document we can all work on in real time. We can ask questions and answer them, post code for everyone etc.
The document is written in Markdown. Here is a link to some basic syntax: https://www.markdownguide.org/basic-syntax/
> blockquote
code (indented by one tab, or four spaces)
In line `code`by using backticks (`)
## Flash2: Merging paired end reads
Write your first own process:
```
process FLASH2 {
input:
// directives
script:
"""
"""
output:
}
```
#### Amrei
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container 'https://depot.galaxyproject.org/singularity/flash2:2.2.00--hed695b0_2'
publishDir "$params.outdir/02_flash"
script:
"""
flash2 \\
$reads \\
--output-prefix="${id}.flash2" \\
--max-overlap=150 \\
| tee -a ${id}_flash2.log
"""
output:
tuple val(id), path ("${id}.flash2.extendedFrags.fastq"), emit: merged_reads
path "${id}.flash2.notCombined*.fastq", emit: notCombined_fastq
path "${id}.flash2.hist", emit: flash2_hist
path "${id}.flash2.histogram", emit: flash2_his2togram
path "${id}_flash2.log", emit: log
}
```
#### Group with 5
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container "https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6"
publishDir "$params.outdir/02_flash2"
script:
"""
flash \\
-z -t 2 \\
$reads 2>&1 \\
| tee
flash -t 2 \
-o ${id}.flash \
-d $params.outdir/02_flash \
${reads[0]} ${reads[1]}
"""
output:
path "${id}*flash", emit: flash
path "${id}*log", emit: log
}
```
#### Group 5
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container https://depot.galaxyproject.org/singularity/flash2%3A2.2.00--he4a0461_6
publishDir '$params.outdir/flas2'
script
"""
flash \
$reads
"""
output:
path "${id}", emit: merged_reads
}
```
```
process BOWTIE2 {
input:
database
tuple val(id), path(merged_reads)
// directives
container https://depot.galaxyproject.org/singularity/bowtie2%3A2.5.4--he20e202_0
publishDir '$params.outdir/03_bowtie2'
script
"""
bowtie2 \
-x $bowtie2_db \\
-U $merged_reads \\
-S ${id}.sam
"""
output:
path "${id}.sam", emit: sam_file
}
```
#### Group the best potato group
##### Flash2: Merging paired end reads
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container 'https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6'
publishDir "$params.outdir/02_merged"
script:
"""
flash \\
--threads 2
--compress
-o {}.flash
${id}_1.fastq.gz
${id}_2.fastq.gz
"""
output:
path "${id}*extendedFrags.fasta.gz", emit: merged_reads
}
```
```
process BOWTIE2 {
input:
tuple val(id), path(merged_reads), path(bowtie_db)
// directives
container 'https://depot.galaxyproject.org/singularity/bowtie2:2.5.2--py39h6fed5c7_0'
publishDir "$params.outdir/03_bowtie"
script:
"""
bowtie2 \\
-x $bowtie_db \\
--no-unal \\
-U $merged_reads \\
-S ${id}_merged2PhiX.sam \\
| tee -a ${id}_merged2PhiX.log
"""
output:
path "${id}_merged2PhiX.sam", emit: bowtie_sam
path "${id}_merged2PhiX.log", emit: bowtie_log
}
```
#### Group the best best group
##### Flash2: Mergigng paired end reads
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container 'https://depot.galaxyproject.org/singularity/flash:1.2.11--h5bf99c6_6'
publishDir "$params.outdir/02_flash2"
script:
"""
flash \\
--threads 2 \\
--compress \\
--output-prefix .flash \\
$reads \\
"""
output:
path "${id}*fastqc.html", emit: flash_
"
}
```
```
process BOWTIE2{
input:
path(bowtie2_db)
tuple val(id), path(merged_reads)
// directives
container 'https://depot.galaxyproject.org/singularity/bioconductor-crisprbowtie:1.6.0--r43hdfd78af_0'
publishDir "$params.outdir/03_bowtie2"
script:
"""
bowtie2 -x $bowtie2_db -U $merged_reads --threads 8 --no-unal \\
| tee ${id}_bowtie2.log
"""
output:
path "${id}.sam", emit: sam
}
```
## change workflow scope
Here is also the old workflow scope. How do we need to change that one to add the new process?
```
workflow{
ch_input = Channel.fromFilePairs(params.input_read_pairs, checkIfExists: true )
// quality control
FASTQC ( ch_input )
// assembling reports
MULTIQC ( FASTQC.out.qc_zip.collect())
}
```
## Where to find containers?
Here is a good resource on where to find containers:
https://depot.galaxyproject.org/singularity/
Alternatively, you can do an internet search for **nf-core modules** and the name of your tool - in this case flash2. Go to the github repository of that module and copy paste the link to the singularity container from the **main.nf** file.
Look up seqera containers - brand-new - which can make singularity containers for you on the fly. Choose which tools should be available in there, which environment they will be run in and it'll make a container for you.
## Bowtie2: Map the reads to the data base
We need to:
a) change the workflow scope
What do we need as an input for bowtie?
Which channel factory/factories could be useful here?
b) add a new process called BOWTIE2
```
process BOWTIE2 {
input:
// directives
script:
"""
"""
output:
}
```
c) define the run time environment for this process
```
withName:'BOWTIE2'{
}
```
### Flash2: Merging paired end reads
```
process FLASH2 {
input:
tuple val(id), path(reads)
//directives
container 'https://depot.galaxyproject.org/singularity/flash%3A1.2.11--h5bf99c6_6'
publishDir "$params.outdir/02_flash"
script:
"""
flash $reads \
"""
output:
path "${id}.fastq.gz", emit: merged_pairs
}
```
#### erik
```
process FLASH2 {
input:
tuple val(id), path(reads)
// directives
container 'https://depot.galaxyproject.org/singularity/flash2:2.2.00--hed695b0_2'
publishDir "$params.outdir/02_flash"
script:
"""
flash2 \\
$reads \\
--output-prefix="${id}.flash2" \\
--max-overlap=150 \\
| tee -a ${id}_flash2.log
"""
output:
tuple val(id), path ("${id}.flash2.extendedFrags.fastq"), emit: merged_reads
path "${id}.flash2.notCombined*.fastq", emit: notCombined_fastq
path "${id}.flash2.hist", emit: flash2_hist
path "${id}.flash2.histogram", emit: flash2_histogram
path "${id}_flash2.log", emit: flash2_log
}
```
#### Amrei Bowtie2
```
process BOWTIE2 {
input:
path(bowtie2_db)
tuple val(id), path(merged_reads)
// directives:
container 'https://depot.galaxyproject.org/singularity/mulled-v2-ac74a7f02cebcfcc07d8e8d1d750af9c83b4d45a:f70b31a2db15c023d641c32f433fb02cd04df5a6-0'
publishDir "$params.outdir/03_bowtie2"
script:
db_name = bowtie2_db.find{it.name.endsWith(".rev.1.bt2")}.name.minus(".rev.1.bt2")****
"""
bow****tie2 \\
-x $db_name \\
-U $merged_reads \\
-S ${id}_bowtie2_merged_${db_name}.sam \\
--no-unal \\
|& tee -a ${id}_bowtie2_merged_${db_name}.log
"""
output:
path "${id}_bowtie2_merged_${db_name}.log", emit: logs
path "${id}_bowtie2_merged_${db_name}.sam", emit: aligned_reads
}
```
## Kraken
```
process KRAKEN2{
input:
path(kraken2_db)
tuple val(id), path(reads)
// directives:
container 'https://depot.galaxyproject.org/singularity/mulled-v2-8706a1dd73c6cc426e12dd4dd33a5e917b3989ae:c8cbdc8ff4101e6745f8ede6eb5261ef98bdaff4-0'
publishDir "$params.outdir/04_kraken2"
script:
"""
kraken2 \\
-db $kraken2_db \\
--paired \\
--gzip-compressed \\
--output ${id}_kraken2_taxonomy.txt \\
--classified-out "${id}_kraken2_classified#.fq" \\
--report ${id}_kraken2_report.txt \\
$reads \\
|& tee -a ${id}_kraken2.log
"""
output:
path "${id}_kraken2.log", emit: logs
path "${id}_kraken2_classified*.fq" , emit: kraken2_classified_reads
path "${id}_kraken2_taxonomy.txt", emit: kraken2_taxonomy
path "${id}_kraken2_report.txt", emit: kraken2_report
}
```
Command to make the output look nicer:
sed -i -E 's/\b[scogfkp]\__//g' $krona_output