---
tags: Pasteur
---
# Whole genome analysis of *Klebsiella pneumoniae* CG45 isolates
This is a summary of my analysis of hybrid assemblies of *Klebsiella pneumoniae* genomes obtained by combining Oxford Nanopore Technologies long reads with Illumina short reads.
I sometimes detail my workflow a bit too much, because I want to be able to come back to it and reproduce it even if I don't remember anything. Feel free to skip some of those sections.
[toc]
## Biological context
CG45 is a widespread *Klebsiella pneumoniae* clonal group that has been sampled in a variety of hosts (humans, mammals, birds...) as well as in food and in the environment. It is a high-risk clone that contains AMR resistant isolates as well as virulent isolates.
The lab already has Illumina sequencing data for numerous CG45 isolates, but it would be interesting to produce a high-quality assembly for a subset of representative isolates by combining Illumina reads (short but accurate) with Nanopore reads (long but less accurate). This is what we are doing for this project: we sequenced 19 CG45 isolates and analyzed the genomes assembled from the sequencing data.
## General technical framework
I worked with the Pasteur cluster, which is easily accessible via a terminal by using an ssh connection (if you are using Windows, Putty is a simple but efficient tool to do that). Working with different softwares on the cluster is made unbelievably easy with `module`: basically it allows you to load any version of any software that is already installed on the cluster. Just do `module av my_software` to check whether a software is available, and `module load my_software` to load it. (The only downside to that system is that you need to go through the IT service if you need a software that is not already installed.)
Jobs are distributed on nodes through the slurm system: to launch anything that takes calculation power (basically any command except stuff like `cd`, `mv`, `cp` etc.), you should use `sbatch`.
Commands will look like this:
```
sbatch -c 12 -p bebp --qos=bebp --wrap='<your job here>'
```
The `-c` option tells slurm how many cores you want to use, the `-p bebp` tells it that you want to work on the partition specifically allocated to the Biodiversity and Epidemiology of Pathogen Bacteria (BEBP) unit. The `--qos=bebp` option indicates the quality of service (i.e. time after which your job is automatically killed if it isn't complete). In this case, `--qos=bebp` actually gives your job an infinite time to be completed (so remember to kill your job yourself if it's stuck in an infinite loop or something).
NB: I asked the IT service (more specifically, Eric Deveaud) to install a few useful tools that were not initially available: digIS, Ariba, srst2, and MicrobeAnnotator (for this last one, there were several complications in the installation, and in the end he gave up).
## Key papers for biological context
- Population Genomics of Klebsiella Pneumoniae: 10.1038/s41579-019-0315-1
- Antimicrobial Resistance: A One Health Perspective 10.1128/microbiolspec.ARBA-0009-2017
- Molecular Mechanisms of Resistance to Conventional Antibiotics in Bacteria 10.29252/IJMR-050305
## Genome assembly
The initial input for this analysis is a set of fast5 files obtained through ONT sequencing (using the MinION or MK1C for instance).
### Basecalling (Guppy)
The first step is basecalling: the fast5 files contain the measures of the disturbance in the electric current caused around the nanopore as the DNA went through, and we need to convert that signal into a good old-fashioned DNA sequence.
Here, we use Guppy, Nanopore's in-house basecaller (other basecallers exist though). The process can be a bit long, so use as many CPUs as you can to speed the process along (especially if you have A LOT of reads).
The options are fairly straightforward: you need to specifiy the path to the fast5 files, the path to the output directory, the number of parallel basecallers to create, the number of CPU threads to use for each caller, as well as the material you used for your sequencing experiment (flowcell type and barcoding kit). There is also a useful option that tells Guppy to immediately compress the output fastq files.
```
sbatch -c 12 -p bebp --qos=bebp --wrap='guppy_basecaller \
--input_path path_to_fast5_files \
--save_path basecalling_output_dir \
--num_callers 1 --cpu_threads_per_caller 32 \
--flowcell FLO-MIN106 --kit SQK-RBK004 --compress_fastq'
```
Reads will be classified as either 'pass' or 'fail' depending on their quality ('pass' reads have an average quality score > Q7).
### Demultiplexing (Guppy)
If you sequenced multiple isolates, you normally attributed a distinct barcode to each sample. In the demultiplexing step, we group the reads based on the barcode they are carrying (so we put reads coming from the same sample in the same directory). Again, I used Guppy (which, as you guessed, not only has a basecalling option but also a demultiplexing one).
In the slurm part of the command line, we indicate two files that will contain the output messages (barcoding.o) and potential errors (barcoding.e). In the Guppy part, we indicate the files demultiplex (you should indicate reads classified as 'pass' in the basecalling step), the output directory, and the barcoding kit used in the sequencing experiment. You can also tell Guppy to remove the barcodes in the same step (you don't want them in your final assembly).
```
sbatch -c 40 -p bebp --qos=bebp -o barcoding.o -e barcoding.e --wrap='guppy_barcoder \
-i $$basecalling_output_dir/pass \
-s demultiplexing_output_dir \
--barcode_kits SQK-RBK004 \
--trim_barcodes'
```
### Filter reads using Illumina data (Filtlong)
Now that we have our fastq files sorted by sample, we can filter them based on their quality. To do that, we use filtlong. Filtlong can assess the quality of reads based on Illumina sequencing data (which is more accurate than Nanopore data): this is what we did here. It can also be used to sort long reads directly, without information from the Illumina sequences.
Again, we indicate two files to contain the output messages and potential errors in the slurm part of the command (filter.o and filter.e). In the filtlong part of the command, we indicate files containing the forward and reverse Illumina reads, the minimum length below which a read should be discarded, the percentage of reads to keep. The --trim option tells filtlong to trim bases from the start and end of reads which do not match a k-mer in the reference, whic ensures that each read starts and ends with a sequence of good quality. The --split 500 option means that whenever 500 consecutive bases fail to match a k-mer in the reference, the read is split. This allows filtlong to remove very poor parts of reads while keeping the good parts. The --target_bases 500000000 option removes the worst reads until only 500 Mbp remain (if you have less than 500 Mbp of data, the option will have no effect). At the end of the command, you indicate the file containing the reads to be filtered.
Here, we also added a subsequent command to zip the output.
```
sbatch -c 16 -p bebp --qos=bebp -o filter.o -e filter.e \
--wrap='filtlong -1 illumina_forward.fastq -2 illumina_reverse.fastq \
--min_length 1000 --keep_percent 90 --trim --split 500 \
--target_bases 500000000 SB5168_ONT.fastq \
| gzip > SB5158_ONT_filR.fastq.gz'
```
### Hybrid assembly (Unicycler)
Now that we have only good-quality reads remaining, we can use them for our genome assembly. If you have Nanopore reads AND Illumina reads, you can build a hybrid assembly: basically, the idea is that you build contigs using the Illumina data (which is more accurate), and then bridge them using the Nanopore data (which contains longer reads). The program I use is called Unicycler.
Again, we indicate two files to contain the output messages and potential errors in the slurm part of the command (unicycler.o and unicycler.e). In the unicycler part of the command, we indicate files containing the forward and reverse Illumina reads, the file containing the long reads (i.e. the Nanopore reads), the output directory, how much information we want in the output messages (verbosity), the level of file retention (2 means that we keep SAM files), how many threads should be used, and the bridging mode that Unicycler should use.
```
sbatch -c 64 -p bebp --qos=bebp -o unicyler.o -e unicyler.e \
--wrap="unicycler -1 illumina_forward.fastq -2 illumina_reverse.fastq \
-l long_reads.fastq --out hybrid_assembly --verbosity 2 \
--keep 2 --threads 64 --mode normal"
```
### Long-reads assemby (Unicycler)
Unicycler can also be used to assemble long reads only. The options are the same, we just don't mention the Illumina reads.
```
sbatch -c 64 -p bebp --qos=bebp -o unicyler.o -e unicyler.e \
--wrap="unicycler -l long_reads.fastq --out long_read_assembly --verbosity 2 \
--keep 2 --threads 64 --mode normal"
```
NB: Even if you have both short and long reads, it can be useful to run the long-reads assembly because it sometimes performs better than the hybrid assembly (i.e. it generates less contigs).
### Overview of results


### Polishing a long-read assembly using Illumina data
For genomes for which I have a better long-reads assembly, I used Polypolish to polish it using the Illumina data: https://github.com/rrwick/Polypolish.
I had it installed on the cluster.
Ryan Wick made a very clear [tutorial](https://github.com/rrwick/Polypolish/wiki/How-to-run-Polypolish) (as usual) that I just had to follow, and everything ran smoothly. I tried it with the SB5234 long-reads assembly, which was better than the hybrid assembly.
```
sbatch -c 16 -p bebp --qos=bebp --wrap='bwa index SB5234_l.fasta'
sbatch -c 16 -p bebp --qos=bebp --wrap='bwa mem -t 16 -a SB5234_l.fasta SB5234_R1.fastq.gz > alignments_1.sam'
sbatch -c 16 -p bebp --qos=bebp --wrap='bwa mem -t 16 -a SB5234_l.fasta SB5234_R2.fastq.gz > alignments_2.sam'
sbatch -c 16 -p bebp --qos=bebp --wrap='polypolish_insert_filter.py --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam'
sbatch -c 16 -p bebp --qos=bebp --wrap='polypolish SB5234_l.fasta filtered_1.sam filtered_2.sam > SB5234_l_polished.fasta'
```
Since it worked well, I'm also applied it to the SB5501, SB5462, and SB5970. To check how different the hybrid and polished long-reads assemblies were, I looked at the number of SNPs between the two (using Snippy), ran a Kleborate analysis, and compared them using BRIG to check that there was no huge difference.
| Strain | SNPs |
| -------- | -------- |
| SB5234 | 155 |
| SB5462 | 139 |
| SB5501 | 33 |
| SB5970 | 129 |
| Ctrl (2 different strains) | 3515 |
The Kleborate analysis yielded identical results between the two assemblies, except for SB5970, for which there was one more tet(A) gene found in the hybrid assembly and a mutation in PrmB (0%).
Chiara also suggested looking at this [preprint](https://www.biorxiv.org/content/10.1101/2022.06.30.498322v1.full.pdf) by Kat Holt's lab to see how they were comparing assemblies. To complete my analysis, I ran the assemblies through the [Pathogenwatch](https://pathogen.watch/) pipeline, as they did.
Again, the only differences I found between assemblies were for the SB5970 assemblies (hybrid vs polished):
- % core families: 92.4 vs 100
- % non-core: 61.1 vs 60.6
- % GC content: 57.3 vs 57.4
- colistin resistance: 1 vs 0
This is also the strain for which the size difference between the two assemblies is the highest (5084008 vs 5463958).
### Trying out Trycyler
Guidelines here: https://github.com/rrwick/Trycycler/wiki/How-to-run-Trycycler
My first attempt was with SB5404: turns out the coverage is too shallow for Trycycler to generate enough read subsets from which to generate assemblies.
Second attempt with SB6659 (which generated a very good assembly with Unicycler, so I'm assuming the coverage in deeper): it worked.
```
module load R miniasm minimap2 Mash muscle Trycycler
sbatch -c 12 -p bebp --qos=bebp \
--wrap="trycycler subsample --reads SB6659_filR.fastq.gz --out_dir read_subsets"
```
I then tried out the assemblies using the commands provided by R. Wick (don't forget to load any2fasta or the assembly using miniasm and minipolish will fail):
```
mkdir assemblies
module load Flye
module Minipolish
module load raven
module load any2fasta
sbatch -c 16 -p bebp --qos=bebp --wrap="flye --nano-raw read_subsets/sample_01.fastq --threads 16 --out-dir assembly_01 && cp assembly_01/assembly.fasta assemblies/assembly_01.fasta && rm -r assembly_01"
sbatch -c 16 -p bebp --qos=bebp --wrap="miniasm_and_minipolish.sh read_subsets/sample_02.fastq 16 > assembly_02.gfa && any2fasta assembly_02.gfa > assemblies/assembly_02.fasta && rm assembly_02.gfa"
sbatch -c 16 -p bebp --qos=bebp --wrap="raven --threads 16 read_subsets/sample_03.fastq > assemblies/assembly_03.fasta && rm raven.cereal"
sbatch -c 16 -p bebp --qos=bebp --wrap="flye --nano-raw read_subsets/sample_04.fastq --threads 16 --out-dir assembly_04 && cp assembly_04/assembly.fasta assemblies/assembly_04.fasta && rm -r assembly_04"
sbatch -c 16 -p bebp --qos=bebp --wrap="miniasm_and_minipolish.sh read_subsets/sample_05.fastq 16 > assembly_05.gfa && any2fasta assembly_05.gfa > assemblies/assembly_05.fasta && rm assembly_05.gfa"
sbatch -c 16 -p bebp --qos=bebp --wrap="raven --threads 16 read_subsets/sample_06.fastq > assemblies/assembly_06.fasta && rm raven.cereal"
sbatch -c 16 -p bebp --qos=bebp --wrap="flye --nano-raw read_subsets/sample_07.fastq --threads 16 --out-dir assembly_07 && cp assembly_07/assembly.fasta assemblies/assembly_07.fasta && rm -r assembly_07"
sbatch -c 16 -p bebp --qos=bebp --wrap="miniasm_and_minipolish.sh read_subsets/sample_08.fastq 16 > assembly_08.gfa && any2fasta assembly_08.gfa > assemblies/assembly_08.fasta && rm assembly_08.gfa"
sbatch -c 16 -p bebp --qos=bebp --wrap="raven --threads 16 read_subsets/sample_09.fastq > assemblies/assembly_09.fasta && rm raven.cereal"
sbatch -c 16 -p bebp --qos=bebp --wrap="flye --nano-raw read_subsets/sample_10.fastq --threads 16 --out-dir assembly_10 && cp assembly_10/assembly.fasta assemblies/assembly_10.fasta && rm -r assembly_10"
sbatch -c 16 -p bebp --qos=bebp --wrap="miniasm_and_minipolish.sh read_subsets/sample_11.fastq 16 > assembly_11.gfa && any2fasta assembly_11.gfa > assemblies/assembly_11.fasta && rm assembly_11.gfa"
sbatch -c 16 -p bebp --qos=bebp --wrap="raven --threads 16 read_subsets/sample_12.fastq > assemblies/assembly_12.fasta && rm raven.cereal"
```
The next step is contig clustering:
```
sbatch -c 16 -p bebp --qos=bebp --wrap="trycycler cluster --assemblies assemblies/*.fasta --reads SB6659_filR.fastq.gz --out_dir cluster_output"
```
For the subsequent cluster choice, I only picked cluster 1: from the hybrid assembly, I know there are no plasmids in this assembly, so I only need to pick the cluster corresponding to the chromosome. Cluster 1 contains several contigs that are the right size.
```
sbatch -c 16 -p bebp --qos=bebp --wrap="trycycler reconcile --reads SB6659_filR.fastq.gz --cluster_dir cluster_output/cluster_001"
```
The first launch of contig reconciliation failed, because the size differences between contigs in cluster 1 were too important. As suggested in the error message, I removed contigs that were not ~5.26 Mb long and relaunched the command.
Failed again:
> Error: failed to circularise sequence C_Utg866 for multiple reasons. You must
either repair this sequence or exclude it and then try running trycycler
reconcile again.
Removed the sequence and started it again. Same problem with the F contig. Removed it and started again. This time it seems the program ran correctly.
The next step is a multiple sequence alignment:
```
sbatch -c 16 -p bebp --qos=bebp --wrap="trycycler msa --cluster_dir cluster_output/cluster_001"
```
> Error: MUSCLE failed to complete on 4988 of the 4988 pieces. Please remove the most divergent sequences from this cluster and then try again.
The problem lies in the default MUSCLE version on the server: you should load muscle/3.8.31 instead of muscle/5.1(default). After changing this, the default command ran correctly.
Partitioning reads:
```
sbatch -c 16 -p bebp --qos=bebp --wrap="trycycler partition --reads SB6659_filR.fastq.gz --cluster_dirs cluster_output/cluster_001"
```
Consensus sequence:
```
sbatch -c 16 -p bebp --qos=bebp --wrap="trycycler consensus --cluster_dir cluster_output/cluster_001"
```
Running Medaka to polish the assembly:
```
sbatch -c 16 -p bebp --qos=bebp --wrap="medaka_consensus -i cluster_output/cluster_001/4_reads.fastq -d cluster_output/cluster_001/7_final_consensus.fasta -o cluster_output/cluster_001/medaka -m r941_min_sup_g507 -t 12"
mv cluster_output/cluster_001/medaka/consensus.fasta cluster_output/cluster_001/8_medaka.fasta
```
Error:
> /opt/gensoft/exe/medaka/1.4.4/bin/medaka_consensus: line 5: python: command not found
Loading python and/or adding `python` at the beginning of the command doesn't solve it.
## Overview of the assembly
### Overview of contigs (Bandage)
To look at the assembly you obtained, you can open the generated .gfa file with Bandage to visualize contigs. Ideally, your assembly should look like this:

Here, both the chromosome and the plasmids are closed and contain only one segment, so there is no ambiguity on either the nature of the DNA (plasmid or chromosome) or the order of the sequences.
### Visualizing contigs content (SnapGene Viewer)
To get a more detailed picture of the different contigs, you can open the assembly (this time the fasta file) in SnapGene Viewer. You can look at the sequence, search for a specific motif, etc.
## Analysis of plasmids
If your assembly is good enough to separate plasmids from the chromosome, you can use several tools to analyze plasmid content.
### Extracting plasmid sequences (SnapGene Viewer)
First, we open the assembly in SnapGene Viewer and we take a look at the plasmids. You can annotate probable ORF with SnapGene Viewer's Detect Features tool.

In the Sequence tab, you can copy the entire sequence of the plasmid and paste it in a text editor to create a separate fasta file.
### Identification of plasmids (abricate)
Now that we have a separate file for the plasmid, we can use abricate to see if this plasmid has already been identified previously. To do so, you need to specify PlasmidFinder as the database to be searched.
```
sbatch -p bebp --qos=bebp -c 2 --wrap="abricate --db plasmidfinder \
plasmid.fasta > abricate_plasmidfinder_output.txt"
```
### Identification of resistance genes carried by plasmids (abricate)
Abricate can also be used to detect plasmid resistance genes. Here, I used the ResFinder database, but you can also search the NCBI resistance genes database or the CARD database. (Results were the same using different databases in my case.)
```
sbatch -p bebp --qos=bebp -c 2 --wrap="abricate --db resfinder \
plasmid.fasta > abricate_plasmidfinder_output.txt"
```
### Annotation of CDS (prokka)
We can also use Prokka to detect CDS and annotate them when they correspond to genes that have already been described. The GenBank (.gbk) output file can then be opened with SnapGene Viewer to visualize the predicted/annotated ORF. You can use this to verify the abricate predictions of resistance genes.
```
sbatch -p bebp --qos=bebp -c 2 --wrap="prokka --outdir prokka_outputs \
--prefix sample_name plasmid.fasta"
```

### Comparison of similar plasmids (easyfig)
We can also compare the content of similar plasmids with easyfig. Here, I compared 4 plasmids sequenced from different strains that have very similar contents.

Chiara suggested that the middle region that is different in the unidentified plasmid might be the one used by PlasmidFinder to identify the other plasmids (which would explain why it does not yield any identification for this one). To check whether this region is present in the other two plasmids (last two lines), I reorganized the comparison:

It looks like this region is not present in the other two either (which *are* identified by PlasmidFinder).
I compared plasmids carrying similar metal resistance genes or antibiotic resistance genes:

### Clustering plasmids (mafft, FastTree)
Finally, we can cluster plasmids to see which are more similar by aligning plasmid sequences using Mafft and creating a tree from the alignment using FastTree. The resulting .tree file can be visualized in iTOL.
```
sbatch -p bebp --qos=bebp -c 2 --wrap="mafft \
--auto plasmid_seqs.fasta > mafft_output.fasta"
```
NB: it took forever with `-c 2` so next time I'm trying to run it with more cores (`-c 32`?) to see how much faster it goes.
```
sbatch -p bebp --qos=bebp -c 2 --wrap="FastTree -gtr -nt mafft_output.fasta \
> fasttree_plasmids.tree"
```

### Clustering plasmids with Panaroo
Panaroo compares genomes on a gene presence/absence basis rather than on a sequence basis. It then generates a graph summing up the structure of the pangenome of the analyzed genomes. I ran it on all putative plasmids, on Melanie's suggestion. The command I used the following (Panaroo is run on .gff files that are generated as an output of Prokka, see previous section):
```
panaroo -i /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/analysis_all_CG45/putative_plasmids_analysis/prokka/gff_files/*.gff \
-o /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/analysis_all_CG45/putative_plasmids_analysis/panaroo \
--clean-mode strict
```
(I ran the command from a slurm script.)
The resulting .gml graph file can be visualized in Cytoscape:


Each node corresponds to one gene, which I colored based on the number of plasmids it is found in (between 1 and 13), as suggested here: https://gtonkinhill.github.io/panaroo/#/vis/cytoscape.
Panaroo also generates data tables summing up where each gene is found.
Distribution of the number of plasmids in which each gene appears:

## Analyzing the IS content of a genome
Since I've already been using ISEScan over the previous months, let's keep up the good work and look at insertion sequences in the assemblies we generated. ISEScan is very simple to use:
```
sbatch -p bebp --qos=bebp -c 2 --wrap="isescan.py \
--seqfile path_to_assembly_fasta_file --output output_directory --nthread 2"
```

## Running Kleborate
Launch Kleborate from a folder in which you have writing rights, otherwise it won't work.
```
sbatch -c 16 --qos=bebp -p bebp -o kleborate.o -e kleborate.e \
--wrap="kleborate --assemblies /path/to/my/assemblies/*.fasta \
--outfile Kleborate_my_assemblies.txt --all"
```
I ran Kleborate both on the whole assemblies and on isolated plasmids, to check for the location of resistance genes indicated in the Kleborate report.
## Generating a tree of CG45 isolates (parsnp)
I ran parsnp to generate a quick alignment of my assemblies with the following command:
```
sbatch -p bebp --qos=bebp -c 12 --wrap="parsnp –p 12 \
–d /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/analysis_all_CG45/hybrid_assemblies/fasta_files/ \
–r !"
```
The `-r !` option tells parsnp to pick a random genome as reference. I added a ST485 genome to my folder to be used as an outgroup.
NB: don't copy-paste the command from a Windows machine. I don't know enough about computers to explain it, but parsnp won't recognize the `-d mygenomes` parameter and won't run. I lost an hour trying to figure out what I did differently when I tried to rerun the command by copy-pasting it.
One of the output file is a .tree.
I started editing the tree in Microreact: https://microreact.org/project/9KXy7dizUQkSLqDwo9Q1Vt-cg45tree
I'm also working on it in Inkscape:

I launched this on whole assemblies because for some of the more messy assemblies, it's difficult to say whether a contig is actually part of the core genome or not. Still, even if it's not ideal, I also launched a parsnp alignment on what I assume to be the core genomes after sorting the contigs as best I could. It almost didn't change anything:

## Searching for metal resistance genes
I'll try to use BacMet to see if I missed metal resistance genes. I downloaded the BacMet database of predicted resistance genes and I'm going to blast my plasmids against it.
## To do
- Low-quality hybrid assemblies: use the long-reads assembly and polish it with Illumina data? (Polypolish)
- Make a BRIG figure to compare genomes?
- Check the FRC0137 paper: second AMR segment (RG12) is found on the chromosome, so the composite transposon probably does transpose
- Compare results obtained with digIS vs ISEScan
- Write guide to useful genomics tools
## Helpful commands
Looping through a file and executing a command on it:
```
while IFS= read -r line; do
echo "Text read from file: $line"
done < my_filename.txt
```
Splitting a multifasta into simple fasta files:
```
cat multifasta.fasta | awk '{
if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fasta")}
print $0 > filename
}'
```
# Bibliography posters
I created a few summary posters as I read through the *Klebsiella pneumoniae* and Nanopore bibliography:



# Running ISEScan on Gabriele's genomes
Gabriele analyzed diphteria genomes and found composite transposons in them. The idea would be to confirm this finding by running ISEScan on his genomes and see whether we detect the insertion sequences.
```
sbatch -c 2 -p bebp --qos=bebp -o ISEScan_nanopore_Gabriele.o \
-e ISEScan_nanopore_Gabriele.e \
--wrap="cat /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_genomes_Gabriele/list_nanopore_genomes.txt \
| xargs -n 1 -P 2 -I{} isescan.py --seqfile {} \
--output /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_genomes_Gabriele/results_nanopore_assemblies \
--nthread 2"
```
Overall, ISEScan confirms his results, and identifies further insertion sequences in the regions he studied.
Here are the IS elements identified by ISEScan + ISFinder in FRC0137 and FRC0466 (I added the resistance genes identified by Gabriele):


# Running ISEScan on the Bordetella data
The idea is to look at the IS content of Bordetella genomes sequenced for the SARA project.
```
sbatch -c 12 -p bebp --mem 12gb--qos=bebp -o /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_bordetella_results/ISEScan_bordetella_genomes.o \
-e /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_bordetella_results/ISEScan_bordetella_genomes.e \
--wrap="cat /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/list_bordetella_genomes.txt \
| xargs -n 1 -P 12 -I{} isescan.py --seqfile {} \
--output /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_bordetella_results \
--nthread 2"
```
I have a problem with the command, it does not run correctly. In the log, the output stops at the beginning of the run (right after the 'prepare gff file' step). I do get the hmm and proteome intermediary files, but nothing beyond that, and only for the first 12 genomes (I launched a command that treats 12 genomes parallely). I have checked the permissions on the genomes and made a copy in my directory, but that does not seem to solve the problem. It isn't a problem of cluster availability either (Carla has been running a command for days so I tried running mine on the common partition, but it did not solve anything).
Error message in the .e log:
> isescan.py terminated by signal 9. Detected 2 oom-kill events in StepID= 8639548.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I tried relaunching with a precision concerning the memory requirements (`--mem 12gb`).
Now it runs past the 'prepare gff file' step, but can't manage to open the .out proteome file once it's generated ('no such file or directory', although it's there, I checked). I also checked the permissions, and those files can be opened. For some of the genomes it generates empty .out.tmp.0 (or 1) files, which I can't explain.
Since Carla is done with her process I tried relaunching after reorganizing my folders (results folder in the same folder as the genomes), with the exact same command as the one I used for Gabriele's genomes:
```
sbatch -c 12 -p bebp --mem 12gb--qos=bebp -o /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/Bordetella_genomes/ISEScan_bordetella_results/ISEScan_bordetella_genomes.o \
-e /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/Bordetella_genomes/ISEScan_bordetella_results/ISEScan_bordetella_genomes.e \
--wrap="cat /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/list_bordetella_genomes.txt \
| xargs -n 1 -P 2 -I{} isescan.py --seqfile {} \
--output /pasteur/zeus/projets/p01/Klebsiella-ngs/Flora/ISEScan_bordetella_results \
--nthread 2"
```
This time it worked (although it took several DAYS).
Here is an overview of the IS content in these genomes:

The repartition in most of the genomes looks like this:

For the two remaining genomes (FR6464 and FR6461) it looks like this:

# Submitting a poster abstract to the YRLS conference
> Klebsiella pneumoniae is a widespread multi-host pathogen and one of the major agents responsible for nosocomial infections globally. Multidrug resistant (MDR) isolates are usually associated with the hospital setting, whilst community-acquired infections are caused by isolates carrying virulence determinants (hv), which are responsible for severe disease. Convergence of these two traits on the same isolates represents a major public health threat, which is why K. pneumoniae strains should be monitored through sequencing and genomic techniques. Given its widespread distribution across hosts and the environment, the understanding of K. pneumoniae ecology and epidemiology benefits from the One Health approach applied to genomics. Untangling the interconnexions between ecological niches (human, animals, environment) through genomics (e.g., monitoring the distribution of lineages and serotypes, as well as the flow of virulence and resistance genes in the different compartments) can greatly benefit public health and improve our understanding of K. pneumoniae dynamics.
>
> Here, we combined short-read Illumina sequencing and long-read Oxford Nanopore sequencing to facilitate genomic surveillance of K. pneumoniae in a One Health framework. Oxford Nanopore sequencing allows to rapidly generate high-quality genomic assemblies that can easily be scanned for resistance and virulence determinants. We analyze nineteen isolates belonging to the K. pneumoniae clonal group (CG) 45, which has been reported across continents and can be found in a variety of hosts as well as in the environment. CG45 comprises both MDR and hv isolates and it is therefore considered a high-risk clonal group. Our work focuses on plasmid-borne genes found in C45 isolates, which remain poorly described. This description will contribute to a better understanding of K. pneumoniae population genomics and of the plasmids disseminating resistance genes, which is crucial to proper public health surveillance.
Here is the poster I presented (with a flashtalk on the 10/06/2022):

# Completing Kp host metadata
Chiara asked me to take a look at the file containing metadata on the diverse ST45 isolates that have been sequenced (by the lab or by other entities): not all available isolates are included, and some information is missing, so the aim is to add missing isolates and information. I completed the host metadata whenever I could (on a copy of the original data table that I put in my own folder) and looked at the repartition of hosts across the main STs.


I added data from Carla's work for ST45 and ST37 (CG45_CG37_infos.xlsx), as well as data from NCBI isolates that were not listed in the table yet (Klebsiella_genomes_14832_NCBI_160222). The table with everything in it is called kp1_hosts_added_ncbi.xlsx. I've started playing around with Excel's graph options but I'm not very good at it. I have a few graphs though:




The genomes of NCBI haven't been typed yet, so I downloaded them using Alexis' script [wgetGenBankWGS](https://gitlab.pasteur.fr/GIPhy/wgetGenBankWGS). 22 genomes are missing in the final file, I'll dig into that later. I renamed the genomes to get a simpler name, using a command written by Chiara (see Rename_from_csv.sh). I then launched Kleborate on all genomes:
```
sbatch -c 16 --qos=bebp -p bebp -o kleb.o -e kleb.e --wrap="kleborate --assemblies *.fasta --outfile Kleborate.txt --all"
```
I had a few issues with the Kleborate run. First it stopped after about 1700 genomes with an error message about the fact that there were too many genomes.
```
/opt/gensoft/exe/Kleborate/2.0.0/scripts/kleborate: line 51: /usr/bin/logger: Argument list too long
Error: Kaptive failed to run with the following error:
Error: tblastn encountered an error:
free(): invalid pointer
```
I relaunched after removing the genomes that were already typed (`for genome in $(cat list_of_genomes.txt); do mv "$genome".fasta newfolder; done
`). Than it ran into an issue with one of the genomes.
```
Error: Kaptive failed to run with the following error:
Error: tblastn encountered an error:
munmap_chunk(): invalid pointer
```
Again, I relaunched it and put the problematic genome on the side.
After digging around for ~5 seconds, it turns out that this is a blast issue rather than an issue with my genomes: https://github.com/katholt/Kaptive/issues/13. Since I don't understand how exactly I'm supposed to edit the kaptive database to remove the warning (and I don't want to meddle with the files if I don't know what I'm doing), I'll just keep relaunching the Kleborate analysis I guess.
# Looking at the *Corynebacteria ulcerans* genomes
## *Corynebacterium ulcerans* toxin and PAI
Chiara asked me to look at the region containing the toxin in *ulcerans* genomes that carry the DTU but not the phage usually responsible for its dissemination. I found a pathogenicity island that had been described in a few other strains before (see [this paper](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-014-0113-3)). It a GC-low region, distinct from the rest of the *ulcerans* genome.

I isolated the sequence of the PAI (from the genome of *C. ulcerans 05146*) and blasted it against the other *ulcerans* genomes. The results are available in the BLAST_PAI file. The PAI is present only in the genomes with the toxin but with no phage. (Blastn hits that are only about 2000 pb long correspond to the segment containing the toxin, and not to the rest of the PAI).
```
sbatch -c 12 -p bebp --qos=bebp --wrap="blastn -query toxin_specific_region_05146.fasta -db blastdb.fasta -outfmt '6 qseqid sseqid stitle pident qcovs length qlen mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq' -out BLAST_phages.txt"
```

NB: FRC0995 and FRC1175 are marked as carrying incomplete toxins, but it's just because the toxin is broken up between two contigs. They actually contain the complete toxin (and the PAI).
For the few remaining genomes with the toxin but no phage nor PAI (*C. ulcerans 131001, C. ulcerans 210932*, FRC1184, FRC1237, FRC1250; FRC1283), I looked a bit around the toxin gene to see if I found anything interesting linked to gene mobility. No luck yet.
## *Corynebacterium ulcerans* IS content
I launched ISEScan on the *ulcerans* genomes:
```
sbatch -c 12 -p bebp --mem 12gb --qos=bebp \
-o /pasteur/zeus/projets/p01/Corynebacterium-ngs/Flora/ISEScan_run/ISEScan_ulcerans.o \
-e /pasteur/zeus/projets/p01/Corynebacterium-ngs/Flora/ISEScan_run/ISEScan_ulcerans.e \
--wrap="cat /pasteur/zeus/projets/p01/Corynebacterium-ngs/Flora/ISEScan_run/list_ulcerans_genomes.list \
| xargs -n 1 -P 12 -I{} isescan.py --seqfile {} \
--output /pasteur/zeus/projets/p01/Corynebacterium-ngs/Flora/ISEScan_run/ \
--nthread 2"
```

Detailed results are available in my the ISEScan_run folder.

I also looked at the IS content of the dog outbreak (I picked genomes for which the host is a dog and that carry the DTU_6 allele):

## Phylogeny of *Corynebacterium* phages
Chiara suggested I extract the phage sequences from both the *diphteria* and *ulcerans* genomes to build a phylogeny and compare it to the phylogeny of the DTU. I extracted the sequence of the phage from *C. ulcerans 0102* and blasted it against the other genomes. The results are available in the blast_phages directory.
```
sbatch -c 12 -p bebp --qos=bebp --wrap="blastn -query Culcerans_0102_phage.fasta -db blastdb.fasta -outfmt '6 qseqid sseqid stitle pident qcovs length qlen mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq' -out BLAST_phages.txt"
```
I then retrieved the sequence of the phage the best I could in a few representative genomes and ran mafft and FastTree on the sequences.
```
sbatch -p bebp --qos=bebp -c 16 --wrap="mafft --auto representative_phage_sequences.fasta > mafft_phages.fasta"
sbatch -p bebp --qos=bebp -c 2 --wrap="FastTree -gtr -nt mafft_phages.fasta > fasttree_phages.tree"
```

Here is the phylogeny obtained using the DTU sequences:

# Nanopore analysis pipeline
Chiara asked me to write a comprehensive script that would contain all the steps to convert raw fast5 data to a hybrid or long-reads assembly (so that you would only have to launch that script to perform the analysis). Here is the pipeline that I wrote (I'm still missing a test that would include the basecalling and demultiplexing steps, but the rest is working, provided you organize your files the right way). I adapted it either for the Mk1C or the classic MinION, and for hybrid, long-reads only, or double assembly.
## slurm file
```
#!/bin/bash
#SBATCH --job-name=nanopore_analysis # Job name
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=user@pasteur.fr # Where to send mail
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=1 # Number of processes
#SBATCH --cpus-per-task=64
#SBATCH --mem=32gb # Total memory limit
#SBATCH --partition=bebp
#SBATCH --qos=bebp
#SBATCH --output=nanopore_analysis.log # Standard output and error log
./nanopore_analysis_pipeline.sh
```
## bash script
```
#!/bin/bash
### defining variables (this is where the user should change variable names
### according to their own files)
# fast5 data and Illumina data to be used as input
path_to_fast_5='/pasteur/zeus/projets/p01/my_project/Nanopore_data/my_experiment'
path_to_illumina='/pasteur/zeus/projets/p01/my_project/Illumina_data'
# I'm assuming the Illumina files are all put together and in the following format:
# sample1_R1.fastq.gz sample1_R2.fastq.gz sample2_R1.fastq.gz etc
# paths to initially empty directories that will contain intermediary files
path_to_basecalled_data='/pasteur/zeus/projets/p01/my_project/Nanopore_data/basecalled_data'
path_to_demultiplexed_data='/pasteur/zeus/projets/p01/my_project/Nanopore_data/demultiplexed_data'
# other variables
flowcell='FLO-MIN106'
kit='RBK-004'
sample_names=( 'sample_name1' 'sample_name2' 'sample_name3' 'sample_name4' 'sample_name5' 'sample_name6' \
'sample_name7' 'sample_name8' 'sample_name9' 'sample_name10' 'sample_name11' 'negative_control' )
barcodes=( 'barcode01' 'barcode02' 'barcode03' 'barcode04' 'barcode05' 'barcode06' 'barcode07' 'barcode08' \
'barcode09' 'barcode10' 'barcode11' 'barcode12' )
echo '--------------------------------------------------------'
echo 'Printing out variables to check for potential mistakes:'
echo '--------------------------------------------------------'
echo 'Path to fast5 raw data:' $path_to_fast_5
echo 'Path to Illumina data:' $path_to_illumina
echo 'Flowcell type:' $flowcell
echo 'Kit:' $kit
echo 'First sample:' ${sample_names[0]}
echo ' '
### loading modules
echo '--------------------------------------------------------'
echo 'Loading modules.'
echo '--------------------------------------------------------'
echo ' '
module load cuda
module load ont-guppy/4.5.2_gpu
module load Filtlong/0.2.0
module load graalvm/ce-java8-20.0.0
module load SPAdes/3.14.0 samtools/1.10 bowtie2/2.3.5.1 blast+/2.10.0 pilon/1.23 racon/1.4.3
module load Unicycler/0.4.8
#### basecalling
echo '--------------------------------------------------------'
echo 'Launching basecalling.'
echo '--------------------------------------------------------'
echo ' '
guppy_basecaller \
--input_path $path_to_fast_5 \
--save_path $path_to_basecalled_data --num_callers 1 --cpu_threads_per_caller 32 \
--disable_pings --flowcell $flowcell --kit $kit --compress_fastq
echo '--------------------------------------------------------'
echo 'Done basecalling.'
echo '--------------------------------------------------------'
echo ' '
#### demultiplexing
echo '--------------------------------------------------------'
echo 'Launching demultiplexing.'
echo '--------------------------------------------------------'
echo ' '
guppy_barcoder \
-i $path_to_basecalled_data/pass \
-s $path_to_demultiplexed_data \
--barcode_kits $kit \
--trim_barcodes
echo '--------------------------------------------------------'
echo 'Done demultiplexing.'
echo '--------------------------------------------------------'
echo ' '
### Concatenate reads in the same file
echo '--------------------------------------------------------'
echo 'Concatenating reads and reorganizing data.'
echo '--------------------------------------------------------'
echo ' '
cd $path_to_demultiplexed_data
for d in barcode*/ ; do
cd $d
cat *fastq.gz > ${d::-1}.fastq.gz
cd ..
done
#### reorganize files: rename folders and outputs & put ONT reads for one barcode
#### in the same directory with Illumina reads
for i in "${!barcodes[@]}";do
mv ${barcodes[i]} ${sample_names[i]}
cd ${sample_names[i]}
mv ${barcodes[i]}.fastq.gz ${sample_names[i]}.fastq.gz
cd ..
done
for i in "${sample_names[@]-1}";do
mv $path_to_illumina/${i}_R1.fastq.gz $path_to_demultiplexed_data/${i}
mv $path_to_illumina/${i}_R2.fastq.gz $path_to_demultiplexed_data/${i}
done
echo '--------------------------------------------------------'
echo 'Done reorganizing data.'
echo '--------------------------------------------------------'
echo ' '
#### filter long reads (I exclude the last sample because I assume it is your empty control)
echo '--------------------------------------------------------'
echo 'Filtering reads.'
echo '--------------------------------------------------------'
echo ' '
for i in "${sample_names[@]-1}";do
cd $path_to_demultiplexed_data/${i}
filtlong \
--min_length 1000 --keep_percent 90 --target_bases 500000000 \
${i}.fastq.gz | gzip > ${i}_filR.fastq.gz
done
echo '--------------------------------------------------------'
echo 'Done filtering reads.'
echo '--------------------------------------------------------'
echo ' '
#### hybrid assembly (I exclude the last sample because I assume it is your empty control)
echo '--------------------------------------------------------'
echo 'Launching hybrid assembly.'
echo '--------------------------------------------------------'
echo ' '
for i in "${sample_names[@]-1}";do
cd $path_to_demultiplexed_data/${i}
unicycler -1 ${i}_R1.fastq.gz -2 ${i}_R2.fastq.gz \
-l ${i}_filR.fastq.gz --out ${i}_h --verbosity 2 \
--keep 2 --threads 64 --mode normal
done
echo '--------------------------------------------------------'
echo 'Done with hybrid assembly.'
echo '--------------------------------------------------------'
echo ' '
#### long-reads assembly
echo '--------------------------------------------------------'
echo 'Launching long-reads assembly.'
echo '--------------------------------------------------------'
echo ' '
for i in "${sample_names[@]::${#sample_names[@]}-1}";do
cd $path_to_demultiplexed_data/$i
unicycler -l ${i}_filR.fastq.gz --out ${i}_l --verbosity 2 \
--keep 2 --threads 64 --mode normal
done
echo '--------------------------------------------------------'
echo 'Done with long-reads assembly.'
echo '--------------------------------------------------------'
echo ' '
#### Control: long-reads assembly with negative control (should fail)
echo '--------------------------------------------------------'
echo 'Launching long-reads assembly for negative control.'
echo '--------------------------------------------------------'
echo ' '
for i in "${sample_names[-1]}";do
cd $path_to_demultiplexed_data/$i
unicycler -l ${i}.fastq.gz --out ${i}_l --verbosity 2 \
--keep 2 --threads 64 --mode normal
done
echo '--------------------------------------------------------'
echo 'Done with long-reads assembly for negative control.'
echo '--------------------------------------------------------'
echo ' '
```