---
tags: JPL-HBCU
---
[TOC]
# Simulating illumina metagenomic datasets from some of our previously classified genomes
## Figuring out some target genomes to include
Working from the file Ran sent me "counts_for_sim.txt", can be pulled grabbed with the following:
```bash
curl -L -o counts_for_sim.tsv https://ndownloader.figshare.com/files/24054926
```
Summing counts of each taxa and ordering in R:
```R
tab <- read.table("counts_for_sim.tsv", sep="\t", header=T, row.names=1, check.names=F)
sums <- data.frame(sort(colSums(tab), decreasing=T))
col_1 <- row.names(sums)
col_2 <- sums[,1]
sums_df <- data.frame("taxa"=col_1, "summed_counts"=col_2)
write.table(sums_df, "ordered-summed-taxa.tsv", sep="\t", quote=F, row.names=F)
```
Downloading NCBI RefSeq summary table (initially done on 24-July-2020):
```bash
curl -L -o ncbi_RefSeq_assembly_info.tsv ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
```
Doing a quick `grep` to see how many (if any) are labeled this way in here in there in the organism_name field:
```bash
IFS=$'\n'
for tax in $(tail -n +2 ordered-summed-taxa.tsv | cut -f 1)
do
count=$(grep -c -w "$tax" ncbi_RefSeq_assembly_info.tsv)
printf "$tax\t$count\n"
done | sort -t $'\t' -nrk 2 > genome_counts_by_org_name.tsv
```
Getting those that have at least 1 in RefSeq by that name:
```bash
awk -F $'\t' ' $2 > 0 ' genome_counts_by_org_name.tsv > present_genome_counts_by_org_name.tsv
# count:
wc -l present_genome_counts_by_org_name.tsv
# 48
```
There are 48 present, the number available per taxa is a precipitous slope of course:
```bash
column -ts $'\t' present_genome_counts_by_org_name.tsv | head | sed 's/^/# /'
```
```
# Pseudomonas aeruginosa 5115
# Staphylococcus epidermidis 783
# Micrococcus luteus 68
# Alcaligenes faecalis 28
# Pseudomonas monteilii 25
# Corynebacterium jeikeium 23
# Bradyrhizobium japonicum 21
# Moraxella osloensis 15
# Sphingobium yanoikuyae 14
# Acidovorax citrulli 11
```
For now, just taking 2 of each of the top 12 which will give us 60 total to simulate:
```bash
head -n 12 present_genome_counts_by_org_name.tsv | cut -f 1 > targets-for-2-genomes.txt
tail -n +13 present_genome_counts_by_org_name.tsv | cut -f 1 > targets-for-1-genome.txt
```
Getting assembly accessions and taxids of each:
```bash
IFS=$'\n'
for tax in $(cat targets-for-2-genomes.txt)
do
grep -w -m 2 "$tax" ncbi_RefSeq_assembly_info.tsv | cut -f 1,6,8
done > p1.tmp
for tax in $(cat targets-for-1-genome.txt)
do
grep -w -m 1 "$tax" ncbi_RefSeq_assembly_info.tsv | cut -f 1,6,8
done > p2.tmp
cat <(printf "accession\ttaxid\torg_name\n") p1.tmp p2.tmp > target-genome-info.tsv
tail -n +2 target-genome-info.tsv | cut -f 1 > target-accessions.txt
rm p1.tmp p2.tmp
```
## Grabbing target genomes and full lineage info
Conda installation of my [`bit`](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) program to help grab the genomes and get lineage info
```bash
conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.08
conda activate bit
```
Downloading genome fasta files by their accessions:
```bash
bit-dl-ncbi-assemblies -w target-accessions.txt -f fasta -j 5
```
Gunzipping the assemblies because inSilicoSeq which we'll be using doesn't take gzipped files:
```bash
gunzip G*.gz
```
Getting lineage info for each taxid (uses the glorious [taxonkit](https://bioinf.shenwei.me/taxonkit/)):
```bash
cut -f 2 target-genome-info.tsv | tail -n +2 > target-taxids.txt
bit-get-lineage-from-taxids -i target-taxids.txt -o target-genome-lineages.tmp
paste target-genome-info.tsv <( cut -f 2- target-genome-lineages.tmp ) > tab.tmp && mv tab.tmp target-genome-info.tsv
rm target-genome-lineages.tmp
conda deactivate
```
### Trimming down primate genomes
There are a handful of misclassified primate genomes in here, like macaques:
```bash
grep "Euk" target-genome-info.tsv | grep "Chordata" | column -ts $'\t' | sed 's/^/# /'
```
```
# GCF_000956065.1 9545 Macaca nemestrina Eukaryota Chordata Mammalia Primates Cercopithecidae Macaca Macaca nemestrina
# GCF_002880775.1 9601 Pongo abelii Eukaryota Chordata Mammalia Primates Hominidae Pongo Pongo abelii
# GCF_013052645.1 9597 Pan paniscus Eukaryota Chordata Mammalia Primates Hominidae Pan Pan paniscus
# GCF_000951045.1 9568 Mandrillus leucophaeus Eukaryota Chordata Mammalia Primates Cercopithecidae Mandrillus Mandrillus leucophaeus
# GCF_000951035.1 336983 Colobus angolensis palliatus Eukaryota Chordata Mammalia Primates Cercopithecidae Colobus Colobus angolensis
# GCF_000409795.2 60711 Chlorocebus sabaeus Eukaryota Chordata Mammalia Primates Cercopithecidae Chlorocebus Chlorocebus sabaeus
```
We want to retain them somewhat, because this happens, but we can't keep the full genomes in there because they are ~1,000 times larger. So going to cut them down to ~3 MB, and keep just that fragment of them:
```bash
grep "Euk" target-genome-info.tsv | grep "Chordata" | cut -f 1 > accs-to-trim.txt
for acc in $(cat accs-to-trim.txt)
do
echo -e "\n Workin on ${acc}"
printf ">${acc}\n" > ${acc}-trimmed.fa
grep -v "^>" ${acc}.fa | tr -d "\n" | cut -c 1-3000000 >> ${acc}-trimmed.fa
rm ${acc}.fa
done
```
## Making map of assembly-accession-to-contigs
The reads output from `inSilicoSeq` will have headers that start with the source contig identifier (up to any whitespace in the initial contig identifier). So here, making a file that has those linked to each assembly accession, which with the `target-genome-info.tsv` file made above, ties to taxonomy. The exact format of how this info will be wanted in the end will be dependent on how things are being done, but here's a start 🙂
```bash
# starting file with header
printf "assembly_accession\tcontig_IDs\n" > assembly-accession-to-contigs-map.tsv
# need to generate differently for the primate ones we trimmed down vs the ones we didn't trim down
# so making a file of just those accs we didn't trim down to iterate over
comm -23 <(sort target-accessions.txt) <(sort accs-to-trim.txt) > accs-not-trimmed.txt
for acc in $(cat accs-not-trimmed.txt)
do
contigs=$(grep ">" ${acc}.fa | tr -d ">" | cut -f 1 -d " " | tr "\n" ";" | sed 's/.$//')
printf "$acc\t$contigs\n" >> assembly-accession-to-contigs-map.tsv
done
# now adding info for those we did trim down, which just have 1 contig named as the accession
for acc in $(cat accs-to-trim.txt)
do
printf "$acc\t$acc\n" >> assembly-accession-to-contigs-map.tsv
done
```
Multiple contigs are delimited by semi-colons in second column, e.g.:
```bash
head assembly-accession-to-contigs-map.tsv | column -ts $'\t' | sed 's/^/# /'
```
```
# assembly_accession contig_IDs
# GCF_000006605.1 NC_007164.1;NC_003080.1
# GCF_000006765.1 NC_002516.2
# GCF_000007645.1 NC_004461.1;NC_005008.1;NC_005007.1;NC_005006.1;NC_005005.1;NC_005004.1;NC_005003.1
# GCF_000011925.1 NC_002976.3;NC_006663.1
# GCF_000014625.1 NC_008463.1
# GCF_000015165.1 NC_009485.1;NC_009475.1
# GCF_000015325.1 NC_008752.1
# GCF_000023145.1 NC_012704.1
# GCF_000023205.1 NC_012803.1
```
## Simulating metagenomic datasets
Conda install of [`inSilicoSeq`](https://github.com/HadrienG/InSilicoSeq#insilicoseq):
```bash
conda create -y -n insilicoseq -c conda-forge -c bioconda -c defaults insilicoseq=1.4.6 biopython=1.77
conda activate insilicoseq
```
Fuller documentation of inSilicoSeq available [here](https://insilicoseq.readthedocs.io/en/latest/).
### No error model used (so perfect reads generated)
Generating dataset with **even** coverage, no errors/variation:
```bash
iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance uniform --mode perfect --output even-perfect-sim -z
mv even-perfect-sim_R1.fastq.gz even-perfect-sim-R1.fq.gz
mv even-perfect-sim_R2.fastq.gz even-perfect-sim-R2.fq.gz
mv even-perfect-sim_abundance.txt even-perfect-sim-abundance.tsv
rm even-perfect-sim_R1.fastq even-perfect-sim_R2.fastq
```
Generating dataset with **uneven** coverage ([exponential](https://insilicoseq.readthedocs.io/en/latest/iss/generate.html#abundance-distribution) used), no errors/variation:
```bash
iss generate --seed 100 -p 40 -n_reads 10M --draft *.fa --abundance exponential --mode perfect --output uneven-perfect-sim -z
mv uneven-perfect-sim_R1.fastq.gz uneven-perfect-sim-R1.fq.gz
mv uneven-perfect-sim_R2.fastq.gz uneven-perfect-sim-R2.fq.gz
mv uneven-perfect-sim_abundance.txt uneven-perfect-sim-abundance.tsv
rm uneven-perfect-sim_R1.fastq uneven-perfect-sim_R2.fastq
```
### Using packaged HiSeq error model
Generating dataset with **even** coverage, using HiSeq error model:
```bash
iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance uniform --model HiSeq --output even-hiseq-sim -z
mv even-hiseq-sim_R1.fastq.gz even-hiseq-sim-R1.fq.gz
mv even-hiseq-sim_R2.fastq.gz even-hiseq-sim-R2.fq.gz
mv even-hiseq-sim_abundance.txt even-hiseq-sim-abundance.tsv
rm even-hiseq-sim_R1.fastq even-hiseq-sim_R2.fastq
```
Generating dataset with **uneven** coverage ([exponential](https://insilicoseq.readthedocs.io/en/latest/iss/generate.html#abundance-distribution) used), and HiSeq error model:
```bash
iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance exponential --model HiSeq --output uneven-hiseq-sim -z
mv uneven-hiseq-sim_R1.fastq.gz uneven-hiseq-sim-R1.fq.gz
mv uneven-hiseq-sim_R2.fastq.gz uneven-hiseq-sim-R2.fq.gz
mv uneven-hiseq-sim_abundance.txt uneven-hiseq-sim-abundance.tsv
rm uneven-hiseq-sim_R1.fastq uneven-hiseq-sim_R2.fastq
```
## Downloading simulated datasets
These and their relative abundance tables can be downloaded from figshare with the following commands:
**even-perfect-sim**
```bash
curl -L -o even-perfect-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058625
curl -L -o even-perfect-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058631
curl -L -o even-perfect-sim-abundances.tsv https://ndownloader.figshare.com/files/24058619
```
**uneven-perfect-sim**
```bash
curl -L -o uneven-perfect-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058634
curl -L -o uneven-perfect-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058637
curl -L -o uneven-perfect-sim-abundances.tsv https://ndownloader.figshare.com/files/24058628
```
**even-hiseq-sim**
```bash
curl -L -o even-hiseq-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058652
curl -L -o even-hiseq-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058667
curl -L -o even-hiseq-sim-abundances.tsv https://ndownloader.figshare.com/files/24058658
```
**uneven-hiseq-sim**
```bash
curl -L -o uneven-hiseq-sim-R1.fq.gz https://ndownloader.figshare.com/files/24065039
curl -L -o uneven-hiseq-sim-R2.fq.gz https://ndownloader.figshare.com/files/24065048
curl -L -o uneven-hiseq-sim-abundances.tsv https://ndownloader.figshare.com/files/24065042
```
## Downloading information tables
Downloading the input genome summary table (`target-genome-info.tsv`) as generated [above](https://hackmd.io/@astrobiomike/simulating-illumina-reads#Grabbing-target-genomes-and-full-lineage-info):
```bash
curl -L -o target-genome-info.tsv https://ndownloader.figshare.com/files/24079580
```
Which looks like this:
```bash
head target-genome-info.tsv | column -ts $'\t' | sed 's/^/# /'
```
```
# accession taxid org_name domain phylum class order family genus specific_name
# GCF_000006765.1 208964 Pseudomonas aeruginosa PAO1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa
# GCF_000014625.1 208963 Pseudomonas aeruginosa UCBPP-PA14 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa
# GCF_000007645.1 176280 Staphylococcus epidermidis ATCC 12228 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis
# GCF_000011925.1 176279 Staphylococcus epidermidis RP62A Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis
# GCF_000023205.1 465515 Micrococcus luteus NCTC 2665 Bacteria Actinobacteria Actinobacteria Micrococcales Micrococcaceae Micrococcus Micrococcus luteus
# GCF_000176875.1 596312 Micrococcus luteus SK58 Bacteria Actinobacteria Actinobacteria Micrococcales Micrococcaceae Micrococcus Micrococcus luteus
# GCF_000275465.1 1156918 Alcaligenes faecalis subsp. faecalis NCIB 8687 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Alcaligenes Alcaligenes faecalis
# GCF_000739855.1 1218102 Alcaligenes faecalis subsp. faecalis NBRC 13111 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Alcaligenes Alcaligenes faecalis
# GCF_000262005.1 1123524 Pseudomonas monteilii QM Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas monteilii
```
And downloading the `assembly-accession-to-contigs-map.tsv` table as generated [above](https://hackmd.io/@astrobiomike/simulating-illumina-reads#Making-map-of-assembly-accession-to-contigs):
```bash
curl -L -o assembly-accession-to-contigs-map.tsv https://ndownloader.figshare.com/files/24079577
```
Which looks like this:
```bash
head assembly-accession-to-contigs-map.tsv | column -ts $'\t' | sed 's/^/# /'
```
```
# assembly_accession contig_IDs
# GCF_000006605.1 NC_007164.1;NC_003080.1
# GCF_000006765.1 NC_002516.2
# GCF_000007645.1 NC_004461.1;NC_005008.1;NC_005007.1;NC_005006.1;NC_005005.1;NC_005004.1;NC_005003.1
# GCF_000011925.1 NC_002976.3;NC_006663.1
# GCF_000014625.1 NC_008463.1
# GCF_000015165.1 NC_009485.1;NC_009475.1
# GCF_000015325.1 NC_008752.1
# GCF_000023145.1 NC_012704.1
# GCF_000023205.1 NC_012803.1
```