Simulating illumina metagenomic datasets from some of our previously classified genomes

--- tags: JPL-HBCU --- [TOC] # Simulating illumina metagenomic datasets from some of our previously classified genomes ## Figuring out some target genomes to include Working from the file Ran sent me "counts_for_sim.txt", can be pulled grabbed with the following: ```bash curl -L -o counts_for_sim.tsv https://ndownloader.figshare.com/files/24054926 ``` Summing counts of each taxa and ordering in R: ```R tab <- read.table("counts_for_sim.tsv", sep="\t", header=T, row.names=1, check.names=F) sums <- data.frame(sort(colSums(tab), decreasing=T)) col_1 <- row.names(sums) col_2 <- sums[,1] sums_df <- data.frame("taxa"=col_1, "summed_counts"=col_2) write.table(sums_df, "ordered-summed-taxa.tsv", sep="\t", quote=F, row.names=F) ``` Downloading NCBI RefSeq summary table (initially done on 24-July-2020): ```bash curl -L -o ncbi_RefSeq_assembly_info.tsv ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt ``` Doing a quick `grep` to see how many (if any) are labeled this way in here in there in the organism_name field: ```bash IFS=$'\n' for tax in $(tail -n +2 ordered-summed-taxa.tsv | cut -f 1) do count=$(grep -c -w "$tax" ncbi_RefSeq_assembly_info.tsv) printf "$tax\t$count\n" done | sort -t $'\t' -nrk 2 > genome_counts_by_org_name.tsv ``` Getting those that have at least 1 in RefSeq by that name: ```bash awk -F $'\t' ' $2 > 0 ' genome_counts_by_org_name.tsv > present_genome_counts_by_org_name.tsv # count: wc -l present_genome_counts_by_org_name.tsv # 48 ``` There are 48 present, the number available per taxa is a precipitous slope of course: ```bash column -ts $'\t' present_genome_counts_by_org_name.tsv | head | sed 's/^/# /' ``` ``` # Pseudomonas aeruginosa 5115 # Staphylococcus epidermidis 783 # Micrococcus luteus 68 # Alcaligenes faecalis 28 # Pseudomonas monteilii 25 # Corynebacterium jeikeium 23 # Bradyrhizobium japonicum 21 # Moraxella osloensis 15 # Sphingobium yanoikuyae 14 # Acidovorax citrulli 11 ``` For now, just taking 2 of each of the top 12 which will give us 60 total to simulate: ```bash head -n 12 present_genome_counts_by_org_name.tsv | cut -f 1 > targets-for-2-genomes.txt tail -n +13 present_genome_counts_by_org_name.tsv | cut -f 1 > targets-for-1-genome.txt ``` Getting assembly accessions and taxids of each: ```bash IFS=$'\n' for tax in $(cat targets-for-2-genomes.txt) do grep -w -m 2 "$tax" ncbi_RefSeq_assembly_info.tsv | cut -f 1,6,8 done > p1.tmp for tax in $(cat targets-for-1-genome.txt) do grep -w -m 1 "$tax" ncbi_RefSeq_assembly_info.tsv | cut -f 1,6,8 done > p2.tmp cat <(printf "accession\ttaxid\torg_name\n") p1.tmp p2.tmp > target-genome-info.tsv tail -n +2 target-genome-info.tsv | cut -f 1 > target-accessions.txt rm p1.tmp p2.tmp ``` ## Grabbing target genomes and full lineage info Conda installation of my [`bit`](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) program to help grab the genomes and get lineage info ```bash conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.08 conda activate bit ``` Downloading genome fasta files by their accessions: ```bash bit-dl-ncbi-assemblies -w target-accessions.txt -f fasta -j 5 ``` Gunzipping the assemblies because inSilicoSeq which we'll be using doesn't take gzipped files: ```bash gunzip G*.gz ``` Getting lineage info for each taxid (uses the glorious [taxonkit](https://bioinf.shenwei.me/taxonkit/)): ```bash cut -f 2 target-genome-info.tsv | tail -n +2 > target-taxids.txt bit-get-lineage-from-taxids -i target-taxids.txt -o target-genome-lineages.tmp paste target-genome-info.tsv <( cut -f 2- target-genome-lineages.tmp ) > tab.tmp && mv tab.tmp target-genome-info.tsv rm target-genome-lineages.tmp conda deactivate ``` ### Trimming down primate genomes There are a handful of misclassified primate genomes in here, like macaques: ```bash grep "Euk" target-genome-info.tsv | grep "Chordata" | column -ts $'\t' | sed 's/^/# /' ``` ``` # GCF_000956065.1 9545 Macaca nemestrina Eukaryota Chordata Mammalia Primates Cercopithecidae Macaca Macaca nemestrina # GCF_002880775.1 9601 Pongo abelii Eukaryota Chordata Mammalia Primates Hominidae Pongo Pongo abelii # GCF_013052645.1 9597 Pan paniscus Eukaryota Chordata Mammalia Primates Hominidae Pan Pan paniscus # GCF_000951045.1 9568 Mandrillus leucophaeus Eukaryota Chordata Mammalia Primates Cercopithecidae Mandrillus Mandrillus leucophaeus # GCF_000951035.1 336983 Colobus angolensis palliatus Eukaryota Chordata Mammalia Primates Cercopithecidae Colobus Colobus angolensis # GCF_000409795.2 60711 Chlorocebus sabaeus Eukaryota Chordata Mammalia Primates Cercopithecidae Chlorocebus Chlorocebus sabaeus ``` We want to retain them somewhat, because this happens, but we can't keep the full genomes in there because they are ~1,000 times larger. So going to cut them down to ~3 MB, and keep just that fragment of them: ```bash grep "Euk" target-genome-info.tsv | grep "Chordata" | cut -f 1 > accs-to-trim.txt for acc in $(cat accs-to-trim.txt) do echo -e "\n Workin on ${acc}" printf ">${acc}\n" > ${acc}-trimmed.fa grep -v "^>" ${acc}.fa | tr -d "\n" | cut -c 1-3000000 >> ${acc}-trimmed.fa rm ${acc}.fa done ``` ## Making map of assembly-accession-to-contigs The reads output from `inSilicoSeq` will have headers that start with the source contig identifier (up to any whitespace in the initial contig identifier). So here, making a file that has those linked to each assembly accession, which with the `target-genome-info.tsv` file made above, ties to taxonomy. The exact format of how this info will be wanted in the end will be dependent on how things are being done, but here's a start 🙂 ```bash # starting file with header printf "assembly_accession\tcontig_IDs\n" > assembly-accession-to-contigs-map.tsv # need to generate differently for the primate ones we trimmed down vs the ones we didn't trim down # so making a file of just those accs we didn't trim down to iterate over comm -23 <(sort target-accessions.txt) <(sort accs-to-trim.txt) > accs-not-trimmed.txt for acc in $(cat accs-not-trimmed.txt) do contigs=$(grep ">" ${acc}.fa | tr -d ">" | cut -f 1 -d " " | tr "\n" ";" | sed 's/.$//') printf "$acc\t$contigs\n" >> assembly-accession-to-contigs-map.tsv done # now adding info for those we did trim down, which just have 1 contig named as the accession for acc in $(cat accs-to-trim.txt) do printf "$acc\t$acc\n" >> assembly-accession-to-contigs-map.tsv done ``` Multiple contigs are delimited by semi-colons in second column, e.g.: ```bash head assembly-accession-to-contigs-map.tsv | column -ts $'\t' | sed 's/^/# /' ``` ``` # assembly_accession contig_IDs # GCF_000006605.1 NC_007164.1;NC_003080.1 # GCF_000006765.1 NC_002516.2 # GCF_000007645.1 NC_004461.1;NC_005008.1;NC_005007.1;NC_005006.1;NC_005005.1;NC_005004.1;NC_005003.1 # GCF_000011925.1 NC_002976.3;NC_006663.1 # GCF_000014625.1 NC_008463.1 # GCF_000015165.1 NC_009485.1;NC_009475.1 # GCF_000015325.1 NC_008752.1 # GCF_000023145.1 NC_012704.1 # GCF_000023205.1 NC_012803.1 ``` ## Simulating metagenomic datasets Conda install of [`inSilicoSeq`](https://github.com/HadrienG/InSilicoSeq#insilicoseq): ```bash conda create -y -n insilicoseq -c conda-forge -c bioconda -c defaults insilicoseq=1.4.6 biopython=1.77 conda activate insilicoseq ``` Fuller documentation of inSilicoSeq available [here](https://insilicoseq.readthedocs.io/en/latest/). ### No error model used (so perfect reads generated) Generating dataset with **even** coverage, no errors/variation: ```bash iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance uniform --mode perfect --output even-perfect-sim -z mv even-perfect-sim_R1.fastq.gz even-perfect-sim-R1.fq.gz mv even-perfect-sim_R2.fastq.gz even-perfect-sim-R2.fq.gz mv even-perfect-sim_abundance.txt even-perfect-sim-abundance.tsv rm even-perfect-sim_R1.fastq even-perfect-sim_R2.fastq ``` Generating dataset with **uneven** coverage ([exponential](https://insilicoseq.readthedocs.io/en/latest/iss/generate.html#abundance-distribution) used), no errors/variation: ```bash iss generate --seed 100 -p 40 -n_reads 10M --draft *.fa --abundance exponential --mode perfect --output uneven-perfect-sim -z mv uneven-perfect-sim_R1.fastq.gz uneven-perfect-sim-R1.fq.gz mv uneven-perfect-sim_R2.fastq.gz uneven-perfect-sim-R2.fq.gz mv uneven-perfect-sim_abundance.txt uneven-perfect-sim-abundance.tsv rm uneven-perfect-sim_R1.fastq uneven-perfect-sim_R2.fastq ``` ### Using packaged HiSeq error model Generating dataset with **even** coverage, using HiSeq error model: ```bash iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance uniform --model HiSeq --output even-hiseq-sim -z mv even-hiseq-sim_R1.fastq.gz even-hiseq-sim-R1.fq.gz mv even-hiseq-sim_R2.fastq.gz even-hiseq-sim-R2.fq.gz mv even-hiseq-sim_abundance.txt even-hiseq-sim-abundance.tsv rm even-hiseq-sim_R1.fastq even-hiseq-sim_R2.fastq ``` Generating dataset with **uneven** coverage ([exponential](https://insilicoseq.readthedocs.io/en/latest/iss/generate.html#abundance-distribution) used), and HiSeq error model: ```bash iss generate --seed 100 -p 40 --n_reads 10M --draft *.fa --abundance exponential --model HiSeq --output uneven-hiseq-sim -z mv uneven-hiseq-sim_R1.fastq.gz uneven-hiseq-sim-R1.fq.gz mv uneven-hiseq-sim_R2.fastq.gz uneven-hiseq-sim-R2.fq.gz mv uneven-hiseq-sim_abundance.txt uneven-hiseq-sim-abundance.tsv rm uneven-hiseq-sim_R1.fastq uneven-hiseq-sim_R2.fastq ``` ## Downloading simulated datasets These and their relative abundance tables can be downloaded from figshare with the following commands: **even-perfect-sim** ```bash curl -L -o even-perfect-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058625 curl -L -o even-perfect-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058631 curl -L -o even-perfect-sim-abundances.tsv https://ndownloader.figshare.com/files/24058619 ``` **uneven-perfect-sim** ```bash curl -L -o uneven-perfect-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058634 curl -L -o uneven-perfect-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058637 curl -L -o uneven-perfect-sim-abundances.tsv https://ndownloader.figshare.com/files/24058628 ``` **even-hiseq-sim** ```bash curl -L -o even-hiseq-sim-R1.fq.gz https://ndownloader.figshare.com/files/24058652 curl -L -o even-hiseq-sim-R2.fq.gz https://ndownloader.figshare.com/files/24058667 curl -L -o even-hiseq-sim-abundances.tsv https://ndownloader.figshare.com/files/24058658 ``` **uneven-hiseq-sim** ```bash curl -L -o uneven-hiseq-sim-R1.fq.gz https://ndownloader.figshare.com/files/24065039 curl -L -o uneven-hiseq-sim-R2.fq.gz https://ndownloader.figshare.com/files/24065048 curl -L -o uneven-hiseq-sim-abundances.tsv https://ndownloader.figshare.com/files/24065042 ``` ## Downloading information tables Downloading the input genome summary table (`target-genome-info.tsv`) as generated [above](https://hackmd.io/@astrobiomike/simulating-illumina-reads#Grabbing-target-genomes-and-full-lineage-info): ```bash curl -L -o target-genome-info.tsv https://ndownloader.figshare.com/files/24079580 ``` Which looks like this: ```bash head target-genome-info.tsv | column -ts $'\t' | sed 's/^/# /' ``` ``` # accession taxid org_name domain phylum class order family genus specific_name # GCF_000006765.1 208964 Pseudomonas aeruginosa PAO1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa # GCF_000014625.1 208963 Pseudomonas aeruginosa UCBPP-PA14 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa # GCF_000007645.1 176280 Staphylococcus epidermidis ATCC 12228 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis # GCF_000011925.1 176279 Staphylococcus epidermidis RP62A Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis # GCF_000023205.1 465515 Micrococcus luteus NCTC 2665 Bacteria Actinobacteria Actinobacteria Micrococcales Micrococcaceae Micrococcus Micrococcus luteus # GCF_000176875.1 596312 Micrococcus luteus SK58 Bacteria Actinobacteria Actinobacteria Micrococcales Micrococcaceae Micrococcus Micrococcus luteus # GCF_000275465.1 1156918 Alcaligenes faecalis subsp. faecalis NCIB 8687 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Alcaligenes Alcaligenes faecalis # GCF_000739855.1 1218102 Alcaligenes faecalis subsp. faecalis NBRC 13111 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Alcaligenes Alcaligenes faecalis # GCF_000262005.1 1123524 Pseudomonas monteilii QM Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas monteilii ``` And downloading the `assembly-accession-to-contigs-map.tsv` table as generated [above](https://hackmd.io/@astrobiomike/simulating-illumina-reads#Making-map-of-assembly-accession-to-contigs): ```bash curl -L -o assembly-accession-to-contigs-map.tsv https://ndownloader.figshare.com/files/24079577 ``` Which looks like this: ```bash head assembly-accession-to-contigs-map.tsv | column -ts $'\t' | sed 's/^/# /' ``` ``` # assembly_accession contig_IDs # GCF_000006605.1 NC_007164.1;NC_003080.1 # GCF_000006765.1 NC_002516.2 # GCF_000007645.1 NC_004461.1;NC_005008.1;NC_005007.1;NC_005006.1;NC_005005.1;NC_005004.1;NC_005003.1 # GCF_000011925.1 NC_002976.3;NC_006663.1 # GCF_000014625.1 NC_008463.1 # GCF_000015165.1 NC_009485.1;NC_009475.1 # GCF_000015325.1 NC_008752.1 # GCF_000023145.1 NC_012704.1 # GCF_000023205.1 NC_012803.1 ```