GLDS-249 Nick 2-genome probing

--- tags: GeneLab title: GLDS-249 Nick 2-genome probing --- # GLDS-249 Nick 2-genome probing [toc] ## Conda enviroment setup ```bash # installing bit mamba create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.46 # installing anvio # mostly following instructions here: https://anvio.org/install/ # but done exactly as depicted here mamba create -y -n anvio python=3.6 conda activate anvio # installing dependencies mamba install -y -c conda-forge -c bioconda -c defaults "sqlite >=3.31.1" mamba install -y -c conda-forge -c bioconda -c defaults prodigal mamba install -y -c conda-forge -c bioconda -c defaults mcl mamba install -y -c conda-forge -c bioconda -c defaults muscle=3.8.1551 mamba install -y -c conda-forge -c bioconda -c defaults hmmer mamba install -y -c conda-forge -c bioconda -c defaults diamond mamba install -y -c conda-forge -c bioconda -c defaults blast mamba install -y -c conda-forge -c bioconda -c defaults bowtie2 tbb=2019.8 mamba install -y -c conda-forge -c bioconda -c defaults samtools=1.9 mamba install -y -c conda-forge -c bioconda -c defaults trimal mamba install -y -c conda-forge -c bioconda -c defaults trnascan-se mamba install -y -c conda-forge -c bioconda -c defaults fasttree mamba install -y -c conda-forge -c bioconda -c defaults fastani # getting anvio curl -L https://github.com/merenlab/anvio/releases/download/v7.1/anvio-7.1.tar.gz \ --output anvio-7.1.tar.gz # installing anvio pip install anvio-7.1.tar.gz # setting up some anvio-configured ref dbs that will be used mkdir -p ~/ref-dbs/anvio anvi-setup-kegg-kofams --kegg-data-dir ~/ref-dbs/anvio/KOs anvi-setup-ncbi-cogs --cog-data-dir ~/ref-dbs/anvio/COGs -T 20 ``` ## Getting GenBank files of 2 target ref genomes [*Extibacter muris*](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_004345005.1/) | GCF_004345005.1 [*Dysosmobacter welbionis*](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_005121165.2/) | GCF_005121165.2 ```bash conda activate bit printf "GCF_004345005.1\nGCF_005121165.2\n" > target-accs.txt bit-dl-ncbi-assemblies -w target-accs.txt -f genbank -j 2 gunzip *.gz mkdir ref-genome-genbank-files mv GCF_00*gb ref-genome-genbank-files/ ``` ## Getting reads files ```bash curl -L -o GL-filenames-json-to-tsv.py https://figshare.com/ndownloader/files/35049817 curl -L -o GLDS-249-files.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/249 python GL-filenames-json-to-tsv.py GLDS-249-files.json GLDS-249-files.tsv grep "NxtaFlex" GLDS-249-files.tsv | grep "HRremoved" | grep -v "GMetagenomics" > NxtaFlex-read-files-and-links.tsv wc -l NxtaFlex-read-files-and-links.tsv # 96 grep -v "_BSL_" NxtaFlex-read-files-and-links.tsv > NxtaFlex-read-files-and-links-no-basal-samples.tsv wc -l NxtaFlex-read-files-and-links-no-basal-samples.tsv # 64 ############ ### at the time of doing this, the links given in the json table downloaded were not working, so building with a link i know works from the site... the not working part is commented out here # # renaming to remove all the redundancy before downloading # cut -f 1 NxtaFlex-read-files-and-links-no-basal-samples.tsv > orig-names.tmp # sed 's/GLDS-249_metagenomics_Mmus_C57-6T_FCS_//' orig-names.tmp | sed 's/_NxtaFlex//' | sed 's/_HRremoved//' > new-names.tmp # # only works because they are all version 1 at time of doing this (not sure how to get around that) # paste -d " " <( cut -f 2 NxtaFlex-read-files-and-links-no-basal-samples.tsv | sed 's/^/curl /' | sed 's/$/?version=1 -s -o/' ) new-names.tmp > dl-wanted-reads.sh # rm *.tmp # # downloading in parallel (done within a screen) # mkdir reads # cd reads # mv ../dl-wanted-reads.sh . # cat dl-wanted-reads.sh | parallel --xapply -j 10 ################ # renaming to remove all the redundancy before downloading cut -f 1 NxtaFlex-read-files-and-links-no-basal-samples.tsv > orig-names.tmp sed 's/GLDS-249_metagenomics_Mmus_C57-6T_FCS_//' orig-names.tmp | sed 's/_NxtaFlex//' | sed 's/_HRremoved//' > new-names.tmp # building links that work (only works because they are all version 1, not sure what to do if not, because that info isn't in the json pulled above either) paste -d " " <( sed 's|^|curl -L https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/|' orig-names.tmp | sed 's/$/?version=1 -s -o/' ) new-names.tmp > dl-wanted-reads.sh # this file looks like this head -n 3 dl-wanted-reads.sh | sed 's/^/# /' # curl -L https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-249_metagenomics_Mmus_C57-6T_FCS_FLT_ISS-T_NxtaFlex_Rep1_F3_R1_HRremoved_raw.fastq.gz?version=1 -s -o FLT_ISS-T_Rep1_F3_R1_raw.fastq.gz # curl -L https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-249_metagenomics_Mmus_C57-6T_FCS_FLT_ISS-T_NxtaFlex_Rep1_F3_R2_HRremoved_raw.fastq.gz?version=1 -s -o FLT_ISS-T_Rep1_F3_R2_raw.fastq.gz # curl -L https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-249_metagenomics_Mmus_C57-6T_FCS_FLT_ISS-T_NxtaFlex_Rep2_F4_R1_HRremoved_raw.fastq.gz?version=1 -s -o FLT_ISS-T_Rep2_F4_R1_raw.fastq.gz rm *.tmp # downloading in parallel (done within a screen) mkdir reads cd reads cp ../dl-wanted-reads.sh . cat dl-wanted-reads.sh | parallel --xapply -j 10 cd ../ ``` ## Processing Done with snakemake workflow here: https://github.com/AstrobioMike/GLDS-249-Nick-2-genome-probing ```bash cp target-accs.txt genome-IDs.txt ls reads/*R1* | cut -f 2 -d "/" | sed 's/_R1_raw.fastq.gz//' > sample-IDs.txt snakemake -j 5 ``` ## Visualizing Copied to local and visualized as follows: ```bash anvi-interactive -c contigs-dbs/GCF_004345005.1-contigs.db -p GCF_004345005.1-merged-profile/PROFILE.db anvi-interactive -c contigs-dbs/GCF_005121165.2-contigs.db -p GCF_005121165.2-merged-profile/PROFILE.db ```