Bioinformatics Resources

# Bioinformatics Resources The new NextGenResources /exports/igmm/eddie/NextGenResources -> /exports/igmm/eddie/BioinformaticsResources * John has made BioinformaticsResources writeable to Alison, Gogo, Graeme, Mike, Elvina, and Murray - readable to EVERYONE (uni wide) * NextGenResources is now frozen - will become read-only to everyone, warning sent out that it will go away entirely in 1 year * 10TB on eddie * minimal (500GB) on datastore * Everyone will be assigned one or more top-level folders to populate * Each top level folder will have 2 documents * <folder_name>_README.md - what's in the folder, please use markdown if you can or plain text is fine * <folder_name>_populate.sh - script to populate the data in the folder * These documents will be added to the gitlab repository https://git.ecdf.ed.ac.uk/igmmbioinformatics/nextgenresources ## Data * reference genomes, transcriptomes, proteomes, annotation * aligner indexes * gnomad * GTEx * nextflow igenomes * * https://ewels.github.io/AWS-iGenomes/ * GATK * bcbio ## reference genome resource manager [refgenie](http://refgenie.databio.org/en/latest/). Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. ## NextGenResources current top level folders 14713-16PCW.clusters.rds Mike: to check annotation bcbio-1.0.6 bcbio-1.1.5 bcbiotx bismark_genome_indexes crossmap find1G gatk_bundle gnomAD gtex_resources GTEx_v8 igenomes liftOverResources mappability Minion motif_databases nextflow ngs.find README reference resources shared software src targets tcga TCGA_SV_CALLS TCGA_SV_calls.info.txt test_folder test_vsvinti transcriptome ## nf-core ### singularity cache nf-core use Singularity images for software. You can specify a cache directory with the nextflow environment variable `NXF_SINGULARITY_CACHEDIR` This should point to the directory with singularity images for the pipeline and version. #### getting the singularity images nf-core provides a helper tool nf-core that enables you to download an entire pipeline offline, including the singularity images see [here](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) For example, the nfcore/rnaseq release 3.0 images can be download using the command below. ~~~ #need latest version of nf-core module load nextflow module load roslin/singularity/3.5.3 export NXF_SINGULARITY_CACHEDIR=/exports/igmm/eddie/BioinformaticsResources/nfcore/singularity-images/ nf-core download -s -r 3.0 rnaseq --force --singularity --singularity-cache tar -xzf nf-core-rnaseq-3.0.tar.gz ~~~ ## iGenomes The [iGenomes](https://emea.support.illumina.com/sequencing/sequencing_software/igenome.html) are a collection of reference sequences and annotation files for commonly analyzed organisms. The files have been downloaded from Ensembl, NCBI, or UCSC. Chromosome names have been changed to be simple and consistent with the download source. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism. ### iGenomes nf-core nf-core can use the iGenomes folder for it's pipelines. For more information about this resource, please see the GitHub readme [here]( https://github.com/ewels/AWS-iGenomes). https://ewels.github.io/AWS-iGenomes/ example command: The command below will download the STARIndex for `Homo_sapiens/UCSC/hg38` ~~~ module load igmm/apps/awscli/2.1.6 aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/ ./Homo_sapiens/UCSC/hg38/Sequence/STARIndex/ ~~~ ## refgenie Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another. ## tagging # Agenda Meeting 4th May 2021 https://git.ecdf.ed.ac.uk/igmmbioinformatics/nextgenresources 1. 9:00 Existing NextGenResources contents disccussion, led by PG 2. AM to show gitlab folder and show how it works e.g. issues etc and how to document the Bioinformatics resources folder. 3. 9:30 assign tasks 3. MW 3. MH 3. PG 4. 5. AM 4. gtex 6. GG 4. igenomes (Human and Mouse) 5. Mouse, mm10, GRCm38, GRCm39 6. Human, hg19, hg38, GRCh38 6. download nextflow singularity images into single cache directory 7. chipseq 8. ataqseq 9. rnaseq v3.0 10. sarek ## Questions/Discussion points. * What data do we want to store e.g. Zebrafish genome? * question of the origin of the genome we want to add: UCSC, Ensembl, NCBI (all being the same with maybe slight difference in chromosome names ("chr" or not, MT vs chrM, etc) + the question of which version ("top level", primary, soft masked, etc) * specific problem with nf-core/igenomes not including Ensembl GRCh38 (make files available separately as they used an outdated version of STAR) * refgenie: could be ok for us to use, but might be a bit too obscure for the "users"? Especially when trying to find the exact origin of the defaults assets: this need to be well documented, perhaps separately. We can still keep the refgenie directory organisation, even if we download individual files * How do we deal with new datasets and databases e.g. someone wants to use dbSNP: looks like most of the existing data on the old NextGenResources could probably be scrapped/updated * if made available to users: question on how/when to update the data (ensembl gtf, for ex) * on NextGenResources: there's a long list of 100-150 software/packages. Should we maintain this as well? * Should we stick to the most "basic" databases (igenomes, ensembl, UCSC) and leave it to the users to deal with the more specialised ones? ## Proposed folder structure * README.md * refgenie/ * README.md * build.sh * gnomad/ * README.md * build.sh * gtex/ * README.md * build.sh