# Bioinformatics Resources
The new NextGenResources
/exports/igmm/eddie/NextGenResources -> /exports/igmm/eddie/BioinformaticsResources
* John has made BioinformaticsResources writeable to Alison, Gogo, Graeme, Mike, Elvina, and Murray - readable to EVERYONE (uni wide)
* NextGenResources is now frozen - will become read-only to everyone, warning sent out that it will go away entirely in 1 year
* 10TB on eddie
* minimal (500GB) on datastore
* Everyone will be assigned one or more top-level folders to populate
* Each top level folder will have 2 documents
* <folder_name>_README.md - what's in the folder, please use markdown if you can or plain text is fine
* <folder_name>_populate.sh - script to populate the data in the folder
* These documents will be added to the gitlab repository https://git.ecdf.ed.ac.uk/igmmbioinformatics/nextgenresources
## Data
* reference genomes, transcriptomes, proteomes, annotation
* aligner indexes
* gnomad
* GTEx
* nextflow igenomes
* * https://ewels.github.io/AWS-iGenomes/
* GATK
* bcbio
## reference genome resource manager
[refgenie](http://refgenie.databio.org/en/latest/). Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another.
## NextGenResources current top level folders
14713-16PCW.clusters.rds Mike: to check
annotation
bcbio-1.0.6
bcbio-1.1.5
bcbiotx
bismark_genome_indexes
crossmap
find1G
gatk_bundle
gnomAD
gtex_resources
GTEx_v8
igenomes
liftOverResources
mappability
Minion
motif_databases
nextflow
ngs.find
README
reference
resources
shared
software
src
targets
tcga
TCGA_SV_CALLS
TCGA_SV_calls.info.txt
test_folder
test_vsvinti
transcriptome
## nf-core
### singularity cache
nf-core use Singularity images for software. You can specify a cache directory with the nextflow environment variable `NXF_SINGULARITY_CACHEDIR`
This should point to the directory with singularity images for the pipeline and version.
#### getting the singularity images
nf-core provides a helper tool nf-core that enables you to download an entire pipeline offline, including the singularity images see [here](https://nf-co.re/tools/#downloading-pipelines-for-offline-use)
For example, the nfcore/rnaseq release 3.0 images can be download using the command below.
~~~
#need latest version of nf-core
module load nextflow
module load roslin/singularity/3.5.3
export NXF_SINGULARITY_CACHEDIR=/exports/igmm/eddie/BioinformaticsResources/nfcore/singularity-images/
nf-core download -s -r 3.0 rnaseq --force --singularity --singularity-cache
tar -xzf nf-core-rnaseq-3.0.tar.gz
~~~
## iGenomes
The [iGenomes](https://emea.support.illumina.com/sequencing/sequencing_software/igenome.html) are a collection of reference sequences and annotation files for commonly analyzed organisms. The files have been downloaded from Ensembl, NCBI, or UCSC. Chromosome names have been changed to be simple and consistent with the download source. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.
### iGenomes nf-core
nf-core can use the iGenomes folder for it's pipelines.
For more information about this resource, please see the GitHub readme [here]( https://github.com/ewels/AWS-iGenomes).
https://ewels.github.io/AWS-iGenomes/
example command: The command below will download the STARIndex for `Homo_sapiens/UCSC/hg38`
~~~
module load igmm/apps/awscli/2.1.6
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/ ./Homo_sapiens/UCSC/hg38/Sequence/STARIndex/
~~~
## refgenie
Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to download pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another.
## tagging
# Agenda Meeting 4th May 2021
https://git.ecdf.ed.ac.uk/igmmbioinformatics/nextgenresources
1. 9:00 Existing NextGenResources contents disccussion, led by PG
2. AM to show gitlab folder and show how it works e.g. issues etc and how to document the Bioinformatics resources folder.
3. 9:30 assign tasks
3. MW
3. MH
3. PG
4.
5. AM
4. gtex
6. GG
4. igenomes (Human and Mouse)
5. Mouse, mm10, GRCm38, GRCm39
6. Human, hg19, hg38, GRCh38
6. download nextflow singularity images into single cache directory
7. chipseq
8. ataqseq
9. rnaseq v3.0
10. sarek
## Questions/Discussion points.
* What data do we want to store e.g. Zebrafish genome?
* question of the origin of the genome we want to add: UCSC, Ensembl, NCBI (all being the same with maybe slight difference in chromosome names ("chr" or not, MT vs chrM, etc) + the question of which version ("top level", primary, soft masked, etc)
* specific problem with nf-core/igenomes not including Ensembl GRCh38 (make files available separately as they used an outdated version of STAR)
* refgenie: could be ok for us to use, but might be a bit too obscure for the "users"? Especially when trying to find the exact origin of the defaults assets: this need to be well documented, perhaps separately. We can still keep the refgenie directory organisation, even if we download individual files
* How do we deal with new datasets and databases e.g. someone wants to use dbSNP: looks like most of the existing data on the old NextGenResources could probably be scrapped/updated
* if made available to users: question on how/when to update the data (ensembl gtf, for ex)
* on NextGenResources: there's a long list of 100-150 software/packages. Should we maintain this as well?
* Should we stick to the most "basic" databases (igenomes, ensembl, UCSC) and leave it to the users to deal with the more specialised ones?
## Proposed folder structure
* README.md
* refgenie/
* README.md
* build.sh
* gnomad/
* README.md
* build.sh
* gtex/
* README.md
* build.sh