owned this note
owned this note
Published
Linked with GitHub
# DC Genomics Workshop 06-27/28-2018
###### tags: `dc genomics`
Example DC genomics outline:
https://dib-lab.github.io/2018-06-27-DIBSI-Genomics/
## Workflow suggestions
+ We propose a change to the order of lessons. We propose:
+ Day 1 = Introduction to command line for bioinformatics
+ Why shell? (use tools, automate)
+ [Why of cloud computing?](http://www.datacarpentry.org/cloud-genomics/01-why-cloud-computing/) (more space. also note you need shell to cloud compute)
+ remove "choosing a cloud section"
+ *NB* this section will be quite short.
+ [Cloud Genomics, Episode 2: Logging onto cloud](http://www.datacarpentry.org/cloud-genomics/)
+ Talk about command structure when `ssh`ing
+ [Shell Genomics](http://www.datacarpentry.org/shell-genomics/) - on cloud, written around a text file. This could be the metadata file, that we reveal later. It could include all [2,443 Lenski samples](https://www.ebi.ac.uk/ena/data/view/PRJNA380528). [Meta-data here](https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=PRJNA380528&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt). Include `fasta` file as well.
+ [Episode 3](http://www.datacarpentry.org/shell-genomics/03-working-with-files/index.html) needs a rewrite. We think we need to cover `cd`,
`rm`, `head`, `tail`, `cat`, `print`, `mv`, `cp`, `grep`, `wc`, `less`, `man`, `scp` (teach with `cp`), `curl`
+ Show `grep` by grepping for our 6 samples.
+ We think this could be named "Exploring the Shell"
+ @tomsing1 pirate treasure hunt to demonstrate folder structure in a rewarding way
+ Add **optional** episode that includes `cut`, `paste`, `sort`, `uniq`, `awk`
+ [Shell Genomics, Episode 4: Pipes & Redirection](http://www.datacarpentry.org/shell-genomics/04-redirection/index.html)
+ right now includes `>`, `|`, `sort`, `wc`, and `uniq`, `cut`, `paste`
+ We think it should only include `>` and `|`
+ [Shell Genomics, Episode 05: Writing Scripts](http://www.datacarpentry.org/shell-genomics/05-writing-scripts/index.html).
+ Change name to "Writing For loops & scripts"
+ Don't write a script using `history`.
+ Write the script in `nano`
+ Modify to include `for` loop, addressing variables (`$`) and arguments (`$1`)
+ Also use `print` in the for loop, like [@ctb's Beginner Unix lesson](http://dib-training.readthedocs.io/en/pub/2016-01-13-adv-beg-shell.html).
+ **For non-novice learners optional:** Introduce `tmux`/`screen`, perhaps with for loops.
+ Consider making two episodes
+ Day 02
+ Move [Shell Genomics, Episode 6: Project Organization](http://www.datacarpentry.org/shell-genomics/06-organization/index.html) to [Wrangling Genomics: Variant Calling Workflow](http://www.datacarpentry.org/wrangling-genomics/03-automation/index.html)
+ Project organization and management
+ Data Tidiness (see https://hackmd.io/DbFMOsvATai6Fb8UbEmZLw#)
+ nix formal [Genomics Organization, Episode 02: Planning for NGS Projects](http://www.datacarpentry.org/organization-genomics/02-project-planning/), roll this info into data tidiness where we will have a more relevant spreadsheet to work with
+ Download data instead of moving from hidden files. Download a subsampled dataset that we post on figshare. Note that figshare is acting as our backup.
+ 90% of chromosome can be thrown out with high coverage of 10%
+ [Wrangling Genomics, Episode 01: Assessing Read Quality](http://www.datacarpentry.org/wrangling-genomics/00-quality-control/index.html)
+ Back up plan with cyberduck/filezilla
+ [Wrangling Genomics, Episode 02: Trimming and Filtering](http://www.datacarpentry.org/wrangling-genomics/01-trimming/index.html)
+ Add a section to show the Trimmomatic manual.
+ [Wrangling Genomics, Episode 03: Variant Calling Workflow](http://www.datacarpentry.org/wrangling-genomics/03-automation/index.html)
+ add information on all of the flags used in the different commands
+ [Wrangling Genomics, Episode 04: Automating a Variant Calling Workflow](http://www.datacarpentry.org/wrangling-genomics/03-automation/index.html)
+ Only live code the "Automating QCing" section.
+ Allow the learners to download a full automated script to look at
+ Move "[Genomics Organization, Episode 04: Examining Data on the NCBI SRA Database](http://www.datacarpentry.org/organization-genomics/03-ncbi-sra/)" to the end of Day 2, and include other resources. Demonstrate finding the SRR accession number in the paper, searching for it in the ENA, and downloading a fastq file with `wget`.
+ This is also nice bc people are tired at the end of day, and we can give them goodies here :)
## Additional suggestions
+ use GitBash instead of PuTTy. Include pasting instructions in GitBash, and note that `open` and `man` don't work in GitBash. Relates to https://github.com/datacarpentry/genomics-workshop/issues/41
+ Change the dataset to longer reads (~150bp) from Lenski lab as suggested in https://github.com/datacarpentry/genomics-workshop/issues/42
+ we suggest removing the use of `df` and `tree` in favor of reinforcing the command `ls`
+ Remove the lesson "Fine tuning your cloud setup"
+ Download subsampled fastq files data an s3 bucket or figshare and talk about write-protecting data once downloaded
+ We need to demonstrate a best practice workflow (for example if we tell people to write-protect files we should then write-protect them in our workflow)
+ we need to find (or make) a new *exemplar* csv file of meta data that answers a biological question that correponds to a new dataset which has been proposed in https://github.com/datacarpentry/genomics-workshop/issues/42
+ make sure that the snps are biologically relvant
+ we think this paper relates to the data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/
+ We need to create a terrible and beautiful metadata file for these data
+ the workshop template really needs editing so that the default does not populate a workshop webpage with R, pyhton, and shell instructions
+ List of resources at end of day-2
+ ncbi sra
+ ENA
+ ncbi taxonomy (help finding a genome)
+ Genbank vs RefSeq
+ The lesson is really lacking in philosphy and overview.
+ Include links to manuals and pull up Trimmomatic manual to dissect the flags we include.
+ Add bash vocab!
+ Some overlapping and confusing terms we hear a lot (at least how Mike Lee thinks of them; not peer reviewed!)
+ “command line” or “terminal” – a text-based environment capable of taking input and providing output
+ “shell” – what runs in a terminal, it is your ambassador to your operating system so you can tell it very specific things; the shell translates our language into the computer’s language and then back again so we can communicate with it
+ “bash” – a specific type of shell (by far the most common these days), this describes the programming language your shell understands; the language you use to talk to the shell so it can talk to the computer for you
+ explain the difference between mapping and aligning
+ broken link at http://www.datacarpentry.org/genomics-workshop/setup/
+ `bcftools mpileup` instead of `samtools mpileup`
+ And pipe the pileup output instead of saving it to a bcf file