# --- Awk workshop --- 29-30 August, 2022 \* During the workshop, you will be able to register for the second day of the workshop, if you want to attend. **To register** - follow the information on the workshop's UPPMAX web page https://www.uppmax.uu.se/support/courses-and-workshops/awk-workshop-winter-2022/ > Linux command line tools survey: https://forms.gle/8brjbNEav6uPSKBe6 > [Results](https://docs.google.com/forms/d/1XgXdb9TGwnqbx2CP_v1RFpH1aXODey_4RQVGaRTc5ps/viewanalytics) (including previous surveys) > Course material: https://pmitev.github.io/to-awk-or-not/ > Q&A: https://hackmd.io/@pmitev/to-awk-or-not-QA > [Suggest a topic](https://forms.gle/usYYkbWaZVkNceSK6) or check [recent suggestions](https://docs.google.com/forms/d/1tQYWc504BQ-uYRA7MWgu1pNXM613r4Ua1wP_yBPlNDM/viewanalytics) Have a brief look at the course web page https://pmitev.github.io/to-awk-or-not/ to peek in on the contents of the workshop. The course will not cover all the material that is available but rather present you with typical examples and solutions for some common problems. ### First day On the first day in the morning, the course will start with a general introduction and basic concepts of the tool. The material is not organized linearly, so we can try to adapt the material of the course guided by your questions and particular interests. In the afternoon exercise-session we will practice awk on some typical situations where you could probably find solutions to problems relevant to your work. The material covered in the first day should be enough to learn how to use awk for the most commonly used purposes - awk "one-liners" and small scripts. ### Second day If you decide to attend the second day, we will start with a task from the bioinformatics field and try to go through a tutorial which will demonstrate how one can combine awk with other common command-line tools to analyze and manipulate the output from genome analysis. Then we will focus on more advanced features of awk that will be illustrated with some easy to follow "case studies" in the materials science field mixed again with examples from the bioinformatics field. The order will depend on your expressed interests. ### Technical The Zoom meeting will be active 30 minutes before the workshop, so we can help you with some trivial setup problems and general questions. If you have serious troubles with the setup, please let us know in advance (by replying to this email), so we can try to resolve the problem. ## Second day preparations. We will start with this tutorial https://pmitev.github.io/to-awk-or-not/Case_studies/manipulating_vcf/ There are 3 large files that need to be downloaded if you want to follow the tutorial yourself. Here are the direct links to download them with `wget` or another program. ``` # 4.9 MB wget ftp://ftp.ensembl.org/pub/release-101/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.28.101.gff3.gz # 41 MB wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-48/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz # 3.7 GB wget http://dgrp2.gnets.ncsu.edu/data/website/dgrp2.vcf ``` If you work on Rackham - they will be available in `/tmp/awk-course/`. Documtation for the file formats at: GFF3: http://genome.ucsc.edu/FAQ/FAQformat.html#format3 FASTA: http://genetics.bwh.harvard.edu/pph/FASTA.html VCF: http://genome.ucsc.edu/FAQ/FAQformat.html#format10.1 --- AWK is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool. This two-days workshop aims to promote and demonstrate the flexibility of the tool, where the overhead of more sophisticated approaches and programming languages is not worth the bother. **Learn how to** - use Awk as **an advanced** `grep` command, capable of arithmetic selection rules with control over the content of the matched lines. - perform simple conversions, analysis or filter you data on the fly making it easy to plot or read in your favorite research tool. - handle and take advantage on data split over multiple file data sets. - use Awk as simple function or data generator. - perform simple sanity checks on your results. ## :biohazard_sign: Awk for bioinformaticians Use what you learn and dive into the basic concepts of bioinformatics with simple exercises on typical scientific problems and tasks. ## :atom_symbol: Awk for computational physicists/chemists _Second day case studies_ Use Awk to easy typical computational setup scenarios - pre-parse or modify input data or parameters - monitor and/or visualize data on the fly # :clipboard: Schedule ## 1^-st^ day 9:15 - 12:00 - [Seminar session](https://pmitev.github.io/to-awk-or-not/) - Examples of typical problems suitable for Awk "treatment" - Introduction to the basics of Awk scripting language - Solving interactively simple problems - **Lunch break** - [Exercises](https://pmitev.github.io/to-awk-or-not/Exercises/Exercises/) 13:15 -16:00 - Solving interactively the exercise problems ___ ## 2^-nd^ day 9:15 - 12:00 - Awk for bioinformaticians - seminar - Solving of variety bioinformatics problems - Case studies and exercises - [Case Study: Manipulating the output from a genome analysis - vcf and gff](https://pmitev.github.io/to-awk-or-not/Case_studies/manipulating_vcf/) - Filtering and formatting raw data - Counting and piling features - Indexing and hashing to compare variants and annotations - **Lunch break** - Walk-through session on selected topics: [Vote here](https://forms.gle/71hfaLCbiXrsSmkH9) - **Awk parsing "simultaneously" multiple input files** [Multiple input files - second approach](https://pmitev.github.io/to-awk-or-not/Case_studies/multiple_files_II/) scenario will be discussed. - **How to trick awk to accept options on the command line like regular program** i.e. `$ script.awk filename parameter1 parameter2` [link](https://pmitev.github.io/to-awk-or-not/Furthermore/Command_params/) - **Declaring and calling functions with awk** - [link](https://pmitev.github.io/to-awk-or-not/Furthermore/User_defined_functions/) - **Input/output to/from an external programs** Learn how to send input to an external program (might be based on your data) and read the result back - [link](https://pmitev.github.io/to-awk-or-not/More_awk/Input_output/) - **Running averages - elegant awk solution** difficult as exercise though - [link](https://pmitev.github.io/to-awk-or-not/More_awk/Running_average/) - Handy tips: awk oneliners use with Vim, gnuplot... --- # Prerequisites :::spoiler Awk on MacOS, Linux, Windows 10, etc. ## MacOS The system provided awk version will work for most of the examples during the workshop with few exceptions, which are noted in the online material. *Tilda `~` sign on Mac with Swedish keyboard layout - `Alt + ^`* ## Linux Several distributions have other awk flavors installed by default. The easiest fix is to install the gnu version `gawk` i.e. for Ubuntu: `sudo apt install gawk` ## Windows 10 - [Ubuntu for Windows 10](https://docs.microsoft.com/en-us/windows/wsl/install-win10) - it is better to read from the source, despite it might not be the easiest tutorial. To my experience, this is the best Linux environment without virtualization. - [MobaXterm](https://mobaxterm.mobatek.net/) use the internal package manager to install gawk. The default is provided by [Busybox](https://www.busybox.net/) and is not enough for the purpose of the workshop. ## Linux computer center - Just login to your account and use the provided awk - any version newer than 4 will work. ``` rackham3:[~] awk -V GNU Awk 4.0.2 Copyright (C) 1989, 1991-2012 Free Software Foundation. ``` ## :cloud: Virtual Linux Machine Just follow some tutorial on how to setup and use the virtual Linux environment. - [VirtualBox](https://www.virtualbox.org/) - [Ubuntu on Public Clouds](https://ubuntu.com/public-cloud) - [GitHub & Binder](https://pmitev.github.io/to-awk-or-not/Other/Binder/) (*you need only a browser*) - [Singularity](https://sylabs.io/) ```bash singularity run shub://pmitev/Teoroo-singularity:gawk 'BEGIN{ for(i=1;i<=4;i++) print i}' ``` - https://www.onworks.net/ - provides multiples Opeative systems to start up for free. **Warning** - sessions are disconnected in 5 minutes inactivity; overloaded by commercials. ::: # Online meeting - Zoom :zzz: The meeting will be active 30 minutes before the workshop, so we can help you with some trivial setup problems and general questions. The workshop is intended to be **interactive**, which could be a challenging task on computers with small screens. The best is to leave space for the shared Zoom window and an active terminal. The material for the workshop is available online https://pmitev.github.io/to-awk-or-not/ # Suggested topics via the online [form](https://forms.gle/usYYkbWaZVkNceSK6) ## Feedback from previous workshops: | [2022.09](https://docs.google.com/forms/d/1UUZP97qXq3rwxY7VGJsu1w-4QWfRCzEmO1xZWva-CVM/viewanalytics) | [2022.01](https://docs.google.com/forms/d/1mIboAG1nudj1yPN07-HZbQ6L9ghlZxrCLTFbAMJpARg/viewanalytics) | [2021.09](https://docs.google.com/forms/d/1GILWudpKGoZSkyfkyBR-kRGTYieoXC1yPOz0Jn0UrcI/viewanalytics) | [2021.01](https://docs.google.com/forms/d/1be529TgFwsaNnsH_YQ-6qJWFNV15NTl510dWqrqzu1A/viewanalytics) | [2020.08](https://docs.google.com/forms/d/1I6tMA-mXy5kIMEy5H1Nt2fbKcuMZpvxE_WYpJPkAJ5Q/viewanalytics) | [2020.01](https://docs.google.com/forms/d/1Wa9lCwxp0Pes38KFziilNbdcvYfHwxBiou9j3c3hNO0/viewanalytics) | [2019.08](https://docs.google.com/forms/d/1-wha3xg_jkcZ03ljF6HmPnTFQGzGe08Jun5c0IAFfEU/viewanalytics) | [2019.01](https://docs.google.com/forms/d/1O1v8i3f1UDavfmntbEZ9cvm8_U-5Mj5P6GTEHUWyuuk/viewanalytics) |[2018.08](https://docs.google.com/forms/d/1PG8dt0LSOdp9gv1rFCjEe1kiapx3a-SiSJkvl2MOlyA/viewanalytics) | [2018.01](https://docs.google.com/forms/d/1d85npGj6O5xuQEF9drBRhneqYKjW0yAZJOnTiI1QP0c/viewanalytics) | [2017.01](https://docs.google.com/forms/d/1aTeYzOJTLNVkRYnXqOAOWFbtWIzgigqbt6hvuc4EBoE/viewanalytics) | [2017.08](https://docs.google.com/forms/d/1Y_D8kKDHsVCeu3Hli87iphnxp_ayNXfVJRcmFDiSe7Y/viewanalytics) | [2016.08](https://docs.google.com/forms/d/1PXdyRsABx60Uq6mDwepKv8-0ztur8z9dEkoUOmmfqjg/viewanalytics) | [2016.01](https://docs.google.com/forms/d/11q4-HAOSy7LB8mla0EkP0PhkfuBVdyIpOKb9pSqCkb0/viewanalytics) | [2015.10](https://docs.google.com/forms/d/1KSab3x3IlXdgtTScXPfHbFR81FrEpZ8j__hOgV8P5wU/viewanalytics) | - # Contacts: - [Pavlin Mitev](https://katalog.uu.se/profile/?id=N3-1425) - [Voichita Marinescu](https://katalog.uu.se/profile/?id=N12-828) - [Jonas Söderberg](https://katalog.uu.se/empinfo/?id=N2-1277) - [Lars Eklund](https://katalog.uu.se/empinfo/?id=N5-89) - [UPPMAX](https://www.uppmax.uu.se/), [Awk@UPPMAX](https://www.uppmax.uu.se/support/courses-and-workshops/awk-workshop-summer-2020/) ![](https://snic.se/digitalAssets/603/c_603880-l_1-k_image.png =122x38) ![](https://live.webb.uu.se/digitalAssets/207/c_207717-l_3-k_bg-city.png) *[VCF]: VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position. ###### tags: `awk`, `UPPMAX`, `intro course`, `SNIC`