2020-06-19
- key skills:
- search, filter, extract, cross-reference data from large databases - make use of data & knowledge that's already out there!
- sequence alignment - core concept in evolution/phylogeny, functional genomics, genome assembly, differential expression analyses, transcriptomics, metagenomics, etc etc etc
- parsing data - reading data from many different (often messy!) file formats
- organisation - keep track of what/where your data is, which analyses you've run, with what parameters/settings, etc
- start with web-based tools e.g. EMBL-EBI/NCBI resources
- EBI Train Online has a huge amount of freely-accessible content introducing the fundamental concepts & guiding users on getting started
- once you begin working with larger amounts of data, you'll probably need to learn some command line computing (avoid long waits/costs of uploading data & downloading results)
- some understanding of statistics is also necessary
- other good, free, online resources I know of for learning bioinformatics:
- Finally: know that, if you're spending a lot of time searching the internet for help/answers, you're not alone!
- Python & R are equally popular and great places to start - free, open source, easy to install, huge online community, many resources to help you learn
- Choose whichever language your friends/colleagues are already using - I suspect this is the single biggest predictor of success
- Otherwise: Python is good for image analysis (so is ImageJ/Fiji, which provide a graphical interface), and more broadly applicable/useful outside bioinformatics, R has more cutting-edge statistical methods because of Bioconductor
- if using/learning Python: check out Biopython
3. How to build strong skills on a given programming language for data analysis and visualization.
- study other people's code - how do they do what they do?
- Python: learn numpy; pandas; matplotlib
- Use JupyterLab or Jupyter Notebook
- R: learn Tidyverse (dplyr; readr; tidyr; purrr; ggplot2; etc)
- Use Rstudio; work in RMarkdown
- Make it Open & Reproducible
- data analysis:
- Rosalind programming challenges will help you to simultaneously develop programming skills and insight into bioinformatic algorithms & approaches
- for data viz: use an interactive environment like Jupyter or RStudio - makes iterating over/exploring new visualisations much more fun.