owned this note
owned this note
Published
Linked with GitHub
# cell line to command line version 2 change log
### book update
- [x] add my how to ask questions youtube video https://www.youtube.com/watch?v=UrammW0bmHI
- [x] add genome reference video https://www.youtube.com/watch?v=eIVlSG11umQ
- [x] add my 6-step framework learning computational biology youtube video https://www.youtube.com/watch?v=JYauh4GZExo
- [x] add speadsheet best practice vidoe https://www.youtube.com/watch?v=D-_C8dQZCVA
- [x] add how to make heatmap youtube video https://www.youtube.com/watch?v=7fQkPUqusTg
- [x] add assessing clustering significance https://github.com/pkimes/sigclust2
- [x] add understanding hclust https://jokergoo.github.io/2023/07/07/what-is-a-hclust-object/
- [x] add ANGUS training https://angus.readthedocs.io/en/2019/
- [x] add Home brew for mac https://brew.sh/
- [x] add programming with dplyr/ggplot, add row-wise operations
- [x] DESEq2 make PCA plot, heatmap, RNAseq video https://www.youtube.com/watch?v=LmZD_a8XZEQ
- [x] add GSEA from RNAseq video https://www.youtube.com/watch?v=7fQkPUqusTg
- [x] use here()for reproducible computing in R
- [x] add ncdu
- [x] showcase screen
- [x] add https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html ML intro to biologist figure. add https://www.mdpi.com/1422-0067/22/6/2903 to ML chapter
- [x] add https://osf.io/4pd9n/ for reproducible research and ****Building reproducible analytical pipelines with R**** https://raps-with-r.dev/
- [x] linux tricks: head -1 | tr “,” “\n” | grep
- [x] add Integrating Molecular Biology and Bioinformatics Education https://www.degruyter.com/document/doi/10.1515/jib-2019-0005/html
# Notes on Ming's book
### Thank you harris for a such detailed review. I made the changes accordingly.
## Preface
- Thank you so much Ming for sharing your personal story and your motivation behind becoming a computational biologist. It is truly moving and inspirational! I can definitely relate.
### Notes
- [x] "We use cloud computing and machine learning" instead of "cloud computing, machine learning"
- [x] (1.1) Most time is spent in installing tools and performing data cleaning and wrangling intead of analysing data
- [x] (1.1) "large clinical trials known as..." instead of "large clinical trials know as..."
- [x] (1.1) "Learning resources are available on..." instead of "Learning resource..."
- [x] (1.2) "I got many e-mails from all around the world..."
- [x] (1.2) "asking questions and requesting help..."
- [x] (1.3) "This is not a book that will cover many topics in depth..." instead of "This is not a book that will teach you deep"
- [x] (1.3) "TAKE ACTION" instead of "TAKE ACTIONS"
- [x] (1.4) "it would attract over 6000 visitors every month from around the world..." instead of "attracts over 6000 traffic every month around the world"
- [x] (1.4) "only bigger and with larger flavor variety this time" instead of "only bigger and more flavors this time"
- [x] (1.4) "I realized that I needed to grow and learn" instead of "I realized that I need to grow and learn"
- [x] (1.4) Put the Jobs quote in quotes. It was in quote. >
## Chapter 2
### Notes
- [x] Excellent book collection. I would also add the Sobell book on Linux and SQL for mere mortals
- [x] Coursera, EdX, Datacamp offer free options and certification-excellent ways to learn if you prefer videos over books. O'Reilly is also terrific, if the readers happen to have access. (I mentioned coursera and edx in chapter 1 for my story, I mentioned it again per suggestion)
3. Dash and friends for easy access to documentation for various programming languages
- [x] Including containers later would be nice given that installing software takes so much time. The section on workflow managers may be the best place for it. Actually, you include them in Chapter 12, which is great. Mention Chapter 12 here as it mentions ways to deal with installation problems!
### Errors
- [x] section 2.3 Hadley's name should be corrected
- [x] 2.3.4. Correct the title: Modern Statistics for Modern Biology
page 17 Unix proceeded to become one of the most used operating systems
## Chapter 3
- I love the definitions and the Unix resources. It can be confusing for beginners and they may be lost in the vast number or resources available to learn the command line
- I really like the short introduction to file permissions
- I love the ack reference and the reference to your one liners post
### Notes
- [x] A few extra words on zcat and friends would be useful, since we have to deal with compressed files very often
- [x] I also find the `tree` command useful. You may want to include it
- [x] In 3.17, I would say Regular Expressions instead of Regular Expression. Remember what computational biologists do instead of biologist.
## Chapter 4
1. The notes on purrr and how to batch import files are gold!
2. Excellent job on introducing pivot_wider and pivot_longer. Datacamp also offers excellent courses on Tidyverse that you might want to link to
3. Figure 4.12 is extremely helpful
### Notes
- [x] A few words about Julia too as the new kid on the block (I had it)
- [x] In 4.8.4, it should read Do not reinvent the wheel
- [x] In 4.8.4, Did you notice instead of noticed
- [x] In 4,12, it should read Awesome Quarto
## Chapter 5
1. I particularly like the emphasis on the experimental design and why orthogonal data and sanity checks are important.
2. I also enjoyed the advanced usage of tidyverse in this chapter
3. A potential issue with these chapter is that it tries to cover multiple different topics and this affects the flow of information
### Notes
- [x] In 5.3, deferentially should read differentially
- [x] In 5.6, we should not be obsessed with p-value as shown... instead of show
## Chapter 6
1. I totally agree with the choice to include PCA as a separate chapter due to its central importance in Bioinformatics
2. tSNE and UMAP are key dimensionality reduction methods for single-cell data so it is great that you include them here
### Notes
1. It might be useful to include some other R packages that are useful for publication-quality PCA plots (e.g., ggpubr). **Although it is useful and I use it myself, I believe I saw somewhere saying it may have some potential mistakes inside it**
## Chapter 7
1. Great that you devote a whole chapter on heatmaps. They are indeed ubiquitous and an essential skill for computational biologists.
2. I also like the fact that you list the commonly used packages for heatmap generation including ComplexHeatmap.
3. Using the Polychrom library and getting discrete colors to improve the dendrogram is a useful suggestion.
### Notes
- [x] In many Genomics papers you will see heatmaps (instead of heatmap)
2. "Heatmap is of no mystery" could be ommited
- [x] A very simple using case for heatmap should read a very simple use case.
## Chapter 8
1. Working efficiently with speadsheets is an important skill to have and thus I agree with dedicating a whole chapter to it
### Notes
1. I have found csvkit ([csvkit.readthedocs.io](http://csvkit.readthedocs.io/)) to be extremely useful with dealing with Excel files and converting them to .csv files (you mention it in the Tools section along with other great tools such as Miller that I had never heard of!)
- [x] I would replace wet biologists with wet-lab biologists or experimental biologists.(I did not change)
- [x] I would replace "to the benefit of your own sake" with "for your own sake".
- [x] I would replace "Tidy a spreadsheet" with "Tidy up a spreadsheet "
## Chapter 11
1. Domain-specific languages and workflow managers become increasingly important as the pipelines become more complicated and we have to deal with various tool dependencies in different compute environment. So, this chapter is really important.
2. The chapter has unbalanced content at the moment. Snakemake is very well covered but there are only a couple of links for Nextflow and almost nothing on WDL (see notes for suggestions).
## Notes
1. I am glad that you include a reference to Brown's tutorial on Snakemake. In my personal experience, it is diffucult to find high-quality learning material for Snakemake. This is one of the reasons that I went for Nextflow instead.
- [x] The chapter content looks unbalanced at the moment as there is a lot about Snakemake but only a couple of links for Nextflow in 11.19 and nothing about WDL. I would add a few things about Nextflow (at least what are processes and channels and how the data are fed from one process to the other through channels). Also, it would be very helpful to link the official [Nextflow documentation](https://www.nextflow.io/docs/latest/index.html). A similar approached should be used for WDL.(**I fully agree with you. Snakemake is the one that I am familiar with so I focused on it. I added the links for Nextflow and WDL**)
- [x] You could also add this perspective by Wratten et al. to the readings: [Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers](https://www.nature.com/articles/s41592-021-01254-9). I think it is another excellent resource for anyone interested in understanding the differences between scripts, pipelines, and workflows. (**added**)
## Chapter 12
1. This chapter is fantastic-offering instructions on how to name files and organize computational projects is invaluable.The notes on cron and crontab are also very useful.
### Notes
- [x] A few spelling errors on GitHub and Singularity. Also, rocker should read Docker (**Rocker is docker for R**)
## Chapter 14
1. I totally agree with the "learn by doing" approach. Unfortunately, many times the experimental biologists are eager to get the results and do not give us enough time to experiment... This has to change!
2. I also agree with the reproducibility comment.
### Notes
- [x] 14.1. Instead of "what data needed", "What data is needed?". "What are the axes?".
- [x] 14.2. Instead of "People think they need to be an expert to teach", write "People think they need to be experts...".
- [x] 14.2. Remove moreover from the sentence "“Moreover, teaching is also an excellent way for you to learn and grow”.
- [x] 14.2. I would replace "“In my wildest dream, I can not imagine I am now serving as the” with "In my wildest dreams, I could not have imagined serving as the Chair...".
- [x] 14.3. Tweets instead of tweeps. (**people who write tweets are tweeps**)
- [x] 14.4. "Do the results make biological sense" instead of "does the results...".
- [x] 14.4. Instead of "codes" use "code" or "pieces of code".
- [x] 14.4. Rephrase "stage of formats". It is unclear. (**fixed to Data could come in different formats at different steps of data processing**)
- [x] 14.4. "You may want to split your script in two modules:...".
- [x] 14.4. "need to be documented or written down" instead of "need to be taken down".
- [x] 14.4. "start your journey in computational biology".