or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Substitute scientific with common species names in a phylogenetic tree file
Step 1 - Generate a table with the scientific-common name correspondence
We need the correspondence between the scientific and common species names as described in the NCBI Taxonomy Database.
We want to do this for any number of species automatically, so we download the entire archive taxdump.tar.gz from the NCBI taxonomy database dump.
This archive contains the
names.dmp
file with the format:Make new folder for this exercise and
cd
into it. Download the file and extract thenames.dmp
Step 2 - Edit the phylogenetic tree file
The phylogenetic tree file used for the 100way alignment is hg38.100way.scientificNames.nh.
It can be downloaded from here and details could be found here.
Download the file (4.1KB).
The format of the tree file is
See a description of the Newick tree format here.
The phylogenetic tree could be visualized online at https://itol.embl.de/ (notice that this application takes care of removing the _ from the scientific name).
Before

After

Step 1
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →The files are in Windows/DOS
ASCII text, with CRLF line terminators
format which makes awk to misbehave. Check your files and convert them to UNIX format if necessary.Let's first tabulate the NCBI Taxonomy Database in more convenient format for us - getting the relevant information on single line, replace some spaces with underscore symbol
_
, remove the extra blanks in fron and after the names, etc.names.tab
Might not be the best solution but it is easy to read and modify, for now. Note, we do not need to sort but it will look better if we have the final result in order.
Code will appear here after some discussions.
Just refresh the page when it is revealed.
01.tabulate-names.awk
Step 2
Now we can use the tabulated data in
names.tab
and perform the replacement inhg38.100way.scientificNames.nh
by matching the scientific names in$2
with the common names in$3
- we useFS="|"
Again, this might not be the best way but it works. The suggested solutions could be easily merged into a single script. I would prefer to have them in steps, so I can make sure that the first step has completed successfully (it takes some time) before I continue. Also I can filter the unnecessary data in the newly tabulated file and use only relevant data or alter further if I need.
Code will appear here after some discussions.
Just refresh the page when it is revealed.
tags:
awk
bioawk
UPPMAX
SNIC