owned this note
owned this note
Published
Linked with GitHub
# Linux basics
This is a lesson plan to teach basic commands to new users of the bash command line. For other resources, check out [Software Carpentry](http://swcarpentry.github.io/shell-novice/).
## Bash basics
Bash is a Unix or Linux shell. You can think or it as a local dialect of Linux. There are some small differences in syntax compared to other shells, such as tsch.
### Take notes!
Note taking is important! Just like you'd keep a lab notebook at the bench, if you're working at the terminal, you should keep detailed records of what you've done and explanatory notes to yourself. There are lots of ways you could do this, including using a paper notebook, Word, Google Docs, or a text editor. I recommend taking notes in [Markdown format](https://www.markdownguide.org/cheat-sheet), a simple way to annotate your notes in a text document. The major advantage to Markdown over something like Google Docs is that it is fundamentally just text, and it can be easily moved to new platforms. Most of our workflows will end up in [Rstudio](https://www.rstudio.com/) where Markdown code can seemlessly become [Rmarkdown](https://rmarkdown.rstudio.com/) code. In this way, your entire workflow, from bash to R, can be recorded in one simple document.
You can write Markdown easily in any plain text editor. My favorite is [Atom](https://atom.io/). However, you may want to use an editor that is specialize for Markdown. Several excellent free options exist.
- [MacDown](https://macdown.uranusjr.com/) runs on your computer and renders your Markdown code in real time in a split screen.
- [StackEdit](https://stackedit.io/) is a web app or [Google Chrome Plugin](https://chrome.google.com/webstore/detail/stackedit/iiooodelglhkcpgbajoejffhijaclcdg) that provides a similar split-screen editor for cloud-based markdown note keeping.
- [HackMD](https://hackmd.io/) is my preferred markdown editor at the moment. It's cloud-based, but you can download files or back-up to Google Drive or DropBox. You can share notes, instantly publish to the web, or keep things private.
### Every character matters
Linux is case-sensitive. Just get used to it.
More importantly, certain characters have special meaning in the bash command line, and you'll need to learn to treat them carefully.
Spaces separate commands, parameters and filenames. So you can't use a space character in a file name. If a file has been imported into a Linux system from Windows or Mac OSX spaces are "escaped", meaning they can only be referenced by preceding them with the back-slash `\`.
```bash
cp My\ File.docx my.file.docx
```
Question marks and the asterisks (`*`) are "wild cards" and are meant to stand in for any character. Question marks for one character; asterisks for any number of contiguous characters.
To run a program (or "command"), enter its name in the command line. The function of a program can be modified by "parameters" (also called "switches" or "arguments"). These are denoted after the program's name by one or two dashes, `-`.
```bash
ls -lh
```
Other characters like the greater than `>` and less than `<` signs and the "pipe" character `|` have special purposes we'll cover below.
### Tab-completion
One of the nice things about Linux is that is you don't always need to type the full name of commands or files. Just start typing the name of a long command, file or folder and hit `[TAB]`. Bash will fill it in for you, if there's only one option that fits with what you've started. If there's more than one option that would complete it, it will give you a list of those options. This becomes a real time saver!
### Getting help
#### `man`
Most Linux commands have built-in documentation, a "manual" or "man page" that can be accessed by `man` followed by the command's name. For example, to get help on the list command, type `man ls`.
#### Google it!
There is a huge community of Linux users online who post answers to questions. For most problems, a simple Google search will point the way to a solution.
### Canceling a command
Sometimes you run something that gets out of control. Most programs in Linux can be canceled (or "killed") by pressing `[control]`-`[C]`.
## Access
### Launch terminal
Start by launching a terminal app on your computer. On a Mac this is called "terminal" and can be found under the Applications/Utilities folder. Once the terminal is running, you are working in Linux on your machine.
### Log onto **nscc**
Many of the programs you'll need to run for bioinformatic analysis will be impractical to run on your own machine. For that reason, Colby has the [natural science computing cluster (**nscc**)](https://www.colby.edu/arc/hardwaresystems/computer-clusters/nscc/) where you'll want to do most of the intensive applications and store large files. You can log into **nscc** remotely from anywhere. First, be sure you're inside Colby's firewall. This will be the case if you're on campus and logged into the "Colby Access" wireless network. If you're somewhere more interesting, you'll nee to run a [VPN client](http://www.colby.edu/its/virtual-private-network-vpn/) that provides secure access.
Only people with active Colby user accounts will be able to log on. Moreover, users are granted access to **nscc**, node 26, or other sensitive applications on the basis of need by faculty request. If you need access, talk to [Dave Angelini](https://www.colby.edu/directory/profile/dave.angelini/), the lab PI, or [Randy Downer](https://www.colby.edu/directory/profile/rhdowner/), Colby's High Performance Computing Applications Manager.
Log on using the secure shell command `ssh`. For example...
```bash
ssh username@nscc
```
When you're done working in Linux, it's good form to log out by typing `exit`. This command also exits out of node 26 and out of a screen.
### Log onto different nodes
**nscc** has separate computing nodes for different high-demand users. Our lab tends to use either node 26 or node 28. All nodes have access to the same file structure, but when logged onto a node, the commands you execute will be run on different processors. This is important to the operation of the whole system, because it means that all users aren't competing for computational resources on the same processors.
When you initially log onto **nscc** you'll be on "node 0" (also called the "head node"). The first time you want to access a different node, you must run `cluster_locksmith`. You'll only ever need to do this once.
To access node 26...
```bash
ssh n26
```
### Using `screen`
Anytime you'll run a command in Linux that might take a while to complete or otherwise be problematic, it's useful to run `screen`. Starting a screen is like creating a new instance of your Linux session, nested within the original one. To start it just type `screen`. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type `[control]`-`[A]` and then `[D]` to "disconnect" from the current screen. Now you're back where you started. To recover the screen again enter `screen -r`. If you have multiple screens running at the same time, you'll be prompted to specify which one you want.
```bash
There are several suitable screens on:
145901.pts-1.n26 (Detached)
145764.pts-1.n26 (Detached)
Type "screen [-d] -r [pid.]tty.host" to resume one of them.
```
In this example we might return to the first one on the list by entering `screen -r 145901`
Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night.
To end a screen session, just type `exit` from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run `screen` when you're already in a screen. It gets confusing, because `[control]`-`[A]` `[D]` will return you to the "base" session.
## Looking around
Below are commands that will let you orient yourself within the bash environment.
### List files
The command `ls` will list the file and folder names in the folder where your are currently. Often you want a little more information than just their names, like the their size and date. `ls -lh` provides a nice table. You can make a short "alias" for this command
```bash
alias lh "ls -lh"
```
This defines `lh` as a short hand for `ls -lh`. Aliases can be very useful for common, lengthy commands.
### Where am I? - Print working directory
`pwd` prints the working directory. This will show you the path to your current location from the root.
### Storage space
`du -sh` will report how much of the collective size of files in the folder where you are currently. If you add a folder name to the end of this command, it will focus on that folder's contents. The `du` stands for "disk usage", as in the days when storage was on a magnetic "hard disk".
`df -h` will report free storage space.
#### A note on **nscc**'s directory structure
Just like Windows or Mac OS, Linux systems have a branching tree-like system of folders (also called directories). These folder begin at "root", which is denoted `/`. The partitions that exist at the root are organized by their speed.
`/var` is the fastest storage space on **nscc**, and it's used only by programs for temporary storage as they run. You should never attempt to copy anything there, and you can usually just ignore it. As you run bioinformatics software, keep an eye on `/var` to make sure it does not fill up. If it does the programs relying on it may crash.
`/export` is the working space on **nscc**. It is reasonably fast, and it's the place to keep files that we are actively working on. This is also where you'll find your "home directory", which has the special name `~`.
`/storage` is intermediate in its speed, slower than `/export`. It can be used for intermediate-term file storage.
`/research` is the slowest partition on **nscc**. This is where we keep files for long-term storage.
### Other users and processes that are running
It's often important to be aware of other users on **nscc** and what they're doing. Even on node 26, it's a bad idea (and impolite) to start an intensive process if someone else is already using the system heavily.
`w` is a simple command to list the other users currently logged into **nscc**. If it's someone you know, email them if needed to see when they plan to be done.
`top` provides a real time list of all the programs running on the node, with information about which user started them, how long they're been running and how many CPUs they're using.
`ps` lists the processes you are currently running. This can be useful to check if something you've forgotten about is still going. How would you ever get into that situation? Well, if you anticipate a command will take a while to run, you can run it "in the background". Just add ` &` to the end of your command and while it runs, you'll have access to the command line prompt again.
`kill` you can stop a process you're running if you need to with this command. Obviously it should be used carefully. To do so, add the process ID number from `ps` after the `kill` command. The example below will kill the R run.
`history` provides a list of all your command, going back to the beginning of your session. You can instantly re-execute an old command by typing `!` followed by the command's number in `history`.
```bash
[drangeli@n26 ~]$ ps
PID TTY TIME CMD
142505 pts/1 00:00:00 tcsh
150197 pts/1 00:00:00 ps
110256 pts/1 00:22:57 R
[drangeli@n26 ~]$ kill 110256
```
## Folders
Move around among folders using the `cd` command to change directories. Where you are now is `./` and the "parent" folder "above" you is `../`
Make new directories with `mkdir` followed by the name of the new folder you'd like. Remove it using `rmdir`. As a precaution, the folder must be empty to delete it. You can override this with `rmdir -r`.
## Files
### Copying, moving and deleting files
`cp` copies files. To copy a folder and all its contents use `cp -r`
`rm` deletes files. Use this with caution. There is no trash bin in Linux. Once a file is deleted it's gone.
`mv` copies files to a new location and then deletes the originals. It can be used to move files from one folder to another or to rename a single file or folder.
`scp` copies files to or from your computer and **nscc**.
This will only work when executed from your machine, referencing `nscc` as the remote system.
Start a new terminal window on your machine, where you don't log into **nscc**.
```bash
echo 'Hello, World!' > test.txt
scp test.txt username@nscc:~
```
In this example `test.txt` will copied to your homer folder on **nscc**.
### Examining file contents
Linux provides lots of way to examine the contents of files, which is excellent for big data.
`cat` displays the entire contents of a file to the screen.
`head` shows the first 10 lines of a file. If you want a number other than 10, for example 4, use `head -n 4`
`tail` shows the last 10 lines. If the file is actively being added to by a program, you can have `tail` continuously write out the contents as they're added in with `tail -F`
`less` opens a program where you can view the entire contents of a file, one page at a time. When you're in `less`, you can also search forward by hitting `/` and entering your search string, then hitting enter. You can also search backward in the file with `?`.
### Redirects
The output generated by all the programs in Linux go to "stdout", the screen. Sometimes you don't want that. You can redirect that output using a few special characters.
The greater than sign `>` redirects the output of a program to a new file. For example, `ls > filenames.txt` will create a new file with the output from `ls`. If you want to add new output to the end of an existing file use `>>`
If you want to send the output of one program to another program use the "pipe" character `|`. We'll see an example in the next section.
### Searching with `grep`
It's often useful to search the contents of files of the output of other programs for particular sequences of characters. To search based on one in Linux, use the command `grep`. Try this...
```bash
history | grep 'cd'
```
This will filter the output of `history` showing only the lines where you used the `cd` command (or where the characters "cd" occurred).
Grep also works with [regular expressions (regex)](https://en.wikipedia.org/wiki/Regular_expression). Refer to chapters 2-3 of [Haddock & Dunn 2011](https://drive.google.com/drive/folders/1BEHT3rbQi8DQip2FyELXNhKxPPa0evLK?usp=sharing) for more details on regex. The example below will show only lines from `history` that have a two-letter word beginning with 'l', flanked by spaces.
```bash
history | grep '\sl\w\s'
```
### Editing files
Sometimes the easiest way to edit a file will be to copy it from **nscc** to your computer and then copy it back. Often however, that will be impossible because of the file's size. Linux has several tools for file editing.
#### Interactive text editors
`nano` is a user-friendly interactive text editor for Linux. It works in a separate screen, where you can move the cursor with arrow keys, add new text, or type over old text, loosely similar to Word.
`vim` is an old-school Linux text editor. Unlike `nano` it's default mode is not as a direct editor. Instead different keys do specific commands like delete whole words or lines. Read the `man vim` entry for more details. Exiting out of `vim` can be difficult to figure out if you run it without first reading the instructions! (To exit press [ESC], then type `:q!`)
#### Text processing commands
Unlike the editors described above, Linux has other tools that can edit file contents or the output of other programs in a predefined way. This can be really useful for bioinformatic work, where you may want to preform the same change to hundreds or millions of lines in a file.
##### `sed`
`sed` is the most commonly used of these text processing commands. It has many applications, but one of the most frequent is to substitute one text string for another. The basic syntax is that the instructions to `sed` must be inside of quotes. `s` tells the program it should make a substitution. Then the search and replacement strings are separated by slashes.
```bash
echo 'Hello, World!' > test.txt
sed 's/Hello/Hola/' test.txt > prueba.txt
cat prueba.txt
```
This becomes really powerful when combined with [regex](https://en.wikipedia.org/wiki/Regular_expression). The example below replaces the first entire word beginning with "H".
```bash
sed 's/H\w*/Greetings/' test.txt > regex.test.txt
cat regex.test.txt
```
`sed` can replace multiple matches if you add `g` after the last slash. This example replaces all vowels with asterisks. (Note that because the asterisks is a special character that usually acts as a wildcard, it must be preceded by the backslash.)
```bash
sed 's/[aeiou]/\*/g' test.txt
```
You can also use `sed` to delete characters or words by simply leaving the replacement part of the substitution empty. This example will remove any commas or exclamation points.
```bash
sed 's/[\,\!]//g' test.txt
```
`sed` can also be used to delete an entire line from a file, based on the presence of a particular character string.
```bash
echo 'Do, Ra, Mi, Fa, So, La, Ti...' >> test.txt
cat test.txt
sed '/La/d' test.txt
```
You can also delete lines by their number. This example deletes the first line.
```bash
sed '1d' test.txt
```
For longer files, you can delete a range of lines. For example `sed '5,10d;12d' my.long.file.txt` would delete lines 5-10 and 12.
There are other text processing programs that can be useful for specific purposes.
##### `cut`
`cut` is useful to remove or select individual columns from a CSV or TSV file. The delimiter can be specified using the `-d` switch. (The default assumes that a tab character, `\t`, is the delimiter.) The number of the column is given by the `-f` switch. The example below will output just the third column.
```bash
echo '1,2,3\n4,5,6\n7,8,9\n10,11,12' > numbers.csv
cat numbers.csv
cut -d "," -f 3 numbers.csv
```
(Note that in the example above `\n` was used to stand in for a new line.)
If you want multiple columns, specify their range `-f 1-2` or individual numbers `-f 1,3`.
##### `sort`
In bioinformatics, you may find yourself in a situation where you want to sort long lists in files and perhaps remove duplicates. `sort` is the obvious tool here. It can also be used to eliminate duplicates with the `-u` switch. By default `sort` treats characters as plain text. So, it will place "10", between "1" and "2". If you want it to treat digits as numbers, add the `-g` switch.
```bash
shuf -i 1-10 -r -n 12 > random.txt
cat random.txt
sort random.txt
sort -u random.txt
sort -u -g random.txt
```
##### `awk`
`awk` is a powerful but arcane tool for text processing. It is a programming language in its own right. It can do some useful things, such as calculate the length of each line in a file.
```bash
cat test.txt | awk '{ print length($0); }'
```
The `awk` code below will calculate mean and standard deviation from a list of numbers.
```bash
shuf -i 1-1000000 -r -n 1000 > random.txt
cat random.txt | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} END {for (i=1;i<=NF;i++) { printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)} }'
```
## Practice Problems
Each of the questions applies to the file `/export/groups/drangeli/rnaseq.sample.fq` Start by copying it to your home directory.
Solutions to these problems can be found [here](https://hackmd.io/@dts8RULgQqi0n0PPDKh7JQ/Sy0CtEq-S).
### Problem B1
How frequent is the [*Eco*RI](https://www.neb.com/products/r0101-ecori) recognition sequence, GAATTC?
### Problem B2
Create a file containing only the ID lines. Replace colons and space characters with tabs.
### Problem B3
In each of the [FASTQ ID lines](https://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers) the 5th number is the "tile number". How many different tile numbers are represented in this file?
## Challenge Problems
### Problem A1
Using bash commands and/or [R](https://hackmd.io/@aphanotus/Rtutorial) determine the median distance between genes in *Onocpeltus fasciatus*. Use the Offical Gene Set vesion 1.2, which is available at `/research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff`
### Problem A2
[Create a plot](https://hackmd.io/@aphanotus/Rtutorial#Plots-with-base-R) of the distribution of inter-gene distances.
### Problem A3
How does this compare to some [other insects](https://www.ncbi.nlm.nih.gov/data-hub/taxonomy/50557/)?
---
*Dave Angelini*, 2018