Linux basics

This is a lesson plan to teach basic commands to new users of the bash command line. For other resources, check out Software Carpentry.

Bash basics

Bash is a Unix or Linux shell. You can think or it as a local dialect of Linux. There are some small differences in syntax compared to other shells, such as tsch.

Take notes!

Note taking is important! Just like you'd keep a lab notebook at the bench, if you're working at the terminal, you should keep detailed records of what you've done and explanatory notes to yourself. There are lots of ways you could do this, including using a paper notebook, Word, Google Docs, or a text editor. I recommend taking notes in Markdown format, a simple way to annotate your notes in a text document. The major advantage to Markdown over something like Google Docs is that it is fundamentally just text, and it can be easily moved to new platforms. Most of our workflows will end up in Rstudio where Markdown code can seemlessly become Rmarkdown code. In this way, your entire workflow, from bash to R, can be recorded in one simple document.

You can write Markdown easily in any plain text editor. My favorite is Atom. However, you may want to use an editor that is specialize for Markdown. Several excellent free options exist.

MacDown runs on your computer and renders your Markdown code in real time in a split screen.
StackEdit is a web app or Google Chrome Plugin that provides a similar split-screen editor for cloud-based markdown note keeping.
HackMD is my preferred markdown editor at the moment. It's cloud-based, but you can download files or back-up to Google Drive or DropBox. You can share notes, instantly publish to the web, or keep things private.

Every character matters

Linux is case-sensitive. Just get used to it.

More importantly, certain characters have special meaning in the bash command line, and you'll need to learn to treat them carefully.

Spaces separate commands, parameters and filenames. So you can't use a space character in a file name. If a file has been imported into a Linux system from Windows or Mac OSX spaces are "escaped", meaning they can only be referenced by preceding them with the back-slash \.

cp My\ File.docx my.file.docx

Question marks and the asterisks (*) are "wild cards" and are meant to stand in for any character. Question marks for one character; asterisks for any number of contiguous characters.

To run a program (or "command"), enter its name in the command line. The function of a program can be modified by "parameters" (also called "switches" or "arguments"). These are denoted after the program's name by one or two dashes, -.

ls -lh

Other characters like the greater than > and less than < signs and the "pipe" character | have special purposes we'll cover below.

Tab-completion

One of the nice things about Linux is that is you don't always need to type the full name of commands or files. Just start typing the name of a long command, file or folder and hit [TAB]. Bash will fill it in for you, if there's only one option that fits with what you've started. If there's more than one option that would complete it, it will give you a list of those options. This becomes a real time saver!

Getting help

`man`

Most Linux commands have built-in documentation, a "manual" or "man page" that can be accessed by man followed by the command's name. For example, to get help on the list command, type man ls.

Google it!

There is a huge community of Linux users online who post answers to questions. For most problems, a simple Google search will point the way to a solution.

Canceling a command

Sometimes you run something that gets out of control. Most programs in Linux can be canceled (or "killed") by pressing [control]-[C].

Access

Launch terminal

Start by launching a terminal app on your computer. On a Mac this is called "terminal" and can be found under the Applications/Utilities folder. Once the terminal is running, you are working in Linux on your machine.

Log onto nscc

Many of the programs you'll need to run for bioinformatic analysis will be impractical to run on your own machine. For that reason, Colby has the natural science computing cluster (nscc) where you'll want to do most of the intensive applications and store large files. You can log into nscc remotely from anywhere. First, be sure you're inside Colby's firewall. This will be the case if you're on campus and logged into the "Colby Access" wireless network. If you're somewhere more interesting, you'll nee to run a VPN client that provides secure access.

Only people with active Colby user accounts will be able to log on. Moreover, users are granted access to nscc, node 26, or other sensitive applications on the basis of need by faculty request. If you need access, talk to Dave Angelini, the lab PI, or Randy Downer, Colby's High Performance Computing Applications Manager.

Log on using the secure shell command ssh. For example…

ssh username@nscc

When you're done working in Linux, it's good form to log out by typing exit. This command also exits out of node 26 and out of a screen.

Log onto different nodes

nscc has separate computing nodes for different high-demand users. Our lab tends to use either node 26 or node 28. All nodes have access to the same file structure, but when logged onto a node, the commands you execute will be run on different processors. This is important to the operation of the whole system, because it means that all users aren't competing for computational resources on the same processors.

When you initially log onto nscc you'll be on "node 0" (also called the "head node"). The first time you want to access a different node, you must run cluster_locksmith. You'll only ever need to do this once.

To access node 26…

ssh n26

Using `screen`

Anytime you'll run a command in Linux that might take a while to complete or otherwise be problematic, it's useful to run screen. Starting a screen is like creating a new instance of your Linux session, nested within the original one. To start it just type screen. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type [control]-[A] and then [D] to "disconnect" from the current screen. Now you're back where you started. To recover the screen again enter screen -r. If you have multiple screens running at the same time, you'll be prompted to specify which one you want.

There are several suitable screens on:
	145901.pts-1.n26	(Detached)
	145764.pts-1.n26	(Detached)
Type "screen [-d] -r [pid.]tty.host" to resume one of them.

In this example we might return to the first one on the list by entering screen -r 145901

Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night.

To end a screen session, just type exit from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run screen when you're already in a screen. It gets confusing, because [control]-[A] [D] will return you to the "base" session.

Looking around

Below are commands that will let you orient yourself within the bash environment.

List files

The command ls will list the file and folder names in the folder where your are currently. Often you want a little more information than just their names, like the their size and date. ls -lh provides a nice table. You can make a short "alias" for this command

alias lh "ls -lh"

This defines lh as a short hand for ls -lh. Aliases can be very useful for common, lengthy commands.

Where am I? - Print working directory

pwd prints the working directory. This will show you the path to your current location from the root.

Storage space

du -sh will report how much of the collective size of files in the folder where you are currently. If you add a folder name to the end of this command, it will focus on that folder's contents. The du stands for "disk usage", as in the days when storage was on a magnetic "hard disk".

df -h will report free storage space.

A note on nscc's directory structure

Just like Windows or Mac OS, Linux systems have a branching tree-like system of folders (also called directories). These folder begin at "root", which is denoted /. The partitions that exist at the root are organized by their speed.

/var is the fastest storage space on nscc, and it's used only by programs for temporary storage as they run. You should never attempt to copy anything there, and you can usually just ignore it. As you run bioinformatics software, keep an eye on /var to make sure it does not fill up. If it does the programs relying on it may crash.

/export is the working space on nscc. It is reasonably fast, and it's the place to keep files that we are actively working on. This is also where you'll find your "home directory", which has the special name ~.

/storage is intermediate in its speed, slower than /export. It can be used for intermediate-term file storage.

/research is the slowest partition on nscc. This is where we keep files for long-term storage.

Other users and processes that are running

It's often important to be aware of other users on nscc and what they're doing. Even on node 26, it's a bad idea (and impolite) to start an intensive process if someone else is already using the system heavily.

w is a simple command to list the other users currently logged into nscc. If it's someone you know, email them if needed to see when they plan to be done.

top provides a real time list of all the programs running on the node, with information about which user started them, how long they're been running and how many CPUs they're using.

ps lists the processes you are currently running. This can be useful to check if something you've forgotten about is still going. How would you ever get into that situation? Well, if you anticipate a command will take a while to run, you can run it "in the background". Just add & to the end of your command and while it runs, you'll have access to the command line prompt again.

kill you can stop a process you're running if you need to with this command. Obviously it should be used carefully. To do so, add the process ID number from ps after the kill command. The example below will kill the R run.

history provides a list of all your command, going back to the beginning of your session. You can instantly re-execute an old command by typing ! followed by the command's number in history.

[drangeli@n26 ~]$ ps
   PID TTY          TIME CMD
142505 pts/1    00:00:00 tcsh
150197 pts/1    00:00:00 ps
110256 pts/1    00:22:57 R
[drangeli@n26 ~]$ kill 110256

Folders

Move around among folders using the cd command to change directories. Where you are now is ./ and the "parent" folder "above" you is ../

Make new directories with mkdir followed by the name of the new folder you'd like. Remove it using rmdir. As a precaution, the folder must be empty to delete it. You can override this with rmdir -r.

Files

Copying, moving and deleting files

cp copies files. To copy a folder and all its contents use cp -r

rm deletes files. Use this with caution. There is no trash bin in Linux. Once a file is deleted it's gone.

mv copies files to a new location and then deletes the originals. It can be used to move files from one folder to another or to rename a single file or folder.

scp copies files to or from your computer and nscc.
This will only work when executed from your machine, referencing nscc as the remote system.

Start a new terminal window on your machine, where you don't log into nscc.

echo 'Hello, World!' > test.txt
scp test.txt username@nscc:~

In this example test.txt will copied to your homer folder on nscc.

Examining file contents

Linux provides lots of way to examine the contents of files, which is excellent for big data.

cat displays the entire contents of a file to the screen.

head shows the first 10 lines of a file. If you want a number other than 10, for example 4, use head -n 4

tail shows the last 10 lines. If the file is actively being added to by a program, you can have tail continuously write out the contents as they're added in with tail -F

less opens a program where you can view the entire contents of a file, one page at a time. When you're in less, you can also search forward by hitting / and entering your search string, then hitting enter. You can also search backward in the file with ?.

Redirects

The output generated by all the programs in Linux go to "stdout", the screen. Sometimes you don't want that. You can redirect that output using a few special characters.

The greater than sign > redirects the output of a program to a new file. For example, ls > filenames.txt will create a new file with the output from ls. If you want to add new output to the end of an existing file use >>

If you want to send the output of one program to another program use the "pipe" character |. We'll see an example in the next section.

Searching with `grep`

It's often useful to search the contents of files of the output of other programs for particular sequences of characters. To search based on one in Linux, use the command grep. Try this…

history | grep 'cd'

This will filter the output of history showing only the lines where you used the cd command (or where the characters "cd" occurred).

Grep also works with regular expressions (regex). Refer to chapters 2-3 of Haddock & Dunn 2011 for more details on regex. The example below will show only lines from history that have a two-letter word beginning with 'l', flanked by spaces.

history | grep '\sl\w\s'

Editing files

Sometimes the easiest way to edit a file will be to copy it from nscc to your computer and then copy it back. Often however, that will be impossible because of the file's size. Linux has several tools for file editing.

Interactive text editors

nano is a user-friendly interactive text editor for Linux. It works in a separate screen, where you can move the cursor with arrow keys, add new text, or type over old text, loosely similar to Word.

vim is an old-school Linux text editor. Unlike nano it's default mode is not as a direct editor. Instead different keys do specific commands like delete whole words or lines. Read the man vim entry for more details. Exiting out of vim can be difficult to figure out if you run it without first reading the instructions! (To exit press [ESC], then type :q!)

Text processing commands

Unlike the editors described above, Linux has other tools that can edit file contents or the output of other programs in a predefined way. This can be really useful for bioinformatic work, where you may want to preform the same change to hundreds or millions of lines in a file.

`sed`

sed is the most commonly used of these text processing commands. It has many applications, but one of the most frequent is to substitute one text string for another. The basic syntax is that the instructions to sed must be inside of quotes. s tells the program it should make a substitution. Then the search and replacement strings are separated by slashes.

echo 'Hello, World!' > test.txt
sed 's/Hello/Hola/' test.txt > prueba.txt
cat prueba.txt

This becomes really powerful when combined with regex. The example below replaces the first entire word beginning with "H".

sed 's/H\w*/Greetings/' test.txt > regex.test.txt
cat regex.test.txt

sed can replace multiple matches if you add g after the last slash. This example replaces all vowels with asterisks. (Note that because the asterisks is a special character that usually acts as a wildcard, it must be preceded by the backslash.)

sed 's/[aeiou]/\*/g' test.txt

You can also use sed to delete characters or words by simply leaving the replacement part of the substitution empty. This example will remove any commas or exclamation points.

sed 's/[\,\!]//g' test.txt

sed can also be used to delete an entire line from a file, based on the presence of a particular character string.

echo 'Do, Ra, Mi, Fa, So, La, Ti...' >> test.txt
cat test.txt
sed '/La/d' test.txt

You can also delete lines by their number. This example deletes the first line.

sed '1d' test.txt

For longer files, you can delete a range of lines. For example sed '5,10d;12d' my.long.file.txt would delete lines 5-10 and 12.

There are other text processing programs that can be useful for specific purposes.

`cut`

cut is useful to remove or select individual columns from a CSV or TSV file. The delimiter can be specified using the -d switch. (The default assumes that a tab character, \t, is the delimiter.) The number of the column is given by the -f switch. The example below will output just the third column.

echo '1,2,3\n4,5,6\n7,8,9\n10,11,12' > numbers.csv
cat numbers.csv
cut -d "," -f 3 numbers.csv

(Note that in the example above \n was used to stand in for a new line.)

If you want multiple columns, specify their range -f 1-2 or individual numbers -f 1,3.

`sort`

In bioinformatics, you may find yourself in a situation where you want to sort long lists in files and perhaps remove duplicates. sort is the obvious tool here. It can also be used to eliminate duplicates with the -u switch. By default sort treats characters as plain text. So, it will place "10", between "1" and "2". If you want it to treat digits as numbers, add the -g switch.

shuf -i 1-10 -r -n 12 > random.txt
cat random.txt
sort random.txt
sort -u random.txt
sort -u -g random.txt

`awk`

awk is a powerful but arcane tool for text processing. It is a programming language in its own right. It can do some useful things, such as calculate the length of each line in a file.

cat test.txt | awk '{ print length($0); }'

The awk code below will calculate mean and standard deviation from a list of numbers.

shuf -i 1-1000000 -r -n 1000 > random.txt
cat random.txt | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} END {for (i=1;i<=NF;i++) { printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)} }'

Practice Problems

Each of the questions applies to the file /export/groups/drangeli/rnaseq.sample.fq Start by copying it to your home directory.

Solutions to these problems can be found here.

Problem B1

How frequent is the EcoRI recognition sequence, GAATTC?

Problem B2

Create a file containing only the ID lines. Replace colons and space characters with tabs.

Problem B3

In each of the FASTQ ID lines the 5th number is the "tile number". How many different tile numbers are represented in this file?

Challenge Problems

Problem A1

Using bash commands and/or R determine the median distance between genes in Onocpeltus fasciatus. Use the Offical Gene Set vesion 1.2, which is available at /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff

Problem A2

Create a plot of the distribution of inter-gene distances.

Problem A3

How does this compare to some other insects?

Dave Angelini, 2018