This is a lesson plan to teach basic commands to new users of the bash command line. For other resources, check out Software Carpentry.
Bash is a Unix or Linux shell. You can think or it as a local dialect of Linux. There are some small differences in syntax compared to other shells, such as tsch.
Note taking is important! Just like you'd keep a lab notebook at the bench, if you're working at the terminal, you should keep detailed records of what you've done and explanatory notes to yourself. There are lots of ways you could do this, including using a paper notebook, Word, Google Docs, or a text editor. I recommend taking notes in Markdown format, a simple way to annotate your notes in a text document. The major advantage to Markdown over something like Google Docs is that it is fundamentally just text, and it can be easily moved to new platforms. Most of our workflows will end up in Rstudio where Markdown code can seemlessly become Rmarkdown code. In this way, your entire workflow, from bash to R, can be recorded in one simple document.
You can write Markdown easily in any plain text editor. My favorite is Atom. However, you may want to use an editor that is specialize for Markdown. Several excellent free options exist.
Linux is case-sensitive. Just get used to it.
More importantly, certain characters have special meaning in the bash command line, and you'll need to learn to treat them carefully.
Spaces separate commands, parameters and filenames. So you can't use a space character in a file name. If a file has been imported into a Linux system from Windows or Mac OSX spaces are "escaped", meaning they can only be referenced by preceding them with the back-slash \
.
Question marks and the asterisks (*
) are "wild cards" and are meant to stand in for any character. Question marks for one character; asterisks for any number of contiguous characters.
To run a program (or "command"), enter its name in the command line. The function of a program can be modified by "parameters" (also called "switches" or "arguments"). These are denoted after the program's name by one or two dashes, -
.
Other characters like the greater than >
and less than <
signs and the "pipe" character |
have special purposes we'll cover below.
One of the nice things about Linux is that is you don't always need to type the full name of commands or files. Just start typing the name of a long command, file or folder and hit [TAB]
. Bash will fill it in for you, if there's only one option that fits with what you've started. If there's more than one option that would complete it, it will give you a list of those options. This becomes a real time saver!
man
Most Linux commands have built-in documentation, a "manual" or "man page" that can be accessed by man
followed by the command's name. For example, to get help on the list command, type man ls
.
There is a huge community of Linux users online who post answers to questions. For most problems, a simple Google search will point the way to a solution.
Sometimes you run something that gets out of control. Most programs in Linux can be canceled (or "killed") by pressing [control]
-[C]
.
Start by launching a terminal app on your computer. On a Mac this is called "terminal" and can be found under the Applications/Utilities folder. Once the terminal is running, you are working in Linux on your machine.
Many of the programs you'll need to run for bioinformatic analysis will be impractical to run on your own machine. For that reason, Colby has the natural science computing cluster (nscc) where you'll want to do most of the intensive applications and store large files. You can log into nscc remotely from anywhere. First, be sure you're inside Colby's firewall. This will be the case if you're on campus and logged into the "Colby Access" wireless network. If you're somewhere more interesting, you'll nee to run a VPN client that provides secure access.
Only people with active Colby user accounts will be able to log on. Moreover, users are granted access to nscc, node 26, or other sensitive applications on the basis of need by faculty request. If you need access, talk to Dave Angelini, the lab PI, or Randy Downer, Colby's High Performance Computing Applications Manager.
Log on using the secure shell command ssh
. For example…
When you're done working in Linux, it's good form to log out by typing exit
. This command also exits out of node 26 and out of a screen.
nscc has separate computing nodes for different high-demand users. Our lab tends to use either node 26 or node 28. All nodes have access to the same file structure, but when logged onto a node, the commands you execute will be run on different processors. This is important to the operation of the whole system, because it means that all users aren't competing for computational resources on the same processors.
When you initially log onto nscc you'll be on "node 0" (also called the "head node"). The first time you want to access a different node, you must run cluster_locksmith
. You'll only ever need to do this once.
To access node 26…
screen
Anytime you'll run a command in Linux that might take a while to complete or otherwise be problematic, it's useful to run screen
. Starting a screen is like creating a new instance of your Linux session, nested within the original one. To start it just type screen
. The screen will clear, and you'll see the same prompt. Here you can run some command and while it's running you can return to the original session. Type [control]
-[A]
and then [D]
to "disconnect" from the current screen. Now you're back where you started. To recover the screen again enter screen -r
. If you have multiple screens running at the same time, you'll be prompted to specify which one you want.
In this example we might return to the first one on the list by entering screen -r 145901
Screen is useful because you can keep time-consuming processes running without "breaking the pipe", for example if you close your laptop or go to sleep for the night.
To end a screen session, just type exit
from within it. It's good practice to clean-up screens when you're done with them. Also avoid creating "nested screens". Don't run screen
when you're already in a screen. It gets confusing, because [control]
-[A]
[D]
will return you to the "base" session.
Below are commands that will let you orient yourself within the bash environment.
The command ls
will list the file and folder names in the folder where your are currently. Often you want a little more information than just their names, like the their size and date. ls -lh
provides a nice table. You can make a short "alias" for this command
This defines lh
as a short hand for ls -lh
. Aliases can be very useful for common, lengthy commands.
pwd
prints the working directory. This will show you the path to your current location from the root.
du -sh
will report how much of the collective size of files in the folder where you are currently. If you add a folder name to the end of this command, it will focus on that folder's contents. The du
stands for "disk usage", as in the days when storage was on a magnetic "hard disk".
df -h
will report free storage space.
Just like Windows or Mac OS, Linux systems have a branching tree-like system of folders (also called directories). These folder begin at "root", which is denoted /
. The partitions that exist at the root are organized by their speed.
/var
is the fastest storage space on nscc, and it's used only by programs for temporary storage as they run. You should never attempt to copy anything there, and you can usually just ignore it. As you run bioinformatics software, keep an eye on /var
to make sure it does not fill up. If it does the programs relying on it may crash.
/export
is the working space on nscc. It is reasonably fast, and it's the place to keep files that we are actively working on. This is also where you'll find your "home directory", which has the special name ~
.
/storage
is intermediate in its speed, slower than /export
. It can be used for intermediate-term file storage.
/research
is the slowest partition on nscc. This is where we keep files for long-term storage.
It's often important to be aware of other users on nscc and what they're doing. Even on node 26, it's a bad idea (and impolite) to start an intensive process if someone else is already using the system heavily.
w
is a simple command to list the other users currently logged into nscc. If it's someone you know, email them if needed to see when they plan to be done.
top
provides a real time list of all the programs running on the node, with information about which user started them, how long they're been running and how many CPUs they're using.
ps
lists the processes you are currently running. This can be useful to check if something you've forgotten about is still going. How would you ever get into that situation? Well, if you anticipate a command will take a while to run, you can run it "in the background". Just add &
to the end of your command and while it runs, you'll have access to the command line prompt again.
kill
you can stop a process you're running if you need to with this command. Obviously it should be used carefully. To do so, add the process ID number from ps
after the kill
command. The example below will kill the R run.
history
provides a list of all your command, going back to the beginning of your session. You can instantly re-execute an old command by typing !
followed by the command's number in history
.
Move around among folders using the cd
command to change directories. Where you are now is ./
and the "parent" folder "above" you is ../
Make new directories with mkdir
followed by the name of the new folder you'd like. Remove it using rmdir
. As a precaution, the folder must be empty to delete it. You can override this with rmdir -r
.
cp
copies files. To copy a folder and all its contents use cp -r
rm
deletes files. Use this with caution. There is no trash bin in Linux. Once a file is deleted it's gone.
mv
copies files to a new location and then deletes the originals. It can be used to move files from one folder to another or to rename a single file or folder.
scp
copies files to or from your computer and nscc.
This will only work when executed from your machine, referencing nscc
as the remote system.
Start a new terminal window on your machine, where you don't log into nscc.
In this example test.txt
will copied to your homer folder on nscc.
Linux provides lots of way to examine the contents of files, which is excellent for big data.
cat
displays the entire contents of a file to the screen.
head
shows the first 10 lines of a file. If you want a number other than 10, for example 4, use head -n 4
tail
shows the last 10 lines. If the file is actively being added to by a program, you can have tail
continuously write out the contents as they're added in with tail -F
less
opens a program where you can view the entire contents of a file, one page at a time. When you're in less
, you can also search forward by hitting /
and entering your search string, then hitting enter. You can also search backward in the file with ?
.
The output generated by all the programs in Linux go to "stdout", the screen. Sometimes you don't want that. You can redirect that output using a few special characters.
The greater than sign >
redirects the output of a program to a new file. For example, ls > filenames.txt
will create a new file with the output from ls
. If you want to add new output to the end of an existing file use >>
If you want to send the output of one program to another program use the "pipe" character |
. We'll see an example in the next section.
grep
It's often useful to search the contents of files of the output of other programs for particular sequences of characters. To search based on one in Linux, use the command grep
. Try this…
This will filter the output of history
showing only the lines where you used the cd
command (or where the characters "cd" occurred).
Grep also works with regular expressions (regex). Refer to chapters 2-3 of Haddock & Dunn 2011 for more details on regex. The example below will show only lines from history
that have a two-letter word beginning with 'l', flanked by spaces.
Sometimes the easiest way to edit a file will be to copy it from nscc to your computer and then copy it back. Often however, that will be impossible because of the file's size. Linux has several tools for file editing.
nano
is a user-friendly interactive text editor for Linux. It works in a separate screen, where you can move the cursor with arrow keys, add new text, or type over old text, loosely similar to Word.
vim
is an old-school Linux text editor. Unlike nano
it's default mode is not as a direct editor. Instead different keys do specific commands like delete whole words or lines. Read the man vim
entry for more details. Exiting out of vim
can be difficult to figure out if you run it without first reading the instructions! (To exit press [ESC], then type :q!
)
Unlike the editors described above, Linux has other tools that can edit file contents or the output of other programs in a predefined way. This can be really useful for bioinformatic work, where you may want to preform the same change to hundreds or millions of lines in a file.
sed
sed
is the most commonly used of these text processing commands. It has many applications, but one of the most frequent is to substitute one text string for another. The basic syntax is that the instructions to sed
must be inside of quotes. s
tells the program it should make a substitution. Then the search and replacement strings are separated by slashes.
This becomes really powerful when combined with regex. The example below replaces the first entire word beginning with "H".
sed
can replace multiple matches if you add g
after the last slash. This example replaces all vowels with asterisks. (Note that because the asterisks is a special character that usually acts as a wildcard, it must be preceded by the backslash.)
You can also use sed
to delete characters or words by simply leaving the replacement part of the substitution empty. This example will remove any commas or exclamation points.
sed
can also be used to delete an entire line from a file, based on the presence of a particular character string.
You can also delete lines by their number. This example deletes the first line.
For longer files, you can delete a range of lines. For example sed '5,10d;12d' my.long.file.txt
would delete lines 5-10 and 12.
There are other text processing programs that can be useful for specific purposes.
cut
cut
is useful to remove or select individual columns from a CSV or TSV file. The delimiter can be specified using the -d
switch. (The default assumes that a tab character, \t
, is the delimiter.) The number of the column is given by the -f
switch. The example below will output just the third column.
(Note that in the example above \n
was used to stand in for a new line.)
If you want multiple columns, specify their range -f 1-2
or individual numbers -f 1,3
.
sort
In bioinformatics, you may find yourself in a situation where you want to sort long lists in files and perhaps remove duplicates. sort
is the obvious tool here. It can also be used to eliminate duplicates with the -u
switch. By default sort
treats characters as plain text. So, it will place "10", between "1" and "2". If you want it to treat digits as numbers, add the -g
switch.
awk
awk
is a powerful but arcane tool for text processing. It is a programming language in its own right. It can do some useful things, such as calculate the length of each line in a file.
The awk
code below will calculate mean and standard deviation from a list of numbers.
Each of the questions applies to the file /export/groups/drangeli/rnaseq.sample.fq
Start by copying it to your home directory.
Solutions to these problems can be found here.
How frequent is the EcoRI recognition sequence, GAATTC?
Create a file containing only the ID lines. Replace colons and space characters with tabs.
In each of the FASTQ ID lines the 5th number is the "tile number". How many different tile numbers are represented in this file?
Using bash commands and/or R determine the median distance between genes in Onocpeltus fasciatus. Use the Offical Gene Set vesion 1.2, which is available at /research/drangeli/Ofas.genome/oncfas_OGSv1.2_original.gff
Create a plot of the distribution of inter-gene distances.
How does this compare to some other insects?
Dave Angelini, 2018