UNIX Bootcamp - HackMD

# A UNIX BOOTCAMP > [name=Latest change:] [time=Thu, Sep 04, 2025] > Taken from David Ray's Genomes and Genome Evolution class at Texas Tech University This is your introduction to the Unix environment. Several of these tasks are pointless and really do nothing. Their purpose is to familiarize yourself with some of the basic commands/file manipulations you will use throughout the course. Keep a cheat sheet for yourself that you can quickly refer to in the future. You can find these on google, but some of frequent commands you will use (ex. squeue) are more specific to our system. The following is heavily modified from http://korflab.ucdavis.edu/bootcamp.html. All of the work below assumes that you have requested and obtained an account on HPCC. If you haven't, you won't be able to participate. **A warning. The better you understand what's happening with this tutorial, the less confused you will be throughout the rest of this class. Learning how to run commands and navigate in a linux environment is critical for not getting lost. Really take some time to concentrate on this tutorial, please.** ### A note on formatting Throughout this tutorial, You will notice various text formats. If there is a command you need to type, it will typically be formatted as: `type this command` If I'm trying to represent output you should be seeing on your screen, it will be formatted as: ``` This is stuff you should see on your screen. ``` Sometimes I'll combine the two. You'll get used to it. At various points in these tutorials, you may see commands with placeholders which will always be enclosed within the < and > or [ and ] symbols. These placeholders need to be replaced with information specific to you and your computer. For example: later in this tutorial you will see [eraider] which you simply replace with your eraider user name and discard the [] symbols. Another common example is \<path> which is instructing you to provide the full path (location) of the file or directory. One more note before you get started. The tutorial also assumes you work through it all in one sitting. If you log out and log back in, you will need to recapitulate all of the actions to get back to where you were when you stopped. ## THE TERMINAL A terminal is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view files etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program available. How you get to your terminal will depend on whether you use a Mac or Windows PC. For Mac users, I found Termius to be a good option. Download it from [here](https://termius.com/), and create an account. The free version of the software is more than enough for our work. 1. Click on 'NEW HOST' button. 2. Enter 'login.hpcc.ttu.edu' in the Adress section. 3. Name this host as you please. 4. Complete your eraider and password. If all went well, you should now see something that looks like this. ![Screenshot 2025-09-04 at 11.08.53 PM](https://hackmd.io/_uploads/HJCwfkd5gx.png) Click Connect and now you will be presented the welcome screen of the HPCC. ![Screenshot 2025-09-04 at 11.11.57 PM](https://hackmd.io/_uploads/rJaf7Jd5lg.png) Again, some things may be different because these screenshots were specific to me and not you. There is some valuable information printed out here but we're not going to worry about most of that now. The important text for us at this moment is the part that says, "DO NOT RUN CPU-INTENSIVE JOBS ON THE LOGIN NODES!" That takes us to the next section. ## Movin from the login node Now that we are logged on, you should notice something like this. The actual numbers may be slightly different but not by much. ``` [frcastel@login-20-26 ~]$ ``` The `[frcastel@login-20-26 ~]$` text that you see is the Unix command prompt. In this case, it contains your eraider, the name of the login node you're working on (‘login-20-26’) and the name of the current directory (There's nothing there now but more on that later). Note that the command prompt might not look the same on different Unix systems. In this case, the $ sign marks the end of the prompt. For most of the rest of the course, I'm just going to leave the '$' out because it's a pain to include it with every command you will be executing. For this page, though, I'll be using it a lot. Be aware that you're working on a system that has thousands of processors (~16,000 to be more precise). A single processing node controls how jobs are distributed to all of those other processors. That node is called the 'login node'. DO NOT perform any analyses on the login node. It is a crime punishable by death. It slows everyone down on the entire system. One of the ways to move off of the head node is to request an interactive session (via the 'interactive' command), ``` interactive -p nocona -c 1 ``` Notice the change in your command prompt. This tells you that you’re are working from a compute node. The change will be something like this: ``` [frcastel@login-20-26 ~]$ ``` to ``` [frcastel@cpu-24-31 ~]$ ``` This indicates the specific node on which you are now working. If no nodes are available to use for an interactive session, you will get a message saying your job failed. Sometimes, the processors are just exceptionally busy and you will not be able to get a node to work on. In those cases, generating a submission script is a way to go. You create a text file that says what you want to do and submit it to the 'queue' (the list of pending jobs that's being managed by the head node). Once submitted, the job will run when the resources are available. I will be describing how to do that later in the course. You'll notice below that much of what we're doing is on the head node. The tasks are so simple, that it's really not a problem but it's still bad practice. Forgive me. You can exit your interactive session by closing your terminal. But that is a jerk thing to do. Your session will stay in the queue for a minimum of 48 hours, locking out others from using that set of processors. You have to specifically tell the system you want to end your session by typing `exit`. Don't type it now but please do before you leave class or stop working on this assignment. Any time you are performing computationally intensive work (aka, analyses of any kind) you will want to generate an interactive session. ## UNIX COMMANDS AND THE SHELL A shell is a basic program that allows the user to interact with the system. For this class, we will primarily be using a shell called "bash" and this exercise familiarizes you with it's usage. Every command you use for the remainder of this exericise is a bash command. It’s important to note that you will always be inside a single directory when using the terminal. The default behavior is that when you open a new terminal you start in your own home directory (containing files and directories that only you can modify). To see what files and directories are in our home directory, we need to use the ls command. This command stands for 'list' and it lists the contents of a directory. If we run the `ls` command we should see something like: `$ ls` ``` <some files and things if you've worked on HPCC previously> $ ``` or this if you haven't worked on HPCC before. `$ ls` ``` $ ``` We can resolve the second situation by just copying some things from a directory I created to your home directory. The following command will copy a directory and all it's contents to your home directory. ::: info Note that you will need to replace [eraider] with your eraider id. Mine is 'frcastel'. We will use this same system to indicate your eraider id through the rest of the course. ::: `cp -r /home/frcastel/bootcamp /home/[eraider]` Now, redo your ls command. You should see: ``` $ ls bootcamp $ ``` Again, if you've already worked on HPCC, you may see more. The output of the ls command lists two things. In this case, it's a single directory, but it could also be files. We’ll learn how to tell them apart later on. These directories were created just for this course. You will therefore probably see something very different on your own computer. After the ls command finishes it produces a new command prompt, ready for you to type your next command. The `ls` command is used to list the contents of any directory, not necessarily the one that you are currently in. Try the following: `ls bootcamp` You should see: ``` exampledirectory1 exampledirectory2 examplefile1.txt ``` If you're using a quality terminal, these may be in pretty colors. These are the contents of the bootcamp directory, listed even though you aren't actually _in_ that directory. This has to do with the 'path'. More on that later. ## DIRECTORIES, AKA FOLDERS Looking at directories from within a Unix terminal can often seem confusing, especially if you've grown up using graphical user interfaces (GUIs) like on Windows computers and Macs. But bear in mind that these directories are exactly the same type of folders that you can see if you use any graphical file browser. You might notice some of these names appearing in different colors. Many Unix systems will display files and directories differently by default. Other colors may be used for special types of files. To see this in a form that you might be more familiar with, we'll use the Secure File Transfer Protocol (SFTP) available in Termius. ![Screenshot 2025-09-04 at 11.21.53 PM](https://hackmd.io/_uploads/Bys_rkOqee.png) Click on it and you'll see something like this: ![Screenshot 2025-09-04 at 11.22.54 PM](https://hackmd.io/_uploads/Hk2orkO9ge.png) A new window should appear that is divided into two panels. On the left is your local desktop. On the right is your home directory on HPCC. ![image](https://github.com/davidaray/Genomes-and-Genome-Evolution/assets/12877334/cee93616-d959-4e35-9a90-be028e8f1055) On the left, you can see the folders in your own computer, but we can also access the folders and files in our HPCC account by clicking 'Connect to host' on the right of the screen. Select the name of the Host that you just created and now you are in! ![Screenshot 2025-09-04 at 11.26.20 PM](https://hackmd.io/_uploads/SyusIk_cex.png) Note that my home directory has an extra folder. Yours is probably pretty empty and may only include one folder, the 'bootcamp' folder you just copied. Find that directory and double-click it. You should see the same three entries you saw using `ls bootcamp` earlier in this exercise. ## SETTING UP A DIRECTORY STRUCTURE FOR THIS CLASS As described in a previous tutorial, a coherent directory structure is important to keep you organized. Thus, I want to set one up for this class. Please do the following. Login to HPCC and get an interactive session with one processor. `interactive -p nocona -c 1` or you can just simply type `interactive -p nocona`. By default, you will be given one processor. We haven't discussed this part of the structure of HPCC yet: you have three working areas (and their associated paths), 'home' (`/home/[eraider]`), 'work' (/`lustre/work/[eraider]`), and 'scratch' (`/lustre/scratch/[eraider]`). There are important differences among the three. Home has the smallest storage capacity but it's backed up. Work has limited but larger storage capacity but it's not backed up. Scratch has effectively unlimited storage but is also not backed up and is purged periodically. We will be working in scratch because when I worked through all of these exercises, I needed the storage. Migrate to your scratch directory. `cd /lustre/scratch/[eraider]` Make a directory for this class and for this exercise. `mkdir -p bootcamp2025/data` Note what this command does. It creates two directories simultaneously, the directory for this class, 'bootcamp2025' and the subdirectory, 'data'. That is possible using the -p option, which tells unix to create any necessary intermediate directories required to create the last directory in the path. You will be storing all of your data and results from any exercises from here on out in your 'bootcamp2025' directory and you will be generating new subdirectories as we go. Before moving on to the next section, go back to your home directory by typing `cd /home/[eraider]` ## PATHS Each file on the filesystem can be uniquely identified by a combination of a filename and a path. You can reference any file on the system by giving its full name, which begins with a / indicating the root directory, continues through a list of subdirectories (the components of the path) and ends with the filename. The absolute path describes the relationship of the file to the root directory, /. Each name in the path represents a subdirectory of the prior directory, and '/' characters separate the directory names. The full name, or absolute path, of a file in someone's home directory might look like this: ``` /home/frcastel/bootcamp/exampledirectory1/file1.txt ``` This means that there is a subdirectory of 'root' (aka /) called 'home'. Within 'home', there is a subdirectory called 'bootcamp'. Within 'bootcamp' there is a subdirectory called 'exampledirectory1'. Within 'exampledirectory1' there is a file called 'file1.txt'. You can think of these as ways to direct someone to a specific location as follows. If you wanted to tell some aliens how to find you right now, you could tell them that you're in the universe. Within the universe you're in a galaxy called the Milky Way. In the Milky Way, you're in something we call the Solar System. Within the Solar System, you're on one of the planets, the one called Earth. On Earth, you're on the North American continent and on that continent, you're in the country, the USA. Within the USA, you're in Texas. Within Texas, you're in Lubbock County, in the city of Lubbock, on the campus of Texas Tech University, in the Biology Building, in room 405. To make that into a path similar to the ones we'll be using we'd write it like this. We'll symbolize the universe with an initial '/' and all of the other parts of the path will be subdivisions of that. `/MilkyWay/SolarSystem/Earth/NorthAmerica/USA/Texas/LubbockCounty/Lubbock/TexasTechUniversity/BiologyBuilding/Room405` is the path to you that anyone in the universe could use to locate your position. It's your absolute path. Suppose instead of being here, you were at the top of Olympus Mons on Mars in a spacecraft. The entire absolute path would not change, only the relevant parts. In other words, `/MilkyWay/SolarSystem/Mars/OlympusMons/YourSpaceCraft` Every file or directory on the system can be named by its absolute path, but it can also be named by a relative path that describes its relationship to the current working directory. Files in the directory you are in can be uniquely identified just by giving the filename they have in the current working directory. Files in subdirectories of your current directory can be named in relation to the subdirectory they are part of. From frcastel's home directory (/home/frcastel/), he can uniquely identify the file file1.txt as exampledirectory1/file1.txt. The absence of a preceding / means that the path is defined relative to the current directory rather than relative to the root directory. If our aliens were already in the Solar System, they would only need to refer to the needed part of the path. `Earth/NorthAmerica/USA/Texas/LubbockCounty/Lubbock/TexasTechUniversity/BiologyBuilding/Room405` or `Mars/OlympusMons/YourSpaceCraft`, depending on your location. If you want to name a directory that is on the same level or above the current working directory, there is a shorthand for doing so. Each directory on the system contains two links, ./ and ../, which refer to the current directory and its parent directory (the directory it's a subdirectory of), respectively. If user frcastel is working in the directory home/frcastel/exampledirectory1, he can refer to the directory /home/frcastel/exampledirectory2 as ../exampledirectory2. The '../' backs one out of the example directory into /home/frcastel/ and then the 'exampledirectory2' directs attention to that folder. Another shorthand naming convention, is that home directory itself. It can be designated simply by ~. For example if you wanted to identify the path to file1.txt, you could simply type ~/exampledirectory1/file1.txt. ## YOU ARE HERE: 'PWD' `pwd` stands for "print working directory," and that's exactly what it does. `pwd` sends the full pathname of the directory you are currently in, the current working directory, to standard output - it prints to the screen. You can think of being "in" a directory in this way: if the directory tree is a map of the filesystem, the current working directory is the "you are here" pointer on the map. When you log in to the system, your "you are here" pointer is automatically placed in your home directory. Your home directory is a unique place. It contains the files you use almost every time you log into your system, as well as the directories that you create to store other files. What if you want to find out where your home directory is in relation to the rest of the system? Typing `pwd` at the command prompt in your home directory should give output something like: ``` [frcastel@login-20-26 ~]$ pwd /home/[eraider] ``` This means that your particular home directory is a subdirectory of the system 'home' directory (but designated by your user ID). The system 'home' directory is, in turn, a subdirectory of the root (/) directory. ## MAKING NEW DIRECTORIES If we want to make a new directory (e.g. to store some work related data), we can use the mkdir command: `mkdir making_a_directory` Note that the underscores '_' are important. Unix does not like blank spaces. `ls` Assuming you're still in your home directory and that you have not worked on HPCC before, you should see: ``` bootcamp making_a_directory ``` ## A DIGRESSION ON FILE HIERARCHY Filesystems can be deep and narrow or broad and shallow. It's best to follow an intuitive scheme for organizing your files. Each level of hierarchy should be related to a step in the process you've used to carry out the project. A filesystem is probably too shallow if the output from numerous processing steps in one large project is all shoved together in one directory. However, a project directory that involves several analyses of just one data object might not need to be broken down into subdirectories. The filesystem is too deep if versions of output of a process are nested beneath each other or if analyses that require the same level of processing are nested in subdirectories. It's much easier to for you to remember and for others to understand the paths to your data if they clearly symbolize steps in the process you used to do the work. As you'll see in the upcoming example, your home directory will probably contain a number of directories, each containing data and documentation for a particular project. Each of these project directories should be organized in a way that reflects the outline of the project. Each directory should contain documentation that relates to the data within it. That documentation typically takes the form of a README file that has text to describe the contents and, possibly, how they were generated. :::danger Brief reminder. Are you working in an interactive session? if you don't remember what that is, [click here](##MOVING-FROM-THE-LOGIN-NODE). ::: ## ESTABLISHING FILE-NAMING CONVENTIONS Unix allows an almost unlimited variability in file naming. Filenames can contain any character other than the/ or the null character (the character whose binary representation is all zeros). However, it's important to remember that some characters, such as a space, a backslash, or an ampersand, have special meaning on the command line and may cause problems when naming files. Filenames can be up to 255 characters in length on most systems. However, it's wise to aim for uniformity rather than uniqueness in file naming. Most humans are much better at remembering frequently used patterns than they are at remembering unique 255-character strings, after all. A common convention in file naming is to name the file with a unique name followed by a dot (.) and then an extension that uniquely indicates the file type. As you begin working with computers in your research and structuring your data environment, you need to develop your own file-naming conventions, or preferably, find out what naming conventions already exist and use them consistently throughout your project. There's nothing so frustrating as looking through old data sets and finding that the same type of file has been named in several different ways. Have you found all the data or results that belong together? Can the file you are looking for be named something else entirely? In the absence of conventions, there's no way to know this except to open every unidentifiable file and check its format by eye. The next section provides a detailed example of how to set up a filesystem that won't have you tearing out your hair looking for a file you know you put there. Here are some good rules of thumb to follow for file-naming conventions: * Files of the same type should have the same extension. * Files derived from the same source data should have a common element in their unique names. * The unique name should contain as much information as possible about the experiment. * Filenames should be as short as is possible without compromising uniqueness. You'll probably encounter preestablished conventions for file naming in your work. For instance, if you begin working with protein sequence and structure datafiles, you will find that families of files with the same format have common extensions. You may find that others in your group have established local conventions for certain kinds of data files and results. You should attempt to follow any known conventions. Some typical file naming conventions we'll use are: * .fa - fasta formatted sequence files * .fq - fastq files that have sequence data and quality scores * .txt - plain text files * .sh - shell scripts * .gz - compressed files * .bam - binary versions of mapped read files ## STRUCTURING A PROJECT In a typical genome sequencing and assembly project you will encounter several file types. For example, you may want to keep a record of the sample origination information in a spreadsheet (samples.xlsx). That could be kept in an 'info' directory along with a readme file that describes the project. Then, after getting the initial sequencing reads, you would want to store those in a 'raw_reads' directory. The reads will eventually be assembled into an assembly but you may use multiple assemblers and/or perform multiple assemblies using any one assembler. Thus, you should have an 'assemblies' directory and any subdirectories might reflect the different assembly methods you used within them. Once you decide on an assembly to use for downstream analyses, you will want to keep those files in a relevant directory called 'data_analysis'. Finally, you will likely be writing several scripts to use fo r your assemblies and analyses. Thus, you will want to store them in a 'scripts' directory. Overall, it may look something like this: ![image](https://github.com/davidaray/Genomes-and-Genome-Evolution/assets/12877334/55547aed-12a8-4517-8592-e1523c32e7bb) Assuming you're working on the TTU HPCC and keeping your files on the /lustre/work/ system, your file hierarchy would look something like this: ```/lustre/work/your username/species_x_assembly /lustre/work/your username/species_x_assembly/info /lustre/work/your username/species_x_assembly/info/readme.txt /lustre/work/your username/species_x_assembly/raw_reads /lustre/work/your username/species_x_assembly/raw_reads/illumina /lustre/work/your username/species_x_assembly/raw_reads/illumina/file1_R1.fastq.gz /lustre/work/your username/species_x_assembly/raw_reads/illumina/file1_R2.fastq.gz /lustre/work/your username/species_x_assembly/raw_reads/illumina/file2_R1.fastq.gz /lustre/work/your username/species_x_assembly/raw_reads/illumina/file2_R2.fastq.gz ``` and so on.... ``` /lustre/work/your username/species_x_assembly/raw_reads/pacbio /lustre/work/your username/species_x_assembly/raw_reads/pacbio/file1.fastq.gz /lustre/work/your username/species_x_assembly/raw_reads/pacbio/file2.fastq.gz ``` and so on.... ``` /lustre/work/your username/species_x_assembly/raw_reads/nanopore /lustre/work/your username/species_x_assembly/raw_reads/nanopore/file1.fastq.gz /lustre/work/your username/species_x_assembly/raw_reads/nanopore/file2.fastq.gz ``` and so on.... ## GETTING FROM A TO B We are in the home directory on the computer but we want to to work in the bootcamp directory. To change directories in Unix, we use the `cd` command: `cd bootcamp` ``` [frcastel@cpu-24-31 bootcamp]$ ``` Notice that — on this system — the command prompt has expanded to include our current directory. This doesn’t happen by default on all Unix systems, but you should know that you can configure what information appears as part of the command prompt. Let’s make two new subdirectories and navigate into them: `mkdir outer_directory` `cd outer_directory` ``` [frcastel@cpu-24-31 outer_directory]$ ``` Now try: `mkdir inner_directory` `cd inner_directory` ``` [frcastel@cpu-24-31 inner_directory]$ ``` The folder hierarchy is not quite obvious with the recent HPCC update, but type `pwd`. ``` /home/frcastel/bootcamp/outer_directory/inner_directory ``` This reveals that we are three levels beneath the home directory. We created the two directories in separate steps, but it is possible to use the mkdir command in way to do this all in one step. Like most Unix commands, `mkdir `supports command-line options which let you alter its behavior and functionality. Command-like options are — as the name suggests — optional arguments that are placed after the command name. They often take the form of single letters (following a dash). If we had used the `-p` option of the `mkdir `command we could have done this in one step. E.g. `mkdir -p outer_directory/inner_directory` The -p option means, 'create all intermediate subdirectories in making the ultimate directory in the path.' ## MAKING THE 'LS' COMMAND MORE USEFUL The `..` operator that we saw earlier can also be used with the `ls` command, e.g. you can list directories that are ‘above’ you: `ls ../../` ``` exampledirectory1 exampledirectory2 examplefile1.txt outer_directory ``` Time to learn another useful command-line option. If you add the letter ‘l’ to the `ls `command it will give you a longer output compared to the default: `ls -l ../../` ``` total 12 drwxr-xr-x 2 <eraider> bio 4096 Nov 1 2019 exampledirectory1 drwxr-xr-x 2 <eraider> bio 4096 Nov 1 2019 exampledirectory2 -rw-r--r-- 1 <eraider> bio 0 Jul 17 11:32 examplefile1.txt drwxr-xr-x 3 <eraider> bio 4096 Jul 17 12:46 outer_directory ``` Note that if you were to enter the following: `ls -l ~/bootcamp` You get exactly the same thing. Why? For each file or directory we now see more information (including file ownership and modification times). The ‘d’ at the start of each line indicates that these are directories. There are many, many different options for the `ls `command. Try out the following (against any directory of your choice) to see how the output changes. `ls -R ../../` `ls -l -t -S ../../` `ls -l -t -S -r ../../` `ls -ltSr ../../` `ls -lh ../../` Note that the last example combine multiple options but only use one dash. This is a very common way of specifying multiple command-line options. You may be wondering what some of these options are doing. It’s time to learn about Unix documentation. I'm a big fan of `ls -lhrt`. Gives you just about everything you could ask for. :::danger Brief reminder. Are you working in an interactive session? if you don't remember what that is, [click here](##MOVING-FROM-THE-LOGIN-NODE). ::: ## 'MAN' PAGES If every Unix command has so many options, you might be wondering how you find out what they are and what they do. Well, thankfully every Unix command has an associated ‘manual’ that you can access by using the man command. E.g. `man ls ` `man cd` `man man # yes even the man command has a manual page` When you are using the man command, press space to scroll down a page, b to go back a page, or q to quit. You can also use the up and down arrows to scroll a line at a time. The man command is actually using another Unix program, a text viewer called less, which we’ll come to later on. ## REMOVING DIRECTORIES We now have a few (empty) directories that we should remove. To do this use the rmdir command, this will only remove empty directories so it is quite safe to use. If you want to know more about this command (or any Unix command), then remember that you can just look at its man page. `cd ~/bootcamp` `rmdir outer_directory/inner_directory` Using `ls` correctly will show you that the outer_directory is still present but the inner directory is gone. `rmdir outer_directory` ## USING TAB COMPLETION Saving keystrokes may not seem important, but the longer that you spend typing in a terminal window, the happier you will be if you can reduce the time you spend at the keyboard. Especially, as prolonged typing is not good for your body. So the best Unix tip to learn early on is that you can tab complete the names of files and programs on most Unix systems. Type enough letters that uniquely identify the name of a file, directory or program and press tab…Unix will do the rest. E.g. if you type ‘tou’ and then press tab, Unix should autocomplete the word to ‘touch’ (this is a command which we will learn more about in a minute). In this case, tab completion will occur because there are no other Unix commands that start with ‘tou’. If pressing tab doesn’t do anything, then you have not have typed enough unique characters. In this case pressing tab twice will show you all possible completions. Navigate to your home directory, make a 'Learning_unix' directory with the `mkdir `command, and then use the cd command to change to the Learning_unix directory. Use tab completion to complete directory name. If there are no other directories starting with ‘L’ in your home directory, then you should only need to type ‘cd’ + ‘L’ + ‘tab’. Tab completion will make your life easier and make you more productive! This trick can save you a LOT of typing! It can also save you many, many instances of trying to figure out what simple typos you may have made when entering a long path. Another great time-saver is that Unix stores a list of all the commands that you have typed in each login session. You can access this list by using the `history `command or more simply by using the up and down arrows to access anything from your history. So if you type a long command but make a mistake, press the up arrow and then you can use the left and right arrows to move the cursor in order to make a change. ## CREATING EMPTY FILES WITH 'TOUCH' The following sections will deal with Unix commands that help us to work with files, i.e. copy files to/from places, move files, rename files, remove files, and most importantly, look at files. First, we need to have some files to play with. The Unix command `touch `will let us create a new, empty file. The `touch `command does other things too, but for now we just want a couple of files to work with. `cd bootcamp` `touch heaven.txt` `touch earth.txt` `ls` ``` earth.txt exampledirectory1 exampledirectory2 examplefile1.txt heaven.txt ``` ### A QUICK DIGRESSION. Go back to your SFTP window and use your mouse to navigate to the bootcamp directory. Look around inside of it and you should see all of the work you've done to this point but in a graphical user format rather than on the command line. Quickly drag and drop the file earth.txt to your desktop. This is how we will transfer files back and forth from HPCC to the local system. ## MOVING FILES (MOVING HEAVEN AND EARTH) Now, let’s assume that we want to move these files to a new directory (‘temp’). We will do this using the Unix `mv `(move) command. Remember to use tab completion: `mkdir temp` `mv heaven.txt temp/` `mv earth.txt temp/` `ls` ``` exampledirectory1 exampledirectory2 examplefile1.txt temp ``` `ls temp/` ``` earth.txt heaven.txt ``` For the mv command, we always have to specify a source file (or directory) that we want to move, and then specify a target location. If we had wanted to we could have moved both files in one go by typing any of the following commands: `mv *.txt temp/ ` `mv *ea* temp/` :::info The asterisk * acts as a wild-card character, essentially meaning ‘match anything’. The second example works because only those two files contain the letters ‘ea’ in their names. The ‘?’ character is also a wild-card but with a slightly different meaning. See if you can work out what it does. ::: :::warning :warning: Using wild-card characters can save you a lot of typing but be careful with it, especially with commands like `rmdir`. Once a file or directory is gone, it's gone forever. :warning: ::: ## RENAMING FILES In the earlier example, the destination for the `mv `command was a directory name (temp). So we moved a file from its source location to a target location, but note that the target could have also been a (different) file name, rather than a directory. E.g. let’s make a new file and move it whilst renaming it at the same time: `touch rags` `ls` ``` exampledirectory1 exampledirectory2 examplefile1.txt rags temp ``` `mv rags temp/riches` `ls temp/` ``` earth.txt heaven.txt riches ``` In this example we create a new file (‘rags’) and move it to a new location and in the process change the name (to ‘riches’). So `mv `can rename a file as well as move it. The logical extension of this is using `mv `to rename a file without moving it (you have to use `mv `to do this as Unix does not have a separate ‘rename’ command): `mv temp/riches temp/rags` Use `ls `to see what happened. ## MOVING DIRECTORIES: 'MV' It is important to understand that as long as you have specified a ‘source’ and a ‘target’ location when you are moving a file, then it doesn’t matter what your current directory is. You can move or copy things within the same directory or between different directories regardless of whether you are in any of those directories. Moving directories is just like moving files: `mkdir temp2` `mv temp2 temp` `ls temp/` ``` earth.txt heaven.txt rags temp2 ``` ## REMOVING FILES: 'RM' You’ve seen how to remove a directory with the `rmdir `command, but `rmdir `won’t remove directories if they contain any files. So how can we remove the files we have created (inside bootcamp/temp)? In order to do this, we will have to use the `rm `(remove) command. Please read the next section VERY carefully. Misuse of the rm command can lead to needless death & destruction Potentially, `rm `is a very dangerous command; if you delete something with `rm`, you will not get it back! It is possible to delete everything in your home directory (all directories and subdirectories) with `rm`, that is why it is such a dangerous command. Never, NEVER use `rm *` Let me repeat that last part again. It is possible to delete EVERY file you have ever created with the `rm `command. Are you scared yet? You should be. Luckily there is a way of making `rm `a little bit safer. We can use it with the `-i` command-line option which will ask for confirmation before deleting anything (remember to use tab-completion): `cd temp` `ls` ``` earth.txt heaven.txt rags temp2 ``` `rm -i earth.txt heaven.txt rags` You'll need to respond with 'y' to the following prompts. ``` rm: remove regular empty file ‘earth.txt’? y rm: remove regular empty file ‘heaven.txt’? y rm: remove regular empty file ‘rags’? y ``` `ls` All you're left with is ``` temp2 ``` We could have simplified this step by using a wild-card (e.g. `rm -i *.txt`) or we could have made things more complex by removing each file with a separate `rm `command. Let’s finish cleaning up: `rmdir temp2` `cd ..` `rmdir temp` :::warning You could have gotten rid of both at once by just using rm with the `-rf `option on the top directory, `rm -rf temp` But again, this is dangerous territory. ::: ## COPYING FILES: 'CP' Copying files with the `cp `(copy) command is very similar to moving them. Remember to always specify a source and a target location. Let’s create a new file and make a copy of it: `mkdir copy` `cd copy` `touch file1` `cp file1 file2` `ls` ``` file1 file2 ``` What if we wanted to copy files from a different directory to our current directory? Let’s put a file in our home directory (specified by ~ remember) and copy it to the current directory (bootcamp): `touch ~/file3` `ls ~` ``` bootcamp file3 <and possibly a bunch of other things> ``` `cp ~/file3 .` `ls` ``` file1 file2 file3 ``` This last step introduces another new concept. In Unix, the current directory can be represented by a `.` (dot) character. You will mostly use this only for copying files to the current directory that you are in. Compare the following: `ls ` `ls . ` `ls ./` In this case, using the dot is somewhat pointless because `ls` will already list the contents of the current directory by default. Also note how the trailing slash is optional. You can use `rm `to remove the temporary files and rmdir to remove the 'copy' directory. :::info Remove all three files using `rm`. ::: ## COPYING DIRECTORIES The `cp `command also allows us (with the use of a command-line option) to copy entire directories. Use man `cp `to see how the `-R` or `-r` options let you copy a directory _recursively_. ## VIEWING FILES WITH 'LESS' So far we have covered listing the contents of directories and moving/copying/deleting either files and/or directories. Now we will quickly cover how you can look at files. The less command lets you view (but not edit) text files. We will use the `echo `command to put some text in a file and then view it: `echo "Call me Ishmael."` ``` Call me Ishmael. ``` `echo "Call me Ishmael." > opening_lines.txt` `ls` ``` opening_lines.txt ``` `less opening_lines.txt` On its own, `echo `isn’t a very exciting Unix command. It just echoes text back to the screen. But we can redirect that text into an output file by using the `>` symbol. This allows for something called file redirection. Careful when using file redirection (`>`), it will overwrite any existing file of the same name When you are using `less`, you can bring up a page of help commands by pressing 'h', scroll forward a page by pressing space, or go forward or backwards one line at a time by pressing 'j' or 'k'. To exit `less`, press 'q' (for quit). The `less `program also does about a million other useful things (including text searching). :::danger Brief reminder. Are you working in an interactive session? if you don't remember what that is, [click here](##MOVING-FROM-THE-LOGIN-NODE). ::: ## VIEWING FILES WITH 'CAT' Let’s add another line to the file: `echo "The primroses were over." >> opening_lines.txt` `cat opening_lines.txt` ``` Call me Ishmael. The primroses were over. ``` Notice that we use `>>` and not just `>`. This operator will append to a file. If we only used `>`, we would end up overwriting the file. The `cat` command displays the contents of the file (or files) and then returns you to the command line. Unlike `less` you have no control on how you view that text (or what you do with it). It is a very simple, but sometimes useful, command. You can use `cat `to quickly combine multiple files or, if you wanted to, make a copy of an existing file: `cat opening_lines.txt > file_copy.txt` Use `ls` to view the result and then remove the new file. ## COUNTING CHARACTERS IN A FILE `ls` ``` opening_lines.txt ``` ls -l ``` total 4 -rw-rw-r-- 1 [eraider] bio 42 Jun 15 04:13 opening_lines.txt ``` `wc opening_lines.txt` ``` 2 7 42 opening_lines.txt ``` `wc -l opening_lines.txt` ``` 2 opening_lines.txt ``` The `ls -l` option shows us a long listing, which includes the size of the file in bytes (in this case ‘42’). Another way of finding this out is by using Unix’s `wc` command (word count). By default this tells you how many lines, words, and characters are in a specified file (or files), but you can use command line options to give you just one of those statistics (in this case we count lines with `wc -l`). ## EDITING SMALL FILES WITH 'NANO' Nano is a lightweight editor installed on most Unix systems. There are many more powerful editors (such as ‘emacs’ and ‘vi’), but these have steep learning curves. Nano is very simple. You can edit (or create) files by typing: `nano opening_lines.txt` You should see the following appear in your terminal: ![image](https://github.com/davidaray/Genomes-and-Genome-Evolution/assets/12877334/a3b72678-7c53-4471-b0f2-06712a488d8a) The bottom of the nano window shows you a list of simple commands which are all accessible by typing ‘Control’ plus a letter. E.g. Control + X exits the program. ## THE $PATH ENVIRONMENT VARIABLE One other use of the echo command is for displaying the contents of something known as environment variables. These contain user-specific or system-wide values that either reflect simple pieces of information (your username), or lists of useful locations on the file system. Some examples: ``` echo $USER [eraider] echo $HOME /home/[eraider] echo $PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games #Yours will likely look different. ``` The last one shows the content of the $PATH environment variable, which displays a — colon separated — list of directories that are expected to contain programs that you can run. This includes all of the Unix commands that you have seen so far. These are files that live in directories which are run like programs (e.g. `ls` is just a special type of file in the /bin directory). Knowing how to change your $PATH to include custom directories can be necessary sometimes (e.g. if you install some new bioinformatics software in a non-standard location). We will do this at times throughout the class. ## MATCHING LINES IN FILES WITH 'GREP' Use `nano` to add the following lines to `opening_lines.txt`: ``` Now is the winter of our discontent. All children, except one, grow up. The Galactic Empire was dying. In a hole in the ground there lived a hobbit. It was a pleasure to burn. It was a bright, cold day in April, and the clocks were striking thirteen. It was love at first sight. I am an invisible man. It was the day my grandmother exploded. When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. Marley was dead, to begin with. ``` :::info You don't have to type all this text. You can copy this text block, and paste it with Ctrl+Shift+V on Windows or ⌘+⇧+V on MacOS. ::: Once you have added these lines to the file, press Ctrl+X or +⇧+X. You will be prompted `Save modified buffer?`, so press the Y key, and now you will be finally asked `File Name to Write: opening_lines.txt`, and finally press Enter or Return. You will often want to search files to find lines that match a certain pattern. The Unix command `grep `does this (and much more). The following examples show how you can use grep’s command-line options to: * show lines that match a specified pattern * ignore case when matching (-i) * only match whole words (-w) * show lines that don’t match a pattern (-v) * use wildcard characters and other patterns to allow for alternatives (*, ., and []) `grep was opening_lines.txt` ``` The Galactic Empire was dying. It was a pleasure to burn. It was a bright, cold day in April, and the clocks were striking thirteen. It was love at first sight. It was the day my grandmother exploded. When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. Marley was dead, to begin with. ``` The `-v` option inverts the search. `grep -v was opening_lines.txt` ``` Call me Ishmael. The primroses were over. Now is the winter of our discontent. All children, except one, grow up. In a hole in the ground there lived a hobbit. I am an invisible man. ``` `grep all opening_lines.txt` ``` Call me Ishmael. ``` The `-i` option allows you to ignore the case of the searched string. `grep -i all opening_lines.txt` ``` Call me Ishmael. All children, except one, grow up. ``` `grep in opening_lines.txt` ``` Now is the winter of our discontent. The Galactic Empire was dying. In a hole in the ground there lived a hobbit. It was a bright, cold day in April, and the clocks were striking thirteen. I am an invisible man. Marley was dead, to begin with. ``` The -w option only yields whole matches to the searched string. `grep -w in opening_lines.txt` ``` In a hole in the ground there lived a hobbit. It was a bright, cold day in April, and the clocks were striking thirteen. ``` See if you can figure out what the following searches accomplish. `grep [aeiou]t opening_lines.txt` ``` In a hole in the ground there lived a hobbit. It was love at first sight. It was the day my grandmother exploded. When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. Marley was dead, to begin with. ``` `grep -w -i [aeiou]t opening_lines.txt` ``` It was a pleasure to burn. It was a bright, cold day in April, and the clocks were striking thirteen. It was love at first sight. It was the day my grandmother exploded. When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. ``` ## COMBINING UNIX COMMANDS WITH PIPES One of the most poweful features of Unix is that you can send the output from one command or program to any other command (as long as the second commmand accepts input of some sort). We do this by using what is known as a pipe. This is implemented using the ‘|’ character (which is a character which always seems to be on different keys depending on the keyboard that you are using). Think of the pipe as simply connecting two Unix programs. Here’s an example which introduces some new Unix commands: `grep was opening_lines.txt | wc -c` ``` 316 ``` The above command searches the specified file for lines matching ‘was’, it sends the lines that match through a `pipe `to the `wc` program. We use the `-c` option to count the total number of characters in the matching lines (316). The following uses some built in commands that we haven't discussed yet. `grep was opening_lines.txt | sort | head -n 3 | wc -c` ``` 130 ``` The second example first sends the output of `grep `to the Unix `sort `command. This sorts a file alphanumerically by default. The sorted output is sent to the `head `command which by default shows the first 10 lines of a file. We use the `-n` option of this command to only show 3 lines. These 3 lines are then sent to the `wc` command as before. Whenever making a long pipe, test each step as you build it! ## MISCELLANEOUS UNIX POWER COMMANDS The following examples introduce some other Unix commands, and show how they _could be_ used to work on a **fictional** file called file.txt. Remember, you can always learn more about these Unix commands from their respective man pages with the man command. _These are not all real world cases I'm asking you to perform, but rather show the potential diversity of Unix command-line tools._ View the penultimate 10 lines of a file (using `head `and `tail `commands): `tail -n 20 file.txt | head` Show lines of a file that begin with a start codon (ATG) (the `^` matches patterns at the start of a line): `grep "^ATG" file.txt` Cut out the 3rd column of a tab-delimited text file and sort it to only show unique lines (i.e. remove duplicates): `cut -f 3 file.txt | sort -u` Count how many lines in a file contain the words ‘cat’ or ‘bat’ (`-c` option of `grep `counts lines): `grep -c '[bc]at' file.txt` Turn lower-case text into upper-case (using `tr `command to ‘transliterate’): `cat file.txt | tr 'a-z' 'A-Z'` Change all occurences of ‘Chr1’ to ‘Chromosome 1’ and write changed output to a new file (using `sed `command): `cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt` :::danger Brief reminder. Are you working in an interactive session? if you don't remember what that is, [click here](##MOVING-FROM-THE-LOGIN-NODE). ::: ## CHANGING FILE/DIRECTORIES PERMISSIONS WITH 'CHMOD' > (modified from https://www.guru99.com/file-permissions.html) Say you do not want your colleague to see your research files. This can be achieved by changing file permissions. We can use the _chmod _command which stands for 'change mode'. Using the command, we can set permissions (read, write, execute) on a file/directory for the owner, group and the world. Usage: `chmod permissions filename` There are 2 ways to use the command: Absolute mode and Symbolic mode ### Absolute (numeric) mode In this mode, file permissions are not represented as characters but a three-digit octal number. The table below gives numbers for all for permissions types. |Number |Permission Type | Symbol | |---|---|--- 0 |No Permission | ---| 1| Execute| --x 2| Write |-w- 3| Execute + Write |-wx 4| Read |r-- 5| Read + Execute |r-x 6| Read +Write |rw- 7| Read + Write +Execute |rwx Perhaps you have a file, text.txt. `chmod 764 text.txt` The above command will change permissions as follows: * Owner can read, write and execute * Usergroup can read and write * World can only read This is shown as '-rwxrw-r-. ### Symbolic mode In the Absolute mode, you change permissions for all 3 owners. In the symbolic mode, you can modify permissions of a specific owner. It makes use of mathematical symbols to modify the file permissions. |Operator |Description| |---|---| |+ |Adds a permission to a file or directory| |- |Removes the permission| |= |Sets the permission and overrides the permissions set earlier.| The various owners are represented as |User|Denotations| |---|---| |u| user/owner| |g| group| |o| other| |a| all| We will not be using permissions in numbers like _755 _but characters like _rwx_. `chmod o=rwx text.txt` allows the other users to read, write, and execute the file. `chmod u-r text.txt` removes read permission from the user (owner). ## Moving files to and from HPCC To move files to and from HPCC, you will need to use an FTP client like the one you already used with Termius. ## For you to do If you've followed this tutorial all the way through, this will be easy. `history | tail -200 > history_tutorial_01.txt` Download this file to your local computer using any of the file transfer methods described above. For additional reading, read through this document (https://kb.iu.edu/d/afsk) to use as a cheat sheet. --- # WORKING WITHIN THE QUEUE Working with a interactive session allows you to interact directly with the file system but many of the jobs we will be running take hours if not days. Thus, keeping a terminal open is not feasible. Thus, you will need to become familiar with working in the queuing system (aka scheduler) that's built in to HPCC. For this exercise, you will NOT get an interactive session. But, don't get used to that. Generally, you should always get an interactive session when doing most everything on the system other than simply submitting a job. To do this, we use submission scripts. Transfer the following script to your bootcamp folder, `cp /home/frcastel/counter.sh ~/bootcamp` and then go to that directory. `cd ~/bootcamp` Before you run the script, peek at the the queue with `squeue` A very long list will appear. It's long because it lists every active and queued job from every user. Generally, we don't care about anything others are running. So, let's only look at what you're running. `squeue -u [eraider]` The -u portion of the command tells the scheduler to show only jobs for a specific user (you, in this case). Assuming you don't have other jobs running, you'll see nothing but the header line. ``` JOBID PARTITION PRIORI ST USER NAME TIME NODES CPUS QOS NODELIST(REASON) ``` Now, run the counter script using: `sbatch counter.sh` and look at the queue again. The details will be different but it should look something like this: ``` login-20-25:$ squeue -u [eraider] 2968376 nocona 3581 R daray counter 6:08 1 1 normal cpu-23-12 ``` Each column means something. Here are the column headers that you're not seeing because you've selected only lines with your ID on it. ``` JOBID PARTITION PRIORI ST USER NAME TIME NODES CPUS QOS NODELIST(REASON) ``` This shows that your job is running on whatever compute node to which it was assigned. Important things to notice. * **'JOBID'** for this job is 2954962. What's yours? It will be important if you want to kill this job or keep up with that's happening with it. * **'PARTITION'** The partition (part of HPCC) on which you set your job running. * **'PRIORI'** is the priority status in the queue. The higher the number, the better for getting your job run. * **'ST'** Your jobs status. * **'R'** (running) is good. * **'PD'** (pending) still good but you gotta wait your turn in the queue. * **'CG'** (completing). Your job is ending without finishing. This usually means something was wrong with your script* USER You can figure this one out. * **'NAME'** Try to name your jobs something easily identifiable but also note that you'll only see up to 8 characters.* 'USER' That's you. You can view processes other people are running but that's not important right now. * **'TIME'** Time your script has been running. * **'NODES'** The number of nodes assigned to this task. * **'CPUS'** The list of nodes assigned to this task. * **'QOS'** Quality of service. This has information on why your job may be being held up or other data. * **'NODELIST(REASON)'** Will tell you which specific nodes you're using or the reason for your job hold. More information is available if you type `man squeue` View `counter.sh` in `less`: `less counter.sh` ```bash= #!/bin/bash #SBATCH --job-name=counter #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 for i in `seq 1 1000`; do echo $i sleep 1 done ``` The first 7 lines are the submission script 'header'. Most of this stuff is standard. You'll only change a couple of things for your jobs. * '#!/bin/bash' Sets the command interpreter path. In this case, we're using bash and that interpeter is located at /bin/bash. * '#SBATCH --job-name=counter' Name your job 'counter'. * '#SBATCH --output=%x.%j' Indicates the name of the standard output file. For example 'counter.o1915516'. * '#SBATCH --error=%x.%j' Indicates the name of the standard error file. For example, 'counter.e1915516'. * '#SBATCH --partition=nocona' Instructs the schedule to use the partition named 'nocona'. You could also choose quanah or ivy. * '#SBATCH --nodes=1' Instructs the scheduler to use one node (on nocona, a node consists of 128 processors). * '#SBATCH --ntasks-per-node=1' Instructs the scheduler to use one task per node (aka one processor from each node). There are other possible lines to specialize this setup but this class doesn't really need to go into them. The next five lines tell the script what to do: ```bash= for i in `seq 1 1000`; do echo $i sleep 1 done ``` This sets up variable, 'i', that ranges from 1 to 1000, counting by 1's, and then prints the first value to the screen, 'echo $i'. It then waits one second and repeats the loop by going to the next number. It will continue to do that until the number 1000 is reached. To see what's happening, we can look at the standard output file. To find out what that standard output file is named use `ls`. You should see at least two new files called `counter.<job-ID>.out` and `counter.<job-ID>.err`. The job-ID will be the value you got from the first column of the `squeue` results. To see what's happening in this file, use `tail` with the `-f` (follow) option. `tail -f counter.<job-ID>.out` You should see a growing list of numbers. The actual numbers you see will depend on how far into the run, the program has gotten when you issue your command. Below is what you'd see if you started to follow the file after the script had already counted to '19'. Watch it for a few seconds and see what happens. ``` :$ tail -f counter.1915525.out 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ``` To exit this tail command use `ctrl+c`. Now that you've gotten the idea, we can kill this job. After all, there's no need to use up processors that someone else could be using. Anytime you want to kill a job just type `scancel <job-ID>`. You can also view any interactive sessions you might have active using squeue. Grab 10 processors on nocona with an interactive login. `interactive -p nocona -c 10`. *Notice that these entries are similar to the last three lines of the submission script header.* Now check the queue. You'll see something like this. ``` cpu-4-7:$ squeue -u [eraider] 2968515 nocona INTERACT daray R 0:15 1 cpu-25-60 ``` Your interactive session is named INTERACT (remember, the queue will only give you the first 8 characters of the job name). To exit your interactive session, type `exit` or kill that process with `scancel <job-ID>`. # WORKING WITH VARIABLES > Written by FXC for the Genomes and Genome Evolution class at TTU Variables store pieces of information, that can be temporary or permanent given that the contents may change, hence its name. They can contain a character, a string of characters, numbers, or all of them (alphanumeric). In bash scripting, it is common practice to use variables because of their practicality and applicability as you will see in this exercise. The utility of working with variables is more noticeable when you are working with complex scripts or loops. ## Environmental vs local variables Every operating system has a predefined set of information on how it should work, what specific software or module it should use to perform certain actions. The set of predefined variables that control everything, are the **environmental variables**. The HPCC terminal is no exception to this, and you can see what these are by typing `printenv`. **Local variables**, on the other hand, are defined by the user and will be erased every time the system is reset or shuts down, and can be changed at anytime. You will make intensive use of local variables in this course. To make it easy to understand, let's start with one of the most common examples in any programming language. Type `greeting="Hello World!"`, followed by `echo $greeting` ``` Hello World! ``` Notice that the content of the variable should be written in quotes, although in some cases it's not strictly necessary. You have created a variable named "greeting", which contains two words and a space between them, or -technically speaking- a string of characters. Bash will identify variables because of the dollar sign "$" preceding its name. You can also add as many characters as you want. For example, `introduction="My name is David Ray and I am a professor at TTU"`, and to display its content type `echo $introduction` ``` My name is David Ray and I am a professor at TTU ``` You can type it all together with `echo $greeting $introduction` ``` Hello World! My name is David Ray and I am a professor at TTU ``` If you want to add more text to an existent variable, or rewrite it, you must type the exact same variable name and change the text inside of the quotes `introduction="My name is David Ray, I am a professor at TTU, and will be your GGE teacher in fall 2024!"`, and again, you can run `echo $greeting $introduction` to check the updated variable content. ``` Hello World! My name is David Ray, I am a professor at TTU, and will be your GGE teacher in fall 2024! ``` ### Naming conventions As you noticed, you can write anything as the content of a variable, however, there are some rules to follow when naming them. It isn't impossible, for example, to write numbers as names, or to use a number as the first character of your variable name. Try this `1variable="This will not work"` ``` -bash: 1variable=This will not work: command not found ``` Of course, bash wasn't able to store it and was actually waiting for a command or function instead. Now try `variable1="This will work"` followed by `echo $variable1` ``` $ variable1="This will work" $ echo $variable1 This will not work ``` Also, you are not able to use spaces if you decide to name your variable with two or more words, try `space variable="some text"` ``` bash: space: command not found ``` The result will be different if you name your variable "space_variable". Try it yourself! Moreover, it is highly recommended and good practice, to name a variable based on its content. You don't want to have a variable named "greeting" when it contains a path to a folder, or a DNA/aminoacid sequence. One last thing. A habit I've gotten into (and this is not common among people who write code but I find it very useful) is to use all capital letters to assign my variables. This makes it easier to identify the variables in my code, especially if there are many, many lines. In other words, I prefer `MY_VARIABLE="variable text"` to `my_variable="variable text"`. You don't have to do this but I recommend it. ## Using variables Using variables allows you to replace long strings of characters for a smaller and easy to use element. Specially when dealing with long paths to directories or files you may use constantly and that may be nested in a directory, inside a directory, inside a directory, like: ``` /this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt ``` Instead of typing this over and over again, you can simply assign the variable with `MY_PATH="/this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt"` And if you need to access that file, you simply type: `$MY_PATH` rather than `/this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt` ### Files and paths Before proceeding, make sure you get a refresher on working directories in [Exercise 01 - You are here](https://github.com/davidaray/Genomes-and-Genome-Evolution/wiki/01.-Logging-In-to-HPCC-and-an-Intro-to-Bash-and-Linux-Navigation#you-are-here-pwd), if you need it. Let's make sure you are in your home directory with `pwd`. You should see: ``` /home/[eraider] ``` Now, copy a directory I set up for this exercise using `cp -r /home/frcastel/arcane .` You now have a copy of that path and anything contained in the directories in your home directory, /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise Notice the very long path. It'd be a real pain to type that repeatedly. However, I put a file you've already used in a previous exercise into the `exercise` directory at the end of that long path and you may want to look at it or use it multiple times during some process. So, you could type `cat /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter.sh` ```bash= #!/bin/bash #SBATCH --job-name=counter #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --partition=nocona #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 for i in `seq 1 1000`; do echo $i sleep 1 done ``` Or, you could do the following to avoid having to type over and over again whenevr you want to use that file. ```bash COUNTER_DIR="/home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/" ``` You can now see the file with `cat $COUNTER_DIR/counter.sh` This allows you to do things like extracting information from that script and saving it in another file, or into another variable. ``` # Check the partition that you used to run the counter script grep -w partition $COUNTER_DIR/counter.sh ``` Which will display: ``` #SBATCH --partition=nocona ``` Notice that it returns the part of the header where the partition of HPCC is assigned. If you wanted to change the partition to be used for that particular script without using nano, you could do something like `sed "s/nocona/quanah/g" $COUNTER_DIR/counter.sh > $COUNTER_DIR/counter_quanah.sh` Open the new file and corroborate that the change was done. Compare the command line you just used to one not making use of variables ```bash! sed "s/nocona/quanah/g" /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter.sh > /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter_quanah.sh ``` Ughhh.. awful! They do exactly the same thing but which one is easier to understand and to enter on the command line? :::warning ::: **NOTE:** You may encounter this error: ``` sed: can't read /lustre/scratch/[eraider]/gge2024/bin/counter.sh: No such file or directory ``` 99.9% of the times this error occurs because you didn't type the directory correctly, so the computer cannot locate that file. You can always check the variable and change its content to fix this. ::: ### Command substitution You can also capture the output of a command and store it in a variable. To make that possible, there's a small modification to be done in the syntax you have been using so far. `VARIABLE=$(command)` For example, if you wanted to store today's date and time, the `date` command can do that. `CURRENT_DATE=$(date)` #save date and time in a variable `echo "Today's date and time is: $CURRENT_DATE"` #Print it We can do things a bit more complicated. Let's use the opening_lines.txt file you used in Exercise 01, and do some filtering with the commands you have already learnt. First, make sure you change your directory to where this file is stored. `CHARACTER_NUMBER=$(grep was opening_lines.txt | wc -c)` Breaking it down, you used `grep` to find the word "was" in the file opening_lines.txt, and you used a pipe `|` , to count the number of characters present in the file that had the word "was" in the matching lines with `wc -c`. The result of this output was stored in the variable "$character_number". :::info **NOTE:** If the last couple of commad lines have confused you, please review [Exercise 01 - Matching lines in files with "grep"](https://github.com/davidaray/Genomes-and-Genome-Evolution/wiki/01.-Logging-In-to-HPCC-and-an-Intro-to-Bash-and-Linux-Navigation#combining-unix-commands-with-pipes), and [Combining UNIX commands with pipes](https://github.com/davidaray/Genomes-and-Genome-Evolution/wiki/01.-Logging-In-to-HPCC-and-an-Intro-to-Bash-and-Linux-Navigation#matching-lines-in-files-with-grep) before you move on with this exercise. ::: ## Putting it all together How is any of this useful? In the context of programming, bioinformatics and genomics, you can simplify your code and change the content of a variable and work with several paths at once. I will illustrate how you could take advantage of variables in a very basic and over simplified pipeline to filter and identify protein sequences transcribed by the BRCA2 gene, which are known to increase the risks of breast and ovarian cancer. Read more about it [here](https://www.cancer.gov/about-cancer/causes-prevention/genetics/brca-fact-sheet). This gene is located on chromosome 13q12, typically the coding region's length is around 11,000bp long which means it encodes a protein of around 3,000 aminoacids. I have created some directories and files to have you practice with variables. Make sure you are located in: ``` /lustre/scratch/[eraider]/bootcamp2025/ ``` Now copy the files `cp -r /lustre/work/frcastel/gge2024/variables_exercise .`, and `cd` into the `variables_exercise` directory. You should see two folders, when you type `ls`. ``` annotations sequences ``` Let's create two variables for each one of those: ```bash= ANNOTATION_DIR="/lustre/scratch/[eraider]/bootcamp2025/variables_exercise/annotations" SEQ_DIR="/lustre/scratch/[eraider]/bootcamp2025/variables_exercise/sequences" ``` Let's see what's inside the file located in "ANNOTATION_DIR" with `cat $ANNOTATION_DIR/brca_description.txt` ``` BRCA2 homo_sapiens 3418 P51587 P51587_BRCA2_HUMAN BRCA2 mus_musculus 3329 P97929 P97929_BRCA2_MOUSE BRCA2 ursus_americanus 3458 A0A452RG22 A0A452RG22_A0A452RG22_URSAM BRCA2 panthera_leo 3356 A0A8C8XLK1 A0A8C8XLK1_A0A8C8XLK1_PANLE PRDX1 myotis_lucifugus 202 Q6B4U9 Q6B4U9_PRDX1_MYOLU BRCA2 myotis_lucifugus 315 G1NUN6 G1NUN6_G1NUN6_MYOLU PALB2 myotis_lucifugus 1118 G1Q4H6 G1Q4H6_G1Q4H6_MYOLU ``` This file contains information in columns from left to right: 1) What gene it is 2) What organism it is from 3) Its length 4) An accession number that can be looked up in [this protein database called UNIPROT](https://www.uniprot.org/) 5) The name with which the sequence is identified in the fasta file Now let's have a look at the BRCA protein sequences `cat $SEQ_DIR/brca_sequences.fasta`. ``` >P51587_BRCA2_HUMAN MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESLKKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVANQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDCILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCENDPKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPT .... and a lot more ``` First, note that the lines starting with a `>` are the ones that contain the name of that sequence, and just below it, you will see letters that represent aminoacids making up the protein sequence. If you look closely at sequence "Q6B4U9_PRDX1_MYOLU" you will notice that, it has an asterisk `*` in between the aminoacids. This means that there is a stop codon there, that could have either resulted from an error in the genome assembly, genome annotation, or it corresponds to a pseudogene. This protein might cause issues in some analyses, and besides, this is also a differente gene called PRDX1. So, let's get rid of it. You can use `grep` to locate where that `*` is, and store it along the protein sequence name in a variable. `BAD_PROTEIN=$(grep -B1 "*" "$SEQ_DIR"/brca_sequences.fasta)` The `-B 1` argument tells grep that once it finds a match, it also includes whatever is one line above it, which in this case, is the name of the protein. Try `echo "$BAD_PROTEIN"` Now you can eliminate the protein that contains the stop codon: `grep -vF "$BAD_PROTEIN" $SEQ_DIR/brca_sequences.fasta > filtered_brca_sequences.fasta` Now open the new `filtered_brca_sequences.fasta` file, and you will notice that the sequence is gone. :::info Use your UNIX skills and find out what `-vF` does. ::: **Something you should know** What you have just done is an extremely basic approximation of the pre-filtering steps you would typically do in a comparative genomics project. In this case, you made sure that you kept only BCRA2 genes do not contain stop codons. To make this more didactic, go to [this link](https://useast.ensembl.org/Homo_sapiens/Location/Compara_Alignments/Image?align=1960&db=core&g=ENSG00000139618&r=13:32311387-32345493&t=ENST00000380152;mr=13:32300404-32358384). You'll see the alignment of the genomic regions in human and mouse where this gene is located (locus) in each organism. Actually, the proteins you used in this exercise were transcribed by the very genes you are seeing because I used this database to tailor this class. With this tool you can explore the structural genetic differences for both organisms. Notice that the aligned region is 34.43kb long, and that the lengths of the genes are approximately the same with some differences in the mouse. The orange/brownish boxes represent exons, and they seem to be conserved in both organisms, with some differences in the exons located close to the 15 Kb of the alignment. Keep in mind that these lengths are relative and refer to the alignment alone. Bear in mind two things this process can be more easily achieved by making use of a combination of _for loops_ and a tool called [seqkit](https://bioinf.shenwei.me/seqkit/), replacing the content of the variables in each iteration. # WORKING WITH CONDA Conda is an exceptionally useful python package. But it can be a little tricky if you want to use it for multiple tasks. Sometimes when you try to install multiple packages into a single conda environment, they can conflict. Resolving those conflicts can be difficult and it's best to just avoid them through the use of multiple conda environments. While you don't have to do this, my experience has been that it's worth the trouble. Thus, for each of the tasks you'll accomplish in this class, I will recommend creating a separate conda environment. Complete instructions for working with conda environments can be found at [this helpful site](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). Over the years of teaching this class, I've noticed that some people have a bit of trouble understanding what's happening with these environments. Here is an analogy that some have found helpful. With conda, you are building a separate workshop for whatever you intend to do. In any workshop you need tools. In a woodworking shop, you need a saw, sandpaper, a drill, etc. In a kitchen (a type of workshop), you need an oven, utensils, a sink, etc. For each task we will use `conda create -n <name of environment>` to build our workshop. We then enter the workshop using `conda activate <name of environment>`. We've entered the workshop to do the work but the workshop is empty. No tools have been brought in yet. We bring in the necessary tools by using `conda install <name of the software being installed>`. Sometimes you need to go to a specialty store (Lowes, Home Depot, Kitchen Suppliers Inc.) to get the right tools. In conda, those specialty stores are called 'channels'. To specify a particular channel you invoke the option `-c`. Two of the channels we will use are [bioconda](https://bioconda.github.io/) and [conda-forge](https://conda-forge.org/). So, the general pattern will be as follows: * Build your workshop/conda environment - `conda create` * Enter your workshop/environment - `conda activate ` * Install all of your tools/software - `conda install` You're then ready to use your new workshop. From then on out, there is no need to rebuild your environment or install the tools, you can just activate your environment and start work. You wouldn't rebuild a new workshop every time you need to drill a hole, would you? No, you just go back to your workshop and use the drill. So, DO NOT create or install software everytime you need to use an environment. You'll just be wasting your time and doing that will tell me that you don't actually understand what conda is all about. For each activity in the course, we will create a separate working environment. The HPCC has recently changed the way Conda is handled, so you will need to [follow these instructions to install it](https://www.depts.ttu.edu/hpcc/userguides/application_guides/Miniforge.php). ## Bedtools seqkit and sra Throughout your tasks, you will be handling lots of genomic data and handling, editing, looking at those files is a MUST. Let's install some packages that will come-in handy: ``` # Create a new environment for genomics tools conda create -n genomics_class python=3.9 # Activate the environment conda activate genomics_class # Install our key tools conda install -c bioconda bedtools sra-tools seqkit conda install -c conda-forge wget curl # Verify installations conda list ``` You will be downloading next generation sequencing data of *E. coli* from the NCBI using the SRA toolkit. These files can be heavy so we are going to zip them as we download them. ``` # Don't forget to activate your environment every time you login conda activate genomics_class # go to your directory cd /lustre/scratch/[eraider]/bootcamp2025 # Create a working directory mkdir -p genomics_class/data cd genomics_class/data # Small RNA-seq dataset fastq-dump --split-files --gzip SRR12442220 ``` Read here for more info on fastq files. XXXXX now lets get some basic starts with seqkit ``` # Get basic stats seqkit stats *.fastq.gz # More detailed statistics seqkit stats -a *.fastq.gz ``` ``` # Filter sequences by length (keep reads >= 50bp) seqkit seq -m 50 --remove-gaps SRR12442220_1.fastq.gz > filtered_SRR12442220_1.fastq # Get sequences by ID patterns seqkit grep -p "^SRR12442220_1" SRR12442220_1.fastq.gz # Sample a subset of reads (useful for testing) seqkit sample -p 0.1 SRR12442220_1.fastq.gz -o subset_SRR12442220_1.fastq.gz ```