UPGG Computational Bootcamp 2025 Day 1

--- tags: course notes --- # UPGG Computational Bootcamp 2025 ## Day 1 - ☕️ Morning ☕️ ### Introduction to Unix Shell #### Instructor: Max Bucklan #### Goals! * Intro to Shell * Navigating files and directories * Writing Scripts and working with data * Project Organization #### Intro to Shell **Why should you care about shell/command line?** * Unix command line is a way to communicate directly with a computer! Most notably, you can interact with the Duke Computing Cluster (DCC). DCC is a power computing resource that you can utilize for analyses and data storage. The only "problem" is that you can only interact with DCC through command lines. * Many bioinformatics tools can only be used through a command line interface. * Automation makes your life easier. Let computer do the competitive tasks for you. So you can spend your time and energy elsewhere! * We humans also make a lot of errors, especially when we have to do repetitive tasks. Computers are a lot better at doing that! * Writing your code down makes your work a lot more reproducible which is crucial in scientific research. **Servers** = A single computer that you can use to work or host apps/websites **Compute clusters** = Many computers/servers combining their power! **Cloud Computing** = On-demand resources like a cluster sourced from the cloud, very powerful! **From you to the computer:** - User - Window program (Konsole, gnome-terminal, xterm) - for our bootcamp: Git Bash, Terminal - Shell (bash, zsh, bourne shell) - GNU/Linux operating system *Note:* *Windows users need to download Git Bash in order to interact with the correct backend for this workshop - the windows default terminal (PowerShell) is designed for a different backend system* #### Opening your terminal The first line you see probably includes (but this depends on your machine and settings): * The machine name! This is normally your name and your machine! * `~` designates where you are (home directory) * Username: who is accessing the terminal right now! * `$` separates the information above and the input box that you can type in commands `bash` vs `zsh` They are mostly similar. `bash` has been default on Mac devices for a while, before they made `zsh` the default shell. Keep an eye on which shell your machine is using, but most common commands are the same. If you have junk in your terminal, you can type `clear` in your line and hit enter to clean up the terminal history. If you don't see the dollar sign as the "terminal prompt" before the line, you can type `PS1='$'` to change the terminal prompt. #### Navigating directories Remember that spaces have meaning in command line. ```bash= pwd # This is shortrened for "print working directory" ls # This prints "list" of files and directories within the working directories ls -l # Adding option to the command. The -l option displays long-format info about files in the directory ``` The `-l` is an option! Be sure that you have a space between the command, `ls`, and `-l`. This specific option tells Shell that you want the list to be printed in the "long" format, containing more information. **Tab completion** While you are typing commands or file names, you can utilize `tab completion` to complete the word for you. It will try to match what you are typing to what is available in the file system. And it will complete up to a point where it is unambiguous which file you are looking for. **Command manual** To figure out what options are available for the commands, you can look up the "manual" of that command by running `man <command>`. For example, ```bash= man ls man pwd ``` Sometimes, the command might not have a `man` from `man` function. Most likely, you can run `<command> -h` to open the "help" page which functions similar to a `man`. *Note:* *Windows users in Git Bash will not have the `man` command at all and should use `-h` or `--help` as above.* These "help" and "manual" pages might open a "new window" that lists a lot of information about that command. But you will notice that the original terminal has disappeared. To go back to the original terminal, type `q` and `Enter`. What is happening here is that the `man` and `-h` command/option opens up a program called `less`. We'll get into it later. Think of it as a "file viewer" within Terminal. You can view the files using `less`, but you cannot edit the content of it. **Command history** ```bash= history # this will show commands that you recently ran in terminal ``` **Characters** - " " (spaces) have meanings - CAPITAL CASE vs lower case matters - If special characters that have command meanings are part of the filename, use escape characters (`\`) to tell shell they are part of the filename ```bash= cd dungeons\&dna/ ``` #### File system ##### home vs root * Think of your file system as a tree. You can create directories within a directory, in effect, making the branches for that directory. But if you go upstream, when does the directory end? There must be a directory that contains ALL directories within your system. That directory is called `root`. You generally don't want to touch this directory directly as it contains a lot of operational files. * Where you will be most of the time is your `home` directory, denoted by`~`. `home` is where you should keep all your relevant files and folders -- even `Desktop` is probably within your `home`. Think about a system that is shared across multiple users, each user is going to have their own `home` directory. So you don't want to get in the way of one another. Changing things in `root` therefore will affect other users' `home` as well, so rule of thumb is to not touch anything beyond `home` if you don't know what you are doing. * `root` is represented as `\` and `home` is represented as `~` in Shell. ##### Absolute vs. Relative paths Absolute path * You can type this address from any location within your computer, and it will take you to the same place * Absolute paths start with a `/` * Useful for scripting and for going straight to one location! * Example: `/Users/raven/Desktop/dungeons&dna/subfolder/file.txt` Relative path * Relative to a specific location, but from your current dir by default * No `/` at the start * If you type a file or directory that is relative to the current directory, you can omit current directory. E.g. `dungeons&dna` is within your current directory. The Shell will understand this. * You can also use relative path from a specific location, for example, from your `home` - that looks like this `~/Desktop/<folder_name>`. * Example: `subfolder/file.txt` ##### Make directory ```bash= mkdir UPGG_Camping_Folder # This "makes directory" within your current directory (Desktop) cd UPGG_Camping_Folder # This takes us into the directory we just made cd .. # This takes us back into the parent directory (Desktop) ls .. # This lists the files and directories within parent directory ls ~ # You can use shorthand for `home` directory with `~` ls -a # Lists all files including hidden files. Hidden files' filenames will start with ., e.g. .hidden_table.csv ``` ##### Examining Files ```bash= cat # Prints the contents of the file to the terminal (stdout) head -n 3 # Prints the first 3 lines of the file to the terminal (stdout) tail -n 10 # Prints the last 10 lines fo the file to the terminal (stdout) less # Opens the file in read-only view without printing it out ``` - in `less` viewer - `g` to go to the top of the file - `/` to go into search mode, e.g. `/monster` to search for instances of the word "monster" in the file - `n` to go to the next instance in the search - `N` to go to the previous instance in the search - `RETURN` to exit search mode - `q` to quit the viewer **FASTA/FASTQ format** These are formats for "genome sequences" data. These have specific format that will help with consistency and optimization and are designed to be easy to read and space and memory efficient. These are technically a `.txt` file, but a specialized format. If a tool requires you to input a `fastq` file, you want to input a `fastq` file. But these are not just the suffix of the filename. Just renaming a file to `filename.fastq` does **NOT** make it a `fastq` file. FASTA/FASTQ files are designed to give you as much information about sequencing reads as possible while preserving memory. **FASTA** First line after the `>` = Read name and information about the read Second line = The actual DNA sequence in bases **FASTQ** First line after the `@` = Read name and information about the read Second line = The actual DNA sequence in bases Third line = Strand information (+ or -) and possibly other information Fourth line = PHRED quality score, each letter corresponds to a nucleotide and has a score value **BAM/SAM format** These are sequence alignment formats. Think of it as processed `fasta`/`fastq`.They include information like where these reads are on the genome (chromosome, location etc.) `bam` - "binary alignment map" and `sam` - "sequence alignment map". The difference is that `bam` files aren't human-readable but are optimized for storage. These formats would include metadata such as read names, alignment quality, and optional tags. CIGAR is the format for alignment quality. Interestingly, the record does not have a field for read length. Either count the length of the sequence field, or calcualte from the CIGAR. Tools like `samtools` can calculate read length. **GFF/BED format** "General feature format" and "browser extensible data" `GFF` is a complex format for detailed genomic feature annotation e.g., genes, exons etc. `BED` is a simpler format for specific genomic regions and intervals with chromosome id and start/end positions, etc. *Note:* *You may also see the format `GTF` instead of GFF - this is similar but not exactly the same, the GTF stands for Gene Transfer Format and is generally a stricter more specialized version of a GFF.* Remember, you can always Google the most updated documentation for each of these file formats! ### Why Computational Pipelines are Important It's always a good idea to plan out your experiments as much as possible! The quality of the data that comes from the lab experiments determines how your analysis will go. As genetics and genomics students, we need to use many different software and analysis tools to get from the experimental data to our pretty results. Visualization helps! Sit down and draw out your pipeline if you need to, getting a birds eye view will help you plan your time and focus. Many experimental pipelines have similar steps. Make sure you understand which steps are already well-standardized and which steps may need to be fine-tuned to your dataset. Making a broad experimental workflow will help you communicate how you analyze your data in publications and presentations. Fellow researchers need to understand how you analyzed your data in order to give helpful critiques. Don't reinvent the wheel! We have amazing collaborators in the CBB program and beyond who spend all their time developing methods and software, so use the resources that are available! Reading methods sections of relevant papers and doing some Googling will help you learn which tools are available and what you can use them for. Choose packages wisely! If possible, use software that is currently supported by an active developer or community. The more steps you have in your workflow, the more possibility for errors. Try to keep file formats compatible and have checkpoints to review your work! We don't do research alone, so do your best to make sure your data is error-free and well-documented. ### Writing Scripts and Downloading Data **Text Editors on the command line** Text editors are extremely useful to write and edit your code. Two common editors for the command line are Vim and Nano. **Nano** - simple and more user-friendly, but less customizable **Vim** - steep learning curve, but much more customizable A good example of a text file is a README file. README files are designed for documentation of a project or folder. They are extremely useful for keeping track of what a project goals were and what the files and scripts do. Many README files are .txt files. Remember, the extension is important to tell the computer what type of file this should be! ```bash= nano README.txt ``` Headers are a good thing to add to your README files. Include relevant data like the author, project title, date file was created, etc. To peek into our file without opening it again to edit it, we can use the `cat` or `head` ```bash= cat README.txt ``` You can check the first few lines of the file using `head`, and specify the number of lines using the `-n` flag ```bash= head -n 1 README.txt ``` **Shell Scripts** To combine command line commands into a file we can run all at once, lets create a shell script! Shell scripts have the .sh file extensions. In this case, we will put the command `echo` in our script, which will print a string to the terminal output. ```bash= nano vicious_mockery.sh #!/bin/bash # Vicious Mockery at 3rd Level echo "I've seen goblins with better fashion sense than you." sleep 2 # this command generates a 2-second pause echo "Your brain must have rolled a natural 1 on Intelligence." sleep 2 echo "Even a gelatinous cube has more charisma." sleep 2 echo "Your aim is so bad, even Magic Missile would miss out of pity!" ``` This script has the first line containing the "shebang" which is `#!/bin/bash`. This line tells your computer that the script is executable using bash. For operations in nano, such as to exit nano, follow the instructions at the bottom of the screen, e.g. control+O `(^O)` to save, control+X `(^X)` to exit To run our script, we can type `bash` followed by our script name. ```bash= bash vicious_mockery.sh ``` You can do many things with a shell script, including navigation instructions to access files in different folders. *Note: Ideally, try to use absolute paths in shell scripts! You can use relative file paths, but things can get messy if you need to move the script or files into different folders* There are lots of other important commands, like `chmod` to lock and unlock data, and `wget` or `curl` (on macs) to access links and data from online. **Download Online Files** To download a file from online, we need to go to the data source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81932 ```bash= curl -O https://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81932/suppl/GSE81932%5FDataset01.txt.gz ``` *Note:* *If you have an error using curl that says something like "unable to check revocation for the certificate" this means that your computer antivirus/internet provider does not trust the connection. If you are downloading from a trusted source, you can add the `-k` flag to curl in addition to the `-O` to allow insecure connections - use with caution!* **File Permissions** When you type `ls -l`, you get a list of files with their permissions. These permissions ae shown below: ![image](https://hackmd.io/_uploads/rkA64K7i0.png) The `r` permission is for opening and reading/viewing files, and is usually open to more people. The `w` permission is for writing or changing the file. You don't want write permission on especially for raw data files. You don't want to *accidentally* modify these files. The `x` permission is for executing the file, mostly applicable to filetypes such as `.sh` files. Lets add execute permissions for everyone to our newly created shell script using the `chmod` command and the plus sign to **add** permission to execute. ```bash= chmod +x vicious_mockery.sh ``` When a script has execution permission, you can run it as an executable like this, without `bash` ```bash= ./vicious_mockery.sh ``` **Creating, Moving, Copying, and Deleting** In addition to the interface you are used to using, we can move, delete, copy, and rename files and folders on the command line! You can use the `touch` command to make a new file of any type. ```bash= cd dungeons\&dna/bag_of_holding touch decanter_of_endless_coffee.txt ``` You can also simultanously create and open a file by using nano. ```bash= nano lab_cloak.txt ``` If we don't need a file anymore, we can delete it with the `rm` or "remove" command. Lets delete our potion of healing file. You will notice that you cannot remove directories using just `rm`. This is because directories often contain things, so bash will try and keep you from deleting useful things. To delete a directory, we need to use the `-r` flag for "recursion", which will delete **everything** in the directory. BE CAREFUL! This can easily delete many files, and they will not come back! ```bash= rm potion_of_greater_healing.txt # This is permanent! ``` You can use the -i flag to have the shell double check and ask you if you are sure if you want to delete the file. ```bash= rm -i pipette_of_arcane_wisdom.txt # Type "n" to cancel ``` Alternatively, you can use the -f flag to force delete the file or directory without asking. ```bash= rm -f pipette_of_arcane_wisdom.txt ``` `mkdir` is "make directory", and is the command to create a new directory. ```bash= mkdir mimic_chest ``` *Note: Its a good idea to avoid using spaces when working on the command line. Most people replace spaces with underscores or dashes* ```bash= rm mimic_chest # Note that this doesn't work with directory rm -r mimic_chest # You need `-r` for "recursion" which will delete everything in the folder! ``` `mv` is "move", and can move a file or directory to a new location. `mv` can also rename files and folders! Lets rename our new file to be more vicious: ```bash= mv decanter_of_endless_coffee.txt decanter_of_endless_tea.txt ``` We can specify the new location after the file to move the file to that location: ```bash= mv decanter_of_endless_tea.txt mimic_chest ``` **Saving Scripts and Version Control** As you work on your research, you will have many different versions of your scripts for different purposes or created by different people. Many labs have different ways to keep track of all these different versions, but whichever one you use, make sure everything is consistent! Duke has a contract with GitLab, so may labs may use it. It is slightly different from GitHub, but with similar functionalities. **Data Organization** Consistent are extremely helpful, because they allow us to use **wildcards**. In bash, you can replace parts of a file name with a star `*` to access any files with that name structure. **Opening Zipped Files** ### Download Files for Day 2 - Refer to Max's slide deck - Note on the yeast gene ontology download - https://www.yeastgenome.org/go/GO:0042254 - Make sure to go to the bottom of the webpage to download the "Computational" table (236 entries), which is bigger than the "Manually Curated" table (25 entries) - Note on the dryad download - https://datadryad.org/dataset/doi:10.5061/dryad.d644f - Only download this file: `scer-mrna-protein-absolute-estimate.txt` ## Day 1 - ☀️ Afternoon ☀️ ### Introduction to R and Rmd #### Instructors: Erick Figueroa and Kayla Wilhoit ### R Background R began as a proprietary language called S, but it is now a free and open source programming language! This means anyone can create and edit R tools, known as packages. R is so popular in part for its top-of-the-line data visualization capabilities. **IDE** An IDE is an Integrated Development Environment, which provides a more intuitive way to interact with code. The IDE for R is called **RStudio**. RStudio has a lot of functionalities, but you don't need to know how to use all of them yet. The first window that will usually open is the R console window. Just like bash, you can write and submit commands here. Like many programming languages, you can do simple arithmetic with common operators `+` for addition `-` for subtraction `/` for division `*` for multiplication ```R= 10 + 5 10 - 5 10 / 2 10 * 5 ``` In order to access previous commands you submitted, you can press the up arrow key on your keyboard and scroll back up through your command history. The upper right panel is the **Environment/History** pane. As we write code and assign variables, those active values will appear here in the environment tab for easy reference. Lets assign a value to a variable in the console. You can assign a value to a variable using `<-` sign. While you can also use an equals sign, which you are probably more familiar with in languages like Python, the recommended standard in R is to use the arrow. ```R= x <- 10 print(x) # 10 ``` Now that we have assigned a value to x, you can see that it appears in our Environment tab in the upper right. That way, if you ever forget what value x has, you can check the upper right. If you want to assign another value to x, you can overwrite the variable with another value. ```R= x <- 5*3 print(x) # 15 ``` You can see that the value has now updated in the environment panel. It's important to note that this Environment panel is associated with your R window/session rather than a specific notebook or script, which means that if you have x assigned to different values in different scripts and you run them right after another in the same R session you might have issues. If you want a completely clean slate, click the broom button on the Environment tab to clean your working Environment. Very satisfying! To clear the console instead (i.e. for visual clutter), control + L is the keyboard short cut that will also remove anything entered into your console. There is a less visible broom on the upper right of the console that has the same function. The other tab that you might use in the upper right is the **History** tab. You can see here a history of all the commands we just ran in the console, so you don't have to scroll back up in the console or back up in the script. Even if you close and reopen the section a week later, the history tab will still keep a log of what you did. You probably won't need to use any of the other tabs on the upper right, but I will note that the tutorial tab will let you load interactive R tutorials if you install the corresponding packages. Moving on to the lower right panel, this one has several important tabs so we will go over them one by one. The **Files** tab is the first, and is probably open to whatever the home directory on your computer is set to. We probably don't want to run code in our home directory, so lets tell R where we want to be by setting the **Working Directory**. To find a specific folder on your device, click the three dots on the upper right of the panel to open up a file explorer. Lets navigate to the Bootcamp folder you made earlier and open it. You should be able to see the list of files and any other directories within the bootcamp folder. Now we can click the gear icon and select the option **"Set as Working Directory"**. This will automatically run a line in the console that tells R where to look for any files and produce any outputs from now on. If you have learned R before, you know that forgetting to set your working directory to the right place is one of the most common early errors when using R! You could also do this manually in your console, although clicking "set as working directory" will also print this command. to your console. ```R= setwd("/path/to/working/directory") ``` With our working directory set, let's move on to the other tabs. The **Plots** tab is empty for now, but this is where any plots we make will appear. The **Packages** tab is where you can find and install all of the user-created packages for R, and we will come back to this in just a second when we install some packages. The next tab is one of the most important and useful, the **Help** tab. This tab provides documentation and help for base R functions as well as any packages you download. You can type in and search for almost any function, like the print function, and pull up a detailed description with parameters and examples. You can pop it out in a new window if you want by clicking the icon in the tab header. You can click the home button to go back to the help Home page. I want to draw your attention to the link to "**Posit Cheatsheets**". These are official cheatsheet guides made by the team behind RStudio, and they are really helpful condensed guides to the most common tools and packages in R. I would suggest you open up the RStudio IDE and the RMarkdown cheatsheets for today, but we have a list of the suggested cheatsheets on the resources website that I personally find helpful to print out and hang in my cubicle. Additional resource: https://bio723-class.github.io/Bio723-book/getting-started-with-r.html If you want to know how to do something, Google probably has the answer! There are many base R functionalities as well as user-created packages to perform the functions you need. Finally, before we get started with notebooks and coding, you can change the appearance of RStudio by clicking *Tools -> Global Options -> Appearance* in the top ribbon. You can customize a lot of things in the editor including the color scheme here. My personal favorite is Tomorrow Night 80's, but I will be sticking with the default white theme for projector purposes today. ### R Markdown Writing basic commands and doing arithmetic is fine, but it can be useful to create a shareable file and save our written code. You can do this in an R Script but I would encourage you to write your code in Notebook format for reproducibility. R studio has integration with RMarkdown Notebooks, which reproducibly combine well-annotated code and pipelines, **knit** into a shareable file. It can even integrate with Zotero for citation management. You may often share or present documentation of your code as a Markdown document. Go to **File** on the upper left Then click **New File** The fifth option down is **R Markdown** This will bring up a window where you can set options for your document, including the title, author, date, and output options. Today we will be using the default HTML output. When submitting this window, this should open a new tab in your RStudio environment. This is different from the terminal (now below the new window) because the new notebook tab is a more permanent way to type and edit your code. The first step you will usually want to do is to **Save** the file. Click the save icon at the upper ribbon and select the directory that you want to save the file to. Once you have saved your file in a directory, the lower right window may fill with the current **working directory**, where all the files in the directory should appear. The window should be populated with a default template for information. Most of this can be deleted to replace with your own data. The first few lines are called the **YAML** header, and contains information about the document itself that should not be deleted! The second block of grey highlighted text is a **code chunk**. This one is a setup chunk of R code, as signified by the `{r setup}` in the first header, and will help the document knit correctly. This chunk won't appear in your final output since the option `include = FALSE` is set to False. The next bit of text that is not highlighted in grey is Markdown formatted text. This is not code and doesn't run, but it allows for more detailed formatting and descriptions of the code in the chunks below. You should be able to see that we are the the "**Source**" editing mode. You can click the "**Visual**" button up here to see a different version of the document that is still editable but has a preview of what the markdown formatted text would actually look like. You can write code entirely in visual format if you want, but its easier to edit the markdown details in Source mode. For example, ##RMarkdown will translate to a bolded header in visual mode. Everything after the "##RMarkdown" is their example code that can be deleted. *Notes:* * To make a newline, either two spaces or a backslash will work * When inserting images, make sure the images are downloaded to your computer or are linked to a image source online * Remember absolute vs. relative paths when linking to an image elsewhere on your machine! * You can place the image in the same directory as the RMarkdown and it can be inserted into the file * When making bulleted lists, make sure to include spaces after the dashes and have each entry on a new line ### Installing Packages Packages are essential for using R to the fullest extent. There are several ways to install packages, but one of the easiest is to use the RStudio framework. Go to the **Packages** tab on the lower right window and click the grey **Install** button on the upper right of that window. This should bring up a new window where you can search for packages by name and install them. Lets search for and install the **tidyverse** package. This may take a while, tidyverse includes many different smaller packages!If a red stop sign is visible in the top right of the console, a task (like installing a package) is still running. **Loading and naming chunks** Once we have installed tidyverse, we need to load it to be able to use the functions in the current notebook. Let's make a new chunk where we load the package. To be descriptive, lets first use markdown to make a header by typing multiple hashtags (##Load Tidyverse). To quickly insert a new R code chunk, you can press `Ctrl + Alt + I` on Windows or `Cmd + Option + I` on Mac. You can edit the header `{r}` to name the chunk and add flags. This can be useful when loading in packages, since many of them can send a warning message upon loading. We will name this chunk "load_packages" and load our packages here. By specifying `include = TRUE`, we are telling R to include this chunk in the final knitted output. A chunk without this flag will still be included by default, but it can be helpful to specify. You'll notice that you can use tab completion in this section as R will suggest the available parameters for you to use. Naming codeblocks is optional, but is important for knowing which code block is responsible for the document not knitting! The name of the code chunk will appear in any error message associated with that chunk, making it easier to diagnose the problem. ```R= ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) #``` # your chunk won't have this leading #. This is just to get around HackMD formatting, since this is also a markdown document ```{r packages, include = TRUE} library(tidyverse) #``` ``` Once we have typed our code there are multiple ways to excecute it. A benefit of using code chunks in notebooks is that you can easily run just a few bits of code at a time instead of having to run the entire file. Click on the green triangle on the upper right of the code chunk to run just the packages chunk for now. You should see the output printed both in the console down at the bottom and below the individual cell. You should see something like this which is not an error, its just a warning letting you know that tidyverse is overwriting some other functions in base R. This is fine! Lets add another code chunk where we run the classic beginner coding excercise of printing "Hello World". ```R= {r hello_world} print("Hello World") ``` We can run this chunk individually again and you can see it print the output "Hello World" in both the terminal and below the chunk as before. Use the "Run" button (with green arrow) at the top right of the notebook header to run specific sets of chunks (current chunk, previous chunk, rerun most recent chunk, etc.). There is a second downard gray arrow pointing to a green line next to the green triangle in the right of each code chunk as well that allows you to run up to your current chunk (but not including the current chunk). **Inserting Citations** One of the strenghts of notebooks as opposed to plain R scripts is the ability to include notes and citations. RMarkdown has multiple options for inserting citations, so let's add a reference to the yeast mRNA-seq dataset that you downloaded this morning. In order to insert references, we need to edit the YAML header at the top of our document. Under the information that is already there such as `title`, add two more lines: `bibliography: {your filename here}` and `link-citations: TRUE`. ```R= --- title: "Markdown Testing" author: "Kayla Wilhoit" output: html_document bibliography: references.bib link-citations: TRUE --- ``` This will give us linked citations within the body of the output, and generate a bibliography at the end of our document. This will also generate a new file called references.bib in your working directory. To get the references, we need a BibTex formatted citation in a file that we can insert. You can obtain the BibTex file from most citation manager software. **For Zotero Users** Zotero is integrated with R! ~~(because Zotero is the best citation manager)~~ To insert the citation directly from your Zotero library, you will need to go into the **Visual Editor**. On the upper left of your notebook window, you will notice two buttons saying "Source" and "Visual". You will likely have been previously working in "Source", so click the "Visual" button. You may need to accept a window, but then a working preview of your code will open. With the cursor where you want to insert your citation, click the **Insert** option on the same ribbon where the Source/Visual options were. Click on the "@ Citation" option, and a new window should open. Click on "Zotero" on the menu on the right, and you can search and select items from your Zotero library! Once you have selected all the items you want, click insert and the options will be automatically inserted. You can hover over the short link to see the full citation, and you can edit the options by messing around with the immport functions. If you don't have Zotero, you can also insert a citation using DOI lookup. If you go to the course website, there will be a link to the paper for the dataset you downloaded this morning that we will be using tomorrow. Follow the link and copy the DOI, then go back to "Insert -> @ Citation" and select the "**From DOI**" option on the left. You can then paste in your DOI and it should pop up with the correct paper, which you can insert by clicking the plus sign. **Reproducibility** It is a good idea to share the details of your current setup at the end of your document. To do this, use the `sessionInfo()` function. ```R= sessionInfo() ``` This will automatically print information such as the version of R used and the versions of all the packages installed at the time of document creation. To see how it will look in the output, click the **Knit** option on the upper ribbon. This will show an approximation of the "knitted" or "rendered" final output. You can see that it shows our title, our headers and code blocks, and prints the output of all our code chunks. You can see that we have our two duplicate citations here in the text, and if we click them it directs us to our full citation at the bottom of the page. Since we have our output set as an html document, if we go to our files tab you can see that we have both the original .Rmd file as well as the knitted output .html file. *Note: In order to change the location of the reference list, you can add the following line to 'force' the list to a specific location:* ```R= <div id='refs'></div> ``` One last thing before our break to make our knitted output look a little nicer by removing the big warning message that happens every time we load tidyverse. We can add another flag `message = FALSE` to our packages chunk and it will hide the long output from this chunk specifically. ```R= {r packages, include = TRUE, message = FALSE} library(tidyverse) ``` Now you can hit Knit again and see a nice clean code chunk in the output! Rstudio also offers other types of Markdown files like Quarto (if you prefer)that can work with instead. In addition, in Rstudio you can also write a plain R script or even write in Python, Julia, Shell, and more! ------------------------------------------ ### BREAK ----------------------------------------- ### Data structure ### Refresher on Data Types We will do a quick review on data types in R. These are basically categories of values that computers can read and interpret in different ways. If you know other programming languages, these categories tend to overlap, though might have different names. #### 1. Numeric Basically, numbers that we know very well. You can perform mathematical operations on them. The types of numeric values you are going to encounter the most are: 1a. `Double` - represent real numbers (whole numbers and decimals). It is the default value for whole numbers. ```R= typeof(10) ``` 1b. `Integer` - represent whole numbers only. Whole numbers are defaulted to `double`. You have to tell R explicitly that a value is `integer`. You are unlikely to come across a case where `double` and `integer` will make significant changes to your code/performance, but it's good to be aware of the different types of numeric values. Another key point is that, arithmetic operation will automatically coerce `integer` to `double` if needs be. ```R= x <- as.integer(10) typeof(x) # Will be integer typeof(x + 10.0) # Will now be coerced to double ``` ### Basic Operations ```R= x + y x * y x / y # Division x ** y # Exponent x^y # Also Exponent x %% y # Modulo ``` Modulo may be a new operator to some- it refers to the remainder after dividing y into x. i.e. if x = 25 and y = 8, 8 goes into 25 evenly 3 times (for a total of 24) with a remainder of 1 (25-24), so 25 moldulo 8 is 1. *Note:* *While seeming niche, modulo can be very useful for logic when dealing with genomic filetypes - for example to check if a line or field in a fastq file has the expected number of entries* **Valid variable names** There are some names you ***can't*** use as variable names, and they are names that you ***shouldn't*** use. 1. Start the name with letters ```R= a.1 <- 5 # Will work 1.a <- 5 # Will not work _a <- 5 # Will not work ``` 2. Don't use weird symbols, use `.` or `_`. Examples of style guides: - [Google's](https://google.github.io/styleguide/Rguide.xml) - [Jean Fan's](http://jef.works/R-style-guide/) - [Tidyverse's](http://style.tidyverse.org/) ```R= myHeightInJanuary <- 7 # camelCase = first letter lowercase rest capitals my_head_size <- 10 # Snake_case = underscores between words, all lowercase ``` 3. **DO NOT** use [reserved words](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html) 4. Avoid naming variables the same as function names #### 2. Logical Sometimes these values are called `boolean` in other languages. It is common enough in programming that you should know about! The only possible logical values are `TRUE` and `FALSE` -- case sensitive!!! ```{r} !TRUE # Not TRUE TRUE & FALSE # TRUE and FALSE TRUE | FALSE # TRUE or FALSE TRUE | TRUE xor(TRUE, FALSE) # either x or y, but not both xor(TRUE, TRUE) ``` ```R= x > y x < y x >= x <= y x == y # Need double equal sign to check equality x != y # Negation - is x NOT equal to y !(x > y) # Negation again - is x NOT greater than y ``` > **Problem with numeric data in logical equivalence** Without going into the weeds, an irrational number like `sqrt(10)` has infinite decimal points. So, it is impossible for R to represent all the digits as it would take an infinite amount of storage. R has a limited degree of `precision` that it can represent an irrational value, which makes the term `(sqrt(10)^2)` slightly off from 10. ```R= 10 == (sqrt(10^2)) # This should equal 10, but it will return FALSE ``` To circumvent this limitation, you can test for "near equality" in R using a function `all.equal()`. ```R= all.equal(sqrt(10^2),10) # Returns TRUE, tolerates extremely tiny discrepeancies in numbers (The default value is close to 1.5e-8, although you could also change this too with the tolerance parameter) ``` **Logical operator** #### 3. Character strings Characters are created by using the `"` (double quote) or `'` (single quote) enclosing them. `nchar()` gives you the number of characters in the variable. Simple strings operations - joining strings using `paste()` and `paste0()` and splitting strings using `strsplit()`. * `paste()` automatically includes a **separator** (default being a space) * `paste0()` doesn't include a separator ```R= paste("Hello", "World") # returns "Hello World" paste("Hello", "World", sep = "_") # returns "Hello_World" paste("Hello", "World", sep = "") # returns "HelloWorld" paste0("Hello", "World") # returns "HelloWorld" strsplit("Hello World", split = " ") # returns "Hello" "World" ``` You cannot combine two data types. However, you can get around this slightly by assigning different data types to a variable, which "coerces" the different values into a more compatible data type. ```R= 1 + "hello" #returns an error!!! mixed_var <- paste(1, "hello") mixed_var #prints "1 hello" ``` **Order of evaluation** Be careful if you change type or value of a variable. Don't short-circuit your code. Options: 1. save as new variable or 2. always run code sequentially. ### Vectors A "vector" is a list of values that, importantly, are **of the same type**. This is the most common data type in R. To index from vectors and many other structures, you can use a colon `:` **Important!!** R is a 1-based indexing language! This is different than languages such as python, which are 0-based. This means that to access the first item in a vector in R, you want to use the number 1. ```{r} v1 <- c() # Empty vector v2 <- c(1) # Vector of length 1 and of type numeric v3 <- c("Hello", "World") # Vector of length 2 and of type character v4 <- c(1, "hi", TRUE, 1+2) # R is smart enough to coerce the type of all element into one that can be applied to all of them -- in this case, character, designated by " marks. print(v4) ``` Vectors have some specific properties that you can call pretty handily with built-in functions. ```{r} # prints the type of elements contained within vector typeof(v4) # Prints the length of the vector length(v4) ``` Concatenate vectors ```{r} v5 <- c(v3, v4) print(v5) ``` Vector arithmetic ```{r} x <- c(1, 2, 3, 4) x * 2 ``` Vector recycling (can skip this) ```{r} y <- c(5, 6, 7) x + y ``` Common statistical functions ```{r} sum(x) min(x) max(x) mean(x) median(x) sd(x) #standard_deviation ``` ```{r} summary(x) ``` **Vector Indexing** Indexing vectors R uses the 1-based indexing system, meaning the first element begins at 1. (Python, for example, is a zero-based indexing system.) ```{r} x x[1] # obtain the first element x[-1] # obtain all but first element x[-2] # all but second element x[3:4] # 3rd-4th elements x[4:1] # reverse order x[x>3] #extract only elements that meet logical condition, i.e. if x is greater than 3 rev(x) x[1:length(x)-1] # exclude last element, flexible for any vector x[c(TRUE,FALSE,TRUE,TRUE)] # exclude second element ``` Use indexing to manipulate content of the vector. ```{r} y <- c(5, 6, 7) y y[2] <- 1999 y ``` Get indexes using `which()` ```{r} x >= 2 which(x >= 2) ``` ##### Exercise Use what you learned about vector to return a vector containing elements within `a` that is **greater** than 30. ```{r} a <- runif(20, 1, 100) #then to extract only elements greater than i.e. 30 a[a>30] #to extract the indecises of entries with values greater than 30 which(a>30) ``` ### List List is a more flexible data type than vector - it can contain a mixture elements of various data types. We'll breeze through this really quickly. Creating a list ```{r} my_list <- list("Hello", 1, TRUE) # Mixed types of element print(my_list) ``` Size of list ```{r} length(my_list) ``` Indexing a list using `[[]]` ```{r} my_list[[1]] ``` Appending a list, list can even contain vector ```{r} my_list[[5]] <- 1:10 my_list ``` Indexing ```{r} my_list[[5]][7] ``` A list within a list ```{r} my_list[[4]] <- list("World") my_list ``` ```{r} my_list[[4]][[1]] ``` Combining list ```{r} c(list(1, 2), list(x=3, y=4)) ``` ### Data Frame This is arguably the most useful feature of R in data analyses. There's a reason we use spreadsheets, they portray multi-dimentional data and its relational information between data points across columns -- i.e., putting variables from the same observation next to each other. Think of data frame as like a stack of vectors (each column). To be able to make sense of the data, we want to add column names to designate what the values mean. Creating a data frame ```{r} df <- data.frame(1:5, 6:10) df ``` ```{r} names(df) names(df) <- c("first.column", "second.column") names(df) # Note how codes are evaluated line-by-line sequentially? ``` Or you can add names when you create data frame. ```{r} df2 <- data.frame(height = 100:105, weight = 120:125) df2 ``` You can think of a data frame as a specialized type of `list` (with constraints), where each vector is an element of the list. BUT the number of rows **must** be of the same length. ```{r, error=TRUE} df3 <- data.frame(height = 100:105, weight = 120:130) head(df3, 2) # displays first 2 rows tail(df3, 3) #displays last 3 rows ``` Data frame is a **specialized** list! It has many similarities with lists. ```{r} typeof(df) class(df) ``` Hence, can use list-related functions. ```{r} length(df) df[[2]] ``` Properties of data frame ```{r} dim(df) #dimension (# of rows, # of collumns) ncol(df) #number of columns nrow(df) #number of rows ``` Indexing, extracting information from data frame ```{r} df[[1]] # Same as list, will display contents as a list df[1] #will displays the same contents in column format if done with single brackets instead of double brackets ``` You can use the `$` operator to access columns by name in the data frame - another reason not to use dollar signs in variable names! ```{r} df$first.column # Most common way to do it ``` R data frames are row-based, meaning first index refers to the row number, then column number ```{r} df ``` ```{r} df[1,2] # row 1, column 2 ``` Or, you can call by column name, and index like a normal vector. ```{r} df$second.column[1] ``` Adding columns ```{r} df$third <- 21:25 fourth <- runif(5, 10, 100) df$fourth <- fourth df ``` R has some built-in data frames that you can play with. ```{r} mtcars ``` ```{r} iris ``` ### Tibble A `tibble` is Tidyverse's attempt to improve on `data frame`. In general, they are about equivalent in terms of function. For the sake of this lesson, we'll use `tibble` because any Tidyverse's `readr` function would output a `tibble` by default. Don't worry about the differences - they are, for the most parts, interchangeable. You can convert a preexisting data frame into a tibble. Just make you have the tidyverse library installed (one time thng) and then loaded (for any new R session) first! ```{r} library(tidyverse) my_tib <- as_tibble(mtcars) # Converting a data frame to tibble my_tib # looks very similar to data frame ``` Another way to create a tibble. Key components are column names and values. Each column can only contain one data type. ```{r} my_tib2 <- tibble(name = c("Josh", "Peter"), income = c(100000000, 20)) my_tib2 ``` Conventionally, for data frames and tibbles, each row contains observations, each column contains variables, and each cell contains values. ```{r} my_tib ``` Subsetting, same as data frame ```{r} tb <- tibble( x = runif(5), y = rnorm(5)) # Extract by name 1 tb$x # Extract by name 2 tb[["x"]] # Extract by column index 1 tb[[1]] ``` ```{r} # If have time, example of differences between data frame vs tibble mtcars[,1] as_tibble(mtcars)[,1] ``` ```{r} class(tb) ``` ### Tidyverse What is [Tidyverse](https://www.tidyverse.org/)? Tidyverse is a collection of R packages that had been designed and developed for data science. R is not a new language, and Tidyverse's philosophy is about reducing redundancy and improving coding style to be cleaner and more intuitive. First, even though we have all installed the package Tidyverse, the package has yet been loaded into your environment. You won't have access to Tidyverse's functions unless you run `library(tidyverse)` Tidyverse is a literal gold mine, and there's no way we're going to cover everything in just a few hours! We're going to go through a few functions that are particularly useful and most commonly used. ### Github Copilot Another IDE is VSCode which is very popular for many languages (not just R). VSCode offers many plug ins that you can install that span everything from easy remote secure connection to a computing cluster to rainbow columns for your CSV folders. You can also install Github Copilot, an in-line AI coding agent, but by default its functionality is limited. Another perk of being a student is the ability to qualify for [GitHub Education](https://github.com/education) and apply. You will need to go to DukeHub for proof of enrollment (see the enrollment verification tab > include by program and plan > include my earned degrees >submit ) and download. Ensure that the verification says your date, name, program, institution/letterhead and enrollment date until graduate. You have to convert the file from PDF to jpeg/png before uploading to the application. Github Education will (hopefully!) approve you and provide you with access to a less limited version of Github CoPilot, which is part of their Student Developr Pack. It includes many learning modules, additional tools, and even listings for jobs. The Github Copilot subscription you get with your Education Plan is also a huge perk! As we will review on Day 3, be MINDFUL of using in-line AI, take care with sensitive data and privacy policies. ### Introduction to Tomorrow's Data Set A common situation in genetics research: you get a series of genes from a screen/GWAS, etc. but who do they do? How do they respond to different conditions? In our example, assume we found a group of relatively conserved genes involved in ribosomal biogenesis. Is their expression consistent or does this change over the cell cycle? Cells need to make more lipids and proteins to grow larger, but they also need to time their growth relative to the cell cycle stage. So genes involved in the growth response to the cell cycle are likely to be periodically expressed. so if genes are involved in the cell cycle, you would expected mRNA levels for genes of interest to fluctuate over time. [This paper](https://pubmed.ncbi.nlm.nih.gov/28057705/), performed RNA-sequencing at different synchronized yeast growth phases, so GSE81932_Dataset01 This is a data frame for all 6,000 genes across the time points. The paper identified several hundred as having some sort of periodicity (expression fluctuated in relation to yeast growth cycles, many of which were also associated with other studies. GSE81932_Dataset02 is the smaller set of genes that overlap with these other studies and are deemed most significant. This is a vector (just the names) of the 144 genes identified as overlap. We have also downloaded a data frame of annotations/common names for S. cerevisiae genes and mRNA (scer-mrna-protein-absolute-estimate.txt) and a second dataframe of just the annotations for our genes of interest involved in ribosome biogenesis (ribsome_ biogeneis_annotations.txt). Tomorrow we will plot our genes of interest in different format and make an expression heatmap for all of the genes to look at periodicity!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.