owned this note
owned this note
Published
Linked with GitHub
---
tags: course notes
---
# UPGG Informatics Orientation Bootcamp 2023
:::warning
The [notes for Wednesday and Thursday](https://hackmd.io/@8dEsA7nbQsexUv4D0NazQA/SkwLQNBan/edit) have been moved to a separate document.
:::
## Friday Morning
### Version Control with Git
#### Instructor: Raven
#### Summary and Setup
Today we'll follow the swcarpentry notes: https://swcarpentry.github.io/git-novice/
Create and navigate to a new working directory
````
mkdir bootcamp
cd bootcamp
````
Version control can keep track of both what you ADD and what you SUBTRACT from the document. However, a conflict can arise if two users are making changes to the same section of the document.
It works like an unlimited "undo".
#### Setup Git
You only need to configure `git` one time at the start!
````bash
git config --global user.name "Vlad Dracula"
git config --global user.email "vlad@tran.sylvan.ia"
````
Your computer encodes the `enter` input as a character ("newline"). You can change the way Git recognize and encode line endings.
On MacOS and Linux:
```bash
git config --global core.autocrlf input
```
And on Windows:
```bash
git config --global core.autocrlf false
```
Setup a default text editor (optional):
```bash
git config --global core.editor "nano -w"
```
Configure the default branch name to be `main`
```bash
git config --global init.defaultBranch main
```
You can view all the configurations you have for `git`.
```bash
git config --list
```
##### Creating a Repository
Make sure you are in your working directory for this lecture.
create a new directory and initialize a new repository.
```bash
mkdir planets
cd planets
git init
```
Print your git version.
```bash
git version
```
Create `main` branch and make sure you are on it.
```bash
git checkout -b main
```
You can get a report on the repository.
```bash
git status
```
It's recommended **NOT** to initialize a new git repository within a pre-existing one. E.g. making a directory called `moons` and call `git init` inside `moons` directory. If you accidentally make nested git repositories just remove the `.git` directory.
```bash
rm -rf moons/.git
```
Version control keeps track of files only (not directories).
##### Tracking Changes
Make a new file and type some notes.
```bash
nano mars.txt
```
We can see that git detected the new file, but it's not being tracked yet:
```bash
git status
```
To start tracking changes:
```bash
git add mars.txt
```
To **save** the changes, you can run the command `commit` (basically **save** in `git` lexicon).
```bash
git commit -m "Start notes on Mars as a base"
```
This will summarizes some changes that you made since last `commit`. After the `-m` flag, you are giving your future self or collaborator a note on what changes you've done in this commit. It should be relatively brief.
Take a look at the commit history.
```bash
git log
```
If you make further changes, you can run `git diff` to show the differences between the current state of the file and the most recently saved version.
```bash
git diff
```
Git add your new changes before committing
```bash
git add mars.txt
git commit -m "Add concerns about effects of Mars' moons on Wolfman"
```
Git compares the staging arae with the unstaged files by default.
If you use git diff after staging your new changes, it won't find a difference.
```bash
nano mars.txt
git add mars.txt
git diff
```
However, you can compare the commited version and the staged version:
```bash
git diff --staged
```
Change visualization from the line level (default) to word level changes
```bash
git diff --color-words
```
If `git log` is very long:
```bash
git log -1 ## Limits to just last commit, change 1 to any number of logs you want
git log --oneline ## reduces information to just one line
```
You can git add a directory even if directories are not tracked (it tracks the contents).
```bash
mkdir spaceships
touch spaceships/apollo-11 spaceships/sputnik-1
git status
git add spaceships
git status
```
Notice the difference in git status before and after adding spaceships.
Commit the changes:
```bash
git commit -m "Add some initial thoughts on spaceships"
```
It's best practice to be descriptive with your commit messages.
###### Exercises:
###### Which command(s) below would save the changes of myfile.txt to my local Git repository?
Answer number 3 is correct:
```bash
git add myfile.txt
git commit -m "my recent changes"
```
The staging area can hold changes from any number of files that you want to commit as a single snapshot.
###### Adding multiple files
- Add some text to mars.txt noting your decision to consider Venus as a base
```bash
nano mars.txt
```
- Create a new file venus.txt with your initial thoughts about Venus as a base for you and your friends
```bash
nano venus.txt
```
- Add changes from both files to the staging area, and commit those changes.
```bash
git add mars.txt venus.txt
git commit -m "Started considering Venus as a base"
```
###### `bio` Repository
- Create a new Git repository on your computer called bio.
- Write a three-line biography for yourself in a file called me.txt, commit your changes
- Modify one line, add a fourth line
- Display the differences between its updated state and its original state.
```bash
cd .. # .git already exists here (planets)
mkdir bio
cd bio
git init # initializes git
nano me.txt # write biography in the file
git add me.txt
git commit -m "Add biography file"
nano me.txt # adds the fourth line
git diff me.txt # shows the differences
```
##### Exploring History
Make some more changes in `mars.txt`
```bash
nano mars.txt
cat mars.txt
```
See what changed. This will compare to the most recent commit.
```bash
git diff HEAD mars.txt
```
Using `HEAD~1` to move down the logs by 1 from the most recent commit (aka 2 commits ago). `HEAD` refers to the most recent commit, and everything before can be referred to relatively to `HEAD`.
```bash
git diff HEAD~1 mars.txt
```
Using `git show` shows the commit message on top of differences between a commit and our working directory.
```bash
git show HEAD~2 mars.txt
```
`HEAD` is good for looking for a recent commit because it is relative to the most recent one, but if you want to point at a commit with an absolute identifier, you can obtain the ID from `git log`. Each commit gets its own unique one.
```bash
git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt
```
You don't have t ofuse the full 40-character string, as long as the first few letters/digits you type in are unique from other commit IDs, git will know which one you are referring to(similar logic to `tab completion`).
```bash
git diff f22b25e mars.txt
```
Using `git checkout` checks out (i.e., restores) an old version of a file. In this case, we are telling Git we want to recover the version of the file recorded in `HEAD`.
```bash
git checkout HEAD mars.txt
```
##### Don't Lose Your Head
If you want to restore a version of a file, but forget to add the filename after `git checkout <commit ID>`, you will enter a unique state called `datached HEAD`. You basically entered an area dedicated for experimenting with that `<commit ID>` version of your repository. Knowing this, you can change files, and even commit those changes. HOWEVER, once you "reattach" the `HEAD`, those changes and commits you've done in the `detached HEAD` state won't come with you.
There's a way to retain all those commits you made in the `detached HEAD` state. Basically, you will make a new `branch` to keep all those commits and make an `alternate reality` version of your repository. Once, you've done that, you can "reattach" `HEAD` to return to the original repository on the `main` branch. Essentially, now you have 2 versions of your repositories on 2 different branches. I'll demonstrate the codes below:
```bash
git checkout f22b25e # Now you entered the 'detached HEAD' state with your repository looking like the stage in which you committed the f22b25e... commit
nano mars.txt # in this state, you can make changes...
git add mars.txt
git commit -m "made changes in mars.txt" # or even commit in this state
### YOU CAN RETURN TO MAIN WITHOUT RETAINING THOSE COMMITS BY
git checkout main
### ALTERNATIVELY, if you wish to retain those changes somewhere, make a new branch
git checkout -b <new-branch-name> # This branch will keep all changes you made inside the "detached HEAD" state
git checkout main # will bring you back to the original state before you ran the 'git checkout f22b25e' line
```
###### Exercise: RECOVERING OLDER VERSIONS OF A FILE
Which commands below will let her recover the last committed version of her Python script called data_cruncher.py?
Answer is 2 and 4:
```bash
git checkout HEAD data_cruncher.py
```
or
```bash
git checkout <unique ID of last commit> data_cruncher.py
```
##### **Important!!!** `Checkout` vs `Restore`
`git checkout` does two different things.
1. Restoring a previous commit
2. Navigate to a branch
This is confusing, so the developers try to separate the two functions to two different commands. Now, to restore a commit, run:
```bash
git restore <commit ID>
```
`git checkout`'s restore function is slowly depreciated. Let's make it a habit to use `restore` for the restore function!!!
##### Ignoring things
You don't always want `git` to keep track of EVERY files in the repository (e.g., raw files, large files etc.). You can tell `git` to ignore these files.
Create a new directory and files:
```bash
mkdir results
touch a.dat b.dat c.dat results/a.out results/b.out
git status
```
The git status will show the new files.
To ignore them, make a ``.gitignore`.
```bash
nano .gitignore
```
type:
````
*.dat
results/
````
```bash
git status
```
You also need to keep track of your `.gitignore`:
```bash
git add .gitignore
git commit -m "Ignore data files and the results folder"
git status
```
Take a look at your ignored files:
```bash
git status --ignored
```
You can override the ignored settings. You can use `-f` to force add the file to staging area.
```bash
git add -f a.dat
```
###### Exercise: IGNORING NESTED FILES
If you have a directory structure like this:
````
results/data
results/plots
````
You can ignore only one of the subdirectories by specifying in `.gitignore`
````
results/plots/
````
##### Remotes in GitHub
###### Create a remote repository
Log into GitHub and click create new repository. Click the create repository button.
###### Connect local to remote repository
Copy the ssh from your remote repository.
Go to your local directory:
```bash
git remote add origin git@github.com:vlad/planets.git
```
Check that it worked:
```bash
git remote -v
```
###### SSH Background and Setup
Create private and public keys in your computer.
```bash=
ls -al ~/.ssh
```
Create an SSH key pair if you don't already have one. Choose whatever password you like for your passphrase. Write down and make sure to remember the passphrase.
```bash
ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia"
```
If you type
```bash
ls -al ~/.ssh
```
the new key pair will show.
Copy the public key to GitHub
```bash
cat ~/.ssh/id_ed25519.pub
```
Copy the output and go to GitHub.
Click profile > settings > SSH keys
Add a new one.
Paste the public key.
Check that the key is setup on GitHub:
```bash
ssh -T git@github.com
```
###### Push local changes to a remote
Check you are still in your planets directory.
After authenticating your ssh key pair, let's push your changes from the local to remote repository.
```bash
git push origin main
```
If you go to your GitHub page, you should be able to see all the files that you committed, and the version history of your repository. Including the local changes.
###### Collaborating
Practice adding a collaborator:
- Go to the settings on your plantes repositories.
- Go to collabors
- Add people
- Search by github username or mail
You should receive an email from the person inviting you to collaborate.
Create a new directory called collaboration:
```bash=
cd .. # get out of your own plants directory
mkdir collaboration
cd collaboration
```
Clone your collaborator's repository from GitHub
```bash=
git clone git@github.com:vlad/planets.git .
```
Go to your collaborator's repository and create a new file
```bash=
cd planets
nano pluto.txt
```
Commit your changes
```bash=
git add pluto.txt
git commit -m "Add notes about Pluto"
```
And push to your collaborator's GitHub repository
```bash=
git push origin main
```
If you go to the repository GitHub page, you can see your new file and your commit in the commit history.
However, your collaborator won't have the new file locally unless they pull from the remote repository.
```bash=
git pull origin main
```
To test the updates before git pull, you can use:
```bash=
git fetch origin main
```
You can also git diff your local vs your remote:
```bash=
git diff main origin/main
```
##### Conflicts
Modify the mars.txt file in your collaborator's repository.
```bash
nano mars.txt
````
Push change to GitHub
```bash=
git add mars.txt
git commit -m "Add a line in our home copy"
git push origin main
```
If the owner also makes parallel changes in their own repository:
```bash=
nano mars.txt
git add mars.txt
git commit -m "Add a line in my copy"
git push origin main
```
It won't work because there is a conflict in the mars.txt
You'll need to pull the repository and resolve the conflict
```bash
git pull origin main
```
Edit the file and resolve the conflicts and commit the changes.
The conflict will be indicated within the file between the `<<<<<<< HEAD`, `=======`, and `>>>>>>>`.
```bash
nano mars.txt
```
```bash
git add mars.txt
git commit -m "Merge changes from GitHub"
```
**Remember to always git pull before you git push.**
##### Branches
You can branch off the main version of the repository into your own space that you can commit and push freely. Your collaborator can also do the same thing on their own branch. Then, if you want to merge those branches into main (aka making changes permanent), you can review and approve if the merge has any conflict or works.
### Data and Project Organization
#### Instructor: Emiliano
Data organization is an ongoing process!
Setup: Open terminal, a text editor of choice, "gapminderDataFiveYear_superDirty data.xlsx" file, create a new directory to work in
#### Introduction
You just started a new job/a new rotation, someone hands you this data file to analyze.
```bash=
#Let's look at the data
head gapminderDataFiveYear_superDirty data.xlsx
```
We can't tell
- What is it?
- Where did it come from?
- When was it collected?
- Has anything been changed? If so, why was it changed?
We'll look into how to store data for yourself and others:
#### Project Structure
Here are some characteristics of files you should pay attention to:
- File History
- File Function
- File Format
- File Origin
Basic intiutive project directory:
- Code Directory (Keep scripts here seperate from data)
- README.txt (add info here about the project and its organization: project name, date, contact info, where data came from, other info about project)
- Data Directory keeps all your data
- Raw data dir: keep raw data seperatefrom everything else
- Modified data dir: keep modified data folder here that have been analyzed by scripts/are stopping points
- Output Directory for files generated from other files
- Like figures, stats, paper etc.
Key points:
- Organize files so that they are intuitive
- Have README files within folders to describe the project and gives context for the analyses
- Make a copy of raw file, so you don't have to modify the raw version
- Keep clear record of modification that has been made by making sure your scripts are reproducible from raw data
- Compartmentalize your directory
##### Helpful Naming conventions
- Keep track of steps by numbers (e.g., `01_file.txt`, `02_file.txt`)
- Use dates and version of files (e.g. `2023-08-24_file.txt`, `2023-08-25_file.txt`)
##### Let's organize the directory that we just created.
```bash=
#Inside our working directory that we just created
#Creating data dir with orig and raw subdir
mkdir data
mkdir data/original
mkdir data/raw
mv gapminder* data/original #moving data files to data/original
cd data/original
ls
#gapminderDataFiveYear_superDirty.xlsx
#gapminderDataFiveYear_superDirty.txt
chmod 444 gapminder* #removes write and execute access at all permission levels
ls -l #checking file permissions -r--r--r--@
cd ..#returning to our main project folder
nano README.txt
#Inside our README.txt>>
#Project Name: UPGG Bootcamp Data Organization
#Date: 2023-08-25
#Email/Contact info:Emiliano Sotelo-jemilianos@gmail.com
#Data downloaded from: https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/setup/#:~:text=gapminderDataFiveYear_superDirty%20data.xlsx
#Goal is to learn how to organize our data projects.
```
You might want to leave your personal email as a contact instead of your Duke email, just because you might not have access to your Duke email later in your life. Your README might outlast your access to Duke email. You most likely will have access to your personal email much longer than your Duke email.
##### Modifying Data
To avoid modifying the original data, we should make a copy and go from there
```bash=
mkdir cleaned #in smae dir as original
cp original/gapminder*.xlsx cleaned/.
cd cleaned
chmod 777 gapminder*.xlsx #gives all the permissions
open gapminder*xlsx # this will open the file in excel
nano README.txt
#Inside README.txt>>
#-Data cleaning steps for gapminder:
#-Removed 5th row
#Better to write out a script to modify our data than to manually clean/edit our data
```
After we've cleaned the data in the copied file, lets make some output directories to sort our files:
```bash=
mkdir code
mkdir output
mkdir doc
```
Here's an example of a template for an analysis that you can adapt: https://github.com/jemilianosf/template_analysis.
### Sharing Jupyter Notebooks
#### Instructor: Hilmar Lapp
You already have all the materials needed for this last lesson, but it will apply to any other work as well.
##### Sharing Jupyter Notebooks using GitHub
How do we share work to other people?
1. Static - sharing a snapshot of your work, cannot be changed
2. Dynamic - they can interact with, or change, or run the code without having to ask you to change anything
##### Binder
Running code is harder than displaying code. To run code you need:
- Hardware
- Code, including dependencies (like r or python, and packages)
Binder provides both.
Example binder link:
https://mybinder.org/v2/gh/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages?filepath=notebooks%2FData_exploration_run.ipynb
You can execute each chunk of code on a "live" notebook in the link above.
Go to your GitHub repository to copy the https link paste in: https://mybinder.org/ under "GitHub repository name or URL". Paste the path to your notebook under "Path to a notebook file (optional)".
Mybinder creates a container, a very lightweight virtual machine in a host hardware.
After creating you notebook execution environment you can get an URL that you can share with others.
###### Adding dependencies
If you only have simple code with standard python code, it should run without extra steps. But if you have a more complex project, you need to tell binder what your dependencies are (what packages to import).
Go to your GitHub repository and create a new file called `requirements.txt`. You can add dependencies like `pandas`, `numpy`, `matplotlib`, `seaborn`.
Binder will recognize your `requirements.txt` and load those packages in your notebook's container.
###### Adding data
You won't be able to host data on binder. If your data is large, you won't be able to save on GitHub either. You can use an external data repository, and get links to your data that you can refer to inside of your notebook.
huggingface is a service to store and document large datasets. Here's an example:
https://huggingface.co/imageomics/BGNN-trait-segmentation
Note: python gets updated frequently, and it's something to be aware of. Be explicit about the versions of python and packages you use.
FYI initiative to build reproducible containers:
https://codeocean.com/