UPGG Informatics Orientation Bootcamp 2023

The notes for Wednesday and Thursday have been moved to a separate document.

Friday Morning

Version Control with Git

Instructor: Raven

Summary and Setup

Today we'll follow the swcarpentry notes: https://swcarpentry.github.io/git-novice/

Create and navigate to a new working directory

mkdir bootcamp
cd bootcamp

Version control can keep track of both what you ADD and what you SUBTRACT from the document. However, a conflict can arise if two users are making changes to the same section of the document.

It works like an unlimited "undo".

Setup Git

You only need to configure git one time at the start!

git config --global user.name "Vlad Dracula"
git config --global user.email "vlad@tran.sylvan.ia"

Your computer encodes the enter input as a character ("newline"). You can change the way Git recognize and encode line endings.

On MacOS and Linux:

git config --global core.autocrlf input

And on Windows:

git config --global core.autocrlf false

Setup a default text editor (optional):

git config --global core.editor "nano -w"

Configure the default branch name to be main

git config --global init.defaultBranch main

You can view all the configurations you have for git.

git config --list
Creating a Repository

Make sure you are in your working directory for this lecture.

create a new directory and initialize a new repository.

mkdir planets
cd planets
git init

Print your git version.

git version

Create main branch and make sure you are on it.

git checkout -b main

You can get a report on the repository.

git status

It's recommended NOT to initialize a new git repository within a pre-existing one. E.g. making a directory called moons and call git init inside moons directory. If you accidentally make nested git repositories just remove the .git directory.

rm -rf moons/.git

Version control keeps track of files only (not directories).

Tracking Changes

Make a new file and type some notes.

nano mars.txt

We can see that git detected the new file, but it's not being tracked yet:

git status

To start tracking changes:

git add mars.txt

To save the changes, you can run the command commit (basically save in git lexicon).

git commit -m "Start notes on Mars as a base"

This will summarizes some changes that you made since last commit. After the -m flag, you are giving your future self or collaborator a note on what changes you've done in this commit. It should be relatively brief.

Take a look at the commit history.

git log

If you make further changes, you can run git diff to show the differences between the current state of the file and the most recently saved version.

git diff

Git add your new changes before committing


git add mars.txt

git commit -m "Add concerns about effects of Mars' moons on Wolfman"

Git compares the staging arae with the unstaged files by default.

If you use git diff after staging your new changes, it won't find a difference.

nano mars.txt
git add mars.txt
git diff

However, you can compare the commited version and the staged version:

git diff --staged

Change visualization from the line level (default) to word level changes

git diff --color-words

If git log is very long:

git log -1 ## Limits to just last commit, change 1 to any number of logs you want
git log --oneline ## reduces information to just one line

You can git add a directory even if directories are not tracked (it tracks the contents).

mkdir spaceships
touch spaceships/apollo-11 spaceships/sputnik-1
git status
git add spaceships
git status

Notice the difference in git status before and after adding spaceships.

Commit the changes:

git commit -m "Add some initial thoughts on spaceships"

It's best practice to be descriptive with your commit messages.

Exercises:
Which command(s) below would save the changes of myfile.txt to my local Git repository?

Answer number 3 is correct:

git add myfile.txt
git commit -m "my recent changes"

The staging area can hold changes from any number of files that you want to commit as a single snapshot.

Adding multiple files
  • Add some text to mars.txt noting your decision to consider Venus as a base
nano mars.txt
  • Create a new file venus.txt with your initial thoughts about Venus as a base for you and your friends
nano venus.txt
  • Add changes from both files to the staging area, and commit those changes.
git add mars.txt venus.txt
git commit -m "Started considering Venus as a base"
bio Repository
  • Create a new Git repository on your computer called bio.
  • Write a three-line biography for yourself in a file called me.txt, commit your changes
  • Modify one line, add a fourth line
  • Display the differences between its updated state and its original state.
cd .. # .git already exists here (planets)

mkdir bio
cd bio

git init # initializes git
nano me.txt # write biography in the file

git add me.txt
git commit -m "Add biography file"

nano me.txt # adds the fourth line

git diff me.txt # shows the differences
Exploring History

Make some more changes in mars.txt

nano mars.txt
cat mars.txt

See what changed. This will compare to the most recent commit.

git diff HEAD mars.txt

Using HEAD~1 to move down the logs by 1 from the most recent commit (aka 2 commits ago). HEAD refers to the most recent commit, and everything before can be referred to relatively to HEAD.

git diff HEAD~1 mars.txt

Using git show shows the commit message on top of differences between a commit and our working directory.

git show HEAD~2 mars.txt

HEAD is good for looking for a recent commit because it is relative to the most recent one, but if you want to point at a commit with an absolute identifier, you can obtain the ID from git log. Each commit gets its own unique one.

git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt

You don't have t ofuse the full 40-character string, as long as the first few letters/digits you type in are unique from other commit IDs, git will know which one you are referring to(similar logic to tab completion).

git diff f22b25e mars.txt

Using git checkout checks out (i.e., restores) an old version of a file. In this case, we are telling Git we want to recover the version of the file recorded in HEAD.

git checkout HEAD mars.txt
Don't Lose Your Head

If you want to restore a version of a file, but forget to add the filename after git checkout <commit ID>, you will enter a unique state called datached HEAD. You basically entered an area dedicated for experimenting with that <commit ID> version of your repository. Knowing this, you can change files, and even commit those changes. HOWEVER, once you "reattach" the HEAD, those changes and commits you've done in the detached HEAD state won't come with you.

There's a way to retain all those commits you made in the detached HEAD state. Basically, you will make a new branch to keep all those commits and make an alternate reality version of your repository. Once, you've done that, you can "reattach" HEAD to return to the original repository on the main branch. Essentially, now you have 2 versions of your repositories on 2 different branches. I'll demonstrate the codes below:

git checkout f22b25e # Now you entered the 'detached HEAD' state with your repository looking like the stage in which you committed the f22b25e... commit

nano mars.txt # in this state, you can make changes...
git add mars.txt
git commit -m "made changes in mars.txt" # or even commit in this state

### YOU CAN RETURN TO MAIN WITHOUT RETAINING THOSE COMMITS BY
git checkout main

### ALTERNATIVELY, if you wish to retain those changes somewhere, make a new branch
git checkout -b <new-branch-name> # This branch will keep all changes you made inside the "detached HEAD" state
git checkout main # will bring you back to the original state before you ran the 'git checkout f22b25e' line
Exercise: RECOVERING OLDER VERSIONS OF A FILE

Which commands below will let her recover the last committed version of her Python script called data_cruncher.py?

Answer is 2 and 4:

git checkout HEAD data_cruncher.py

or

git checkout <unique ID of last commit> data_cruncher.py
Important!!! Checkout vs Restore

git checkout does two different things.

  1. Restoring a previous commit
  2. Navigate to a branch

This is confusing, so the developers try to separate the two functions to two different commands. Now, to restore a commit, run:

git restore <commit ID>

git checkout's restore function is slowly depreciated. Let's make it a habit to use restore for the restore function!!!

Ignoring things

You don't always want git to keep track of EVERY files in the repository (e.g., raw files, large files etc.). You can tell git to ignore these files.

Create a new directory and files:

mkdir results
touch a.dat b.dat c.dat results/a.out results/b.out
git status

The git status will show the new files.

To ignore them, make a ``.gitignore`.

nano .gitignore

type:

*.dat
results/
git status

You also need to keep track of your .gitignore:

git add .gitignore
git commit -m "Ignore data files and the results folder"
git status

Take a look at your ignored files:

git status --ignored

You can override the ignored settings. You can use -f to force add the file to staging area.

git add -f a.dat
Exercise: IGNORING NESTED FILES

If you have a directory structure like this:

results/data
results/plots

You can ignore only one of the subdirectories by specifying in .gitignore

results/plots/
Remotes in GitHub
Create a remote repository

Log into GitHub and click create new repository. Click the create repository button.

Connect local to remote repository

Copy the ssh from your remote repository.

Go to your local directory:

git remote add origin git@github.com:vlad/planets.git

Check that it worked:

git remote -v
SSH Background and Setup

Create private and public keys in your computer.

ls -al ~/.ssh

Create an SSH key pair if you don't already have one. Choose whatever password you like for your passphrase. Write down and make sure to remember the passphrase.

ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia"

If you type

ls -al ~/.ssh

the new key pair will show.

Copy the public key to GitHub

cat ~/.ssh/id_ed25519.pub

Copy the output and go to GitHub.
Click profile > settings > SSH keys

Add a new one.

Paste the public key.

Check that the key is setup on GitHub:

ssh -T git@github.com
Push local changes to a remote

Check you are still in your planets directory.

After authenticating your ssh key pair, let's push your changes from the local to remote repository.

git push origin main

If you go to your GitHub page, you should be able to see all the files that you committed, and the version history of your repository. Including the local changes.

Collaborating

Practice adding a collaborator:

  • Go to the settings on your plantes repositories.
  • Go to collabors
  • Add people
  • Search by github username or mail

You should receive an email from the person inviting you to collaborate.

Create a new directory called collaboration:

cd .. # get out of your own plants directory mkdir collaboration cd collaboration

Clone your collaborator's repository from GitHub

git clone git@github.com:vlad/planets.git .

Go to your collaborator's repository and create a new file

cd planets nano pluto.txt

Commit your changes

git add pluto.txt git commit -m "Add notes about Pluto"

And push to your collaborator's GitHub repository

git push origin main

If you go to the repository GitHub page, you can see your new file and your commit in the commit history.

However, your collaborator won't have the new file locally unless they pull from the remote repository.

git pull origin main

To test the updates before git pull, you can use:

git fetch origin main

You can also git diff your local vs your remote:

git diff main origin/main
Conflicts

Modify the mars.txt file in your collaborator's repository.

nano mars.txt

Push change to GitHub

git add mars.txt git commit -m "Add a line in our home copy" git push origin main

If the owner also makes parallel changes in their own repository:

nano mars.txt git add mars.txt git commit -m "Add a line in my copy" git push origin main

It won't work because there is a conflict in the mars.txt

You'll need to pull the repository and resolve the conflict

git pull origin main

Edit the file and resolve the conflicts and commit the changes.

The conflict will be indicated within the file between the <<<<<<< HEAD, =======, and >>>>>>>.

nano mars.txt
git add mars.txt
git commit -m "Merge changes from GitHub"

Remember to always git pull before you git push.

Branches

You can branch off the main version of the repository into your own space that you can commit and push freely. Your collaborator can also do the same thing on their own branch. Then, if you want to merge those branches into main (aka making changes permanent), you can review and approve if the merge has any conflict or works.

Data and Project Organization

Instructor: Emiliano

Data organization is an ongoing process!

Setup: Open terminal, a text editor of choice, "gapminderDataFiveYear_superDirty data.xlsx" file, create a new directory to work in

Introduction

You just started a new job/a new rotation, someone hands you this data file to analyze.

#Let's look at the data head gapminderDataFiveYear_superDirty data.xlsx

We can't tell

  • What is it?
  • Where did it come from?
  • When was it collected?
  • Has anything been changed? If so, why was it changed?

We'll look into how to store data for yourself and others:

Project Structure

Here are some characteristics of files you should pay attention to:

  • File History
  • File Function
  • File Format
  • File Origin

Basic intiutive project directory:

  • Code Directory (Keep scripts here seperate from data)
  • README.txt (add info here about the project and its organization: project name, date, contact info, where data came from, other info about project)
  • Data Directory keeps all your data
    • Raw data dir: keep raw data seperatefrom everything else
    • Modified data dir: keep modified data folder here that have been analyzed by scripts/are stopping points
  • Output Directory for files generated from other files
    • Like figures, stats, paper etc.

Key points:

  • Organize files so that they are intuitive
  • Have README files within folders to describe the project and gives context for the analyses
  • Make a copy of raw file, so you don't have to modify the raw version
  • Keep clear record of modification that has been made by making sure your scripts are reproducible from raw data
  • Compartmentalize your directory
Helpful Naming conventions
  • Keep track of steps by numbers (e.g., 01_file.txt, 02_file.txt)
  • Use dates and version of files (e.g. 2023-08-24_file.txt, 2023-08-25_file.txt)
Let's organize the directory that we just created.
#Inside our working directory that we just created #Creating data dir with orig and raw subdir mkdir data mkdir data/original mkdir data/raw mv gapminder* data/original #moving data files to data/original cd data/original ls #gapminderDataFiveYear_superDirty.xlsx #gapminderDataFiveYear_superDirty.txt chmod 444 gapminder* #removes write and execute access at all permission levels ls -l #checking file permissions -r--r--r--@ cd ..#returning to our main project folder nano README.txt #Inside our README.txt>> #Project Name: UPGG Bootcamp Data Organization #Date: 2023-08-25 #Email/Contact info:Emiliano Sotelo-jemilianos@gmail.com #Data downloaded from: https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/setup/#:~:text=gapminderDataFiveYear_superDirty%20data.xlsx #Goal is to learn how to organize our data projects.

You might want to leave your personal email as a contact instead of your Duke email, just because you might not have access to your Duke email later in your life. Your README might outlast your access to Duke email. You most likely will have access to your personal email much longer than your Duke email.

Modifying Data

To avoid modifying the original data, we should make a copy and go from there

mkdir cleaned #in smae dir as original cp original/gapminder*.xlsx cleaned/. cd cleaned chmod 777 gapminder*.xlsx #gives all the permissions open gapminder*xlsx # this will open the file in excel nano README.txt #Inside README.txt>> #-Data cleaning steps for gapminder: #-Removed 5th row #Better to write out a script to modify our data than to manually clean/edit our data

After we've cleaned the data in the copied file, lets make some output directories to sort our files:

mkdir code mkdir output mkdir doc

Here's an example of a template for an analysis that you can adapt: https://github.com/jemilianosf/template_analysis.

Sharing Jupyter Notebooks

Instructor: Hilmar Lapp

You already have all the materials needed for this last lesson, but it will apply to any other work as well.

Sharing Jupyter Notebooks using GitHub

How do we share work to other people?

  1. Static - sharing a snapshot of your work, cannot be changed
  2. Dynamic - they can interact with, or change, or run the code without having to ask you to change anything
Binder

Running code is harder than displaying code. To run code you need:

  • Hardware
  • Code, including dependencies (like r or python, and packages)

Binder provides both.
Example binder link:
https://mybinder.org/v2/gh/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages?filepath=notebooks%2FData_exploration_run.ipynb

You can execute each chunk of code on a "live" notebook in the link above.

Go to your GitHub repository to copy the https link paste in: https://mybinder.org/ under "GitHub repository name or URL". Paste the path to your notebook under "Path to a notebook file (optional)".

Mybinder creates a container, a very lightweight virtual machine in a host hardware.

After creating you notebook execution environment you can get an URL that you can share with others.

Adding dependencies

If you only have simple code with standard python code, it should run without extra steps. But if you have a more complex project, you need to tell binder what your dependencies are (what packages to import).

Go to your GitHub repository and create a new file called requirements.txt. You can add dependencies like pandas, numpy, matplotlib, seaborn.

Binder will recognize your requirements.txt and load those packages in your notebook's container.

Adding data

You won't be able to host data on binder. If your data is large, you won't be able to save on GitHub either. You can use an external data repository, and get links to your data that you can refer to inside of your notebook.

huggingface is a service to store and document large datasets. Here's an example:
https://huggingface.co/imageomics/BGNN-trait-segmentation

Note: python gets updated frequently, and it's something to be aware of. Be explicit about the versions of python and packages you use.

FYI initiative to build reproducible containers:
https://codeocean.com/

Select a repo