The notes for Wednesday and Thursday have been moved to a separate document.
Today we'll follow the swcarpentry notes: https://swcarpentry.github.io/git-novice/
Create and navigate to a new working directory
mkdir bootcamp
cd bootcamp
Version control can keep track of both what you ADD and what you SUBTRACT from the document. However, a conflict can arise if two users are making changes to the same section of the document.
It works like an unlimited "undo".
You only need to configure git
one time at the start!
git config --global user.name "Vlad Dracula"
git config --global user.email "vlad@tran.sylvan.ia"
Your computer encodes the enter
input as a character ("newline"). You can change the way Git recognize and encode line endings.
On MacOS and Linux:
git config --global core.autocrlf input
And on Windows:
git config --global core.autocrlf false
Setup a default text editor (optional):
git config --global core.editor "nano -w"
Configure the default branch name to be main
git config --global init.defaultBranch main
You can view all the configurations you have for git
.
git config --list
Make sure you are in your working directory for this lecture.
create a new directory and initialize a new repository.
mkdir planets
cd planets
git init
Print your git version.
git version
Create main
branch and make sure you are on it.
git checkout -b main
You can get a report on the repository.
git status
It's recommended NOT to initialize a new git repository within a pre-existing one. E.g. making a directory called moons
and call git init
inside moons
directory. If you accidentally make nested git repositories just remove the .git
directory.
rm -rf moons/.git
Version control keeps track of files only (not directories).
Make a new file and type some notes.
nano mars.txt
We can see that git detected the new file, but it's not being tracked yet:
git status
To start tracking changes:
git add mars.txt
To save the changes, you can run the command commit
(basically save in git
lexicon).
git commit -m "Start notes on Mars as a base"
This will summarizes some changes that you made since last commit
. After the -m
flag, you are giving your future self or collaborator a note on what changes you've done in this commit. It should be relatively brief.
Take a look at the commit history.
git log
If you make further changes, you can run git diff
to show the differences between the current state of the file and the most recently saved version.
git diff
Git add your new changes before committing
git add mars.txt
git commit -m "Add concerns about effects of Mars' moons on Wolfman"
Git compares the staging arae with the unstaged files by default.
If you use git diff after staging your new changes, it won't find a difference.
nano mars.txt
git add mars.txt
git diff
However, you can compare the commited version and the staged version:
git diff --staged
Change visualization from the line level (default) to word level changes
git diff --color-words
If git log
is very long:
git log -1 ## Limits to just last commit, change 1 to any number of logs you want
git log --oneline ## reduces information to just one line
You can git add a directory even if directories are not tracked (it tracks the contents).
mkdir spaceships
touch spaceships/apollo-11 spaceships/sputnik-1
git status
git add spaceships
git status
Notice the difference in git status before and after adding spaceships.
Commit the changes:
git commit -m "Add some initial thoughts on spaceships"
It's best practice to be descriptive with your commit messages.
Answer number 3 is correct:
git add myfile.txt
git commit -m "my recent changes"
The staging area can hold changes from any number of files that you want to commit as a single snapshot.
nano mars.txt
nano venus.txt
git add mars.txt venus.txt
git commit -m "Started considering Venus as a base"
bio
Repositorycd .. # .git already exists here (planets)
mkdir bio
cd bio
git init # initializes git
nano me.txt # write biography in the file
git add me.txt
git commit -m "Add biography file"
nano me.txt # adds the fourth line
git diff me.txt # shows the differences
Make some more changes in mars.txt
nano mars.txt
cat mars.txt
See what changed. This will compare to the most recent commit.
git diff HEAD mars.txt
Using HEAD~1
to move down the logs by 1 from the most recent commit (aka 2 commits ago). HEAD
refers to the most recent commit, and everything before can be referred to relatively to HEAD
.
git diff HEAD~1 mars.txt
Using git show
shows the commit message on top of differences between a commit and our working directory.
git show HEAD~2 mars.txt
HEAD
is good for looking for a recent commit because it is relative to the most recent one, but if you want to point at a commit with an absolute identifier, you can obtain the ID from git log
. Each commit gets its own unique one.
git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt
You don't have t ofuse the full 40-character string, as long as the first few letters/digits you type in are unique from other commit IDs, git will know which one you are referring to(similar logic to tab completion
).
git diff f22b25e mars.txt
Using git checkout
checks out (i.e., restores) an old version of a file. In this case, we are telling Git we want to recover the version of the file recorded in HEAD
.
git checkout HEAD mars.txt
If you want to restore a version of a file, but forget to add the filename after git checkout <commit ID>
, you will enter a unique state called datached HEAD
. You basically entered an area dedicated for experimenting with that <commit ID>
version of your repository. Knowing this, you can change files, and even commit those changes. HOWEVER, once you "reattach" the HEAD
, those changes and commits you've done in the detached HEAD
state won't come with you.
There's a way to retain all those commits you made in the detached HEAD
state. Basically, you will make a new branch
to keep all those commits and make an alternate reality
version of your repository. Once, you've done that, you can "reattach" HEAD
to return to the original repository on the main
branch. Essentially, now you have 2 versions of your repositories on 2 different branches. I'll demonstrate the codes below:
git checkout f22b25e # Now you entered the 'detached HEAD' state with your repository looking like the stage in which you committed the f22b25e... commit
nano mars.txt # in this state, you can make changes...
git add mars.txt
git commit -m "made changes in mars.txt" # or even commit in this state
### YOU CAN RETURN TO MAIN WITHOUT RETAINING THOSE COMMITS BY
git checkout main
### ALTERNATIVELY, if you wish to retain those changes somewhere, make a new branch
git checkout -b <new-branch-name> # This branch will keep all changes you made inside the "detached HEAD" state
git checkout main # will bring you back to the original state before you ran the 'git checkout f22b25e' line
Which commands below will let her recover the last committed version of her Python script called data_cruncher.py?
Answer is 2 and 4:
git checkout HEAD data_cruncher.py
or
git checkout <unique ID of last commit> data_cruncher.py
Checkout
vs Restore
git checkout
does two different things.
This is confusing, so the developers try to separate the two functions to two different commands. Now, to restore a commit, run:
git restore <commit ID>
git checkout
's restore function is slowly depreciated. Let's make it a habit to use restore
for the restore function!!!
You don't always want git
to keep track of EVERY files in the repository (e.g., raw files, large files etc.). You can tell git
to ignore these files.
Create a new directory and files:
mkdir results
touch a.dat b.dat c.dat results/a.out results/b.out
git status
The git status will show the new files.
To ignore them, make a ``.gitignore`.
nano .gitignore
type:
*.dat
results/
git status
You also need to keep track of your .gitignore
:
git add .gitignore
git commit -m "Ignore data files and the results folder"
git status
Take a look at your ignored files:
git status --ignored
You can override the ignored settings. You can use -f
to force add the file to staging area.
git add -f a.dat
If you have a directory structure like this:
results/data
results/plots
You can ignore only one of the subdirectories by specifying in .gitignore
results/plots/
Log into GitHub and click create new repository. Click the create repository button.
Copy the ssh from your remote repository.
Go to your local directory:
git remote add origin git@github.com:vlad/planets.git
Check that it worked:
git remote -v
Create private and public keys in your computer.
ls -al ~/.ssh
Create an SSH key pair if you don't already have one. Choose whatever password you like for your passphrase. Write down and make sure to remember the passphrase.
ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia"
If you type
ls -al ~/.ssh
the new key pair will show.
Copy the public key to GitHub
cat ~/.ssh/id_ed25519.pub
Copy the output and go to GitHub.
Click profile > settings > SSH keys
Add a new one.
Paste the public key.
Check that the key is setup on GitHub:
ssh -T git@github.com
Check you are still in your planets directory.
After authenticating your ssh key pair, let's push your changes from the local to remote repository.
git push origin main
If you go to your GitHub page, you should be able to see all the files that you committed, and the version history of your repository. Including the local changes.
Practice adding a collaborator:
You should receive an email from the person inviting you to collaborate.
Create a new directory called collaboration:
cd .. # get out of your own plants directory
mkdir collaboration
cd collaboration
Clone your collaborator's repository from GitHub
git clone git@github.com:vlad/planets.git .
Go to your collaborator's repository and create a new file
cd planets
nano pluto.txt
Commit your changes
git add pluto.txt
git commit -m "Add notes about Pluto"
And push to your collaborator's GitHub repository
git push origin main
If you go to the repository GitHub page, you can see your new file and your commit in the commit history.
However, your collaborator won't have the new file locally unless they pull from the remote repository.
git pull origin main
To test the updates before git pull, you can use:
git fetch origin main
You can also git diff your local vs your remote:
git diff main origin/main
Modify the mars.txt file in your collaborator's repository.
nano mars.txt
Push change to GitHub
git add mars.txt
git commit -m "Add a line in our home copy"
git push origin main
If the owner also makes parallel changes in their own repository:
nano mars.txt
git add mars.txt
git commit -m "Add a line in my copy"
git push origin main
It won't work because there is a conflict in the mars.txt
You'll need to pull the repository and resolve the conflict
git pull origin main
Edit the file and resolve the conflicts and commit the changes.
The conflict will be indicated within the file between the <<<<<<< HEAD
, =======
, and >>>>>>>
.
nano mars.txt
git add mars.txt
git commit -m "Merge changes from GitHub"
Remember to always git pull before you git push.
You can branch off the main version of the repository into your own space that you can commit and push freely. Your collaborator can also do the same thing on their own branch. Then, if you want to merge those branches into main (aka making changes permanent), you can review and approve if the merge has any conflict or works.
Data organization is an ongoing process!
Setup: Open terminal, a text editor of choice, "gapminderDataFiveYear_superDirty data.xlsx" file, create a new directory to work in
You just started a new job/a new rotation, someone hands you this data file to analyze.
#Let's look at the data
head gapminderDataFiveYear_superDirty data.xlsx
We can't tell
We'll look into how to store data for yourself and others:
Here are some characteristics of files you should pay attention to:
Basic intiutive project directory:
Key points:
01_file.txt
, 02_file.txt
)2023-08-24_file.txt
, 2023-08-25_file.txt
)
#Inside our working directory that we just created
#Creating data dir with orig and raw subdir
mkdir data
mkdir data/original
mkdir data/raw
mv gapminder* data/original #moving data files to data/original
cd data/original
ls
#gapminderDataFiveYear_superDirty.xlsx
#gapminderDataFiveYear_superDirty.txt
chmod 444 gapminder* #removes write and execute access at all permission levels
ls -l #checking file permissions -r--r--r--@
cd ..#returning to our main project folder
nano README.txt
#Inside our README.txt>>
#Project Name: UPGG Bootcamp Data Organization
#Date: 2023-08-25
#Email/Contact info:Emiliano Sotelo-jemilianos@gmail.com
#Data downloaded from: https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/setup/#:~:text=gapminderDataFiveYear_superDirty%20data.xlsx
#Goal is to learn how to organize our data projects.
You might want to leave your personal email as a contact instead of your Duke email, just because you might not have access to your Duke email later in your life. Your README might outlast your access to Duke email. You most likely will have access to your personal email much longer than your Duke email.
To avoid modifying the original data, we should make a copy and go from there
mkdir cleaned #in smae dir as original
cp original/gapminder*.xlsx cleaned/.
cd cleaned
chmod 777 gapminder*.xlsx #gives all the permissions
open gapminder*xlsx # this will open the file in excel
nano README.txt
#Inside README.txt>>
#-Data cleaning steps for gapminder:
#-Removed 5th row
#Better to write out a script to modify our data than to manually clean/edit our data
After we've cleaned the data in the copied file, lets make some output directories to sort our files:
mkdir code
mkdir output
mkdir doc
Here's an example of a template for an analysis that you can adapt: https://github.com/jemilianosf/template_analysis.
You already have all the materials needed for this last lesson, but it will apply to any other work as well.
How do we share work to other people?
Running code is harder than displaying code. To run code you need:
Binder provides both.
Example binder link:
https://mybinder.org/v2/gh/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages?filepath=notebooks%2FData_exploration_run.ipynb
You can execute each chunk of code on a "live" notebook in the link above.
Go to your GitHub repository to copy the https link paste in: https://mybinder.org/ under "GitHub repository name or URL". Paste the path to your notebook under "Path to a notebook file (optional)".
Mybinder creates a container, a very lightweight virtual machine in a host hardware.
After creating you notebook execution environment you can get an URL that you can share with others.
If you only have simple code with standard python code, it should run without extra steps. But if you have a more complex project, you need to tell binder what your dependencies are (what packages to import).
Go to your GitHub repository and create a new file called requirements.txt
. You can add dependencies like pandas
, numpy
, matplotlib
, seaborn
.
Binder will recognize your requirements.txt
and load those packages in your notebook's container.
You won't be able to host data on binder. If your data is large, you won't be able to save on GitHub either. You can use an external data repository, and get links to your data that you can refer to inside of your notebook.
huggingface is a service to store and document large datasets. Here's an example:
https://huggingface.co/imageomics/BGNN-trait-segmentation
Note: python gets updated frequently, and it's something to be aware of. Be explicit about the versions of python and packages you use.
FYI initiative to build reproducible containers:
https://codeocean.com/