Reproducible Research 2 - Git

Wojciech Hardy

link: https://hackmd.io/@WHardy/RR24-git1


So what's Git? A Version Control System.

  • VCS tracks the history of changes (e.g. within a folder).
  • The history includes: what was done, who did it, when, why, etc.
  • Teams can collaborate on a project and recover a previous version if necessary.
  • More sophisticated workflows include code reviews steps, automated testing, etc.
  • See: Git handbook

Principles

  • There's a central repository - the one predefined source of truth.
  • You start by grabbing the most up-to-date version from the central repo.
  • Work is split into increments (called commits)
  • Git gives you tools to resolve file conflicts, etc.

How does it work? an example

The project is stored in a central repo.


Contributor 1 grabs the most recent version.


Contributor 1 does some new work locally.


Contributor 1 checks if central version changed in the meantime.


Contributor 1 puts their changes in the central repo.


Central repo now stores the new step on top of the previous one.


Contributor 2 joins in and goes through the same steps.


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


This can go on and on.


And involve a lot more people.


How is this helpful?

  • VCS ensures we don't mess anything up permanently.
  • We can use it to collaborate.
  • We can use it to test and develop new versions (branching) without interfering with the working one.
  • We can use it to store our outputs, share them with the public and allow others to contribute (e.g. update our codes).
  • We can roll back to a previous version whenever we need.

GIT general info

  • Open source distributed version control system
    • Unlike once popular centralized VCS, DVCSs like Git don’t need a constant connection to a central repository
  • Created in early 2000's by Linus Torvalds during work on Linux kernel project
  • Moderately difficult to learn, very difficult to master
  • Became so popular that it effectively replaced older tools (svn, mercurial, svc)

Before we start: let's install Git!

(Check if we have it?)

  • git --version
    (on MacOS, this might prompt installation if you don't have Git)
  • which git
  • where git

If not there: installation

  • Linux:
    • $ sudo apt install git-all
  • Windows:
  • macOS:
    • git --version might prompt installation in newer OS versions
    • $ xcode-select --install if you don't have Xcode
    • $ brew install git using Homebrew
    • or get the binary installer

Also check here


Why "Git"? You can actually pick (citing Wikipedia:)

Torvalds (): "I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'."

The man page describes Git as "the stupid content tracker".

The read-me file of the source code: "git" can mean anything, depending on your mood.

  • Random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of "get" may or may not be relevant.
  • Stupid. Contemptible and despicable. Simple. Take your pick from the dictionary of slang.
  • "Global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
  • "Goddamn idiotic truckload of sh*t": when it breaks.

The source code for Git refers to the program as, "the information manager from hell."


Ouch! In Git's defense

  • It actually works!
  • It is said that Git replaced the older solutions because the latter were even worse
  • and it's flexibility helps to use more modern project workflows (scrum/agile, etc)
  • Git has both CLI and visual plugins in IDEs (integrated development environments) and integration with various apps, services, etc.
  • A lot of firms build on some Git implementation.

Popularity of different VCS among developers

Source: Stackoverflow, Developer Survey Results 2017


Popularity of collaboration tools among devs


Source: Stackoverflow, Developer Survey Results 2020 [44,328 responses, select all that apply]


Extensively used tools over last year + intention to continue


Source: Stackoverflow, Developer Survey Results 2021
(30% of those not working with Git say they'd like to)


Let's take a look at a simple project in Git

  • repository: the entire collection of project's files and folders;
    along with version history (including alternative versions). Short: "repo"
  • commit: basic unit of work (here as a circle)
  • branch: set of units of work (downward line)
  • main (/master): usual name for the main repo branch (production branch)
  • develop here is just a name given to the branch with developed features. You pick the branch names.
  • merge/rebase are used to combine two branches


CLI vs GUI

  • Today we will work with Git using its CLI git bash to become familiar with basic commands


Source: XKCD.


First though, some basic terminal commands

Terminal:

  • cd pathname for navigation (cd .. to go up one level)
  • mkdir foldername to create a new folder
  • dir to list files/folders in the current path (ls in OS/Linux/bash)
  • echo text to print a message
  • > is a redirection indicator
  • >> for redirection with appending
  • echo text > file to print a text to a file (and overwrite it)
  • echo text >> file to add new lines to a file. See also man echo
  • touch file to just create a new file (empty); Note: it does something different on Linux if a file already exists
  • cat file/message display text
  • diff file1 file2 show difference between two files
  • rm to remove a file and rmdir to remove a folder

Exercise 0: terminal (go down for more)

  1. Open up the terminal
  2. Pick a space for your reproducible research materials and navigate there

E.g. you can use cd <pathname> for navigation or do "Git Bash here" in Windows.


  1. Create a new folder named RR_git1

mkdir RR_git1


  1. Navigate to your new folder

cd RR_git1


  1. Create a classes.txt file with a line with today's date.

echo "3/6/2024" > classes.txt


Great! Now that it's covered let's look at Git diagnostics

(also see standard command line options)

  • git, git --help to display git inline help
  • git [cmd] --help to display web help about cmd
  • git --version to display diagnostic info (version)
  • git status to display local repository status
  • git log to display history of commits

These commands do not alter anything. Feel free to use them frequently to verify and understand the results of your actions.


Basic git commands - setting up the repository

git init [repo_name] to initialize an empty repository in the current [or specified] directory
git clone [repo_name] [clone_name] to create a linked copy of a repository

git config -l to view all configuration options

Config structure: git config [-l] [--scope] [option_name] [value]

There are three levels of configuration (i.e. scope):
--system - pertains to repositories of all system users
--global - pertains to all user's repositories, overrides system settings
--local (default) - pertains to the current repository, overrides global settings

Note: global configuration will be visible only if you've used Git before (and added some options)
Note 2: local configuraiton will be visible only if we're in a Git repository


Exercise 1: creating a repository (go down for more)

  1. In your RR_git1 directory, initiate a git repository named EX1 and enter it

(hint: you can either initiate it with that name, or create a folder named EX1, enter it, and initiate the repository inside)

git init EX1
cd EX1

or

mkdir EX1
cd EX1
git init


  1. List all available configuration options.

git config -l


  1. List all global options

git config -l --global

(Note: this will only work if you've ever changed any global options)


  1. List all local options

git config -l --local


  1. Set global option 'user.name' to your name

git config --global user.name "Name Surname"


  1. Set global option 'user.email' to your e-mail

git config --global user.email "your.email@smth.smth"


  1. List all global options, check the difference

git config -l --global


  1. Set local option 'user.name' to your initials

git config --local user.name "AB"


  1. List all local options

git config -l --local


The three git states

Unlike the other VCS, Git has something called the "staging area" or "index". This is an intermediate area where commits can be formatted and reviewed before completing the commit.

Source for this and following slides: https://git-scm.com/


The three trees

See here for a detailed description
And think of a tree as an ordered collection of files.

Tree Role
HEAD Last commit snapshot
Index Proposed next commit snapshot
Working directory Sandbox

HEAD is a snapshot of your last commit on a given branch.

If you want to see what that snapshot looks like, simply type:

**$ git cat-file -p HEAD

If you recall the branch graph, this is the latest commit on the branch.


Staging area aka Index

The index is your proposed next commit.

Command that shows you what your index and working area currently hold (also check options):
git ls-files

It's a box where you put the files you'd like to include in your next commit (sort of a work-in-progress not-yet-commited commit)


Working directory

Think of the working directory as a sandbox, where you can try changes out before sending them to your staging area (index) and then to history.


Ok, so let's examine this process

#0 At this point, only the working directory tree has any content.


#1 We use git add to take content in the working directory and copy it to the index.

(Note that we keep all boxes. Two currently store the same information)


#2 We use git commit, which takes the contents of the index and saves it as a permanent snapshot, creates a commit object which points to that snapshot, and updates master to point to that commit.

(In general terms, there's now a new commit on the main branch, and there's an indicator pointing to it saying "hey, this is where we're at".)


#3 If we run git status, we’ll see no changes, because all three trees are the same.


Git commands: the workflow

git add [filename(s)] to add files to the staging area
git add . to add all new/modified files to the staging area
git commit -m "<commit description>" to create a new commit with what's in the staging area

At any point you can:
git status to verify where you are, and what are the differences between the three trees
git diff to compare last commit with what's in the working directory
git log to view the commit history


Exercise 2: adding commits (go down for more)

  1. In your RR_git1 directory, initiate a git repository named EX2

cd ..
git init EX2

(check git status and git diff to get a better feel of this)


  1. Go inside the new repository.

cd EX2


  1. Create a file named README.md, add a single line of text inside, save the file [hint: you can use echo or create it manually with e.g. Notepad]

touch README.md
echo "one line" >> README.md

(check git status and git diff to get a better feel of this)


  1. Stage the new file.

git add README.md

(check git status and git diff to get a better feel of this)


  1. Commit the file (remember to include a helpful commit description!)

git commit -m "Added README.md with one line of text"

(check git status and git diff and git log to get a better feel of this)


Exercise 3: adding commits (go down for more)

  1. Add another line of text to the file you created.

echo "a second line" >> README.md


  1. Create a new file named "readme.txt".

touch readme.txt


  1. Create an empty folder named "data"

mkdir data


  1. Run the repository diagnostics as above.

git status


  1. Stage and commit the modifed file and the new file.

git add .
git commit -m "Modified README.md and added readme.txt"


  1. Check git log, etc. again.

git log
git status


Exercise 4: using .gitignore

  1. Create data/data1.csv file and fill it with a random data line (can be just comma-separated text, it doesn't matter), check status and diff

echo "var1,var2\n1,2" > data/data1.csv
git status


  1. Create a .gitignore file (yes, starting with a dot), put the word 'data' inside (it's the name of our directory), check status and diff

touch .gitignore
echo "data" >> .gitignore
git status

.gitignore is a file that tells git to ignore certain elements. Should we commit it? <- depends on the workflow and, e.g., who we're working with (we might not want to share it with collaborators)


Assignment

While in your EX2 repository, run these four commands:
git status
git log
git ls-files
git ls-files -o

Copy the contents of Bash/CLI starting from git status and send them to me in a notepad file (wojciechhardy@uw.edu.pl).

Try to store your files in a safe place so we can pick up where we left next time.

Hint: if you simply copy the folder to a pendrive or smth, the repository will continue working (everything you need is already inside!)


Stuck in VIM?

If you forgot about adding a message to your commit, you might have ended up in VIM. It's a free, text-editting software that sometimes feels like a trap.

Tl;dr: hit [ESC], then type :q** and press Enter .
Repeat your commit with a helpful description.

You can also try adding the comment in VIM instead, and then exit with :wq instead, which should do the commit with the comment.

See more in this helpful Stackoverflow answer.


Read more on the three trees with the git reset guideline

Cheat sheet 1 (Atlassian)

Git-scm in general

Atlassian in general

If you need more, just Google tutorials/blog posts/YouTube videos until you find one that makes it clear :) Lots to choose from!


Select a repo