Introduction to Git - Fall 2020

Lecture 1: Why use version control?

Slides: https://hackmd.io/@hpc2n-git-2020/L1-motivation


What is version control?


In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections of information. - Wikipedia


Version control systems (VCS)

systems responsible for managing changes


Why use version control?


In an ideal world, things develop linearly:

  • Every new version is an improvement upon the previous version.
    • No need to backtrack.
  • Everyone known what everyone else is doing
  • In the end, things are simply finished.
digraph {
  rankdir=LR
  Mon[label="Monday's\n improvements"] [fixedsize=circle]
  Tue[label="Tuesday's\n improvements"] [fixedsize=circle]
  Wed[label="Wednesday's\n improvements"] [fixedsize=circle]
  Mon -> Tue
  Tue -> Wed
}

In real world, things develop non-linearly:

  • A new version can anything between
    • a complete catastrophe and
    • a major breakthrough.
  • People do not know what others are doing
  • Sometimes we are simply fixing earlier mistakes
digraph {
  rankdir=LR
  Mon[label="Monday's\n improvements"] [fixedsize=circle]
  Tue[label="Tuesday's\n mistakes"] [fixedsize=circle]
  Wed[label="Wednesday's\n corrections"] [fixedsize=circle]
  Mon -> Tue
  Tue -> Wed
}

Going back to an earlier version

Sometimes, it is easier to simply backtrack to an earlier version

digraph {
  rankdir=LR
  Mon[label="Monday's\n improvements"] [fixedsize=circle]
  Tue[label="Tuesday's\n mistakes"] [fixedsize=circle]
  Wed[label="Wednesday's\n improvements"] [fixedsize=circle]
  Mon -> Tue
  Mon -> Wed
}

Where is this earlier version?

  • CTRL + Z
  • my_file.txt, my_file.txt.old,
  • My project/
    • 2020-08-12/
    • 2020-08-13/
  • Daily home directory backup

Challenges and obstacles

  • Prone to mistakes
    • CTRL + Z has limits, overwritten/deleted files, human/hardware error
  • How much to save?
    • Individual files? Everything? How much space is required?
  • How to organize versions?
    • What is the difference between different versions?

Overall, difficult to manage!


What about the granularity?

digraph {
  rankdir=LR

  subgraph cluster1 {
    t1a [label="Component A\n improvement"] [fixedsize=circle]
    t1b [label="Component B\n mistake"] [fixedsize=circle]
    t1c [label="Component C\n improvement"] [fixedsize=circle]
    label="Mondays's changes"
  }

  subgraph cluster2 {
    t2a [label="Component A\n improvement"] [fixedsize=circle]
    t2b [label="Component B\n correction"] [fixedsize=circle]
    t2c [label="Component C\n mistake"] [fixedsize=circle]
    label="Tuesday's changes"
  }

  subgraph cluster3 {
    t3a [label="Component A\n mistake"] [fixedsize=circle]
    t3b [label="Component B\n improvement"] [fixedsize=circle]
    t3c [label="Component C\n correction"] [fixedsize=circle]
    label="Wednesday's changes"
  }
  
  t1a -> t2a
  t1b -> t2b
  t1c -> t2c

  t2a -> t3a
  t2b -> t3b
  t2c -> t3c
}

This compounds the problems!


How does VCS solve this?

  • Stores the history using snapshots (commits)
    • Each snapshot represents the project in a given point of time
  • Manages snapshots and associated metadata
    • Naming (tags), comments, dates, authors, etc
  • Easy to move between different snapshots
  • Can handle different degrees of granularity
  • Can handle multiple development paths (branches)

Comparing and joining

  • VCS makes it easy to compare different snapshots
    • Named revisions, comments, time information, author information
    • Diff tools
    • Search tools
    • Bisection search
  • VCS also allows the joining (merging) of different snapshots
    • Easy to experiment with ideas

Collaboration

  • One of the primary functions of VCS is to allow collaboration
  • Usual setup: server (remote) + multiple clients
    • People work locally and send (push) the changes to the server
    • VCS keeps track of what has been done and by whom
  • Safer since mistakes can be easily remedied
  • The contributions of several people can be merged

Backup

  • VCS functions as an backup
  • Locally, the system maintains a copy of each file
    • Usually only the changes or the files that have changed are stored
  • Globally, lost files can be recovered from the server

Integration

  • VCSs such as Git have been integrated with several services
    • HackMD, Overleaf,
  • Services such as GitHub can do almost everything for you
    • Store history, distribute, testing / continuous integration, bug reports, milestones, website,

Practical use cases

What are the practical use cases for VCS?


Source code

  • Many VCSs are designed for managing source code
  • Manage deployment (production, development, testing, etc)
  • Manage published versions (v0.1 etc)
  • Manage (experimental) features
  • Bug hunting

Latex files

  • Track which version of a manuscript has been
    • submitted,
    • revised and/or
    • accepted
  • Collaboration between several authors

HPC: batch files and data

  • Track different version of your batch scripts
    • Easy to check the used configuration afterwords
  • Track input and output files
    • Limited to smallish files

Select a repo