04-Advanced Git Part 2

--- tags: devtools2022 --- # 04-Advanced Git Part 2 Learning about `git` internal details further to use it to our advantage. ## Schedule and Learning Objectives * **1530-1600**: Git Internal Objects and Pruning * **1600-1630**: Managing dotfiles, packing submodules * **1630-1700**: Discussion, Q&A ## Prune Dangling Commits When we do a `hard reset`, it may appear as though we have lost or deleted the commit, but Git is very strict about not deleting history. You can find all dangling commits using: ``` git fsck --lost-found ``` ![](https://i.imgur.com/XTHSAms.png) You can **confirm** that it is still there because you can `checkout` that commit: ![](https://i.imgur.com/AHtvnPR.png) The command: ``` git prune --dry-run --verbose ``` will display output indicating what is set to be pruned but not actually prune it. ![](https://i.imgur.com/lQXyNhO.png) ## `git` internal objects 1. **blob**: the **contents** of files are stored in objects called blobs, binary large objects. No metadata like a file. > Every blob in git is identified by its SHA-1 hash. SHA-1 hashes consist of 20 bytes, usually represented by 40 characters in hexadecimal form 2. **tree**: a tree is basically a directory listing in git, referring to blobs as well as other trees. 3. **commit**: a **snapshot** object that includes a pointer to the main `tree` (like **root** dir), and other metadata: commiter, message, time. For example, look the diagram below (taken from [here](https://www.freecodecamp.org/news/git-internals-objects-branches-create-repo/)): ![](https://i.imgur.com/Evu6Okb.png) The diagram above is equivalent to a file system with a root directory that has one file at `/test.js`, and a **directory** named `/docs` with two files: `/docs/pic.png` and `/docs/1.txt`. ### Commit Data Doesn’t that mean that we have to store a lot of data every commit? Let’s examine what happens if we **change** the contents of a file. Say that we edit `1.txt`, and add an **exclamation** mark — that is, we changed the content from `HELLO WORLD`, to `HELLO WORLD!`. This results in a **new** blob (with new hash). We change both the hash and the content of the **tree** pointing to hello-world blob. In fact, this change is trickled **up** till the parent tree. ![](https://i.imgur.com/jCwYhVn.png) If a new commit is made, we made **references** to the blobs that doesn't change. > Since the blobs consist of the same data, they’ll have the **same** SHA-1 values. No need to copy them for the new commit (snapshot) ![](https://i.imgur.com/pcmE5lT.png) ### Branch and HEAD A branch is none other than a **named** commit. > Typically, the branch points to the latest commit in the line of development we are currently working on. How does git know what branch we’re currently on? It keeps a special pointer called HEAD. Usually, HEAD points to a branch, which in turns points to a commit. > In some cases, HEAD can also point to a commit directly, but we won’t focus on that. ![](https://i.imgur.com/p3ujWj0.png) ### Staging Area After we make some changes, we want to record them in our **repository**. A repository (in short: repo) is a **collection** of commits, each of which is an **archive** of what the project’s working tree looked like at a past date, whether on our machine or someone else’s. > A repository also includes things other than our code files, such as HEAD, branches, and so on. Unlike other, similar tools you may have used, git does not commit changes from the working tree directly into the repository. Instead, changes are first registered in something called the index, or the **staging area**. ![](https://i.imgur.com/AdZ36ez.png) Files in our working directory can be in one of two states: **tracked** or **untracked**. ## Resume Prune Git commits can become **inaccessible** when performing history altering commands like git `reset` or git `rebase`. In an effort to preserve history and avoid data loss Git **will not** delete detached commits. A detached commit can still be checked out, cherry picked, and examined through the git log. To conclude our git prune simulation demo we can just `git prune` again without `--dry run`. However, this git prune command is intended to be invoked as a **child** command to `git gc` (garbage collection). We usually don't really need this in a day-to-day basis, and `git gc` is usually run regularly by invoking regular commands like `pull, merge, rebase, commit`. ## Managing dotfiles What are dotfiles? - Shell configuration files (.zshrc, .bashrc) - Git configuration files (.gitconfig) - Basically configuration files for programs living your system - Sometimes it's placed directly in `~` (`$HOME`) directory, or in `~/.config` directory. Goal: No extra tooling, no symlinks, files are tracked on a version control system, you can use different branches for different computers, you can replicate you configuration easily on new installation. The technique consists in storing a Git **bare** repository in a "side" folder (like `$HOME/.cfg` or `$HOME/.config`) using a specially crafted **alias** so that commands are run against that repository and **not the usual** `.git` local folder, which would interfere with any other Git repositories around. ### Bare Repository This is a [great article](https://stegosaurusdormant.com/bare-git-repo/) to read about it in full, relevant to this dotfiles section. In short, a bare git repo is a repo with just the **history** but without a **snapshot** (aka working tree). A normal git repo we have been using contains both a **snapshot** of all files in the repo (`working tree`) and a **history** of all the changes ever made to all these tracked files. This **history** is stored in the `.git` folder. - In a regular repository, the history is stored in the `.git` folder at the top level of your repository - It does not contain just the metadata, but it also **STORES** the data That means **we don't really need the snapshot**, and thats the key idea of bare git repo. > This is why you can delete all of the files in your snapshot and then restore them with git reset --hard: the data is still in the history even if you delete the snapshot. We store the history slightly differently in the bare git repo: - In a non-bare repository, all of the history is stored in **subdirectories** of the `.git` directory (e.g. for a project called `ProjectX`, this data would be in `ProjectX/.git/objects`, `ProjectX/.git/refs`, etc.). - In a bare repository, the history is stored in **multiple** top-level directories at the **project root** (e.g. `ProjectX/objects`, `ProjectX/refs`, etc.). > The layout is otherwise the same. ### Starting up ``` git init --bare $HOME/.config alias config='/usr/bin/git --git-dir=$HOME/.config/ --work-tree=$HOME' config config --local status.showUntrackedFiles no echo "alias config='/usr/bin/git --git-dir=$HOME/.config/ --work-tree=$HOME'" >> $HOME/.zshrc ``` 1. Create a folder called `$HOME/.config` (git [bare](https://www.saintsjd.com/2011/01/what-is-a-bare-git-repository/) repo) 2. Create alias `config` (we use this instead of `git` command) 3. Set a **flag** (local to the repo), hide files we aren't tracking 4. Echo the alias defn to `.zshrc` Now, any file within the `$HOME` folder can be **versioned** with normal commands, replacing git with your newly created config alias, like: ``` config status config add .zshrc config commit -m "Add zshrc" config add .[FILE] config commit -m "Add .[FILE]" config push ``` ### Migrating If you already store your configuration/dotfiles in a Git repository using the method above, on a **new** system you can migrate to this setup with the following steps: #### Clone to bare repo Now clone your dotfiles into a **bare** repository in any "dot" folder of your choice in your `$HOME`: ``` git clone --bare <git-repo-url> $HOME/.config ``` #### Add alias to shell .rc file or current session ``` alias config='/usr/bin/git --git-dir=$HOME/.config/ --work-tree=$HOME' ``` ![](https://i.imgur.com/76eqEM3.png) #### Ignore .config folder And that your source repository ignores the folder where you'll clone it, so that you don't create weird recursion problems: ``` echo ".config" >> .gitignore ``` #### Checkout Checkout the actual content from the bare repository to your `$HOME`: ``` config checkout ``` ![](https://i.imgur.com/tzmZ5RU.png) If there's complaint due to the fact that your `$HOME` folder might already have some stock configuration files which would be overwritten by Git, the solution is simple: **back** up the files if you care about them, remove them if you don't care. ``` mkdir -p .config-backup && \ config checkout 2>&1 | egrep "\s+\." | awk {'print $1'} | \ xargs -I{} mv {} .config-backup/{} ``` #### Hide Untracked files ``` config config --local status.showUntrackedFiles no ``` #### Use Now in your new machine, you can do `config [git-command] ..` to update your dotfiles. ``` config status config add .zshrc config commit -m "Add zshrc" config add .[FILE] config commit -m "Add .[FILE]" config push ``` #### Summary Here's the complete bash script that you can use when migrating to a new setup: ``` #!/bin/bash git clone --bare [GIT_REMOTE_URL] $HOME/.config function config { /usr/bin/git --git-dir=$HOME/.config/ --work-tree=$HOME $@ } mkdir -p .config-backup config checkout if [ $? = 0 ]; then echo "Checked out config."; else echo "Backing up pre-existing dot files."; config checkout 2>&1 | egrep "\s+\." | awk {'print $1'} | xargs -I{} mv {} .config-backup/{} fi; config checkout config config status.showUntrackedFiles no ``` Running `zsh` for instance will reload the setups automatically. ![](https://i.imgur.com/16Ejhpp.png) #### Branch It is even more convenient to create branches depending on your machine locally, then set remote's branch and push: ``` config push --set-upstream origin aws-ubuntu ``` ![](https://i.imgur.com/zA15rjb.png) ![](https://i.imgur.com/yZVtsVL.png) ## Submodules Sometimes we have many other git repos inside our git repo. Adding them will **not** cause the outer clone to clone all the other repos. ![](https://i.imgur.com/GwxQUjO.png) We need to add them as submodule instead using the command: ``` git add submodule [URL] [PATH-IN-REPO] ``` Note that we use the dotfile setup example above, hence `git` was aliased as `config`. ![](https://i.imgur.com/KcggKBx.png) Then you can add and commit the file `.gitmodules` containing the files to the repo. When you `clone`, you need to `init` and `update` these submodules for use: ``` git submodule init git submodule update ``` ![](https://i.imgur.com/6CtX7S2.png) Git submodules allow you to **keep** a git repository as a subdirectory of another git repository. Git submodules are simply a **reference** to another repository at a particular snapshot in time. * A git submodule is a **record** within a host git repository that points to a **specific** commit in another external repository. * Submodules are very **static** and only track specific commits. Submodules do not track git refs or branches and are **not** automatically updated when the host repository is updated. ### Use Cases When an external component or subproject is changing too fast or upcoming changes will break the API, you can **lock** the code to a specific commit for your own safety. Or, when you have a component that **isn’t updated** very often and you want to track it as a vendor dependency. ### Danger A common pattern of confusion and error is **forgetting** to **push** updates for the submodule for **remote** users. Suppose we have parent repo A that utilises submodule repo B, both available locally. We update B, and commit (but didn't push). A knows that B is updated (commited locally), made a `commit` then `push`. A remote developer that tries to pull the **latest** A will be unable to pull because we had forgotten to push B. To avoid this failure scenario make sure to always commit and push the submodule and parent repository. ### Alternative There's a whole lot of debate online on why we shouldn't use git submodule, but really it depends on the use case. An alternative is to use [`git subtree`](https://blog.developer.atlassian.com/the-power-of-git-subtree/?_ga=2-71978451-1385799339-1568044055-1068396449-1567112770). Git subtree **allows** you to insert any repository as a **sub**-directory of another one. > It is one of several ways Git projects can manage project dependencies.