--- tags: github --- > New website: [https://rhythm16.github.io](https://rhythm16.github.io) # How Git repositories are blockchains I think most people (even some computer science people, including me) only think of cryptocurrencies anytime when "blockchain" is mentioned. However I recently learned a bit about how git works internally and was very surprised by how similar it is to some of the bitcoin/blockchain concepts introductions that I have studied. So as a note and sharing of this interesting finding, in this article, I'm going to try to explain how git provides its fundamental functionalities, and hopefully if the reader is somewhat familiar with the basic blockchain concepts, can easily see what I claim in the title of the post. Blockchain knowledge is not required to understand this article, though. I'll be mostly talking about the git internals. Only basic git knowledge is assumed. ## Toy repo setup I'll setup a simple git repository for demonstration purposes, all commands in this article will be issued in the repo. ### Setup: ```bash # be sure to not be in a repository! $ mkdir simple_repo $ cd simple_repo $ git init Initialized empty Git repository in /your/path/simple_repo/.git $ echo "FIRST LINE" > my_file.txt $ git add my_file.txt $ git commit -m "add my_file.txt" # commit info.. $ mkdir dir $ echo "LINE IN INNER FILE" > dir/inner_file.txt $ git add dir/inner_file.txt $ git commit -m "add inner_file.txt" # commit info.. ``` ## git under the hood In essence, git is a key-value data store with abstraction layers built on top of it which can be used by convenient commands. The key corresponding to a piece of data is generated by hashing the data using SHA-1. All information that git needs in order to work is inside the .git directory under the root of every git repository. There are four types of objects that git hashes and keeps in the data store: * blobs (binary large objects) * trees * commits * annotated tags I'll just go into the initial three of them because they're enough to paint the picture. In the toy repo, there exists 7 objects saved after the operations above, including blobs, trees and commits (though you won't be able to tell the types now). You can view them under `.git/objects`. The first two characters of the hashes are used as names of the subdirectory and the remaining 38 characters are used as file names (SHA-1 produces a 40-character hexadecimal output). > your hash will not be the same as mine because the input to the hashing function includes username, email, time, etc. ```bash $ tree .git/objects/ .git/objects/ ├── 17 │ └── 1edfe85c938a6d2c7af2f0bb0db6808951be49 ├── 29 │ └── 1caa6e2306eca737a3944824676a16fa6a4a39 ├── 6f │ └── e44c47ff76dad4433f9e0df8e717ad03eb9004 ├── 72 │ └── 538f3bfde6c5c309abfab4cc47d9428566a621 ├── c0 │ └── e23f04d16a2ccda68a117656d92defa2800885 ├── d2 │ └── 8a6d4ef745a7bd6e5e0b013ec17ca00b6fbae6 ├── f5 │ └── 5db9d5a884e10f0924529629f2dcb117abcaac ├── info └── pack 9 directories, 7 files ``` ### hash-object and cat-file `git hash-object` and `git cat-file` are two low level commands that directly interacts with objects, they hash and print objects, respectively. For example we can tell git to hash a string for us: ```bash # --stdin means read from stdin $ echo "hi" | git hash-object --stdin 45b983be36b73c0788dc9cbcb76cbb80fc7b057 # specify the -w flag to store the data in .git/objects/ $ echo "hi" | git hash-object --stdin -w # .git/objects/45/b983be... created 45b983be36b73c0788dc9cbcb76cbb80fc7b057 ``` use `cat-file` to see the data with its key(hash): ```bash $ git cat-file -p 45b983b # -p stands for pretty-print hi $ git cat-file -t 45b983b # -t stands for type blob ``` With these two commands we can manipulate the git data store directly. However, if you print out or open the raw files inside `.git/objects`, you won't be able to read them because git compresses them before saving. Also the compressed information includes some metadata, if not `cat-file` wouldn't be able to report back the type. ### Blobs (Binary Large OBjectS) Blobs are objects git uses to store contents of files, not including file names, only the file contents. In my local simple repo, hash d28a6d4.. and hash 171edfe.. correspond to the contents of the two text files created. Check it by `cat-file`: ```bash $ git cat-file -t d28a6d4 blob $ git cat-file -p d28a6d4 FIRST LINE $ git cat-file -t 171edfe blob $ git cat-file -p 171edfe LINE IN INNER FILE ``` A blob is like this: |hash |file content| |-----|--------| |d28a6d4...|FIRST LINE| ### Trees Trees are objects git uses to store contents of directories, each directory is stored as a tree object. It's easier to understand what's in a tree by looking at an example: ```bash $ git cat-file -t 72538f3 tree $ git cat-file -p 72538f3 040000 tree 291caa6e2306eca737a3944824676a16fa6a4a39 dir 100644 blob d28a6d4ef745a7bd6e5e0b013ec17ca00b6fbae6 my_file.txt $ git cat-file -t 291caa6 tree $ git cat-file -p 291caa6 100644 blob 171edfe85c938a6d2c7af2f0bb0db6808951be49 inner_file.txt ``` The trees are printed out as tables, looking like this: |permissions|object type|hash|name| |-----|--------|-|-| |040000|tree|291caa6e2306eca737a3944824676a16fa6a4a39|dir| |100644|blob|d28a6d4ef745a7bd6e5e0b013ec17ca00b6fbae6|my_file.txt| This tree corresponds to the root directory of our toy repository. The interesting thing here is that tree objects can store other tree objects' information, just like directories(like the 'dir' row above). Furthermore, you can see the hash to file name mapping is stored in the tree objects, not in the blob objects. The meaning of the permissions field is explained in [this](https://unix.stackexchange.com/questions/450480/file-permission-with-six-bytes-in-git-what-does-it-mean) stackoverflow question. Here's some visualization I made :) ```graphviz digraph structs { node[shape=record] struct4 [label="{Tree 72538..|{{<f0> perm|<f4> 040000|<f8>100644}|{<f1> type|<f5>tree|<f9>blob}|{<f2> hash|<f6>271ca..|<f10>d28a6..}|{<f3> name|<f7>dir|<f11>my_file.txt}}}"]; struct5 [label="{Tree 271ca..|{{<f0> perm|<f8>100644}|{<f1> type|<f9>blob}|{<f2> hash|<f10>171ed..}|{<f3> name|<f11>inner_file.txt}}}"]; struct6 [label="{Blob 171ed..|\"LINE IN INNER FILE\"}"] struct7 [label="{Blob d28a6..|\"FIRST LINE\"}"] struct4:f6 -> struct5; struct5:f10 -> struct6; struct4:f10 -> struct7; } ``` ### commits Finally, we come to commits. Commits are the last type of objects I'll be talking about, and they bring the concepts all together. As the name suggests, each commit you do is stored in a commit object by git, they contain a tree, an author, a parent commit (none for the first commit and 2 for merge commits), a commiter, and the commit message. Let's see it: ```bash $ git cat-file -t c0e23f commit $ git cat-file -t f55db9 commit $ git cat-file -p c0e23f tree 6fe44c47ff76dad4433f9e0df8e717ad03eb9004 author rhythm <rhythm@music.yes> 1618749200 +0800 committer rhythm <rhythm@music.yes> 1618749200 +0800 add my_file.txt $ git cat-file -p f55db9 tree 72538f3bfde6c5c309abfab4cc47d9428566a621 parent c0e23f04d16a2ccda68a117656d92defa2800885 author rhythm <rhythm@music.yes> 1618749484 +0800 committer rhythm <rhythm@music.yes> 1618749484 +0800 add inner_file.txt ``` The diagram below shows the structure git uses to keep track of commits and their contents. __NOTE__ that the arrow pointing from `commit f55db`'s parent field to `commit c0e23` is missing, the reason is that I just can't get it to work without the graphviz engine making a mess. So you'll have to imagine for yourself. I hope you get the point though. ```graphviz digraph structs { newrank=true; node[shape=record] {rankdir=LR} rank1 [style=invisible]; rank2 [style=invisible]; rank1 -> rank2 [color=white]; structcf [label="{Commit f55db..|{{<f0> tree|<f4> parent|<f8>author|commiter}|{<f1> 72538..|<f5>c0e23..|<f9>rhythm ...|rhythm ...}}}}"]; structcc [label="{Commit c0e23..|{{<f0> tree|<f4> parent|<f8>author|commiter}|{<f1> 6fe44..|<f5>|<f9>rhythm ...|rhythm ...}}}}"]; structt6 [label="{Tree 6fe44..|{{<f0> perm|<f8>100644}|{<f1> type|<f9>blob}|{<f2> hash|<f10>d28a6..}|{<f3> name|<f11>my_file.txt}}}"]; structt7 [label="{Tree 72538..|{{<f0> perm|<f4> 040000|<f8>100644}|{<f1> type|<f5>tree|<f9>blob}|{<f2> hash|<f6>271ca..|<f10>d28a6..}|{<f3> name|<f7>dir|<f11>my_file.txt}}}"]; structt2 [label="{Tree 271ca..|{{<f0> perm|<f8>100644}|{<f1> type|<f9>blob}|{<f2> hash|<f10>171ed..}|{<f3> name|<f11>inner_file.txt}}}"]; structb1 [label="{Blob 171ed..|\"LINE IN INNER FILE\"}"] structbd [label="{Blob d28a6..|\"FIRST LINE\"}"] structcf:f1 -> structt7; //structcf:f5 -> structcc; structcc:f1 -> structt6; structt6:f10 -> structbd[constraint=false]; structt7:f6 -> structt2; structt2:f10 -> structb1; structt7:f10 -> structbd; //{rank=same; structcc structcf} /*subgraph commit0 { structcf:f1 -> structt7; structt7:f6 -> structt2; structt2:f10 -> structb1; structt7:f10 -> structbd; } subgraph commit1 { structcc:f1 -> structt6; structt6:f10 -> structbd; } {rank = same;} */ } ``` ## Git repos are blockchains From the explanation above, we can see the git commit data structure is something like this: ```graphviz digraph git { {rankdir=LR} "commit #3" -> "root tree #3" "commit #2" -> "root tree #2" "commit #1" -> "root tree #1" "commit #0" -> "root tree #0" "commit #0" -> "commit #1" [dir=back] "commit #1" -> "commit #2" [dir=back] "commit #2" -> "commit #3" [dir=back] {rank = same; "commit #0" "commit #1" "commit #2" "commit #3"} "root tree #0" -> "blob a" "root tree #0" -> "blob b" "root tree #1" -> "blob c" "root tree #1" -> "blob d" "root tree #2" -> "blob e" "root tree #2" -> "blob f" "root tree #3" -> "blob g" "root tree #3" -> "blob h" } ``` The commits have the following properties: * acyclic * hash based back-pointing * stores the merkle root of the data it corresponds to * contains timestamp IMO, these are the properties blockchains possess, but there are voices from both sides, see [this SO question](https://stackoverflow.com/questions/46192377/why-is-git-not-considered-a-block-chain) and [this medium post](https://medium.com/@shemnon/is-a-git-repository-a-blockchain-35cb1cd2c491). As far as I know, the term blockchain isn't rigorously defined, so there's no definitive answer. ## Acknowledgement I learned most of the material in this post from the amazing [Udemy course](https://www.udemy.com/course/git-and-github-bootcamp/?utm_source=adwords&utm_medium=udemyads&utm_campaign=LongTail_la.EN_cc.ROW&utm_content=deal4584&utm_term=_._ag_77879424134_._ad_437497333833_._kw__._de_c_._dm__._pl__._ti_dsa-1007766171312_._li_9040379_._pd__._&matchtype=b&gclid=EAIaIQobChMIysO8j8mR8AIVDq2WCh24YgHCEAAYASAAEgLPdPD_BwE) taught by Colt Steele. I highly recommend it to people who want to have a solid foundation on git/github. ## Some other good references [git internals](https://www.linkedin.com/pulse/git-internals-how-works-kaushik-rangadurai/)