TL;DR: Check out my STUDY_NOTES.md on Git if you want a quick understanding of (ancient) Git internals!
I’ve been using Git since 2009. In all that time I never really bothered with understanding Git internals, because frankly after learning what a directed acyclic graph (DAG) was, everything just fell into place.
That’s going to change, because in the coming weeks, I will start contributing to Git on a somewhat regular basis (at least, that’s the plan). It won’t be the first time contributing to the project (which I did back in 2014), but I will need to begin studying how Git works under the hood.
To that end, I spent the better part of last weekend trying to understand Git’s internals. The current Git codebase is a bit daunting, and there’s no way that I’m going to read it all any time soon. But the very first commit of Git is small enough to read in one sitting, and so I tried compiling it (there were lots of errors), while taking notes in the source code directly. I also actually used the produced binaries to prove to myself that yes, this system actually does work even at this primitive stage.
Now, there are major differences between this ancient root-commit version of
“Git” and modern Git. However, I’ve taken note of all such differences (at least
as many as I could gather, within reason) by digging into the Git mailing list archive to try to make sense of why things were changed the way they were (e.g.,
How come we have so-called “pack” files? How come the SHA1 hash of an object
(using sha1sum
) is not the same as its directory name plus filename?) You can
see my notes in the STUDY_NOTES.md file for the answers.
I have to admit that I found Linus Torvalds’ initial design decisions to be
impressively elegant. Reading the first commit made me have multiple “ah-ha!”
moments behind why Git has a distinction between the index and the working tree,
why it doesn’t track empty directories, why Git doesn’t care if you blow away
the working tree as long as the .git
folder is intact, etc. And the code is
pretty easy to follow! It’s a great resource for any aspiring C hacker.
Note: I’ve reproduced the study notes below for posterity. Do check out the branch directly, or apply the patches yourself on the root commit.
Root-commit Git study notes
This is a special branch of Git for learning purposes. It is special
because it is based off the absolute minimal “ancient” implementation of
Git (Linus Torvald’s root commit at
e83c5163316f89bfbde7d9ab23ca2e25604af290
),
with some small changes to make it easy to compile with
Nix (see the Makefile changes) and
also “feel” more like modern Git (namely, the use of .git/
instead of
.dircache/
). Yes, you can technically grab the
100th or so commit which basically has all of the changes I made, but you’d
be dealing with a lot more code to read. If you just want to quickly
understand Git’s data structures, there’s honestly nothing faster than
reading the root commit (it’s only ~1000 lines of C, including comments)
and with some additional notes to fill in any missing gaps (which this
document tries to do).
The biggest revelation I had while creating these notes is that Git’s
data structures have proven to be incredibly stable — the initial idea
of an object database (.git/objects/...
) and the cache (.git/index
)
were there from day 1 and are still the main workhorses for Git. Knowing
these two concepts will radically reduce the perceived complexity of
modern Git’s numerous bells and whistles, as every other thing you see
in the .git
folder are mere extensions of these two essential data
structures.
This version comes with a basic Usage Guide to help users actually use
the binaries that shipped with the root commit. Run make
(you need the
Nix package manager) and try to use
the commands in the order described in the Usage Guide below. After (or
perhaps before?) you run each command, read its source code. You might
want to have a look at cache.h
first — but the main “meat” of it all
is in update-cache.c
— which makes sense, because the cache is
always updated first before anything is written to the object database.
Pro tip: use
6ad6d3d36c5924c8ff502ebbb6a6216df01e7efb
as a shortcut to view the first 100ish commits in Git’s history. This is
handy to understand some of the early changes that went into Git. As a
bonus, this commit also updates the README to capture the workflow of
actually using GIT in its ancient form. Perhaps it is obvious, but use
git log -p README
to see its history.
Data structures
The two big data structures are:
- The object database (all files under
.git/objects/...
), and - The index file (aka “cache”, at
.git/index
).
Refer to README for a more thorough discussion of these data structures. But here are a few more interesting notes about each data structure.
The Object Database
The 8-bit fanout
The object database has 256 folders, named 00
through ff
in hex
notation (the first 8 bits of the 20-byte SHA1 hashing scheme used to
generate the object IDs for this database). You may wonder why do we
bother with this structure (after all, all files are already named with
their unique SHA1 hash so the chance of an innocent collision is
virtually zero). Torvalds stated in
April 2005 that he didn’t want hundreds of thousands of files in one
subdirectory:
The 8-bit initial fan-out is very much a
middle ground: we waste some space (and some time) doing it, but it does
make the really horrible case largely go away.
The “horrible case” probably refers to the possibility of hundreds of thousands of files all residing in a single directory, which Torvalds brought up in the linked email.
Object IDs
You can run sha1sum
of any file in the object database and the output
(SHA1 hash) will match the filename path. E.g.,
$ sha1sum .git/objects/cc/41b0dfbe81a71ca922cda3c9de9db3a25a56b4
cc41b0dfbe81a71ca922cda3c9de9db3a25a56b4 .git/objects/cc/41b0dfbe81a71ca922cda3c9de9db3a25a56b4
and notice that cc41b0dfbe81a71ca922cda3c9de9db3a25a56b4
matches the
cc/41b0dfbe81a71ca922cda3c9de9db3a25a56b4
. However, this is no longer
the case and you’ll get a different SHA1 hash using any recent Git
version. The reason is because Git originally hashed the compressed
(post-zlibbed) contents, but now it hashes the decompressed
(pre-zlibbed) content. This switch-over was done in
d98b46f8d9a3daf965a39f8c0089c1401e0081ee
and
f18ca7316631914776136455c151d70318299459
,
just a couple weeks after the root commit, mainly for performance
reasons (because write-tree
was taking too long in applying patches).
See
the original discussion and the
“Object DB conversion” announcement.
Also see this page for a guide on using Python to check the hashes of objects (in case you want to check the hash output independently of Git tooling).
Only basic compression
Modern Git uses at least two additional schemes not present in this initial version to help reduce redundant data: pack files (record deltas of similar objects), and recursive tree objects (that’s right, in the original implementation, a tree object could only refer to blobs).
Pack files
Note that in this version, Git treats a file’s content as an atomic unit of data — it doesn’t perform any form of “chunking” to divide it up into smaller bits (similar to what bittorrent does). So every file will get its own blob, and the only way that a blob will be reused (thus saving disk space) in a subsequent commit is if does not change. It must match identically!
You may think, “why not just divide a file into chunks, and make blobs out of each chunk?” — that way, you’d naturally get some level of deduping, even without any additional work. Torvalds considered this but rejected the idea for two reasons: performance and simplicity.
Just a couple months after the above email though, Git learned about pack files. Basically, pack files compress a range of reachable objects between two commits and puts them all into two files, a pack index (
.idx
) and pack (.pack
) file. The basic idea is that you can put all of these objects together in the.pack
file, allowing you to do some level of compression inside it (assuming you have lots of objects that have similar content). Here is a description of how it would work in Torvald’s own words. Here’s a somewhat retrospective announcement, which explains that the previous “delta object” approach (where Git stored delta objects in the object database) is deprecated (however, do note that the algorithms to find the deltas (diff_delta()
) was re-used in the pack files, so not everything was discarded).If you’re wondering why the pack files have a separate dedicated index file, it basically comes down to performance and simplicity, again.
Recursive tree objects
As trees can currently only refer to blobs only, this means that every commit is somewhat wasteful (although this has the unique property that a single commit refers to a single tree object that has everything in it).
Recursive tree objects were added in
d6d3f9d0125a7215f3cdc2600b2307ca55b69536
.
The Cache
The cache, or index file, represents a tree “snapshot”. It is what is
staged, ready to be committed. More precisely, it is just a
cache_header
followed by a list of cache_entry
values, where each
cache_entry
is a blob
object’s metadata. Among other things, the
cache_header
records how many cache entries there are in the index
file. This is still true in modern Git as of March 2023 — if you run
hexdump -C .git/index | head -n1
you can see, for example:
$ hexdump -C .git/index | head -n1
00000000 44 49 52 43 00 00 00 01 00 00 00 0d 64 15 77 69 |DIRC........d.wi|
where the DIRC
is a magic number (standing for dircache
, the
original name of the .git
folder) followed by 4 bytes (unsigned int)
for the index version and another 4 bytes showing the number of cache
entries, or file paths, that are being “tracked” for purposes of tree
object creation. In the example above the index version is 1
(modern
Git uses version 2
), and there are 0x0d
or 13
cache entries, or
files, that would make up the current tree.
Note that if you run the above on an index file created by the original
update-cache
, you would see instead something like:
$ hexdump -C .git/index | head -n1
00000000 43 52 49 44 01 00 00 00 01 00 00 00 2b 1a 2d 28 |CRID........+.-(|
because the byte order was using little-endian, “host byte order”. This
is what is meant by “native CPU byte format” comment in cache.h
(because most CPUs are Intel, and Intel uses little-endian). The byte
order was changed in
ccc4feb579265266d0a4a73c0c9443ecc0c26ce3
to use big-endian, also called “network byte order”, for convenience
over NFS.
Other missing things vs modern Git
This initial version of Git does not have support for HEAD
(.git/HEAD
) or branches (.git/refs/heads/...
). In fact there are no
human-friendly references at all! But one can easily understand that
references are just pointers to the object store — all you would need
is a way to keep track of the latest commit by saving its object ID
(SHA1 hash) somewhere. The simplest possible thing you could do is to
have a file with this object ID in it — and this is what modern Git
(still) does. The old
README
notes that in practice, the SHA1 hash was written at .git/HEAD
. It was
formally recognized as such just a day later in
839a7a06f35bf8cd563a41d6db97f453ab108129
, as part of the git-prune-script
and git-pull-script
helpers to help with merging.
Usage Guide
This guide explains how the earliest version of Git (root commit) works. You can read these steps and also look up the C source code and read them to get a better sense of how everything works.
Initialize the object database with
init-db
. This is the.git
directory.Make changes to files. These files can be any file except the
.git
directory. We don’t have the concept of.gitignore
yet, and also, all dotfiles (any file that begins with a.
) are ignored and cannot be tracked by Git.Stage modified files with
update-cache <FILE> [...FILES]
. This compresses these files’ contents and saves them to the object database, such that each file gets its own object database file. At this point the files are tracked by Git. It also results in adding this file’s metadata (essentially the filename and SHA1 of its contents) to the.git/index
file.(Optional) Check the diff of what is in the
.git/index
(staged) versus the current working tree withshow-diff
. We are just diffing whatever is in the current cache.git/index
(essentially the last known “tree-to-be-written-to-object-database-but-not-yet”) and what is on disk at those paths that the cache describes. The diffing comparison is basic and is based on timestamps and inodes (presumably for performance).This diff is the ancient equivalent of
git diff
. If we add those files that have been modified withupdate-cache
, thenshow-diff
will show nothing, because the working tree files on disk match what is in the index file (just like how moderngit diff
will show nothing, unless you invokegit diff --cached
, in this situation).Note also that we are not comparing things to a previous commit of any kind. Instead we are always only diffing the files that were touched/modified (during the course of normal development) and what the index file has. It’s even more primitive than the modern “detached HEAD mode” in Git because we do not automatically diff against a “current commit” because the concept of a “current commit” doesn’t exist yet — we literally have blobs, trees, and commits in the object database, the index file (describing whatever paths make up another (perhaps new and unique) tree object), and the working tree (everything except the
.git
folder).Lastly, the
show-diff
command shells out todiff
(so the codebase doesn’t have any fancy diffing algorithms).Run
write-tree
to save the data in.git/index
is its own tree object in the object database. The SHA1 of this tree object is printed to STDOUT. Take a note of this SHA1 hash, as it will be referenced to construct a commit (changeset) object.(Optional) Check the SHA1 from
write-tree
withread-tree <SHA1>
. This will display the tree object (by displaying its blobs).Create a new commit with
echo "my-commit-message" | commit-tree <SHA1>
, using the SHA1 from step 5 above. This will create a new commit object and write it to the object database.(Optional) Check the commit with
cat-file <COMMIT_SHA1>
. This will write the commit message and metadata (including the tree SHA (and parent commit SHAs for non-root commits)) to a temporary file. You can justcat
out this file to see it (commit date, author name, email, etc.).The fact that
cat-file
writes to disk is a bit annoying, and so it learned to output to STDOUT inbf0c6e839c692142784caf07b523cd69442e57a5
.Repeat steps 2-7 above, but for step 5 pass in the
-p <SHA1>
flag to mark it as a child of a previous commit SHA. You can pass in multiple-p
flags to denote multiple parents (e.g., a merge). For the very first merge in Git’s own history, seeb51ad4314078298194d23d46e2b4473ffd32a88a
.