Git Packfiles

Packfiles
- Packs
- Indexes
- Plumbing
Summary

Previously, in Git in Reverse, we learned about Git and how it internally stores information. Namely, we went over the "loose" object format that Git uses for storage. However, in the last post, we did not discuss how Git uses another storage format to more compactly store files, changes, and ultimately objects. In this post we will discuss packfiles and how Git uses these primarily for using less bandwidth and, only secondarily, using less storage space for storing repository contents.

We're only going to discuss the high-level details of packfiles, there are plenty of sources that describe the details better.

Packfiles

Packfiles, like git objects before, are an internal file set for storing objects in a more compressed format. That is, instead of storing each version of a file in its entirety, Git can store a single version of the file in its entirety and maintain an internal set of objects which contain patches to derive the other versions. Furthermore, Git can store entire repository's objects into a single packfile, thus eliminating large numbers of small files and improving efficiency of object access.

The actual files themselves are in the .git/objects/pack folder of a repository and there are both pack, .pack, files and index, .idx, files.

Here is the packfile that contains this repository (as of this writing):

± find .git/objects/pack -type f
.git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.pack
.git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.idx

Notice, there are not two "packs", but two files that describe the same "pack". There is the .pack file itself. This is the file that contains the actual objects. There is also the .idx file which provides an "index" of the objects contained in the pack.

We'll take a small moment to describe each in a little more detail.

Packs

Packfiles are relatively straight forward, there's a 12 byte header, first four spell "PACK", next four provide the version, "2" as of this writing, and the final four provide the number of objects in this pack. Following the header, there's a number of objects stored in a very compact but variable length format. Finally, there's a 20 byte trailer that is the checksum of the packfile's contents– header and objects.

In the header, the number of objects is encoded in a 4-byte integer, thus, there can only be $2\^{32}$ or little over 4 billion objects in a packfile. However, this does not give an upper bound of the size of the pack files themselves on disk. The length of each object is encoded in a variable length integer prefacing each object in the packfile.

The format of the objects in the packfile is not as they usually exist in the loose format, but it will compress them more, usually resulting in less space used on disk. That is, the objects stored in the packfile may be a base, undeltified object, or it may be a deltified object.

Undeltified objects are not necessarily as interesting, for one, because they are already [covered][3]. The deltified objects, however, are pretty interesting, and definitely different.

The deltified objects, as the name might imply, contain the delta, or, preferably, the patch and the base object name to create the defined object. That is, Git will store inside a regular Git object a patch used to derive the defined object. But it only does this in the context of packfiles. Furthermore, the structure allows for the base object to itself be a deltified object, thus, making it possible to only store one version of the full file, but then derive all other versions from deltas or patches.

While it is entirely possible to use only the packfile itself to access the contained objects, it's not very efficient for random access. Therefore, the index file is created to maintain a way to peer into the packfile efficiently.

Indexes

Packfile indexes solve the random object access efficiency problems caused by heavily compacting objects into a single file.

Although, the contents of the index are little more complicated than the pack file.

In version 1 of packfiles, the index does not have a header. In version 2, the current version, there are 8 bytes dedicated to the header: the first 4 bytes will always be 255, 116, 79, 99, because these are invalid bytes for the fanout table; the other 4 bytes of the header are dedicated to the version, currently, 2.

Following the "$header", there is, what Git calls, a fanout table. This header table consists of 256 4-byte integers, each entry of the table records the number of objects whose first byte are less than or equal to this entry.

That is, if the repository has 2 objects that start with 00, there will be a 2 in the 00~th entry of the table. Furthermore, if there are 3 objects that start with ~01, the ~01~th entry will report 5 objects. Remember, each entry in the table is the sum of all previous entries ("less than or equal to this entry"). Examining at the 256th entry would provide the total number of objects in the packfile.

Following the fanout table is a sorted table of 20-byte SHA-1 hashes.

In version 2, there is another table following the sorted hashes that consists of 4-byte CRC32 values of the packed object data. This table enables easier copying of data between packfiles. For example, this improves the efficiency of creating new packfiles for new objects.

Next, is another table of 4-byte offset values, usually packed into 31-bits, larger offsets being encoded as offsets for indexes into the next table.

Last table, 8-byte offset entries, this table will be empty if the packfile is less than 2GiB.

Finally, there is a 20-byte checksum of the packfile and another 20-byte checksum of all of the above data.

All of these tables are used to make sure Git has very quick and efficient access to objects in the repository.

Plumbing

Git will automatically create packfiles when synchronizing a repository (e.g., pushing, pulling, cloning), but they can also be created manually with the git-gc command. Let's assume there are some loose objects in the current repository.

± find .git/objects -type f
.git/objects/f2/e90bed364168fcca0893437fb569d762cdbbce
.git/objects/f4/2946046ed0926d5c7b34772642478390a696c9
.git/objects/87/713bb957eef1ed6a8d12f36b2d8b328a72b453
.git/objects/8c/d57af30ad9bf0f2e0640d0141eb908d276d2f1
.git/objects/1f/846d4278f5741d33111d28c03d29b589dabffe
.git/objects/be/020e47fadb8d80281259b1f886c3940dc51a19
.git/objects/d1/2254d273712af99e0585e7dd9dfea2106d5692
.git/objects/ea/41dba10b54a794284e0be009a11f0ff3716a28
.git/objects/98/c37b0fb33a8b2f7ac4c5d94571382071ae859c
.git/objects/4d/5fcadc293a348e88f777dc0920f11e7d71441c
.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
± git gc
Counting objects: 11, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (11/11), done.
Total 11 (delta 0), reused 0 (delta 0)
± find .git/objects -type f
.git/objects/info/packs
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack

We started with 11 objects, in the loose format, we ran git-gc and we are left with a packfile.

The output of git-gc tells us how many objects we packed, how many delta objects were used to create the pack, in this case, 0, and how many objects were copied from an existing pack and how many deltas from an existing pack, both 0 in this example.

Of course, we can also examine the packfile with the git-verify-pack command:

± git verify-pack -v .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
f2e90bed364168fcca0893437fb569d762cdbbce commit 225 153 12
d12254d273712af99e0585e7dd9dfea2106d5692 commit 220 145 165
98c37b0fb33a8b2f7ac4c5d94571382071ae859c commit 172 117 310
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 blob   0 9 427
be020e47fadb8d80281259b1f886c3940dc51a19 blob   9 18 436
f42946046ed0926d5c7b34772642478390a696c9 tree   93 81 454
87713bb957eef1ed6a8d12f36b2d8b328a72b453 tree   31 40 535
8cd57af30ad9bf0f2e0640d0141eb908d276d2f1 tree   31 40 575
1f846d4278f5741d33111d28c03d29b589dabffe tree   31 42 615
ea41dba10b54a794284e0be009a11f0ff3716a28 tree   62 50 657
4d5fcadc293a348e88f777dc0920f11e7d71441c tree   31 42 707
non delta: 11 objects
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack: ok

It does not matter whether the .pack or .idx file are specified to the git-verify-pack command, the output will be the same. However, tab completion will prefer the .idx files.

This output has a lot of information to it: first, it tells us about all the objects in the packfile, we see our 11 original objects from before. But we are also given each object's type, size, size in pack, and offset into the packfile, respectively. For undeltified objects, these sizes won't be very different, but for deltified objects, these two sizes can be significantly different.

This output also tells us the pack contains no deltified objects. Let's see what this would look like with deltified objects:

± git gc
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (17/17), done.
Total 17 (delta 1), reused 10 (delta 0)
± git verify-pack -v .git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.idx
47f24ac6ba3af12714f0dbf7219b9d854f269097 commit 219 146 12
8cfd10e321ac6349132ceb93774f0a881a1b9316 commit 219 146 158
f2e90bed364168fcca0893437fb569d762cdbbce commit 225 153 304
d12254d273712af99e0585e7dd9dfea2106d5692 commit 220 145 457
98c37b0fb33a8b2f7ac4c5d94571382071ae859c commit 172 117 602
5716ca5987cbf97d6bb54920bea6adde242d87e6 blob   4 13 719
be020e47fadb8d80281259b1f886c3940dc51a19 blob   9 18 732
257cc5642cb1a054f08cc83f2d943e56fd3ebe99 blob   4 13 750
3783c58c8b17ba95b2917e5f92a0395efcec9759 tree   93 100 763
87713bb957eef1ed6a8d12f36b2d8b328a72b453 tree   31 40 863
8cd57af30ad9bf0f2e0640d0141eb908d276d2f1 tree   31 40 903
1f846d4278f5741d33111d28c03d29b589dabffe tree   31 42 943
7470c9c852271284dfb0cb8f3ad9047709847e0d tree   93 101 985
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 blob   0 9 1086
f42946046ed0926d5c7b34772642478390a696c9 tree   25 37 1095 1 7470c9c852271284dfb0cb8f3ad9047709847e0d
ea41dba10b54a794284e0be009a11f0ff3716a28 tree   62 50 1132
4d5fcadc293a348e88f777dc0920f11e7d71441c tree   31 42 1182
non delta: 16 objects
chain length = 1: 1 object
.git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.pack: ok
± find .git/objects -type f
.git/objects/info/packs
.git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.idx
.git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.pack

Notice, we repacked the repository then listed the contents of the new pack, also notice the old pack is gone, but the objects that were in the old pack are still available in the new pack.

More importantly, notice that f42946 is a deltified object based on 7470c9c. That is, the tree defined in f42946 is derived by patching 7470c9c with the contents of the object in the packfile. This is also evident in the size listings, the size on disk of the loose object is 25 bytes, but the size in the pack is 37. The increase in size is often, unfortunately, due to how text compression sometimes doesn't work. This is the first look of what Git calls "chains".

Chains are a simple way to describe the length of a deltified object set. The longest chain in this repository is only 1. But if we examine bigger repositories, this number could be much higher. Git itself, for example, has a chain length of 46 for one object, or another 6 objects with a chain length of 44 each.

Another thing to note, unlike the loose object format, it's much more difficult to get to the contents of the objects in the packfile using only the packfile without some effort. However, git-cat-file and other plumbing commands will still work as expected given an object name, even if the object is contained within a packfile.

Summary

Hopefully, we now have a deeper knowledge of the compact object format Git uses, namely, packfiles. Remember, the motivation for these files was not efficiency in storage, but efficiency in network bandwidth when transferring objects and lookup speed when there's a large number of loose objects. Thus, if working in stealth mode, it can be sometimes important to run git-gc occasionally to keep your private repository quick and efficient.