Granularity of (Git) Commits

Introduction
Single Story -> Single Commit
Discrete Tasks -> Multiple Atomic Commits
Examples
Discussion

There are many blog posts which extol the foundations of writing good, communicative commit messages. These same posts, however, tend to otherwise ignore the question of granularity. In this post, I attempt an answer to granularity. The answer is, of course, it depends.

Introduction

Regardless of position on "good" or "communicative," we are often left with questions about the commits themselves. How frequent should we commit? What should be included in the commit? What should be excluded from the commit? What should the operational state of the entire code base be at each commit?

What we're asking about is the granularity of commits. How many commits constitutes a story, a feature, or a bug. There are two schools of thought on the question, although I'm not sure it's been so explicit. On one side, the discretization of work leads naturally into a series of commits. Another, the task is singular, therefore, there should be only one commit that lands in master, seldom more than one.

While both of these answers have tradeoffs, what you, as a developer do, depends entirely on your team. Similar to good commit messages, if your team has a policy of a single commit per task/story/feature/bug/whatever, pragmatically, stick with it. Understand why the team arrived at that answer before suggesting the other approach. Vice-versa, if the team wants the discrete chunks of work to be committed and then a series of commits together combine into the completion of a task, go with it. However, if your team doesn't have a consistent choice, that is what should be fixed.

Single Story -> Single Commit

Using a single commit for tasks comes with a lot of simplicity, but the commits themselves will be large. Integrating the single commit is relatively simple. Resolving merge conflicts is simpler since only the end state has to be considered. Reviewers only have to test the operational build state of the single commit when reviewing or merging. Comprehending the change and its rationale includes reading a single commit message. Ultimately, single commits are associated with simplicity.

However, the commit itself, the changes contained in the commit will, on average, be substantial and far reaching. It may be difficult for reviewers to separate the signal from the noise. If there is an issue with the commit, a bisection indicates the single commit is the cause of a bug, and to rapidly resolve the issue, a revert is issued, the whole commit goes, including any possible refactorings that improved the system. Worse, these refactorings may be referenced down the line, causing more issues.

Arguments can be made that "refactoring" constitutes its own task, and therefore is in its own commit. Unfortunately, this is not always the case. The history of code is far messier than even I wish to admit.

Discrete Tasks -> Multiple Atomic Commits

Instead of using a single commit which contains every single change that was made to accomplish the task, we can use a single commit for each of those tasks. This dampens the size of each commit, but incurs a little complexity for the reviewing and comprehending phases of development. Stringent adherence to atomic commits allows for git blame to provide better contextual information when attempting to understand any one specific line of code. Comparatively, it is easy for changes made using a single commit to become lost in the noise and lose context.

Using atomic commits, however, can make integration tasks particularly difficult when using rebase. For example, a series of atomic commits are made on a topic branch and are ready for merging into master. During the development of this topic branch, other changes were made and merged before this topic branch lands. Easy, fetch origin/master, merge origin/master into master, and rebase the topic branch onto master. Fate would have it, though, there are conflicts. Worse, the conflict is in one of the first commits, and the changes need to be propagated through the commits appropriately for the series to remain sensible. This is not easy. Aside from enabling rerere and using diff3 for merge.conflictstyle, I am personally unaware of a repeatable, universal way to achieve reliable results when rebasing with conflicts.

The difficulty is that each commit must remain atomic and make the changes it states. However, in the middle of the rebase, it is difficult to remember what the change should be for the specified commit. We easily remember the end result, but using the end result may create empty commits or worse, more conflicts.

If you, dear reader, are aware of a better approach, please share.

Atomic commits afford more granularity in case a change needs to be reverted. Although, finding that one commit may be more difficult since there are generally more commits to bisect. Which leads into the next issue, each atomic commit MUST be stable and in working order, e.g., the project is buildable without error, tests SHALL pass, etc. Ensuring this is similarly not easy.

While using a series of commits to communicate a single story can be complex. It does afford other flexibilities not possible when using the former approach. Specifically, pull requests consisting of multiple commits can be partially accepted. GitHub and other source forges do not have tooling for this in their web UI's, but this is certainly possible and is done in larger projects, e.g., Linux.

Examples

Let's examine some pull requests which demonstrate both approaches. I'm going to point to pull requests I have submitted since I don't want to cast any unintentional judgment to someone else. They are both for the same project and they are essentially the same set of changes. The first has one all of the changes into a single commit. It tells a larger story about enabling usage of UNIX Sockets for a local connection. There are several discrete changes required to get there, but the it's all for the larger goal of connecting to a local machine's muisc player daemon via the AF_UNIX connection type.

Repeated here is the single message and summary diff of the first pull request:

libmpdel: enable use of local UNIX sockets for MPD

By using `make-network-process` we can tune the parameters for either
network connections or local UNIX socket connections.

As an added bonus, connections are very fast.  In my limited testing,
`make-network-process` tends to be dramatically faster than the
`open-network-stream` function.  Of course, without passing `:family
'ipv4` to `make-network-process`, the two functions exhibit similar
performance, so it's not as simple as "one is written in C and the other
is written in ELisp."

Deprecate/obsolete the `libmpdel-port` variable in favor of
`libmpdel-service` which semantically makes more sense now that libmpdel
can connect to either a port or a socket.

`libmpdel-profiles` also work with the new local socket connection
behaviour.

Although I don't use `customize`, the new variable (`libmpdel-service`)
and profiles (`libmpdel-profiles`) should be accessible via `customize`
just the same.

1 file changed, 44 insertions(+), 10 deletions(-)
libmpdel.el | 54 ++++++++++++++++++++++++++++++++++++++++++++----------

The second pull request is only slightly different from the first because there were changes in the upstream master branch that does some of the leg work that the first commit proposed. However, at the discretion of the project owner and maintainer, the first large commit was broken down into its component commits. Each commit tells the story of its own change, culminating into the final commit closing the larger story of enabling the UNIX socket connection type.

Repeated here are the series of commit messages and summary differences:

add custom variable `libmpdel-family`

This can be used to switch between IPv4 and IPv6 addresses.

1 file changed, 7 insertions(+)
libmpdel.el | 7 +++++++

Always use `make-network-process`

When making a connection to MPD, whether via local UNIX socket, or TCP
stream, `make-network-process` tends to be faster at establishing the
connection.  In my limited testing, `make-network-process` tends to be
dramatically faster than the `open-network-stream` function.  Of course,
without passing `:family 'ipv4` to `make-network-process`, the two
functions exhibit similar performance, so it's not as simple as "one is
written in C and the other is written in ELisp."

1 file changed, 7 insertions(+), 5 deletions(-)
libmpdel.el | 12 +++++++-----

Add missing element to `limbpdel-profiles` docstring

1 file changed, 1 insertion(+), 1 deletion(-)
libmpdel.el | 2 +-

add libmpdel-family to `libmpdel-profiles`

Add ability to specify the address family for connection profiles.

1 file changed, 7 insertions(+), 4 deletions(-)
libmpdel.el | 11 +++++++----

Clearly, each of the previous four commits are smaller and more focused than the single, all encompassing commit. Each commit provides focused commentary to the specific lines being changed. While it's missing in this case, since this is more a personal issue than a product with board of story cards, or what have you, it's easy to add story context to the bottom of the commit message so that others reading this can see more explicit motivation for the smaller commit.

Discussion

It's rather easy to argue for "small, atomic" commits, but I sense we have failed to clearly articulate what we mean by "small". How small is "small"? What are the contours of our definition of "small"? As a result, we have converged to one of two camps, where we say a single story is "small", or the individual, discretized changes are "small".

The third and/or fourth camp that exists, which is out of scope of this entire discussion, are commits which follow the messy history of how everything came together. Every commit is not buildable, tested, etc., but merely a step in the direction toward the completion of some task. The first draft, if you will, towards the completion of a task. In previous discussions, I've referred to this style has the how the changes "actually" came into existence. Whereas, here, we are talking about commits which are edited to tell the story of the changes, how the changes ought to have happened, if we were perfect and knew everything.

I'm not going to argue either way which strategy is better. As mentioned before, this is a team and/or personal choice. It's a protocol about communicating changes in software that needs to be decided to effectively collaborate. However, I hope I have provided sufficient informational context for you and your team to make a decision about which strategy to pursue.