Blog DevOps Git 2.41 release - Here are five of our contributions in detail
2023-06-20
9 min read

Git 2.41 release - Here are five of our contributions in detail

Find out how GitLab's Git team helped improve the latest version of Git.

git-241.jpg

Git 2.41
was officially released on June 1, 2023, and included some improvements from GitLab's Git team. Git is the foundation of
repository data at GitLab. GitLab's Git team works on everything from new
features, performance improvements, documentation improvements, and growing the Git
community. Often our contributions to Git have a lot to do with the way we integrate Git into
our services at GitLab. Here are some highlights from this latest Git release,
and a window into how we use Git on the server side at GitLab.

1. Machine-parseable fetch output

When git-fetch is run, the output is a familiar for users of Git and looks
something like this:

> git fetch
remote: Enumerating objects: 296, done.
remote: Counting objects: 100% (189/189), done.
remote: Compressing objects: 100% (103/103), done.
remote: Total 296 (delta 132), reused 84 (delta 84), pack-reused 107
Receiving objects: 100% (296/296), 184.46 KiB | 11.53 MiB/s, done.
Resolving deltas: 100% (173/173), completed with 42 local objects.
From https://gitlab.com/gitlab-org/gitaly
   cfd146b4d..a69cf20ce  master                                                                             -> origin/master
   3a877b8f3..854f25045  15-11-stable                                                                       -> origin/15-11-stable
 * [new branch]          5316-check-metrics-and-decide-if-need-to-context-cancel-the-running-git-process-in -> origin/5316-check-metrics-and-decide-if-need-to-context-cancel-the-running-git-process-in
 + bdd3c05a2...0bcf6f9d4 blanet_default_branch_opt                                                          -> origin/blanet_default_branch_opt  (forced update)
 * [new branch]          jt-object-pool-disconnect-refactor                                                 -> origin/jt-object-pool-disconnect-refactor
 + f2447981c...34e06e106 jt-replicate-repository-alternates                                                 -> origin/jt-replicate-repository-alternates  (forced update)
 * [new branch]          kn-logrus-update                                                                   -> origin/kn-logrus-update
 + 05cea76f3...258543674 kn-smarthttp-docs                                                                  -> origin/kn-smarthttp-docs  (forced update)
 * [new branch]          pks-git-pseudorevision-validation                                                  -> origin/pks-git-pseudorevision-validation
 + 2e8d0ccd5...bf4ed8a52 pks-storage-repository                                                             -> origin/pks-storage-repository  (forced update)
 * [new branch]          qmnguyen0711/expose-another-port-for-pack-rpcs                                     -> origin/qmnguyen0711/expose-another-port-for-pack-rpcs
 + 82473046f...8e23e474c use_head_reference

The problem with this output is that it's not meant for machines to parse.

But why would it be useful to make this output parseable by machines? To understand
this, we need to back up a little bit and talk about Gitaly Cluster. Gitaly Cluster
is a service at GitLab that provides high availability of Git repositories by
replicating repository writes to replica nodes. Each time a write comes in which
changes a Git repository (for example, a push that updates a reference) the write goes to
the primary node, and to all replica nodes before the write can succeed. A
voting mechanism takes place where the nodes vote on what its updated
value for the reference would be. This vote succeeds when a quorum of replica
nodes have successfully written the ref, and the write succeeds.

One of our remote procedure calls (RPCs) in Gitaly runs git-fetch(1) for repository mirroring. By
default, when git-fetch(1) is run, it will update any references that are able
to be fast-forwarded and fail on any reference that has since diverged will not
be updated.

As mentioned above, whenever there is an operation that modifies a repository, there
is a voting mechanism that ensures the same modification is made to all replica nodes.
To dive in even a little deeper, our voting mechanism leverages Git's reference transaction hook,
which runs an executable once per reference transaction. git-fetch(1) by default will
start a reference transaction per reference it updates. A fetch that updates hundreds or
even thousand of references would thus vote once per reference that gets updated.

In the following sequence diagram, we are only showing one Gitaly node, but for a Gitaly Cluster
with, let's say, three nodes, what happens with the Gitaly primary also happens in
the replicas.

sequenceDiagram
    actor user
    participant Gitlab UI
    participant p as Praefect
    participant g0 as Gitaly (primary)
    participant git as Git
    user->>Gitlab UI: mirror my repository
    Gitlab UI->>p: FetchRemote
    activate p
    p->>g0: FetchRemote
    activate git
    g0->>git: fetch-remote
    git->>g0: vote on refs/heads/branch1 update
    g0->>p: vote on refs/heads/branch1 update
    git->>g0: vote on refs/heads/branch2 update
    g0->>p: vote on refs/heads/branch2 update
    git->>g0: vote on refs/heads/branch3 update
    g0->>p: vote on refs/heads/branch3 update
    deactivate git
    note over p: vote succeeds
    p->>Gitlab UI: success
    deactivate p

This is inefficient. Ideally we would want to vote once per batch of references
updated from one git-fetch(1) call. There is an option --atomic in
git-fetch(1) that will open one reference transaction for all references
updated by git-fetch(1). However, when --atomic is used, a git-fetch call
will fail if any references have since diverged. This is not how we want
repository mirroring to work. We actually want git-fetch to update whichever
refs it can.

So, that means we cannot use the --atomic flag and are thus stuck voting per
reference we update.

Solution: Handle the reference update ourselves

The way we are solving this inefficiency is to handle the reference update
ourselves. Instead of relying on git-fetch(1) to both fetch the objects and
update all the references, we can use the --dry-run option of git-fetch(1)
to first fetch the objects into a quarantine directory. Then if we can know
which references would be updated, we can start a reference transaction
ourselves with git-update-ref(1) and update all the refs in one transaction,
hence triggering a single vote only.

sequenceDiagram
    actor user
    participant Gitlab UI
    participant p as Praefect
    participant g0 as Gitaly (primary)
    participant git as Git
    user->>Gitlab UI: mirror my repository
    Gitlab UI->>p: FetchRemote
    activate p
    p->>g0: FetchRemote
    g0->>git: fetch-remote --dry-run --porcelain
    activate git
    note over git: objects are fetched into a quarantine directory
    git->>g0: branch1, branch2, branch3 will be updated
    deactivate git
    g0->>git: update-ref
    activate git
    note over git: update branch1, branch2, branch3 in a single transaction
    git->>g0: reference transaction hook
    deactivate git
    g0->>p: vote on ref updates
    note over p: vote succeeds
    p->>Gitlab UI: success
    deactivate p

A requirement for this however, is that we would be able to parse the output of
git-fetch(1) to tell which refs will be updated and to what values. Currently
in --dry-run, git-fetch(1)'s output cannot be parsed by a machine.

Patrick Steinhardt, Staff Backend Engineer, Gitaly, added a --porcelain [option to git-fetch](https://git-scm.com/docs/git-fetch#Documentation/git-fetch.txt

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert