How would Git handle a SHA-1 collision on a blob?

GitHash Collision

Git Problem Overview


This probably never happened in the real-world yet, and may never happen, but let's consider this: say you have a git repository, make a commit, and get very very unlucky: one of the blobs ends up having the same SHA-1 as another that is already in your repository. Question is, how would Git handle this? Simply fail? Find a way to link the two blobs and check which one is needed according to the context?

More a brain-teaser than an actual problem, but I found the issue interesting.

Git Solutions


Solution 1 - Git

I did an experiment to find out exactly how Git would behave in this case. This is with version 2.7.9~rc0+next.20151210 (Debian version). I basically just reduced the hash size from 160-bit to 4-bit by applying the following diff and rebuilding git:

--- git-2.7.0~rc0+next.20151210.orig/block-sha1/sha1.c
+++ git-2.7.0~rc0+next.20151210/block-sha1/sha1.c
@@ -246,6 +246,8 @@ void blk_SHA1_Final(unsigned char hashou
    blk_SHA1_Update(ctx, padlen, 8);

    /* Output hash */
-   for (i = 0; i < 5; i++)
-       put_be32(hashout + i * 4, ctx->H[i]);
+   for (i = 0; i < 1; i++)
+       put_be32(hashout + i * 4, (ctx->H[i] & 0xf000000));
+   for (i = 1; i < 5; i++)
+       put_be32(hashout + i * 4, 0);
 }

Then I did a few commits and noticed the following.

  1. If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
  2. If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
  3. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
  4. If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  5. If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
  6. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  7. If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
  8. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
  9. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

For #2 you will typically get an error like this when you run "git push":

error: object 0400000000000000000000000000000000000000 is a tree, not a blob
fatal: bad blob object
error: failed to push some refs to origin

or:

error: unable to read sha1 file of file.txt (0400000000000000000000000000000000000000)

if you delete the file and then run "git checkout file.txt".

For #4 and #6, you will typically get an error like this:

error: Trying to write non-commit object
f000000000000000000000000000000000000000 to branch refs/heads/master
fatal: cannot update HEAD ref

when running "git commit". In this case you can typically just type "git commit" again since this will create a new hash (because of the changed timestamp)

For #5 and #9, you will typically get an error like this:

fatal: 1000000000000000000000000000000000000000 is not a valid 'tree' object

when running "git commit"

If someone tries to clone your corrupt repository, they will typically see something like:

git clone (one repo with collided blob,
d000000000000000000000000000000000000000 is commit,
f000000000000000000000000000000000000000 is tree)

Cloning into 'clonedversion'...
done.
error: unable to read sha1 file of s (d000000000000000000000000000000000000000)
error: unable to read sha1 file of tullebukk
(f000000000000000000000000000000000000000)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

What "worries" me is that in two cases (2,3) the repository becomes corrupt without any warnings, and in 3 cases (1,7,8), everything seems ok, but the repository content is different than what you expect it to be. People cloning or pulling will have a different content than what you have. The cases 4,5,6 and 9 are ok, since it will stop with an error. I suppose it would be better if it failed with an error at least in all cases.

Solution 2 - Git

Original answer (2012) (see shattered.io 2017 SHA1 collision below)

That old (2006) answer from Linus might still be relevant:

> Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have.

> So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order.

> However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository.
So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one.

> So you have two cases of collision:

> - the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1.
At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the old object.
You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something completely different than what you expected.
So you would generally notice this kind of collision fairly quickly.
In related news, the question is what to do about the inadvertent collision..
First off, let me remind people that the inadvertent kind of collision is really really really damn unlikely, so we'll quite likely never ever see it in the full history of the universe.
But if it happens, it's not the end of the world: what you'd most likely have to do is just change the file that collided slightly, and just force a new commit with the changed contents (add a comment saying "/* This line added to avoid collision */") and then teach git about the magic SHA1 that has been shown to be dangerous.
So over a couple of million years, maybe we'll have to add one or two "poisoned" SHA1 values to git. It's very unlikely to be a maintenance problem ;)

> - The attacker kind of collision because somebody broke (or brute-forced) SHA1.
This one is clearly a lot more likely than the inadvertent kind, but by definition it's always a "remote" repository. If the attacker had access to the local repository, he'd have much easier ways to screw you up.
So in this case, the collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's literally no different from the attacker just not having found a collision at all, but just using the object you already had (ie it's 100% equivalent to the "trivial" collision of the identical file generating the same SHA1).

The question of using SHA-256 is regularly mentioned, but not act upon for now (2012).
Note: starting 2018 and Git 2.19, the code is being refactored to use SHA-256.


Note (Humor): you can force a commit to a particular SHA1 prefix, with the project gitbrute from Brad Fitzpatrick (bradfitz).

> gitbrute brute-forces a pair of author+committer timestamps such that the resulting git commit has your desired prefix.

Example: https://github.com/bradfitz/deadbeef


Daniel Dinnyes points out in the comments to 7.1 Git Tools - Revision Selection, which includes:

> A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.


Even the more recently (February 2017) shattered.io demonstrated the possibility of forging a SHA1 collision:
(see much more in my separate answer, including Linus Torvalds' Google+ post)

  • a/ still requires over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.
  • b/ would forge one file (with the same SHA1), but with the additional constraint its content and size would produce the identical SHA1 (a collision on the content alone is not enough): see "How is the git hash calculated?"): a blob SHA1 is computed based on the content and size.

See "Lifetimes of cryptographic hash functions" from Valerie Anita Aurora for more.
In that page, she notes:

> Google spent 6500 CPU years and 110 GPU years to convince everyone we need to stop using SHA-1 for security critical applications.
Also because it was cool

See more in my separate answer below.

Solution 3 - Git

According to Pro Git:

> If you do happen to commit an object that hashes to the same SHA-1 value as a previous object in your repository, Git will see the previous object already in your Git database and assume it was already written. If you try to check out that object again at some point, you’ll always get the data of the first object.

So it wouldn't fail, but it wouldn't save your new object either.
I don't know how that would look on the command line, but that would certainly be confusing.

A bit further down, that same reference attempts to illustrate the likely-ness of such a collision:

> Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

Solution 4 - Git

To add to my previous answer from 2012, there is now (Feb. 2017, five years later), an example of actual SHA-1 collision with shattered.io, where you can craft two colliding PDF files: that is obtain a SHA-1 digital signature on the first PDF file which can also be abused as a valid signature on the second PDF file.
See also "At death’s door for years, widely used SHA1 function is now dead", and this illustration.

Update 26 of February: Linus confirmed the following points in a Google+ post:

> (1) First off - the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a "content identifier" for a content-addressable system like git.

> (2) Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation.

> (3) And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.

Regarding that transition, see the Q1 2018 Git 2.16 adding a structure representing hash algorithm. The implementation of that transition has started.

Starting Git 2.19 (Q3 2018), Git has picked SHA-256 as NewHash, and is in the process of integrating it to the code (meaning SHA1 is still the default (Q2 2019, Git 2.21), but SHA2 will be the successor)


Original answer (25th of February) But:

Joey Hess tries those pdf in a Git repo and he found:

> That includes two files with the same SHA and size, which do get different blobs thanks to the way git prepends the header to the content.

joey@darkstar:~/tmp/supercollider>sha1sum  bad.pdf good.pdf 
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  bad.pdf
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  good.pdf
joey@darkstar:~/tmp/supercollider>git ls-tree HEAD
100644 blob ca44e9913faf08d625346205e228e2265dd12b65	bad.pdf
100644 blob 5f90b67523865ad5b1391cb4a1c010d541c816c1	good.pdf

> While appending identical data to these colliding files does generate other collisions, prepending data does not.

So the main vector of attack (forging a commit) would be:

> - Generate a regular commit object;

  • use the entire commit object + NUL as the chosen prefix, and
  • use the identical-prefix collision attack to generate the colliding good/bad objects.
  • ... and this is useless because the good and bad commit objects still point to the same tree!

Plus, you already can and detect cryptanalytic collision attacks against SHA-1 present in each file with cr-marcstevens/sha1collisiondetection

Adding a similar check in Git itself would have some computation cost.

On changing hash, Linux comments:

> The size of the hash and the choice of the hash algorithm are independent issues.
What you'd probably do is switch to a 256-bit hash, use that internally and in the native git database, and then by default only show the hash as a 40-character hex string (kind of like how we already abbreviate things in many situations).
That way tools around git don't even see the change unless passed in some special "--full-hash" argument (or "--abbrev=64" or whatever - the default being that we abbreviate to 40).

Still, a transition plan (from SHA1 to another hash function) would still be complex, but actively studied.
A convert-to-object_id campaign is in progress:


Update 20th of March: GitHub detail a possible attack and its protection:

> SHA-1 names can be assigned trust through various mechanisms. For instance, Git allows you to cryptographically sign a commit or tag. Doing so signs only the commit or tag object itself, which in turn points to other objects containing the actual file data by using their SHA-1 names. A collision in those objects could produce a signature which appears valid, but which points to different data than the signer intended. In such an attack the signer only sees one half of the collision, and the victim sees the other half.

Protection:

> The recent attack uses special techniques to exploit weaknesses in the SHA-1 algorithm that find a collision in much less time. These techniques leave a pattern in the bytes which can be detected when computing the SHA-1 of either half of a colliding pair.

> GitHub.com now performs this detection for each SHA-1 it computes, and aborts the operation if there is evidence that the object is half of a colliding pair. That prevents attackers from using GitHub to convince a project to accept the "innocent" half of their collision, as well as preventing them from hosting the malicious half.

See "sha1collisiondetection" by Marc Stevens


Again, with Q1 2018 Git 2.16 adding a structure representing hash algorithm, the implementation of a transition to a new hash has started.
As mentioned above, the new supported Hash will be SHA-256.

Solution 5 - Git

I think cryptographers would celebrate.

Quote from Wikipedia article on SHA-1:

> In February 2005, an attack by Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu was announced. The attacks can find collisions in the full version of SHA-1, requiring fewer than 2^69 operations. (A brute-force search would require 2^80 operations.)

Solution 6 - Git

There are several different attack models for hashes like SHA-1, but the one usually discussed is collision search, including Marc Stevens' HashClash tool.

> "As of 2012, the most efficient attack against SHA-1 is considered to > be the one by Marc Stevens[34] with an estimated cost of $2.77M to > break a single hash value by renting CPU power from cloud servers."

As folks pointed out, you could force a hash collision with git, but doing so won't overwrite the existing objects in another repository. I'd imagine even git push -f --no-thin won't overwrite the existing objects, but not 100% sure.

That said, if you hack into a remote repository then you could make your false object the older one there, possibly embedding hacked code into an open source project on github or similar. If you were careful then maybe you could introduce a hacked version that new users downloaded.

I suspect however that many things the project's developers might do could either expose or accidentally destroy your multi-million dollar hack. In particular, that's a lot of money down the drain if some developer, who you didn't hack, ever runs the aforementioned git push --no-thin after modifying the effected files, sometimes even without the --no-thin depending.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGnurouView Question on Stackoverflow
Solution 1 - GitrubundView Answer on Stackoverflow
Solution 2 - GitVonCView Answer on Stackoverflow
Solution 3 - GitMatView Answer on Stackoverflow
Solution 4 - GitVonCView Answer on Stackoverflow
Solution 5 - GitWillem HengeveldView Answer on Stackoverflow
Solution 6 - GitJeff BurdgesView Answer on Stackoverflow