How does Git(Hub) handle possible collisions from short SHAs?

GitCryptographyGithubSha

Git Problem Overview


Both Git and GitHub display short versions of SHAs -- just the first 7 characters instead of all 40 -- and both Git and GitHub support taking these short SHAs as arguments.

E.g. git show 962a9e8

E.g. https://github.com/joyent/node/commit/962a9e8

Given that the possibility space is now orders of magnitude lower, "just" 268 million, how do Git and GitHub protect against collisions here? And how do they handle them?

Git Solutions


Solution 1 - Git

These short forms are just to simplify visual recognition and to make your life easier. Git doesn't really truncate anything, internally everything will be handled with the complete value. You can use a partial SHA-1 at your convenience, though:

>Git is smart enough to figure out what commit you meant to type if you provide the first few characters, as long as your partial SHA-1 is at least four characters long and unambiguous — that is, only one object in the current repository begins with that partial SHA-1.

Solution 2 - Git

I have a repository that has a commit with an id of 000182eacf99cde27d5916aa415921924b82972c.

git show 00018

shows the revision, but

git show 0001

prints

error: short SHA1 0001 is ambiguous.
error: short SHA1 0001 is ambiguous.
fatal: ambiguous argument '0001': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions

(If you're curious, it's a clone of the git repository for git itself; that commit is one that Linus Torvalds made in 2005.)

Solution 3 - Git

Two notes here:

  • If you type y anywhere on the GitHub page displaying a commit, you will see the full 40 bytes of said commit.
    That illustrates emboss's point: GitHub doesn't truncate anything.

  • And 7 hex digits (28 bits) isn't enough since 2010 anyway.
    See commit dce9648 by Linus Torwalds himself (Oct 2010, git 1.7.4.4):

> The default of 7 comes from fairly early in git development, when seven hex digits was a lot (it covers about 250+ million hash values). Back then I thought that 65k revisions was a lot (it was what we were about to hit in BK), and each revision tends to be about 5-10 new objects or so, so a million objects was a big number.

(BK = BitKeeper)

> These days, the kernel isn't even the largest git project, and even the kernel has about 220k revisions (much bigger than the BK tree ever was) and we are approaching two million objects. At that point, seven hex digits is still unique for a lot of them, but when we're talking about just two orders of magnitude difference between number of objects and the hash size, there will be collisions in truncated hash values. It's no longer even close to unrealistic - it happens all the time.

> We should both increase the default abbrev that was unrealistically small, and add a way for people to set their own default per-project in the git config file.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAseem KishoreView Question on Stackoverflow
Solution 1 - GitembossView Answer on Stackoverflow
Solution 2 - GitKeith ThompsonView Answer on Stackoverflow
Solution 3 - GitVonCView Answer on Stackoverflow