What algorithm does git use to detect changes on your working tree?

Git

Git Problem Overview


This is about the internals of git.

I've been reading the great 'Pro Git' book and learning a little about how git is working internally (all about the SHA1, blobs, references, trees, commits, etc, etc). Pretty clever architecture, by the way.

So, to put into context, git references the content of a file as a SHA1 value, so it's able to know if a specific content has changed just comparing the hash values. But my question is specifically about how git checks that the content in the working tree has changed or not.

The naive approach will be thinking that each time you run a command as git status or similar command, it will search through all the files on the working directory, calculating the SHA1 and comparing it with the one that has the last commit. But that seems very inefficient for big projects, as the Linux kernel.

Another idea could be to check last modification date on the file, but I think git is not storing that information (when you clone a repository, all the files have a new time)

I'm sure it's doing it in an efficient way (git is really fast), does anyone know how that is achieved?

PD: Just to add an interesting link about the git index, specifically stating that the index keeps information about files timestamps, even when the tree objects do not.

Git Solutions


Solution 1 - Git

Git’s index maintains timestamps of when git last wrote each file into the working tree (and updates these whenever files are cached from the working tree or from a commit). You can see the metadata with git ls-files --debug. In addition to the timestamp, it records the size, inode, and other information from lstat to reduce the chance of a false positive.

When you perform git-status, it simply calls lstat on every file in the working tree and compares the metadata in order to quickly determine which files are unchanged. This is described in the documentation under racy-git and update-index.

Solution 2 - Git

On a unix file-system, the file-info is tracked and can be accesed using lstat method. The stat structure contains multiple time-stamps, size information, and more:

struct stat {
    dev_t     st_dev;     /* ID of device containing file */
    ino_t     st_ino;     /* inode number */
    mode_t    st_mode;    /* protection */
    nlink_t   st_nlink;   /* number of hard links */
    uid_t     st_uid;     /* user ID of owner */
    gid_t     st_gid;     /* group ID of owner */
    dev_t     st_rdev;    /* device ID (if special file) */
    off_t     st_size;    /* total size, in bytes */
    blksize_t st_blksize; /* blocksize for file system I/O */
    blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
    time_t    st_atime;   /* time of last access */
    time_t    st_mtime;   /* time of last modification */
    time_t    st_ctime;   /* time of last status change */
};

It seems that initially Git simply relied on this stat structure to decide if a file had been changed (see reference):

> When checking if they differ, Git first runs lstat(2) on the files and compares the result with this information

However, a race condition was reported (racy-git) that found if a file was modified in the following manner:

: modify 'foo'
$ git update-index 'foo'
: modify 'foo' again, in-place, without changing its size 
                      (And quickly enough to not change it's timestamps)

This left the file in a state that was modified but not detectable by lstat.

To fix this issue, now in such situations where lstat state is ambiguous, Git compares the contents of the files to determine if it has been changed.


NOTE:

If anyone is confused, like I was, about st_mtime description, which states that it is updated by writes "of more than zero bytes," this means absolute change.

For example, in the case of a text file file with a single character A: if A is changed to B there is 0 net change in total byte size, but the st_mtime will still be updated (had to try it myself to verify, use ls -l to see timestamp).

Solution 3 - Git

My testing on Windows indicates that Git actually calculates and uses only the file contents hash when deciding if a file has changed.

It seems to completely ignore the dates:

I Changed the Modified date of a file using (Get-Item "bd.png").CreationTime=$(Get-Date),

git status reported "nothing to commit, working tree clean"

I changed the Creation date of a file using (Get-Item "bd.png").LastWriteTime=$(Get-Date)

git status reported "nothing to commit, working tree clean"

I changed one byte of the file using a hex editor

git status reported "modified: bd.png"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKhelbenView Question on Stackoverflow
Solution 1 - GitJosh LeeView Answer on Stackoverflow
Solution 2 - GitbcorsoView Answer on Stackoverflow
Solution 3 - GitDaniel WebbView Answer on Stackoverflow