Ways to improve git status performance

PerformanceGitNfs

Performance Problem Overview


I have a repo of 10 GB on a Linux machine which is on NFS. The first time git status takes 36 minutes and subsequent git status takes 8 minutes. Seems Git depends on the OS for caching files. Only the first git commands like commit, status that involves pack/repack the whole repo takes a very long time for a huge repo. I am not sure if you have used git status on such a large repo, but has anyone come across this issue?

I have tried git gc, git clean, git repack but the time taken is still/almost the same.

Will sub-modules or any other concepts like breaking the repo into smaller ones help? If so which is the best for splitting a larger repo. Is there any other way to improve time taken for git commands on a large repo?

Performance Solutions


Solution 1 - Performance

To be more precise, git depends on the efficiency of the lstat(2) system call, so tweaking your client’s “attribute cache timeout” might do the trick.

The manual for git-update-index — essentially a manual mode for git-status — describes what you can do to alleviate this, by using the --assume-unchanged flag to suppress its normal behavior and manually update the paths that you have changed. You might even program your editor to unset this flag every time you save a file.

The alternative, as you suggest, is to reduce the size of your checkout (the size of the packfiles doesn’t really come into play here). The options are a sparse checkout, submodules, or Google’s repo tool.

(There’s a mailing list thread about using Git with NFS, but it doesn’t answer many questions.)

Solution 2 - Performance

I'm also seeing this problem on a large project shared over NFS.

It took me some time to discover the flag -uno that can be given to both git commit and git status.

What this flag does is to disable looking for untracked files. This reduces the number of nfs operations significantly. The reason is that in order for git to discover untracked files it has to look in all subdirectories so if you have many subdirectories this will hurt you. By disabling git from looking for untracked files you eliminate all these NFS operations.

Combine this with the core.preloadindex flag and you can get resonable perfomance even on NFS.

Solution 3 - Performance

Try git gc. Also, git clean may help.

UPDATE - Not sure where the down vote came from, but the git manual specifically states: > Runs a number of housekeeping tasks within the current repository, such as compressing file revisions (to reduce disk space and increase performance) and removing unreachable objects which may have been created from prior invocations of git add. > > Users are encouraged to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance.

I always notice a difference after running git gc when git status is slow!

UPDATE II - Not sure how I missed this, but the OP already tried git gc and git clean. I swear that wasn't originally there, but I don't see any changes in the edits. Sorry for that!

Solution 4 - Performance

If your git repo makes heavy use of submodules, you can greatly speed up the performance of git status by editing the config file in the .git directory and setting ignore = dirty on any particularly large/heavy submodules. For example:

[submodule "mysubmodule"]
url = ssh://mysubmoduleURL
ignore = dirty

You'll lose the convenience of a reminder that there are unstaged changes in any of the submodules that you may have forgotten about, but you'll still retain the main convenience of knowing when the submodules are out of sync with the main repo. Plus, you can still change your working directory to the submodule itself and use git status within it as per usual to see more information. See this question for more details about what "dirty" means.

Solution 5 - Performance

The performance of git status should improve with Git 2.13 (Q2 2017).

See commit 950a234 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
(Merged by Junio C Hamano -- gitster -- in commit 8b6bba6, 24 Apr 2017)

> string-list: use ALLOC_GROW macro when reallocing string_list

> Use ALLOC_GROW() macro when reallocing a string_list array rather than simply increasing it by 32.
This is a performance optimization.

> During status on a very large repo and there are many changes, a significant percentage of the total run time is spent reallocing the wt_status.changes array.

> This change decreases the time in wt_status_collect_changes_worktree() from 125 seconds to 45 seconds on my very large repository.


Plus, Git 2.17 (Q2 2018) will introduce a new trace, for measuring where the time is spent in the index-heavy operations.

See commit ca54d9b (27 Jan 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit 090dbea, 15 Feb 2018)

> ## trace: measure where the time is spent in the index-heavy operations

> All the known heavy code blocks are measured (except object database access). This should help identify if an optimization is effective or not.
An unoptimized git-status would give something like below:

0.001791141 s: read cache ...
0.004011363 s: preload index
0.000516161 s: refresh index
0.003139257 s: git command: ... 'status' '--porcelain=2'
0.006788129 s: diff-files
0.002090267 s: diff-index
0.001885735 s: initialize name hash
0.032013138 s: read directory
0.051781209 s: git command: './git' 'status'

The same Git 2.17 (Q2 2018) improves git status with:

> ## revision.c: reduce object database queries

> In mark_parents_uninteresting(), we check for the existence of an object file to see if we should treat a commit as parsed. The result is to set the "parsed" bit on the commit.

> Modify the condition to only check has_object_file() if the result would change the parsed bit.

> When a local branch is different from its upstream ref, "git status" will compute ahead/behind counts.
This uses paint_down_to_common() and hits mark_parents_uninteresting().

> On a copy of the Linux repo with a local instance of "master" behind the remote branch "origin/master" by ~60,000 commits, we find the performance of "git status" went from 1.42 seconds to 1.32 seconds, for a relative difference of -7.0%.


Git 2.24 (Q3 2019) proposes another setting to improve git status performance:

See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit f4f8dfe, 09 Sep 2019)

> ## repo-settings: create feature.manyFiles setting

> The feature.manyFiles setting is suitable for repos with many files in the working directory.
By setting index.version=4 and core.untrackedCache=true, commands such as 'git status' should improve.

But:

With Git 2.24 (Q4 2019), the codepath that reads the index.version configuration was broken with a recent update, which has been corrected.

See commit c11e996 (23 Oct 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 4d6fb2b, 24 Oct 2019)

> ## repo-settings: read an int for index.version
> Signed-off-by: Derrick Stolee

> Several config options were combined into a repo_settings struct in ds/feature-macros, including a move of the "index.version" config setting in 7211b9e ("repo-settings: consolidate some config settings", 2019-08-13, Git v2.24.0-rc1 -- merge listed in batch #0).

> Unfortunately, that file looked like a lot of boilerplate and what is clearly a factor of copy-paste overload, the config setting is parsed with repo_config_ge_bool() instead of repo_config_get_int(). This means that a setting "index.version=4" would not register correctly and would revert to the default version of 3.

> I caught this while incorporating v2.24.0-rc0 into the VFS for Git codebase, where we really care that the index is in version 4.

> This was not caught by the codebase because the version checks placed in t1600-index.sh did not test the "basic" scenario enough. Here, we modify the test to include these normal settings to not be overridden by features.manyFiles or GIT_INDEX_VERSION.
While the "default" version is 3, this is demoted to version 2 in do_write_index() when not necessary.


git status will also compare SHA1 faster, due to Git 2.33 (Q3 2021), using an optimized hashfile API in the codepath that writes the index file.

See commit f6e2cd0, commit 410334e, commit 2ca245f (18 May 2021), and commit 68142e1 (17 May 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 0dd2fd1, 14 Jun 2021)

> ## csum-file.h: increase hashfile buffer size
> Signed-off-by: Derrick Stolee

> The hashfile API uses a hard-coded buffer size of 8KB and has ever since it was introduced in c38138c ("git-pack-objects: write the pack files with a SHA1 csum", 2005-06-26, Git v0.99 -- merge).
> It performs a similar function to the hashing buffers in read-cache.c, but that code was updated from 8KB to 128KB in f279894 ("read-cache: make the index write buffer size 128K", 2021-02-18, Git v2.31.0-rc1 -- merge).
> The justification there was that do_write_index() improves from 1.02s to 0.72s.
> Since our end goal is to have the index writing code use the hashfile API, we need to unify this buffer size to avoid a performance regression.
> > Since these buffers are now on the heap, we can adjust their size based on the needs of the consumer.
> In particular, callers to hashfd_throughput() are expecting to report progress indicators as the buffer flushes.
> These callers would prefer the smaller 8k buffer to avoid large delays between updates, especially for users with slower networks.
> When the progress indicator is not used, the larger buffer is preferable.
> > By adding a new trace2 region in the chunk-format API, we can see that the writing portion of 'git multi-pack-index write'(man) lowers from ~1.49s to ~1.47s on a Linux machine.
> These effects may be more pronounced or diminished on other filesystems.

Solution 6 - Performance

git config --global core.preloadIndex true

Did the job for me. Check the official documentation here.

Solution 7 - Performance

In our codebase where we have somewhere in the range of 20 - 30 submodules,
git status --ignore-submodules
sped things up for me drastically. Do note that this will not report on the status of submodules.

Solution 8 - Performance

Something that hasn't been mentioned yet is, to activate the filesystem cache on windows machines (linux filesystems are completly different and git was optimized for them, therefore this probably only helps on windows).

git config core.fscache true


As a last resort, if git is still slow, one could turn off the modification time inspection, that git needs to find out which files have changed.

git config core.ignoreStat true

BUT: Changed files have to be added afterwards by the dev himself with git add. Git doesn't find changes itself.

source

Solution 9 - Performance

Ok, this is quite hard to believe if I wouldn't see with my eyes... I had very BAD performance on my brand new work laptop, git status takes from 5 to 10 seconds to complete even for the most stupid repository. I've tried all the advice in this thread then I noticed that also git log was slow so I've broad my search for generic slowness of git fresh installation and I've found this https://github.com/gitextensions/gitextensions/issues/5314#issuecomment-416081823

in a desperate move I've tried to update the graphic driver of my laptop and...

> Holy Santa Claus sh*t... that did the trick!

...for me too!

So apparently graphic card driver have some relation here... hard to understand why, but now the performance are "as expected"!

Solution 10 - Performance

Leftover index.lock files

git status can be pathologically slow when you have leftover index.lock files.

This happens especially when you have git submodules, because then you often don't notice such lefterover files.

Summary: Run find .git/ -name index.lock, and delete the leftover files after checking that they are indeed not used by any currently running program.


Details

I found that my shell git status was extremely slow in my repo, with git 2.19 on Ubuntu 16.04.

Dug in and found that /usr/bin/time git status in my assets git submodule took 1.7 seconds.

Found with strace that git read all my big files in there with mmap. It doesn't usually do that, usually stat is enough.

I googled the problem and found the Use of index and Racy Git problem.

Tried git update-index somefile (in my case gitignore in the submodule checkout) shown here but it failed with

fatal: Unable to create '/home/niklas/src/myproject/.git/modules/assets/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

This is a classical error. Usually you notice it at any git operation, but for submodules that you don't often commit to, you may not notice it for months, because it only appears when adding something to the index; the warning is not raised on read-only git status.

Removing the index.lock file, git status became fast immediately, mmaps disappeared, and it's now over 1000x faster.

So if your git status is unnaturally slow, check find .git/ -name index.lock and delete the leftovers.

Solution 11 - Performance

It is a pretty old question. Though, I am surprised that no one commented about binary file given the repository size.

You mentioned that your git repo is ~10GB. It seems that apart from NFS issue and other git issues (resolvable by git gc and git configuration change as outline in other answers), git commands (git status, git diff, git add) might be slow because of large number of binary file in the repository. git is not good at handling binary file. You can remove unnecessary binary file using following command (example is given for NetCDF file; have a backup of git repository before):

git filter-branch --force --index-filter \  
'git rm --cached --ignore-unmatch *.nc' \   
--prune-empty --tag-name-filter cat -- --all

Do not forget to put '*.nc' to gitignore file to stop git from recommit the file.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSenthil A KumarView Question on Stackoverflow
Solution 1 - PerformanceJosh LeeView Answer on Stackoverflow
Solution 2 - Performanceuser1077329View Answer on Stackoverflow
Solution 3 - PerformanceJabariView Answer on Stackoverflow
Solution 4 - PerformancebenoView Answer on Stackoverflow
Solution 5 - PerformanceVonCView Answer on Stackoverflow
Solution 6 - PerformanceklimatView Answer on Stackoverflow
Solution 7 - PerformancecitysurroundedView Answer on Stackoverflow
Solution 8 - PerformancedCSevenView Answer on Stackoverflow
Solution 9 - PerformanceMosè BottaciniView Answer on Stackoverflow
Solution 10 - Performancenh2View Answer on Stackoverflow
Solution 11 - PerformanceMS_View Answer on Stackoverflow