Git is really slow for 100,000 objects. Any fixes?

PerformanceGitGit Svn

Performance Problem Overview


I have a "fresh" git-svn repo (11.13 GB) that has over a 100,000 objects in it.

I have preformed

git fsck
git gc

on the repo after the initial checkout.

I then tried to do a

git status

The time it takes to do a git status is anywhere from 2m25.578s and 2m53.901s

I tested git status by issuing the command

time git status

5 times and all of the times ran between the two times listed above.

I am doing this on a Mac OS X, locally not through a VM.

There is no way it should be taking this long.

Any ideas? Help?

Thanks.

Edit

I have a co-worker sitting right next to me with a comparable box. Less RAM and running Debian with a jfs filesystem. His git status runs in .3 on the same repo (it is also a git-svn checkout).

Also, I recently changed my file permissions (to 777) on this folder and it brought the time down considerably (why, I have no clue). I can now get it done anywhere between 3 and 6 seconds. This is manageable, but still a pain.

Performance Solutions


Solution 1 - Performance

It came down to a couple of items that I can see right now.

  1. git gc --aggressive
  2. Opening up file permissions to 777

There has to be something else going on, but this was the things that clearly made the biggest impact.

Solution 2 - Performance

git status has to look at every file in the repository every time. You can tell it to stop looking at trees that you aren't working on with

git update-index --assume-unchanged <trees to skip>

source

From the manpage:

> When these flags are specified, the > object names recorded for the paths > are not updated. Instead, these > options set and unset the "assume > unchanged" bit for the paths. When the > "assume unchanged" bit is on, git > stops checking the working tree files > for possible modifications, so you > need to manually unset the bit to tell > git when you change the working tree > file. This is sometimes helpful when > working with a big project on a > filesystem that has very slow lstat(2) > system call (e.g. cifs). > > This option can be also used as a > coarse file-level mechanism to ignore > uncommitted changes in tracked files > (akin to what .gitignore does for > untracked files). Git will fail > (gracefully) in case it needs to > modify this file in the index e.g. > when merging in a commit; thus, in > case the assumed-untracked file is > changed upstream, you will need to > handle the situation manually. > > > Many operations in git depend on your > filesystem to have an efficient > lstat(2) implementation, so that > st_mtime information for working tree > files can be cheaply checked to see if > the file contents have changed from > the version recorded in the index > file. Unfortunately, some filesystems > have inefficient lstat(2). If your > filesystem is one of them, you can set > "assume unchanged" bit to paths you > have not changed to cause git not to > do this check. Note that setting this > bit on a path does not mean git will > check the contents of the file to see > if it has changed — it makes git to > omit any checking and assume it has > not changed. When you make changes to > working tree files, you have to > explicitly tell git about it by > dropping "assume unchanged" bit, > either before or after you modify > them. > > ... > > In order to set "assume unchanged" > bit, use --assume-unchanged option. To > unset, use --no-assume-unchanged. > > > The command looks at core.ignorestat > configuration variable. When this is > true, paths updated with git > update-index paths… and paths updated > with other git commands that update > both index and working tree (e.g. git > apply --index, git checkout-index -u, > and git read-tree -u) are > automatically marked as "assume > unchanged". Note that "assume > unchanged" bit is not set if git > update-index --refresh finds the > working tree file matches the index > (use git update-index --really-refresh > if you want to mark them as "assume > unchanged").


Now, clearly, this solution is only going to work if there are parts of the repo that you can conveniently ignore. I work on a project of similar size, and there are definitely large trees that I don't need to check on a regular basis. The semantics of git-status make it a generally O(n) problem (n in number of files). You need domain specific optimizations to do better than that.

Note that if you work in a stitching pattern, that is, if you integrate changes from upstream by merge instead of rebase, then this solution becomes less convenient, because a change to an --assume-unchanged object merging in from upstream becomes a merge conflict. You can avoid this problem with a rebasing workflow.

Solution 3 - Performance

git status should be quicker in Git 2.13 (Q2 2017), because of:

On that last point, see commit a33fc72 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
(Merged by Junio C Hamano -- gitster -- in commit cdfe138, 24 Apr 2017)

> ## read-cache: force_verify_index_checksum

> Teach git to skip verification of the SHA1-1 checksum at the end of the index file in verify_hdr() which is called from read_index() unless the "force_verify_index_checksum" global variable is set. > > Teach fsck to force this verification. > > The checksum verification is for detecting disk corruption, and for small projects, the time it takes to compute SHA-1 is not that significant, but for gigantic repositories this calculation adds significant time to every command.


Git 2.14 improves again git status performance by better taking into account the "untracked cache", which allows Git to skip reading the untracked directories if their stat data have not changed, using the mtime field of the stat structure.

See the Documentation/technical/index-format.txt for more on untracked cache.

See commit edf3b90 (08 May 2017) by David Turner (dturner-tw).
(Merged by Junio C Hamano -- gitster -- in commit fa0624f, 30 May 2017)

> When "git checkout", "git merge", etc. manipulates the in-core index, various pieces of information in the index extensions are discarded from the original state, as it is usually not the case that they are kept up-to-date and in-sync with the operation on the main index.
> > The untracked cache extension is copied across these operations now, which would speed up "git status" (as long as the cache is properly invalidated).


More generally, writing to the cache will be also quicker with Git 2.14.x/2.15

See commit ce012de, commit b50386c, commit 3921a0b (21 Aug 2017) by Kevin Willford (``).
(Merged by Junio C Hamano -- gitster -- in commit 030faf2, 27 Aug 2017)

> We used to spend more than necessary cycles allocating and freeing piece of memory while writing each index entry out.
This has been optimized. > > [That] would save anywhere between 3-7% when the index had over a million entries with no performance degradation on small repos.


Update Dec. 2017: Git 2.16 (Q1 2018) will propose an additional enhancement, this time for git log, since the code to iterate over loose object files just got optimized.

See commit 163ee5e (04 Dec 2017) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 97e1f85, 13 Dec 2017)

> sha1_file: use strbuf_add() instead of strbuf_addf() > > Replace use of strbuf_addf() with strbuf_add() when enumerating loose objects in for_each_file_in_obj_subdir(). Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method. > > One consumer of for_each_file_in_obj_subdir() is the abbreviation code. OID (object identifiers) abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects. > > Most repositories do not have many loose objects before repacking, but in the GVFS case (see "Announcing GVFS (Git Virtual File System)") the repos can grow to have millions of loose objects.
Profiling 'git log' performance in Git For Windows on a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent in strbuf_addf(). > > Add a new performance test to p4211-line-log.sh that is more sensitive to this cache-loading.
By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager. > > For a copy of the Linux repo with two ~512 MB packfiles and 572K loose objects, running 'git log --oneline --parents --raw -1000' had the following performance: > > HEAD1 HEAD > ---------------------------------------- > 7.70(7.15+0.54) 7.44(7.09+0.29) -3.4%


Update March 2018: Git 2.17 will improve git status some more: see this answer.


Update: Git 2.20 (Q4 2018) adds Index Entry Offset Table (IEOT), which allows for git status to load the index faster.

See commit 77ff112, commit 3255089, commit abb4bb8, commit c780b9c, commit 3b1d9e0, commit 371ed0d (10 Oct 2018) by Ben Peart (benpeart).
See commit 252d079 (26 Sep 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit e27bfaa, 19 Oct 2018)

> ## read-cache: load cache entries on worker threads

> This patch helps address the CPU cost of loading the index by utilizing the Index Entry Offset Table (IEOT) to divide loading and conversion of the cache entries across multiple threads in parallel. > > I used p0002-read-cache.sh to generate some performance data: > > Test w/100,000 files reduced the time by 32.24% > Test w/1,000,000 files reduced the time by -4.77% > > Note that on the 1,000,000 files case, multi-threading the cache entry parsing does not yield a performance win. This is because the cost to parse the index extensions in this repo, far outweigh the cost of loading the cache entries.

That allows for:

> ## config: add new index.threads config setting

> Add support for a new index.threads config setting which will be used to control the threading code in do_read_index().
> > - A value of 0 will tell the index code to automatically determine the correct number of threads to use.
A value of 1 will make the code single threaded.
>- A value greater than 1 will set the maximum number of threads to use. > > For testing purposes, this setting can be overwritten by setting the GIT_TEST_INDEX_THREADS=<n> environment variable to a value greater than 0.


Git 2.21 (Q1 2019) introduces a new improvement, with the update of the loose object cache, used to optimize existence look-up, which has been updated.

See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)

> ## object-store: use one oid_array per subdirectory for loose cache

> The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once. > > Use one oid_array for each subdirectory.
This ensures that entries have to only be sorted a single time.
It also avoids eight binary search steps for each cache lookup as a small bonus. > > The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory: > > $ git count-objects > 26733 objects, 68808 kilobytes >
> Test HEAD^ HEAD > -------------------------------------------------------------------- > 4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0% > 4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6% > 4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9% > 4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6% > 4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9% > 4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2% > 4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1%


With Git 2.26 (Q1 2020), the object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.

There however are some cases where they can work together, and they were taught about them.

See commit 20a5fd8 (18 Feb 2020) by Junio C Hamano (gitster).
See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06 (14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b (13 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 0df82d9, 02 Mar 2020)

> ## pack-bitmap: implement BLOB_NONE filtering
> Signed-off-by: Jeff King

> We can easily support BLOB_NONE filters with bitmaps.
Since we know the types of all of the objects, we just need to clear the result bits of any blobs. > > Note two subtleties in the implementation (which I also called out in comments): > > - we have to include any blobs that were specifically asked for (and not reached through graph traversal) to match the non-bitmap version >- we have to handle in-pack and "ext_index" objects separately.
Arguably prepare_bitmap_walk() could be adding these ext_index objects to the type bitmaps.
But it doesn't for now, so let's match the rest of the bitmap code here (it probably wouldn't be an efficiency improvement to do so since the cost of extending those bitmaps is about the same as our loop here, but it might make the code a bit simpler). > > Here are perf results for the new test on git.git: > > Test HEAD^ HEAD > -------------------------------------------------------------------------------- > 5310.9: rev-list count with blob:none 1.67(1.62+0.05) 0.22(0.21+0.02) -86.8% >


To know more aboud oid_array, consider Git 2.27 (Q2 2020)

See commit 0740d0a, commit c79eddf, commit 7383b25, commit ed4b804, commit fe299ec, commit eccce52, commit 600bee4 (30 Mar 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a768f86, 22 Apr 2020)

> ## oid_array: use size_t for count and allocation
> Signed-off-by: Jeff King

> The oid_array object uses an "int" to store the number of items and the allocated size.
> > It's rather unlikely for somebody to have more than 2^31 objects in a repository (the sha1's alone would be 40GB!), but if they do, we'd overflow our alloc variable. > > You can reproduce this case with something like: > > git init repo > cd repo > > # make a pack with 2^24 objects > perl -e ' > my $nr = 2**24; > > for (my $i = 0; $i < $nr; $i++) { > print "blob\n"; > print "data 4\n"; > print pack("N", $i); > } > | git fast-import > > # now make 256 copies of it; most of these objects will be duplicates, > # but oid_array doesn't de-dup until all values are read and it can > # sort the result. > cd .git/objects/pack/ > pack=$(echo *.pack) > idx=$(echo *.idx) > for i in $(seq 0 255); do > # no need to waste disk space > ln "$pack" "pack-extra-$i.pack" > ln "$idx" "pack-extra-$i.idx" > done > > # and now force an oid_array to store all of it > git cat-file --batch-all-objects --batch-check > > which results in: > > fatal: size_t overflow: 32 * 18446744071562067968 > > So the good news is that st_mult() sees the problem (the large number is because our int wraps negative, and then that gets cast to a size_t), doing the job it was meant to: bailing in crazy situations rather than causing an undersized buffer. > > But we should avoid hitting this case at all, and instead limit ourselves based on what malloc() is willing to give us.
We can easily do that by switching to size_t. > > The cat-file process above made it to ~120GB virtual set size before the integer overflow (our internal hash storage is 32-bytes now in preparation for sha256, so we'd expect ~128GB total needed, plus potentially more to copy from one realloc'd block to another)).
After this patch (and about 130GB of RAM+swap), it does eventually read in the whole set. No test for obvious reasons. > Note that this object was defined in sha1-array.c, which has been renamed oid-array.c: a more neutral name, considering Git will be eventually transition from SHA1 to SHA2.


Another optimization:

With Git 2.31 (Q1 2021), the code around the cache-tree extension in the index has been optimized.

See commit a4b6d20, commit 4bdde33, commit 22ad860, commit 845d15d (07 Jan 2021), and commit 0e5c950, commit 4c3e187, commit fa7ca5d, commit c338898, commit da8be8c (04 Jan 2021) by Derrick Stolee (derrickstolee).
See commit 0b72536 (07 Jan 2021) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit a0a2d75, 05 Feb 2021)

> ## cache-tree: speed up consecutive path comparisons
> Signed-off-by: Derrick Stolee

> The previous change reduced time spent in strlen() while comparing consecutive paths in verify_cache(), but we can do better.
> The conditional checks the existence of a directory separator at the correct location, but only after doing a string comparison.
> Swap the order to be logically equivalent but perform fewer string comparisons.
> > To test the effect on performance, I used a repository with over three million paths in the index.
> I then ran the following command on repeat:
> > git -c index.threads=1 commit --amend --allow-empty --no-edit > > Here are the measurements over 10 runs after a 5-run warmup:
> > Benchmark #1: v2.30.0 > Time (mean ± σ): 854.5 ms ± 18.2 ms > Range (min … max): 825.0 ms … 892.8 ms > > Benchmark #2: Previous change > Time (mean ± σ): 833.2 ms ± 10.3 ms > Range (min … max): 815.8 ms … 849.7 ms > > Benchmark #3: This change > Time (mean ± σ): 815.5 ms ± 18.1 ms > Range (min … max): 795.4 ms … 849.5 ms > > This change is 2% faster than the previous change and 5% faster than v2.30.0.

Solution 4 - Performance

One longer-term solution is to augment git to cache filesystem status internally.

Karsten Blees has done so for msysgit, which dramatically improves performance on Windows. In my experiments, his change has taken the time for "git status" from 25 seconds to 1-2 seconds on my Win7 machine running in a VM.

Karsten's changes: https://github.com/msysgit/git/pull/94

Discussion of the caching approach: https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

Solution 5 - Performance

In general my mac is ok with git but if there are a lot of loose objects then it gets very much slower. It seems hfs is not so good with lots of files in a single directory.

git repack -ad

Followed by

git gc --prune=now

Will make a single pack file and remove any loose objects left over. It can take some time to run these.

Solution 6 - Performance

For what it's worth, I recently found a large discrepancy beween the git status command between my master and dev branches.

To cut a long story short, I tracked down the problem to a single 280MB file in the project root directory. It was an accidental checkin of a database dump so it was fine to delete it.

Here's the before and after:

time git status
# On branch master
nothing to commit (working directory clean)
git status  1.35s user 0.25s system 98% cpu 1.615 total

⚡ rm savedev.sql

⚡ time git status
# On branch master
# Changes not staged for commit:
#   (use "git add/rm <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	deleted:    savedev.sql
#
no changes added to commit (use "git add" and/or "git commit -a")
git status  0.07s user 0.08s system 98% cpu 0.157 total

I have 105,000 objects in store, but it appears that large files are more of a menace than many small files.

Solution 7 - Performance

You could try passing the --aggressive switch to git gc and see if that helps:

# this will take a while ...
git gc --aggressive

Also, you could use git filter-branch to delete old commits and/or files if you have things which you don't need in your history (e.g., old binary files).

Solution 8 - Performance

You also might try git repack

Solution 9 - Performance

Try running Prune command it will get rid off, loose objects >git remote prune origin

Solution 10 - Performance

maybe spotlight is trying to index the files. Perhaps disable spotlight for your code dir. Check Activity Monitor and see what processes are running.

Solution 11 - Performance

I'd create a partition using a different file system. HFT+ has always been sluggish for me compared to doing similar operations on other file systems.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmanumoomooView Question on Stackoverflow
Solution 1 - PerformancemanumoomooView Answer on Stackoverflow
Solution 2 - PerformancemasonkView Answer on Stackoverflow
Solution 3 - PerformanceVonCView Answer on Stackoverflow
Solution 4 - PerformanceChris KlineView Answer on Stackoverflow
Solution 5 - PerformanceslobobabyView Answer on Stackoverflow
Solution 6 - PerformanceBrendon McLeanView Answer on Stackoverflow
Solution 7 - PerformanceDavid UnderhillView Answer on Stackoverflow
Solution 8 - PerformancebaudtackView Answer on Stackoverflow
Solution 9 - PerformanceDevnegikecView Answer on Stackoverflow
Solution 10 - PerformanceneoneyeView Answer on Stackoverflow
Solution 11 - PerformancesrparishView Answer on Stackoverflow